Workload Manager Slurm

Basics

A Slurm job script is essentially a recipe to automate your computational workflow.
You write a shell script that - step by step - loads necessary software, prepares your input data, processes the data, and generates your results. The script you're writing contains special commentary lines that are interpreted by the Slurm workload manager. These comments contain information about the resources your calculations need, like number of CPUs, RAM requierements and estimated runtime. Slurm also provides several environmental variables which you can use in your script. These variables help you identify your job by ID or the cluster node your job is running on and much more.
The following sections will take you through a series of examples. Beginning with the most basic job script to run your program on a cluster and step by step introducing more and more features and tricks until, finally, you will be able to write your own complex job scripts and refer to the offical documentation for more information.

Here are brief explanations of some of the most common Slurm commands:

sbatch: This command is used to submit a batch job to the Slurm scheduler.
squeue: This command shows the status of all jobs in the queue, including the job ID, user, status, and node allocation.
srun: This command is used to launch a job on the compute nodes and execute commands on them.
scancel: This command is used to cancel a running or pending job.
sinfo: This command provides information about the compute nodes, such as their state, availability, and partition membership.
sacct: This command is used to view job accounting information, such as job start and end times, resource usage, and exit status.
scontrol: This command is used to modify the job and node configuration, such as setting up a reservation, configuring job priority, and modifying the job environment.

Slurm partitions

Since the SC clusters are equipped with differnet hardware and have access to different resources, we have defined slurm partitions for different resources and usage patterns accordingly.

name	max time limit	max resources	usage pattern
sirius	2 days	1 node	big memory applications
sirius-long	10 days	1 node	big memory applications
polaris	2 days	10 nodes	general computation
polaris-long	42 days	4 nodes	long runtime
clara	2 days	29 nodes w/ GPUs	general and GPU computation
clara-long	10 days	6 nodes w/ GPUs	long running GPU jobs
paula	2 days	12 nodes w/ GPUs	general and GPU computation
paul	2 days	32 nodes	general computation
paul-long	10 days	4 nodes	general computation

CPU and memory limitations

To increase the fair-share character of the cluster, we have set some limitations on the CPU and memory usage per job.

Partition	DefMemPerCPU	MaxMemPerCPU
sirius	1G	48G
polaris	1G	8G
clara	1G	16G
paul	1G	8G
paula	1G	16G

By default, you can use 1 GB of RAM per CPU core. If you need more memory, you can request it using the --mem option in your job script. However, the maximum amount of memory you can request per CPU core is limited by the MaxMemPerCPU value in the table above. If you increase the total memory further, you will automatically allocate more CPU cores.

Interactive jobs

Interactive jobs are a perfect way to test your jobs on the real systems. You request and allocate the needed resources using the salloc command. Here, 2 nodes are requested from the clara partition. After the allocation is finished, the prompt returns on the same host.

[sy264qasy@login01 ~]$ salloc -N 2 -p clara
salloc: Pending job allocation 899053
salloc: job 899053 queued and waiting for resources
salloc: job 899053 has been allocated resources
salloc: Granted job allocation 899053

To run any program on the requested resources the srun command has to be used. In this example we run the hostname command and it returns the hostnames of the nodes used during this allocation. It is also possible to run parallel MPI programs this way.

[sy264qasy@login01 ~]$ srun hostname
clara07.sc.uni-leipzig.de
clara06.sc.uni-leipzig.de

To end the allocation just run exit. The resources are freed up again and can be used for other jobs.

[sy264qasy@login01 ~]$ exit
salloc: Relinquishing job allocation 899053
[sy264qasy@login01 ~]$

Serial jobs

Simple serial jobs

Serial jobs run on one cluster node and one CPU only.

Tip

Use this if you want to run a program that can run without supervision and is not feasable to run on your local machine.
For example:

it runs for a very long time (more that a few hours)
it needs more RAM than your local machine has
it generates a lot of temporary data (more than your local machine can hold)
you need to run the same program again and again on multiple (like, really a lot) different data sets or parameters

Warning

This may not necessarily accelerate the execution of your program.
In fact, it may even be slower, because of the lower CPU clock speed (between 2GHz and 3GHz) on the cluster compared to the clock speed on your local workstation or even laptop (sometimes in the range of 4GHz for good workstations).

For this simple scenario we consider the following job script saved as serial-job.sh.

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=10:00:00

./my-serial-program

To submit your job to the Slurm queue you use the sbatch command and get output similar to this:

$ sbatch serial-job.sh
Submitted batch job 2216373

This means that your job is now scheduled to run on the cluster, once it is your turn; depending on requested and available resources. In the example above the job has the ID 2216373.

$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2216373 galaxy-jo serial-j za381baf PD       0:00      1 (None)

$ scontrol show job 216373
JobId=2216373 JobName=serial-job.sh
   UserId=za381bafi(1435002408) GroupId=domain users(1435000513) MCS_label=N/A
   Priority=1 Nice=0 Account=default QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:13 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2021-06-09T14:46:38 EligibleTime=2021-06-09T14:46:38
   AccrueTime=2021-06-09T14:46:38
   StartTime=2021-06-09T14:46:43 EndTime=2021-06-09T15:46:43 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-06-09T14:46:43
   Partition=galaxy-job AllocNode:Sid=login01:24157
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=galaxy146
   BatchHost=galaxy146
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=5G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/sc.uni-leipzig.de/za381bafi/jobs/lab-test/serial-job.sh
   WorkDir=/home/sc.uni-leipzig.de/za381bafi/jobs/lab-test
   StdErr=/home/sc.uni-leipzig.de/za381bafi/jobs/lab-test/slurm-2216373.out
   StdIn=/dev/null
   StdOut=/home/sc.uni-leipzig.de/za381bafi/jobs/lab-test/slurm-2216373.out
   Power=
   NtasksPerTRES:0

scancel 2216373

Serial array jobs

Tip

Use this if you have to run the same serial task for a large set of input parameters.

Let's assume we have a job file array-job.sh with the following content:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=10:00:00

./my-serial-program -seed $SLURM_ARRAY_TASK_ID

To submit an array of jobs to the slurm queue use

sbatch -a 1-500 array-job.sh

This will create 500 jobs in the queue and passes the numbers from 1 through 500 as a parameter to your program. Your are reponsible to evalute this integer appropriately.

The current limit is 15000 jobs per array job.

Parallel jobs

Multi-threading jobs

Tip

Use this kind of job, if your application is using multi-threading instead of multi-processing, e.g., employing technologies like OpenMP.
All the other tips from simple serial jobs still apply in this case.

Word of advice

Multi-threading jobs run on one node only.

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --time=10:00:00

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

./openmp-program

Multi-processing and MPI jobs

Tip

Use this kind of job, if your application is using some kind of multi-processing library like message passing interface (MPI). Many popular simulations suites like GROMACS, LAMMPS and others use this technology.
All the other tips from simple serial jobs still apply in this case.

The following script takes you through a real life example of a self-developed MPI application. In this case, a Monte Carlo simulation of the two-dimensional q-state Potts model using the sophisticated multicanonical simulation method to calculate the density of states.

#!/bin/bash
#SBATCH -J potts-064
#SBATCH --time=2-00:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=24
# Unless you have a good reason for distributing across 4 nodes, you could also use
# #SBATCH --ntasks=96 and let Slurm distribute the tasks across nodes
#SBATCH --mem=64G
#SBATCH -o $HOME/Documents/Projects/cudamuca/jobfiles/log/potts-064.o-%a-%A
#SBATCH -e $HOME/Documents/Projects/cudamuca/jobfiles/log/potts-064.e-%a-%A

L=64
seed="${SLURM_ARRAY_TASK_ID}"

exe="{$HOME}/Documents/Projects/cudamuca/potts_cpu"
WORKDIR=/work/users/$USER
out=$(printf "%s/Projects/cudamuca/output/cpu_potts_bench_soft/L%03d/seed%04d" ${WORKDIR} ${L} ${seed})

mkdir -p ${out}
cd ${out}
cp ${exe} ./exe

srun ./exe -q 8 -L ${L} -s ${seed} -m

Tip

If you are using self written code, make sure you keep communication between workers at a minimum. Communication and synchronization between tasks (especially across nodes), as well as heavy file I/O, can slow down your overall performance.

To start this array of jobs 512 use:

sbatch -a 1000-1511 gpu_potts_L064.job

GPU accelerated jobs

#!/bin/bash
#SBATCH -J gpu-064
#SBATCH --ntasks=1
#SBATCH --gres=gpu:v100:1
#SBATCH --time=8-00:00:00
#SBATCH -o $HOME/Documents/Projects/cudamuca/jobfiles/log/GPU/gpu-potts-064.o-%a
#SBATCH -e $HOME/Documents/Projects/cudamuca/jobfiles/log/GPU/gpu-potts-064.e-%a

L=64
seed="${SLURM_ARRAY_TASK_ID}"

WORKERS=$(seq 256 256 11776)

module load CUDA/11.1.1-iccifort-2020.4.304

# build step
cd $HOME/Documents/Projects/cudamuca
make clean && make gpu

WORKDIR=/work/users/$USER
exe="$HOME/Documents/Projects/cudamuca/gpuPotts"
for WORKER in $WORKERS
  do
    out=$(printf "%s/Documents/Projects/cudamuca/output/potts/gpu_bench_soft/L%03d/seed%04d/W%05d" ${WORKDIR} ${L} ${seed} ${WORKER})

    mkdir -p ${out}
    cd ${out}
    cp ${exe} ./exe

    chmod +x ./exe 
    ./exe -m -p -L ${L} -s ${seed} -W ${WORKER} -q 8

    # copy end result back to home directory
    endresult=$(printf "$HOME/Documents/Project/cudamuca/result/final_result-W%05d-%a-%A-%j.dat" ${WORKER})
    cp out/final_result.dat ${endresult}
  done

Adjusting job run-time after submission

After you have submitted your job you can adjust certain parameters about your job. The most obivious thing you might want to adjust is the runtime of your job.

Warning

To avoid abuse of the scheduler this is only possible with jobs that are not yet running.

To get information about your running job use scontrol and the JOB ID, e.g.

scontrol show job 2216373
JobId=2216373 JobName=serial-job.sh
   UserId=za381bafi(1435002408) GroupId=domain users(1435000513) MCS_label=N/A
   Priority=1 Nice=0 Account=default QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:13 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2021-06-09T14:46:38 EligibleTime=2021-06-09T14:46:38
   AccrueTime=2021-06-09T14:46:38
   StartTime=2021-06-09T14:46:43 EndTime=2021-06-09T15:46:43 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-06-09T14:46:43
   Partition=galaxy-job AllocNode:Sid=login01:24157
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=galaxy146
   BatchHost=galaxy146
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=5G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/sc.uni-leipzig.de/za381bafi/jobs/lab-test/serial-job.sh
   WorkDir=/home/sc.uni-leipzig.de/za381bafi/jobs/lab-test
   StdErr=/home/sc.uni-leipzig.de/za381bafi/jobs/lab-test/slurm-2216373.out
   StdIn=/dev/null
   StdOut=/home/sc.uni-leipzig.de/za381bafi/jobs/lab-test/slurm-2216373.out
   Power=
   NtasksPerTRES:0

Here you will get a lot of information about yout job. In this example we are looking at the TimeLimit in line 7.

To set the time limit of your job to 4 days use this command:

scontrol update jobid=2216373 TimeLimit=4-00:00:00

Alternatively, you can add a certain amount of time to the already set time limit. This is how you would increase the time limit by three hours:

scontrol update jobid=2216373 TimeLimit=+03:00:00

Job Efficiency

After your job has finished, you can inspect the efficiency of your job. For that purpose, we offer reportseff on the login nodes. There are basically two ways to generate the report. The first one is to execute the command in the directorywhere the slurm output files (slurm-ID.ou) are stored.

reportseff

The tool automatically reads the output files and generates a table showing the time efficiency, CPU efficiency and memory efficiency.

If you deleted the output files or just want to have a look at your last jobs, you can execute reportseff with the user parameter:

reportseff -u <username>

You can use that report to further optimize your resource allocation.

Jupyter notebook

Word of advice

Jupyter notebooks are a great way to prototype or setup your more complex Python or even Tensorflow workflows. However, due to its interactive nature it is not suitable for running long calculations or train AI models in a JupyterHub/Lab session. Lucky for you, you can easily submit your Jupyter notebooks to Slurm.

Let us assume you finished your Tensorflow workflow for example as Jupyter notebook and you tested it for functionality. Now you are ready to submit the production run to the Slurm queue.

#!/bin/bash

#SBATCH --time=10:00:00
#SBATCH --mem=40G
#SBATCH --ntasks=1
#SBATCH --job-name=Tensorflow-training
#SBATCH --partition=clara
#SBATCH --gres=gpu:v100:1

module load JupyterLab/1.2.5-fosscuda-2019b-Python-3.7.4
module load TensorFlow/2.4.0-fosscuda-2019b-Python-3.7.4
jupyter nbconvert --to notebook --execute Tensorflow.ipynb --output tf-job-%j.ipynb

R and Rscript

If you use R for your research you can use our RStudio servers for prototyping and testing your R script.
When you are ready to work on that big data set it is time to submit it to the Slurm queue.

#!/bin/bash

#SBATCH --time=05:00:00
#SBATCH --mem=10G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --job-name="Very long R task"
#SBATCH --partition=clara

module load R/4.0.3-foss-2020b
Rscript --vanilla long-job-rscript.R

Word of advice

The number of CPUs per tasks should be the same number of cores you specified in your R script.
That means it should be 1 if you did not specifically set up parallel calculations in R.

Slurm batch options reference

This section includes all the options used in the examples above and are just an excerpt of the vast possibilities.
Please refer to the official documentation for more options or use man sbatch on the cluster.

Number of tasks

Description

Sets the number of tasks (i.e. processes)

Syntax

--ntasks=<number>

-n <number>

Examples

#SBATCH --ntasks=16

$ sbatch -n 16

CPUs per task

Description

Number of CPUs per task (i.e. threads); used for multithreading or hybrid jobs

Syntax

--cpus-per-task=<number>

-c <number>

Examples

#SBATCH --cpus-per-task=8

$ sbatch -c 12

Walltime

Description

The maximal runtime for your job

Format: [hours:]minutes[:seconds]

Alternate format: days-hours[:minutes][:seconds]

Syntax

--time=<walltime>

-t <walltime>

Examples

Set walltime to 30 minutes

#SBATCH --time=30

Set walltime to 3 hours and 20 minutes

sbatch -t 3:20

Set walltime to two days

#SBATCH -t 2-0

Set walltime to 1 day, 5 hours, and 20 minutes

$ sbatch --time=1-5:20

Job name

Description

A name for the job
Allows you to specify a custom string to identify your job in the queue

Syntax

-J <name>

--jobname=<name>

Example

#SBATCH --jobname=MD_P3HT_60mer

$ sbatch -J cudamuca_gpu_64

Number of nodes

Description

The minimum number of nodes allocated for your job.

Syntax

--nodes=<number>

-N <number>

Exmaples

#SBATCH -N 4

$ sbatch --nodes=8

Tasks per node

Description

Is the maximum number of tasks (processes) per node.
Meant only to be used in combination with --nodes option.

Syntax

--tasks-per-node=<number>

Example

#SBATCH --tasks-per-node=32

$ sbatch --tasks-per-node=16

Partition

Description

This selects the partition of the cluster your jobs runs on.
Available partitions are listed here.
You may not be able to access all of them, depending on your permissions.
The default partition is paul.

Syntax

--partition <name>

-p <name>

Example

#SBATCH --partition=paul

$ sbatch -p clara

Memory

Description

Request real memory required per node.
Default unit is megabytes.
The options --mem, --mem-per-cpu, and --mem-per-gpu are mutually exclusive.

Syntax

--mem=size<size>[unit]

Examples

Requesting 40GB of memory

#SBATCH --mem=40G

Requesting 128MB of memory

$ sbatch --mem=128

Memory per CPU

Description

Request memory per CPU task (i.e. process)
Default unit is megabytes.
The options --mem, --mem-per-cpu, and --mem-per-gpu are mutually exclusive.

Syntax

--mem-per-cpu=<number>[unit]

Example

Requesting 512MB per tasks with 4 tasks amounts to 2GB in total for the job.

#SBATCH --mem-per-cpu=512
#SBATCH --ntasks=4

Memory per GPU

Description

Request memory per GPU task (i.e. process), this comprises memory used by the CPU and the GPU. Default unit is megabytes.
The options --mem, --mem-per-cpu, and --mem-per-gpu are mutually exclusive.

Syntax

--mem-per-gpu=<number>[unit]

Example

Requesting 1024MB per tasks with 8 GPU tasks amounts to 8GB in total for the job.
The total includes the memory for the CPU part as well as the memory used on the GPU!

#SBATCH --mem-per-gpu=1024
#SBATCH --ntasks=8

Number and type of GPUs

Description

Request GPU resources for your job.
Available types currently are a30, v100, and rtx2080ti.

Syntax

--gres=gpu:<name>:<number>

Examples

Requesting 8 GeForce RTX 2080Ti

#SBATCH --gres=gpu:rtx2080ti:8

Requesting 2 Tesla V100

$ sbatch --gres=gpu:v100:2

Requesting 4 Tesla A30

#SBATCH --gres=gpu:a30:4

Slurm output file

Description

Allows you to specify a custom filename to which stdout is redirected

Syntax

--output=<filename>

-o <filename>

Examples

Write output of the slurm script to your home directory.

#SBATCH -o $HOME/jobfiles/log/%x.out-%j

Here %x and %j are variables set by Slurm.
%x will be substituted with the job name that you specified.
%j is the job ID.
This way you can relate each output file to a specific job.

Slurm error file

Description

Allows you to specify a custom filename to which stderr is redirected

Syntax

--error=<filename>

-e <filename>

Examples

#SBATCH -e $HOME/jobfiles/log/%x.error-%j

Here %x and %j are variables set by Slurm.
%x will be substituted with the job name that you specified.
%j is the job ID.
This way you can relate each output file to a specific job.

E-Mail notifications

If you want to get notified when your jobs end, you can instruct slurm to send e-mails.

sbatch --mail-type=END my_job.sh

Which will send a mail to the address used at registration. You could specify a different address using --mail-user=<e-mail address>.
Valid mail types are:

Type	Description
`NONE`	default, no notification
`BEGIN`	notification on job start
`FAIL`	notification when your job fails
`END`	notification on job end (includes FAIL)
`TIME_LIMIT`	notification when your job has reached the time limit
`TIME_LIMIT_90`	notification when your job has reached 90 percent of the time limit
`TIME_LIMIT_80`	notification when your job has reached 80 percent of the time limit
`TIME_LIMIT_50`	notification when your job has reached 50 percent of the time limit
`ARRAY_TASK`	send notifications for each array task

Multiple mail types can be used at once, e.g.

sbatch --mail-type=TIME_LIMIT_80,END my_job.sh

You can also specify the mail options in your job script using

#SBATCH --mail-type=END