Workload Manager Slurm
Basics
A Slurm job script is essentially a recipe to automate your computational workflow.
You write a shell script that - step by step - loads necessary software, prepares your input data, processes the data, and generates your results.
The script you're writing contains special commentary lines that are interpreted by the Slurm workload manager.
These comments contain information about the resources your calculations need, like number of CPUs, RAM requierements and estimated runtime.
Slurm also provides several environmental variables which you can use in your script.
These variables help you identify your job by ID or the cluster node your job is running on and much more.
The following sections will take you through a series of examples.
Beginning with the most basic job script to run your program on a cluster and step by step introducing more and more features and tricks until, finally, you will be able to write your own complex job scripts and refer to the offical documentation for more information.
Here are brief explanations of some of the most common Slurm commands:
sbatch
: This command is used to submit a batch job to the Slurm scheduler.squeue
: This command shows the status of all jobs in the queue, including the job ID, user, status, and node allocation.srun
: This command is used to launch a job on the compute nodes and execute commands on them.scancel
: This command is used to cancel a running or pending job.sinfo
: This command provides information about the compute nodes, such as their state, availability, and partition membership.sacct
: This command is used to view job accounting information, such as job start and end times, resource usage, and exit status.scontrol
: This command is used to modify the job and node configuration, such as setting up a reservation, configuring job priority, and modifying the job environment.
Slurm partitions
Since the SC clusters are equipped with differnet hardware and have access to different resources, we have defined slurm partitions for different resources and usage patterns accordingly.
name | max time limit | max resources | usage pattern |
---|---|---|---|
sirius | 2 days | 1 node | big memory applications |
sirius-long | 10 days | 1 node | big memory applications |
polaris | 2 days | 10 nodes | general computation |
polaris-long | 42 days | 4 nodes | long runtime |
clara | 2 days | 29 nodes w/ GPUs | general and GPU computation |
clara-long | 10 days | 6 nodes w/ GPUs | long running GPU jobs |
paula | 2 days | 12 nodes w/ GPUs | general and GPU computation |
paul | 2 days | 32 nodes | general computation |
paul-long | 10 days | 4 nodes | general computation |
CPU and memory limitations
To increase the fair-share character of the cluster, we have set some limitations on the CPU and memory usage per job.
Partition | DefMemPerCPU | MaxMemPerCPU |
---|---|---|
sirius | 1G | 48G |
polaris | 1G | 8G |
clara | 1G | 16G |
paul | 1G | 8G |
paula | 1G | 16G |
By default, you can use 1 GB of RAM per CPU core.
If you need more memory, you can request it using the --mem
option in your job script.
However, the maximum amount of memory you can request per CPU core is limited by the MaxMemPerCPU
value in the table above.
If you increase the total memory further, you will automatically allocate more CPU cores.
CPU limitations on GPU nodes
To prevent GPU nodes (clara and paula) from being used exclusively by CPU-only jobs, a certain number of CPUs are reserved for GPUs. For now, we reserve one core per GPU.
Nodes | Reserved cores | Reserved CPUs | Available for CPU-only jobs |
---|---|---|---|
clara (v100) | 4 | 8 | 56 |
clara (rtx2080ti) | 8 | 16 | 48 |
paula | 8 | 8 | 120 |
Interactive jobs
Interactive jobs are a perfect way to test your jobs on the real systems.
You request and allocate the needed resources using the salloc
command.
Here, 2 nodes are requested from the clara
partition.
After the allocation is finished, the prompt returns on the same host.
[sy264qasy@login01 ~]$ salloc -N 2 -p clara
salloc: Pending job allocation 899053
salloc: job 899053 queued and waiting for resources
salloc: job 899053 has been allocated resources
salloc: Granted job allocation 899053
To run any program on the requested resources the srun
command has to be used.
In this example we run the hostname
command and it returns the hostnames of the nodes used during this allocation.
It is also possible to run parallel MPI programs this way.
To end the allocation just run exit
. The resources are freed up again and can be used for other jobs.
Serial jobs
Simple serial jobs
Serial jobs run on one cluster node and one CPU only.
Tip
Use this if you want to run a program that can run without supervision and is not feasable to run on your local machine.
For example:
- it runs for a very long time (more that a few hours)
- it needs more RAM than your local machine has
- it generates a lot of temporary data (more than your local machine can hold)
- you need to run the same program again and again on multiple (like, really a lot) different data sets or parameters
Warning
This may not necessarily accelerate the execution of your program.
In fact, it may even be slower, because of the lower CPU clock speed (between 2GHz and 3GHz) on the cluster compared to the clock speed on your local workstation or even laptop (sometimes in the range of 4GHz for good workstations).
For this simple scenario we consider the following job script saved as serial-job.sh
.
To submit your job to the Slurm queue you use the sbatch
command and get output similar to this:
This means that your job is now scheduled to run on the cluster, once it is your turn; depending on requested and available resources.
In the example above the job has the ID 2216373
.
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2216373 galaxy-jo serial-j za381baf PD 0:00 1 (None)
$ scontrol show job 216373
JobId=2216373 JobName=serial-job.sh
UserId=za381bafi(1435002408) GroupId=domain users(1435000513) MCS_label=N/A
Priority=1 Nice=0 Account=default QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:13 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2021-06-09T14:46:38 EligibleTime=2021-06-09T14:46:38
AccrueTime=2021-06-09T14:46:38
StartTime=2021-06-09T14:46:43 EndTime=2021-06-09T15:46:43 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-06-09T14:46:43
Partition=galaxy-job AllocNode:Sid=login01:24157
ReqNodeList=(null) ExcNodeList=(null)
NodeList=galaxy146
BatchHost=galaxy146
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=5G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/sc.uni-leipzig.de/za381bafi/jobs/lab-test/serial-job.sh
WorkDir=/home/sc.uni-leipzig.de/za381bafi/jobs/lab-test
StdErr=/home/sc.uni-leipzig.de/za381bafi/jobs/lab-test/slurm-2216373.out
StdIn=/dev/null
StdOut=/home/sc.uni-leipzig.de/za381bafi/jobs/lab-test/slurm-2216373.out
Power=
NtasksPerTRES:0
Serial array jobs
Tip
Use this if you have to run the same serial task for a large set of input parameters.
Let's assume we have a job file array-job.sh
with the following content:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=10:00:00
./my-serial-program -seed $SLURM_ARRAY_TASK_ID
To submit an array of jobs to the slurm queue use
This will create 500 jobs in the queue and passes the numbers from 1 through 500 as a parameter to your program. Your are reponsible to evalute this integer appropriately.
The current limit is 15000 jobs per array job.
Parallel jobs
Multi-threading jobs
Tip
Use this kind of job, if your application is using multi-threading instead of multi-processing, e.g., employing technologies like OpenMP.
All the other tips from simple serial jobs still apply in this case.
Word of advice
Multi-threading jobs run on one node only.
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --time=10:00:00
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./openmp-program
Multi-processing and MPI jobs
Tip
Use this kind of job, if your application is using some kind of multi-processing library like message passing interface (MPI).
Many popular simulations suites like GROMACS, LAMMPS and others use this technology.
All the other tips from simple serial jobs still apply in this case.
The following script takes you through a real life example of a self-developed MPI application. In this case, a Monte Carlo simulation of the two-dimensional q-state Potts model using the sophisticated multicanonical simulation method to calculate the density of states.
#!/bin/bash
#SBATCH -J potts-064
#SBATCH --time=2-00:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=24
# Unless you have a good reason for distributing across 4 nodes, you could also use
# #SBATCH --ntasks=96 and let Slurm distribute the tasks across nodes
#SBATCH --mem=64G
#SBATCH -o $HOME/Documents/Projects/cudamuca/jobfiles/log/potts-064.o-%a-%A
#SBATCH -e $HOME/Documents/Projects/cudamuca/jobfiles/log/potts-064.e-%a-%A
L=64
seed="${SLURM_ARRAY_TASK_ID}"
exe="{$HOME}/Documents/Projects/cudamuca/potts_cpu"
WORKDIR=/work/users/$USER
out=$(printf "%s/Projects/cudamuca/output/cpu_potts_bench_soft/L%03d/seed%04d" ${WORKDIR} ${L} ${seed})
mkdir -p ${out}
cd ${out}
cp ${exe} ./exe
srun ./exe -q 8 -L ${L} -s ${seed} -m
Tip
If you are using self written code, make sure you keep communication between workers at a minimum. Communication and synchronization between tasks (especially across nodes), as well as heavy file I/O, can slow down your overall performance.
To start this array of jobs 512 use:
GPU accelerated jobs
#!/bin/bash
#SBATCH -J gpu-064
#SBATCH --ntasks=1
#SBATCH --gpus=v100
#SBATCH --time=8-00:00:00
#SBATCH -o $HOME/Documents/Projects/cudamuca/jobfiles/log/GPU/gpu-potts-064.o-%a
#SBATCH -e $HOME/Documents/Projects/cudamuca/jobfiles/log/GPU/gpu-potts-064.e-%a
L=64
seed="${SLURM_ARRAY_TASK_ID}"
WORKERS=$(seq 256 256 11776)
module load CUDA/11.1.1-iccifort-2020.4.304
# build step
cd $HOME/Documents/Projects/cudamuca
make clean && make gpu
WORKDIR=/work/users/$USER
exe="$HOME/Documents/Projects/cudamuca/gpuPotts"
for WORKER in $WORKERS
do
out=$(printf "%s/Documents/Projects/cudamuca/output/potts/gpu_bench_soft/L%03d/seed%04d/W%05d" ${WORKDIR} ${L} ${seed} ${WORKER})
mkdir -p ${out}
cd ${out}
cp ${exe} ./exe
chmod +x ./exe
./exe -m -p -L ${L} -s ${seed} -W ${WORKER} -q 8
# copy end result back to home directory
endresult=$(printf "$HOME/Documents/Project/cudamuca/result/final_result-W%05d-%a-%A-%j.dat" ${WORKER})
cp out/final_result.dat ${endresult}
done
Adjusting job run-time after submission
After you have submitted your job you can adjust certain parameters about your job. The most obivious thing you might want to adjust is the runtime of your job.
Warning
To avoid abuse of the scheduler this is only possible with jobs that are not yet running.
To get information about your running job use scontrol
and the JOB ID, e.g.
Here you will get a lot of information about yout job.
In this example we are looking at the TimeLimit
in line 7.
To set the time limit of your job to 4 days use this command:
Alternatively, you can add a certain amount of time to the already set time limit. This is how you would increase the time limit by three hours:
Job Efficiency
After your job has finished, you can inspect the efficiency of your job. For that purpose, we offer reportseff
on the login nodes. There are basically two ways to generate the report. The first one is to execute the command in the directorywhere the slurm output files (slurm-ID.ou
) are stored.
The tool automatically reads the output files and generates a table showing the time efficiency, CPU efficiency and memory efficiency.
If you deleted the output files or just want to have a look at your last jobs, you can execute reportseff with the user parameter:
You can use that report to further optimize your resource allocation.Jupyter notebook
Word of advice
Jupyter notebooks are a great way to prototype or setup your more complex Python or even Tensorflow workflows. However, due to its interactive nature it is not suitable for running long calculations or train AI models in a JupyterHub/Lab session. Lucky for you, you can easily submit your Jupyter notebooks to Slurm.
Let us assume you finished your Tensorflow workflow for example as Jupyter notebook and you tested it for functionality. Now you are ready to submit the production run to the Slurm queue.
#!/bin/bash
#SBATCH --time=10:00:00
#SBATCH --mem=40G
#SBATCH --ntasks=1
#SBATCH --job-name=Tensorflow-training
#SBATCH --partition=clara
#SBATCH --gpus=v100
module load JupyterLab/1.2.5-fosscuda-2019b-Python-3.7.4
module load TensorFlow/2.4.0-fosscuda-2019b-Python-3.7.4
jupyter nbconvert --to notebook --execute Tensorflow.ipynb --output tf-job-%j.ipynb
R and Rscript
If you use R for your research you can use our RStudio servers for prototyping and testing your R script.
When you are ready to work on that big data set it is time to submit it to the Slurm queue.
#!/bin/bash
#SBATCH --time=05:00:00
#SBATCH --mem=10G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --job-name="Very long R task"
#SBATCH --partition=clara
export LC_ALL=C.UTF-8
module load R/4.0.3-foss-2020b
Rscript --vanilla long-job-rscript.R
Word of advice
The number of CPUs per tasks should be the same number of cores you specified in your R script.
That means it should be 1
if you did not specifically set up parallel calculations in R.
Slurm batch options reference
This section includes all the options used in the examples above and are just an excerpt of the vast possibilities.
Please refer to the official documentation for more options or
use man sbatch
on the cluster.
Number of tasks
Description
Sets the number of tasks (i.e. processes)
Syntax
Examples
CPUs per task
Description
Number of CPUs per task (i.e. threads); used for multithreading or hybrid jobs
Syntax
Examples
Walltime
Description
The maximal runtime for your job
Format: [hours:]minutes[:seconds]
Alternate format: days-hours[:minutes][:seconds]
Syntax
Examples
Set walltime to 30 minutes
Set walltime to 3 hours and 20 minutes
Set walltime to two days
Set walltime to 1 day, 5 hours, and 20 minutes
Job name
Description
A name for the job
Allows you to specify a custom string to identify your job in the queue
Syntax
Example
Number of nodes
Description
The minimum number of nodes allocated for your job.
Syntax
Exmaples
Tasks per node
Description
Is the maximum number of tasks (processes) per node.
Meant only to be used in combination with --nodes
option.
Syntax
Example
Partition
Description
This selects the partition of the cluster your jobs runs on.
Available partitions are listed here.
You may not be able to access all of them, depending on your permissions.
The default partition is paul
.
Syntax
Example
Memory
Description
Request real memory required per node.
Default unit is megabytes.
The options --mem
, --mem-per-cpu
, and --mem-per-gpu
are mutually exclusive.
Syntax
Examples
Requesting 40GB of memory
Requesting 128MB of memory
Memory per CPU
Description
Request memory per CPU task (i.e. process)
Default unit is megabytes.
The options --mem
, --mem-per-cpu
, and --mem-per-gpu
are mutually exclusive.
Syntax
Example
Requesting 512MB per tasks with 4 tasks amounts to 2GB in total for the job.
Memory per GPU
Description
Request memory per GPU task (i.e. process), this comprises memory used by the CPU and the GPU.
Default unit is megabytes.
The options --mem
, --mem-per-cpu
, and --mem-per-gpu
are mutually exclusive.
Syntax
Example
Requesting 1024MB per tasks with 8 GPU tasks amounts to 8GB in total for the job.
The total includes the memory for the CPU part as well as the memory used on the GPU!
Number and type of GPUs
Description
Request GPU resources for your job.
Available types currently are a30
, v100
, and rtx2080ti
.
Syntax
Examples
Requesting 8 GeForce RTX 2080Ti
Requesting 2 Tesla V100
Requesting 4 Tesla A30
Shorthand for requesting a single GPU: leave out the number
#SBATCH --gpus=v100
``
### Slurm output file
!!! tip "Description"
Allows you to specify a custom filename to which `stdout` is redirected
#### Syntax
```sh
--output=<filename>
Examples
Write output of the slurm script to your home directory.
Here %x
and %j
are variables set by Slurm.
%x
will be substituted with the job name that you specified.
%j
is the job ID.
This way you can relate each output file to a specific job.
Slurm error file
Description
Allows you to specify a custom filename to which stderr
is redirected
Syntax
Examples
Here %x
and %j
are variables set by Slurm.
%x
will be substituted with the job name that you specified.
%j
is the job ID.
This way you can relate each output file to a specific job.
E-Mail notifications
If you want to get notified when your jobs end, you can instruct slurm to send e-mails.
Which will send a mail to the address used at registration.
You could specify a different address using --mail-user=<e-mail address>.
Valid mail types are:
Type | Description |
---|---|
NONE |
default, no notification |
BEGIN |
notification on job start |
FAIL |
notification when your job fails |
END |
notification on job end (includes FAIL) |
TIME_LIMIT |
notification when your job has reached the time limit |
TIME_LIMIT_90 |
notification when your job has reached 90 percent of the time limit |
TIME_LIMIT_80 |
notification when your job has reached 80 percent of the time limit |
TIME_LIMIT_50 |
notification when your job has reached 50 percent of the time limit |
ARRAY_TASK |
send notifications for each array task |
Multiple mail types can be used at once, e.g.
You can also specify the mail options in your job script using