SLURM Scheduler
HiPerGator uses Simple Linux Utility for Resource Management (SLURM) to allocate resources and schedule jobs. Below are sample SLURM scripts and guidelines for building your own batch jobs.
Sample SLURM Scripts
Basic, Single-Threaded Job
This script is a template for single-processor applications. Use the --mem-per-cpu
flag to request the appropriate amount of memory for your job. Test your application to set a reasonable memory value based on actual usage. The %j
in the --output
line substitutes the job ID in the output file name. You can also add a -e
or --error
line with a filename to separate output and error logs.
#!/bin/bash
#SBATCH --job-name=serial_job_test # Job name
#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu # Where to send mail
#SBATCH --ntasks=1 # Run on a single CPU
#SBATCH --mem=1gb # Job memory request
#SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --output=serial_test_%j.log # Standard output and error log
pwd; hostname; date
module load python
echo "Running plot script on a single CPU core"
python /data/training/SLURM/plot_template.py
date
Multi-Threaded SMP Job
This script serves as a template for applications that use multiple processors on a single server or physical computer. These applications, known as threaded, OpenMP, PTHREADS, or shared memory applications, require all processors to be on the same node.
These applications need shared memory and can only run on one node. Keep in mind:
- Set
--ntasks=1
and--cpus-per-task
to the number of threads you want to use. - Inform the application about the number of processors:
- For some applications, set
OMP_NUM_THREADS
to match--cpus-per-task
. - For others, use a command line option.
- For some applications, set
#!/bin/bash
#SBATCH --job-name=parallel_job # Job name
#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu # Where to send mail
#SBATCH --nodes=1 # Run all processes on a single node
#SBATCH --ntasks=1 # Run a single task
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
#SBATCH --mem=1gb # Job memory request
#SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --output=parallel_%j.log # Standard output and error log
pwd; hostname; date
echo "Running prime number generator program on $SLURM_CPUS_ON_NODE CPU cores"
/data/training/SLURM/prime/prime
date
If you run multi-processing code, such as using Python’s multiprocessing module, ensure you specify a single node and the number of tasks your code will use as in:
#!/bin/bash
#SBATCH --job-name=parallel_job_test # Job name
#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu # Where to send mail
#SBATCH --nodes=1 # Run all processes on a single node
#SBATCH --ntasks=4 # Number of processes
#SBATCH --mem=1gb # Total memory limit
#SBATCH --time=01:00:00 # Time limit hrs:min:sec
#SBATCH --output=multiprocess_%j.log # Standard output and error log
date;hostname;pwd
module load python/3
python script.py
date
Array Job
To submit multiple identical jobs without an external script, use SLURM’s array jobs feature. Note: There is a maximum limit of 3000 jobs per user on HiPerGator.
For example, to run 5 tasks in an array, use the following script. %A
represents the master job ID of the array, and %a
represents the task ID in the output filename:
#!/bin/bash
#SBATCH --job-name=array_job_test # Job name
#SBATCH --mail-type=FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu # Where to send mail
#SBATCH --ntasks=1 # Run a single task
#SBATCH --mem=1gb # Job Memory
#SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --output=array_%A-%a.log # Standard output and error log
#SBATCH --array=1-5 # Array range
pwd; hostname; date
echo This is task $SLURM_ARRAY_TASK_ID
date
GPU Job
Please see GPU Access for more information regarding the use of HiPerGator GPUs. Note that the order in which the environment modules are loaded is important.
-VASP
#!/bin/bash
#SBATCH --job-name=vasptest
#SBATCH --output=vasp.out
#SBATCH --error=vasp.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@ufl.edu
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=8
#SBATCH --distribution=cyclic:cyclic
#SBATCH --mem-per-cpu=7000mb
#SBATCH --partition=gpu
#SBATCH --gpus=a100:4
#SBATCH --time=00:30:00
module purge
module load cuda/10.0.130 intel/2018 openmpi/4.0.0 vasp/5.4.4
srun --mpi=pmix_v3 vasp_gpu
-NAMD
#!/bin/bash
#SBATCH --job-name=stmv
#SBATCH --output=std.out
#SBATCH --error=std.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=4
#SBATCH --distribution=block:block
#SBATCH --time=30:00:00
#SBATCH --mem-per-cpu=1gb
#SBATCH --mail-type=NONE
#SBATCH --mail-user=some_user@ufl.edu
#SBATCH --partition=gpu
#SBATCH --gpus=a100:2
module load cuda/11.0.207 intel/2020.0.166 namd/2.14b2
echo "NAMD2 = $(which namd2)"
echo "SBATCH_CPU_BIND_LIST = $SBATCH_CPU_BIND_LIST"
echo "SBATCH_CPU_BIND = $SBATCH_CPU_BIND "
echo "CUDA_VISIBLE_DEVICES = $CUDA_VISIBLE_DEVICES"
echo "SLURM_CPUS_PER_TASK = $SLURM_CPUS_PER_TASK "
gpuList=$(echo $CUDA_VISIBLE_DEVICES | sed -e 's/,/ /g')
N=0
devList=""
for gpu in $gpuList
do
devList="$devList $N"
N=$(($N + 1))
done
devList=$(echo $devList | sed -e 's/ /,/g')
echo "devList = $devList"
namd2 +p$SLURM_CPUS_PER_TASK +idlepoll +devices $devList stmv.namd
SLURM Commands
Below are some useful SLURM commands. For more detailed information, see the SLURM Usage Guide .
Submit Jobs
sbatch script
Interactive Session
An interactive SLURM session i.e. a shell prompt within a running job can be started with
srun <resources> --pty bash -i
For example, a single node 2 CPU core job with 2gb of RAM for 90 minutes can be started with
srun --ntasks=1 --cpus-per-task=2 --mem=2gb -t 90 --pty bash -i
Canceling Jobs
scancel jobID
or, for cancelling multiple jobs with names that follow a wildcard pattern
scancel pattern
Checking Job Status
The basic command for checking job information is sacct
. While full documentation is available on the SLURM webpage, here are some useful examples and templates for customization.
By default, sacct
shows jobs in the queue or running since midnight of the current day. To view jobs from a specific date, use the start time (-S
or --starttime
) option. For example, to see jobs since May 1st (0501):
sacct -S 0501
To view specific information, specify the desired columns. For example, to see the number of CPUs, total memory use, and wall time of all jobs since May 1st:
sacct -S 0501 -o JobIDRaw,JobName,NCPUS,MaxRSS,Elapsed
And for the whole group:
sacct -S 0501 -o JobIDRaw,JobName,User,NCPUS,MaxRSS,Elapsed -a -A group_name
To view memory use of jobs:
sacct --format=User,JobID,ReqMem,MaxRss
The above commands retrieve information about completed jobs from the SLURM database. To check currently running jobs, use the sstat
command. For example:
sstat -j 123456.batch -o maxrss
For more details, see the sstat
manual page.