Skip to main content Skip to navigation
Kamiak User's Guide

SLURM

Introduction to Job Scheduling and Resource Allocation

For those who are new to cluster computing and resource management, let’s begin with an explanation of what a job scheduler and resource manager is and why it is necessary. Suppose you have a piece of C code that you would like to compile and execute, for example a helloworld program.

#include<stdio.h>
int main(){
 printf("Hello Worldn");
 return 0;
}

On your desktop you would open a terminal, compile the code using your favorite c compiler and execute the code. You can do this without worry as you are the only person using your computer and you know what demands are being made on your CPU and memory at the time you run your code. On a cluster, many users must share the available resources equitably and simultaneously. It’s the job of the resource manager to choreograph this sharing of resources by accepting a description of your program and the resources it requires, searching the available hardware for resources that meet your requirements, and ensuring no one else is given those resources while you are using them.

Occasionally there will not be enough resources available to meet your request. In those instances your job will be “queued”, that is the manager will wait until the needed resources become available before running your job. This will also occur if the total resources you request for all your jobs exceed the limits set by the cluster administrator. This ensures that all users have equal access to the cluster.

The take home point here is this: in a cluster environment a user submits jobs to a resource manager, which in turn runs an executable(s) for the user. So how do you submit a job request to the resource manager? Job requests take the form of scripts, called job scripts. These scripts contain script directives, which tell the resource manager what resources the executable requires. The user then submits the job script to the scheduler.

The syntax of these script directives is manager/scheduler specific. For the SLURM job scheduler and resource manager, all script directives begin with “#SBATCH”.

Hello World Example

Let’s look at a basic SLURM script requesting one node and one core on which to run our helloworld program.

#!/bin/bash
#SBATCH --partition=test        ### Partition (like a queue in PBS)
#SBATCH --job-name=HiWorld      ### Job Name
#SBATCH --output=Hi.out         ### File in which to store job output
#SBATCH --error=Hi.err          ### File in which to store job error messages
#SBATCH --time=0-00:01:00       ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1               ### Node count required for the job
#SBATCH --ntasks-per-node=1     ### Nuber of tasks to be launched per Node
./hello

Notice that the SLURM script begins with #!/bin/bash. This tells the Linux shell what flavor shell interpreter to run. In this example we use BASh (Bourne Again Shell). The choice of interpreter (and subsequent syntax) is up to the user, but every SLURM script should begin with an interpreter directive. This is followed by a collection of #SBATCH script directives telling the manager about the resources needed by our job and where to put the codes output. Lastly, we run the desired executable (note: this script assumes it is located in the same directory as the executable).

With our SLURM script complete, we’re ready to run our program on the cluster. To submit our script to SLURM, we invoke the sbatch command. Suppose we saved our script in the file helloworld.srun (the extension is not important). Then our submission would look like:

[mywsu.NID@login-p1n01 ~]$ sbatch TestSlurm.srun 
Submitted batch job 10
[mywsu.NID@login-p1n01 ~]$

Our job was successfully submitted and was assigned the job identifier 10. We can check the output of our job by examining the contents of our output and error files. Referring back to the helloworld.srun SLURM script, notice the lines

#SBATCH --output=Hi.out ### File in which to store job output
#SBATCH --error=Hi.err  ### File in which to store job error messages

These specify files in which to store the output written to standard out and standard error, respectively. If our code ran without issue, then the Hi.err file should be empty and the Hi.out file should contain our greeting.

[mywsu.NID@login-p1n01 ~]$ cat Hi.err 
[mywsu.NID@login-p1n01 ~]$ cat Hi.out 
Hello World
[mywsu.NID@login-p1n01 ~]$

There are two more commands we should familiarize ourselves with before we begin submitting jobs. The first is the squeue command. This shows us a list of jobs we have submitted to the queue. The second is the scancel command. This allows us to terminate a job that is either queued or running. To see these commands in action, let’s simulate a one hour job by using the sleep command at the end of a new submission script.

#!/bin/bash
#SBATCH --job-name=OneHourJob  ### Job Name
#SBATCH --time=0-00:01:00      ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1              ### Node count required for the job
#SBATCH --ntasks-per-node=1    ### Nuber of tasks to be launched per Node
./hello

sleep 3600

Notice that we’ve omitted some of the script directives from our hello world submission script. When no output directives are given, SLURM will redirect the output of our executable to a files labeled slurm-<job ID number>.out. Let’s suppose that the above is stored in a file named sleep.srun and we submit our job using the sbatch command. Then we can check on the progress of our job using squeue and we can cancel the job by executing scancel on the assigned job ID number.

[mywsu.NID@login-p1n01 ~]$ sbatch sleep.srun 
Submitted batch job 11
[mywsu.NID@login-p1n01 ~]$ squeue
 JOBID PARTITION    NAME      USER ST TIME NODES NODELIST(REASON)
    11      free OneHour mywsu.NID  R 0:03     1 cn32
[mywsu.NID@login-p1n01 ~]$ scancel 11
[mywsu.NID@login-p1n01 ~]$ squeue
 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[mywsu.NID@login-p1n01 ~]$

Notice the ST (state) column of the output of the squeue command. This tells us our job’s status. A status of R indicates our job is currently running, while a status of PD indicates a pending job, i.e. a job which is awaiting a resource allocation. For a full list of Job State Codes see the man page of squeue.

Congratulations, you are ready to begin running jobs on Kamiak!

Memory

The amount of memory on each node of kamiak varies from node to node. On our standard compute nodes, you will find either 128GB, 256GB, or 512GB of DDR3 or DDR4 RAM. Unless you specify a specific amount of memory that is required by your job, you will be given a memory allocation that is proportional to the fraction of cores you are using on that node. For example, if I request 10 cores (out of 20 cores) on a node that has 128GB of RAM, my job will be given half the total or 64GB. If I request only a single core, I will be give 1/20th of the total memory or 6.4GB of RAM. To decouple the amount of memory given to a job from the number of cores, one can use the –mem flag. For example, if I had a single core job that required 256GB of RAM, my submission script would look like

#!/bin/bash
#SBATCH --job-name=MyJob      ### Job Name
#SBATCH --partition=test      ### Quality of Service (like a queue in PBS)
#SBATCH --time=0-00:01:00     ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1             ### Node count required for the job
#SBATCH --ntasks-per-node=1   ### Number of tasks to be launched per Node
#SBATCH --mem=256000          ### Amount of memory in MB
 
my_ram_hungry_program

Notice that I must specify the amount of memory in megabytes.

OpenMP Jobs

When running OpenMP (OMP) jobs on Kamiak, it’s necessary to set your environment variables to reflect the resources you’ve requested. Specifically, you must export the variable OMP_NUM_THREADS so that its value matches the number of cores you have requested from SLURM. This can be accomplished through the use of built in SLURM export environment variables.

#!/bin/bash
#SBATCH --partition=test    ### Partition
#SBATCH --job-name=HelloOMP ### Job Name
#SBATCH --time=00:10:00     ### WallTime
#SBATCH --nodes=1           ### Number of Nodes
#SBATCH --ntasks-per-node=1 ### Number of tasks (MPI processes)
#SBATCH --cpus-per-task=20  ### Number of threads per task (OMP threads)


export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

./hello_omp

In the script above we request 20 cores on one node of Kamiak (which is all the cores available on a standard compute node). As SLURM regards tasks as being analogous to MPI processes, it’s better to use the cpus-per-task directive when employing OpenMP parallelism. Additionally, the SLURM export variable $SLURM_CPUS_PER_TASK stores whatever value we assign to cpus-per-task, and is therefore our candidate for passing to OMP_NUM_THREADS.

MPI Jobs

Now let’s look at how to run an MPI based job across multiple nodes. Suppose we would like to run an MPI based executable named hello_mpi. Let’s further suppose that we wished to run it using a total of 80 MPI processes. On Kamiak, standard compute nodes are equipped with two 10 core processors. A natural way of breaking up our problem would be to run it on four nodes using 20 processes per node.

#!/bin/bash
#SBATCH --partition=test     ### Partition
#SBATCH --job-name=HelloMPI  ### Job Name
#SBATCH --time=00:10:00      ### WallTime
#SBATCH --nodes=4            ### Number of Nodes
#SBATCH --ntasks-per-node=20 ### Number of tasks (MPI processes)

module load openmpi

srun ./hello_mpi

When submitting MPI jobs to SLURM, do not specify the machine. Our MPI instantiations will get the information automatically from SLURM.  Additionally, the use of mpirun on Kamiak is not supported.  srun should be used in its place.

 

Interactive Jobs (idev)

When developing code, it’s often necessary to quickly compile, run, and validate code from the command line. While this would be prohibited on the login node, we provide an app called idev (Interactive DEVelopment) for just such an occasion.

idev

The idev app creates an interactive development environment in the users terminal. In the idev environment, the user is logged directly into a compute node where they can compile code, run executables, and otherwise behave as if they are on their own machine. When executed, the idev script submits a batch job to SLURM, connects to the first node assigned to the job, replicates the user’s batch environment, and then goes to sleep. Exiting from the compute node automatically terminates the batch job and returns the user to the login node.

Example

From one of the Kamiak login nodes run the idev command.

[mywsu.NID@login-p1n02 ~]$ idev

By default, idev submits a job requesting one core from the free partition for one hour. If a node is available, your job will become active and idev will initiate an ssh session on the compute node.

[mywsu.NID@login-p1n02 ~]$ idev
Requesting 1 node(s) from free partition
1 task(s)/node, 1 cpu(s)/task
Time: 0 (hr) 60 (min).
Submitted batch job 37
Job is pending. Please wait. 0(s)
JOBID=37 begin on cn32
--> Creating interactive terminal session (login) on node cn32.
--> You have 0 (hr) 60 (min).
--> Assigned Host List : /tmp/idev_nodes_file_mywsu.NID
Last login: Fri Feb 5 08:30:16 2016 from login-p1n02.mgmt.kamiak.wsu.edu
[mywsu.NID@cn32 ~]$

Notice the prompt [mywsu.NID@cn32 ~]$  indicating that we are logged into compute node 32.

Options

To see a list of options, run idev -h.

[mywsu.NID@cn14 ~]$ idev -h
-c|--cpus-per-task= : Cpus per Task
-N|--nodes= : Number of Nodes
-n|--ntasks-per-node= : Number of Tasks per Node
--mem= : Memory (MB) per Node
-t|--time= : Wall Time
--port= : port forward [local port]:[remote port]

Changing the idev options allows us to tailor the batch job parameters to fit our needs. Suppose I wanted an interactive batch job on two nodes of the CAS partition with five tasks per node, four cpus per task, and I wanted to reserve the resources for two hours:

[mywsu.NID@login-p1n02 ~]$ idev --nodes=2 --ntasks-per-node=5 --cpus-per-task=4 --partition=cas --time=2:00:00
Requesting 2 node(s) from cas partition
5 task(s)/node, 4 cpu(s)/task
Time: 02 (hr) 00 (min).
Submitted batch job 39
Job is pending. Please wait. 0(s)
JOBID=39 begin on cn14
--> Creating interactive terminal session (login) on node cn14.
--> You have 02 (hr) 00 (min).
--> Assigned Host List : /tmp/idev_nodes_file_mywsu.NID
Last login: Fri Feb 5 08:46:08 2016 from login-p1n02.mgmt.kamiak.wsu.edu
[mywsu.NID@cn14 ~]$

 

Jobs on Special Nodes (GPU, Xeon Phi)

Kamiak has several classes of specialty node: large memory, GPU, and Xeon Phi. Using the large memory nodes is as easy as specifying the correct queue, as denoted by the tag bigmem. However, to use the special hardware found on the accelerator equipped nodes one needs to make a small addition to their job script. Accelerator resources are tracked in Slurm through the use of General RESources, or gres, directives. The syntax of this is type:model:count or type:count. For example, if I wished to use a single GPU of any model, I would add the flag –gres=gpu:1

#!/bin/bash
#SBATCH --job-name=GPUjob     ### Job Name
#SBATCH --partition=free_gpu  ### Quality of Service (like a queue in PBS)
#SBATCH --time=2-00:00:00     ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1             ### Node count required for the job
#SBATCH --ntasks-per-node=1   ### Nuber of tasks to be launched per Node
#SBATCH --gres=gpu:1          ### General REServation of gpu:number of GPUs

module load cuda

my_executable $SLURM_JOB_GPUS

In this example, the program “my_executable” expects the gpu ordinal as an input. I can use of the variable SLURM_JOB_GPUS to pass that information from SLURM without knowing apriori which GPU I will run on. SLURM_JOB_GPUS is a list of the ordinal indexes of the GPUs assigned to my job by Slurm. With the request of a single GPU this variable will store a single numeral from 0 up to the number of GPUs on the node. If I wanted to use two GPUs, I would change gres=gpu:1 to gres=gpu:2, and then SLURM_JOB_GPUS would store a list of the form 0,1 (for example).

See the Queue List to determine what Generic RESources exist in each partition.

Job Arrays

Job arrays can be used to submit multiple related jobs as a single job.  More complete information can be found in Slurm’s documentation here.  A simple example submission script of an array job is:

#!/bin/bash
#SBATCH --partition=test       ### Partition
#SBATCH --job-name=HelloArray  ### Job Name
#SBATCH --time=00:10:00        ### WallTime
#SBATCH --nodes=1              ### Number of Nodes
#SBATCH --ntasks=1             ### Number of tasks per array job
#SBATCH --array=0-19           ### Array index

echo "I am Slurm job ${SLURM_JOB_ID}, array job ${SLURM_ARRAY_JOB_ID}, and array task ${SLURM_ARRAY_TASK_ID}."

This will submit a single job which splits into an array of 20 jobs, each with 1 CPU core allocated.  Users who would otherwise submit a large number of individual jobs are encouraged to utilize job arrays.

The Kamiak Partition (backfill)

Kamiak was built on the principals of an enhanced condominium cluster. A key component of that model is the ability to utilize idle compute resources without impeding the investors ability to access the resources they have purchased. To that end, we have deployed a special backfill, or scavenge, partition which we call the kamiak partition.

Jobs that are submitted to the kamiak partition will run on any idle hardware that meets the requirements of the job, regardless of the submitting user’s affiliations. However, these “backfill jobs” may be preempted at any time by jobs belonging to the owner of a resource being used by the backfill job. When a backfill job is preempted, it will be returned to the bottom of the queue.

For example, let’s say that I submit a job to the kamiak partition and SLURM assigns me compute node number 1. That node is owned by the cahnrs partition. If a job is submitted to the cahnrs partition by a user with a CAHNRS affiliation, and that job requires the use of cn1, my job will be preempted and returned to the queue. However, if there are enough idle resources in the cahnrs partition to accommodate the CAHNRS affiliate job, my backfill job will not be preempted.

We have set the limits on the kamiak partition to 6 nodes for 1 week, however it is inadvisable to request that maximum amount as your job will almost certainly be preempted. Currently all users have the same priority in the backfill queue, i.e. jobs are run on a FIFO basis. There are plans to implement a tiered priorities under the guidance of the Kamiak User Group Executive Committee and we will update you as we make progress.

Automatic Cleanup on Job Exit

It is possible to have your job perform “cleanup” tasks when the job ends for any reason.  Here is a simple job which creates a scratch workspace during the job and removes it when the job ends:

#!/bin/bash
#SBATCH -n 1 # Number of cores
#SBATCH -t 0-00:01 # Runtime in D-HH:MM
#SBATCH --job-name=trap_and_cleanup

echo "Starting trap_and_cleanup on host $HOSTNAME"

my_workspace=$(mkworkspace --backend=/local -q)

function clean_up {
     # Clean up. Remove temporary workspaces and the like.
     rmworkspace --autoremove --force --name=$my_workspace

    exit
}

# Call our clean_up function when we exit, even if SLURM cancels the job this should still run
trap 'clean_up' EXIT

# Work happens here ...
echo "My current workspace is $my_workspace"

echo "Completed trap_and_cleanup on host $HOSTNAME"

In that example we create a function named clean_up and trap the EXIT signal to run that function when then shell exits.  The “EXIT” signal is specific to the shell “bash” and should capture any situation where the shell exits normally (i.e. does not crash or is not forcibly terminated by the system).  This includes when the job exits normally after a successful run, after a job fails, or after a job is preempted by another job (canceled and resubmitted).

Users who need to perform specific work in the event of a job being preempted by another job can do so by trapping the signal TERM.  When a job is preempted Slurm will send the job’s processes a TERM signal.  This normally causes the shell to exit but if one were to create a function similar to the example above and call trap 'my_other_function' TERM after defining it, the job script will attempt to run that function before the job exits.  Any work done in such a function must be kept to a minimum because Slurm will only wait seconds between sending a SIGTERM and forcibly killing the job (with SIGKILL, which cannot be trapped and handled within a job).

How to Request Specific Features or Hardware

There are several generations of processors on Kamiak including Ivy Bridge, Haswell, and Broadwell.  If you wish to request a specific architecture, say for benchmarking or to take advantage of the larger instruction set on newer processors, you can do so by using a feature tag in your SLURM script.  For example, suppose I’ve compiled my code to use the AVX2 instruction set which is present on Haswell and later processors, but not on Ivy Bridge.  I would add a “constraint” to my SLURM script:

#!/bin/bash
#SBATCH --job-name=HaswellJob      ### Job Name
#SBATCH --partition=kamiak         ### Quality of Service (like a queue in PBS)
#SBATCH --time=0-00:01:00          ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1                  ### Node count required for the job
#SBATCH --ntasks-per-node=1        ### Number of tasks to be launched per Node
#SBATCH --constraint=haswell       ### Only run on nodes with Haswell generation CPUs

./my_AVX2_executable

The list of available constraints can be found in the Queue List.