Job Submission

Slurm job scheduler

To use the Kamiak HPC cluster, a user submits jobs to the Slurm job scheduler, which allocates resources to jobs and runs them when their requested resources are available. Slurm runs jobs in a first-in first-out order subject to the constraint of available resources, and also ensures that no one else is given those resources while you are using them.

In addition to the investor and college partitions, there is an overarching backfill partition named “kamiak” that contains all nodes, and is available for use by the entire research community. The Slurm resource manager gives priority to investors who run jobs on the nodes they own, and will preempt (i.e., cancel and requeue) jobs in the backfill queue if they are running on nodes that the investors need to be able to run.

So how do you submit a job to Slurm? Job requests take the form of scripts, called job scripts. These scripts contain directives which tell the Slurm what resources the program requires. The user submits the job script to the scheduler using the sbatch command. The Slurm script directives all begin with “#SBATCH”.

Hello World example

Let’s look at a simple Slurm script requesting one node and ten cores on which to run our program.

#!/bin/bash
#SBATCH --partition=kamiak         # Partition/Queue to use
#SBATCH --job-name=myJob           # Job name
#SBATCH --output=myJob_%j.out      # Output file (stdout)
#SBATCH --error=myJob_%j.err       # Error file (stderr)
#SBATCH --mail-type=ALL            # Email notification: BEGIN,END,FAIL,ALL
#SBATCH --mail-user=your.name@wsu.edu  # Email address for notifications
#SBATCH --time=7-00:00:00          # Wall clock time limit Days-HH:MM:SS
#SBATCH --nodes=1                  # Number of nodes (min-max)
#SBATCH --ntasks-per-node=1        # Number of tasks per node (max)
#SBATCH --ntasks=1                 # Number of tasks (processes)
#SBATCH --cpus-per-task=10         # Number of cores per task (threads)
srun echo "Hello World"            # Run #tasks instances of program

Notice that the Slurm script begins with #!/bin/bash. This tells the Linux shell what flavor shell interpreter to run. This is followed by a collection of #SBATCH script directives indicating what resources are needed by our job and where to put the output. Lastly, we run the desired program. srun will run a unique instance of the program for each task, in this case for only 1 task. If there is only 1 task, the srun can be omitted.

To submit our script to Slurm, we invoke the sbatch command. Suppose we saved our script in the file helloworld.srun (the extension is not important). Then our submission would look like:

[myWSU.netid@login-p2n01 ~]$ sbatch helloworld.srun 
Submitted batch job 36510
[myWSU.netid@login-p2n01 ~]$

Our job was successfully submitted and was assigned the job identifier 36510. We can check the output of our job by examining the contents of our output and error files. Referring back to the helloworld.srun script, notice the –output and –error lines. These specify the file names in which to store the output written to standard out and standard error, respectively. The “%j” is a placeholder to insert the job number, so that each job will produce a unique output file. If our code ran without error, then the err file should be empty and the out file should contain our greeting.

[myWSU.netid@login-p2n01 ~]$ cat myJob_36510.err 
[myWSU.netid@login-p2n01 ~]$ cat myJob_36510.out 
Hello World
[myWSU.netid@login-p2n01 ~]$

There are two more commands to become familiar with before submitting jobs. The first is the squeue command, which shows a list of jobs submitted to the queue. The ST (state) column of the output of the squeue command indicates the job’s status. A status of R indicates our job is currently running, while a status of PD indicates a pending job, i.e. a job which is awaiting a resource allocation. For a full list of Job State Codes see the man page of squeue. The second command is the scancel myJobNumber command. This terminates a job that is either queued or running.

Memory

The amount of memory on each node of kamiak varies from node to node, ranging from 128GB to 2TB. Unless you specify a specific amount of memory that is required by your job, you will be given a memory allocation that is proportional to the number of cores you are using on that node. To explicitly specify memory requirements, use one of the following directives (but not both):

#SBATCH --mem=256GB          ### Amount of memory per node

#SBATCH --mem-per-cpu=9GB    ### Amount of memory per core

You must ensure that you are not requesting more than the available memory to run jobs on the nodes in a given partition, otherwise your job submission will be rejected.

To find the available memory to run jobs for a given node, run the command scontrol show node cn87 (or any node): the available memory to run jobs will be RealMemory-MemSpecLimit, in other words the real memory minus 6GB. You can also see the RealMemory for all nodes using sinfo --long --Node. This is because Slurm decreases the available memory to run jobs on nodes by 6GB from the total real memory, to reserve space for system processes and avoid out-of-memory errors. As required, adjust your jobs requested memory using the --mem-per-cpu or --mem sbatch options.

Interactive jobs (idev)

In addition to sbatch batch job submission, Kamiak provides a capability for interactively running commands on compute nodes. When executed from within a terminal window logged into Kamiak, the idev command will allocate resources and launch an interactive bash shell on the requested compute nodes. From that point on, any commands typed into the shell will be executed on the compute nodes, including srun, just as if they were inside an sbatch script.

You can specify a specific set of resource requirements using the same options as #SBATCH. For example, to request 2 nodes with 4 tasks each, for a time limit of 3600 seconds, you would type:

idev -N 2 --ntasks-per-node=4 -t 3600

To exit from an idev session, just type

exit

Job arrays

A job array script is a template to create multiple instances of a job. For example:

#SBATCH --array=1-5

creates 5 instances of the job, one for each index 1,2,3,4,5. Each instance is an individual job with the same requested resources. The job array template holds a fixed place in the job queue, and spawns job instances as resources become available. Each job instance has a unique index that is stored in the environment variable $SLURM_ARRAY_TASK_ID. You can use the index however you want, in different ways.

More complete information can be found in Slurm’s documentation here. A simple job array script is:

#!/bin/bash
#SBATCH --partition=kamiak
#SBATCH --job-name=HelloArray
#SBATCH --time=00:10:00
#SBATCH --nodes=1              ### Number of nodes per instance
#SBATCH --ntasks=1             ### Number of tasks per instance
#SBATCH --array=0-19           ### Instance indexes

echo "I am Slurm job ${SLURM_JOB_ID}, array job ${SLURM_ARRAY_JOB_ID}, with instance index ${SLURM_ARRAY_TASK_ID}."

srun cat "inputs/data_${SLURM_ARRAY_TASK_ID}.txt"

This will submit a single job which splits into 20 instances, each with 1 CPU core allocated. Users who would otherwise submit a large number of individual jobs are encouraged to use job arrays.

Other types of jobs

The amount and type of resources needed to be requested for a job depends on the type of program. There are basically three types of jobs: a single-node job that is multithreaded and communicates by shared memory, a multi-node job that uses message-passing between nodes for communication, and a GPU (Graphics Processor Unit) accelerator job. Each type of job has different Slurm directives for resource requirements. For a given application program, you need to consult the application’s documentation to tell which job type is supported and which you should use.

Single node

A single node job will use multithreading, where threads share memory. The multithreading may be native, or have been compiled using OpenMP directives. An example of the Slurm directives for this type of job are:

#SBATCH –nodes=1
#SBATCH –ntasks=1
#SBATCH –cpus-per-task=20
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
myApp args...

Multiple node

A multiple-node job will use MPI for message-passing communication. Each task is a separate instance of the program, and tasks do not share memory. The srun command must be used in front of the program name to spawn off one instance per task, otherwise only one instance of the program will be run for the job. An example of the Slurm directives for this type of job are:

#SBATCH –nodes=2
#SBATCH –ntasks=4
#SBATCH –cpus-per-task=1
srun myApp args...

GPU accelerator

A GPU job will offload computations onto a graphics accelerator, and so can only run on nodes that have GPU accelerators. A GPU accelerator is composed of matrix processors and thousands of tiny pixel cores. It is used by the main compute node to offload kernel functions to run over many data points. Its use requires special compiler support such as CUDA, OpenCL, or OpenACC. An example of the Slurm directives for this type of job are:

#SBATCH –nodes=1
#SBATCH –ntasks=1
#SBATCH –cpus-per-task=1
#SBATCH –gres=gpu:tesla:1
myApp args...

In this case we are requesting 1 node with 1 GPU. To request more than 1 GPU, for example 2, use “tesla:2”.

Requesting hardware features

There are several generations of processors on Kamiak including Ivy Bridge, Haswell, Broadwell, Skylake, and Cascade Lake. If you wish to request a specific architecture, say for benchmarking or to take advantage of the larger instruction set on newer processors, you can do so by using a feature tag in your Slurm script. For example, suppose I’ve compiled my code to use the AVX2 instruction set which is present on Haswell and later processors, but not on Ivy Bridge. I would add a “constraint” to my Slurm script:

#SBATCH --constraint=avx2  ### Only run on nodes with avx2 vector support

The list of available constraints can be found in the Queue List.

Automatic cleanup on job exit

It is possible to have your job perform “cleanup” tasks when the job ends for any reason. Here is a simple job which creates a scratch workspace during the job and removes it when the job ends.

#!/bin/bash
#SBATCH -n 1 # Number of cores
#SBATCH -t 0-00:01 # Runtime in D-HH:MM
#SBATCH --job-name=trap_and_cleanup

# Call cleanup on exit or job cancel
# Place at front of the script
cleanup() {
     rm -r -f $myscratch/*
}
trap cleanup EXIT

# Work happens here
myscratch=$(mkworkspace)
echo "My current workspace is $myscratch"

In the above example we create a function named cleanup and trap the bash EXIT signal to run that function when then shell exits. The EXIT signal will capture any situation where the shell exits normally as well as abnormally, and covers several terminating signals such as INT and TERM. The trap will be invoked when the job exits normally after a successful run, after a job fails, or after a job is preempted by another job and thus cancelled and requeued.

Users who need to perform specific work only in the event of a job being preempted by another job can do so by trapping the signal TERM. When a job is preempted Slurm will send the job’s processes a TERM signal. This normally causes the shell to exit but if one were to create a function similar to the example above and call trap 'my_other_function' TERM after defining it, the job script will attempt to run that function before the job exits. In this case you should end the function with “exit 1”. Any work done in such a function must be kept to a minimum because Slurm will only wait seconds between sending a SIGTERM and forcibly killing the job with SIGKILL, which cannot be trapped and handled within a job.