Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

This page describes how to submit jobs to the cluster.

Slurm configuration and job restrictions

The cluster has multiple partitions, corresponding to groups of nodes. The different partitions have different hardware and job restrictions as discussed here:

Table 1:Partition Restrictions

PartitionMax # cores per user (running)Time limitMax CPU memory per jobMax cores per job
high (default)967 days128 GB24[1]
low25628 days256 GB32[1]
gpu[2]8 CPU cores28 days6 GB8
epurdom[2]25628 days[3]528 GB128[1]
jsteinhardt[2]varied28 days[3]288 GB (smaug), 792 GB (balrog, rainbowquartz), 1 TB (saruman), 128 GB (various)varied[1]
yugroup[2]varied28 days[3]variedvaried
yss[2]22428 days[3]528 GBvaried[1]

Single-core jobs

Prepare a shell script containing the instructions you would like the system to execute.

The instructions here are for the simple case of submitting a job without any parallelization; i.e., a job using a single core (CPU). When submitted using the instructions in this section, such jobs will have access to only a single core at a time. We also have extensive instructions for submitting parallelized jobs and automating the submission of multiple jobs.

For example a simple script to run an R program called ‘simulate.R’ would contain these lines:

#!/bin/bash
R CMD BATCH --no-save simulate.R simulate.out

Once logged onto a submit host, navigate to a directory within your home or scratch directory (i.e., make sure your working directory is not in /tmp or /var/tmp) and use the sbatch command with the name of the shell script (assumed to be job.sh here) to enter a job into the queue:

$ sbatch job.sh
Submitted batch job 380

Here the job and assigned job ID 380. Results that would normally be printed to the screen via standard output and standard error will be written to a file called slurm-380.out.

If you have many single-core jobs to run, there are various ways to automate submitting such jobs.

Note that Slurm is configured such that single-core jobs will have access to a single physical core (including both hyperthreads on our new machines), so there won’t be any contention between the two threads on a physical core. However, if you have many single-core jobs to run on the high, jsteinhardt, or yugroup partitions; you might improve your throughput by modifying your workflow so that you can one run job per hyperthread rather than one job per physical core. You could do this by taking advantage of parallelization strategies in R, Python, or MATLAB to distribute tasks across workers in a single job, or you could use GNU parallel or srun within sbatch.

Slurm provides a number of additional flags to control what happens; you can see the man page for sbatch for help with these. Here are some examples, placed in the job script file, where we name the job, ask for email updates and name the output and error files:

#!/bin/bash
#SBATCH --job-name=myAnalysisName
#SBATCH --mail-type=ALL                       
#SBATCH --mail-user=blah@berkeley.edu
#SBATCH -o myAnalysisName.out #File to which standard out will be written
#SBATCH -e myAnalysisName.err #File to which standard err will be written
R CMD BATCH --no-save simulate.R simulate.Rout

Parallel Jobs

One can use Slurm to submit parallel code of a variety of types.

High performance (CPU) aprtitions

High partition vs. low partition

Both of these partitions have quite old machines.

While the machines in the high partition are faster than the machines in the low partition, these machines are also quite old and will generally be slower than the machines in the partitions for specific lab groups and (on a per-core basis) than your laptop, particularly Apple Silicon Mac laptops.

epurdom partition

The epurdom partition has two nodes (frodo and samwise) with recent CPUs (128-core AMD EPYC) and 528 GB memory each.

You can request use of these nodes as follows:

arwen:~$ sbatch -p epurdom job.sh
Submitted batch job 380

Purdom group members have priority access to these nodes. If you are in the group, simply submit jobs t the epurdom partition and you will automatically preempt jobs by users not in the group if it is needed for your job to run.

Non-group members can submit jobs as well, but jobs may be preempted (killed) without warning if group member jobs need the resources being used. Pre-emptible jobs are requeued when preemption happens and should restart when the needed resources become available. If you see that your job is not being requeued, please contact us.

If you need more than one CPU, please request that using the --cpus-per-task flag. The value you specify actually requests that number of hardware threads, but with the caveat that a given job is allocated all the threads on a given core to avoid contention between jobs for a given physical core.

jsteinhardt partition

The jsteinhardt partition has various nodes. While these nodes are primarily intended for use for their GPUs, many of them have newer CPUs, a lot of memory, and very fast disk I/O to /tmp and /var/tmp using an NVMe SSD.

Non-group members can submit jobs as well, but jobs may be preempted (killed) without warning if group member jobs need the resources being used. Pre-emptible jobs are requeued when preemption happens and should restart when the needed resources become available. If you see that your job is not being requeued, please contact us.

For example to request use of one of these nodes, which are labelled as manycore nodes:

arwen:~$ sbatch -p jsteinhardt -C manycore job.sh
Submitted batch job 380

You can request specific resources as follows:

Also note that if you need more disk space on the NVMe SSD on some but not all of these nodes, we may be able to make available space on a much larger NVMe SSD if you request it.

Job not starting

The cluster is managed using the Slurm scheduling software. We configure Slurm to try to balance the needs of the various cluster users.

Often there may be enough available CPU cores (aka ‘resources’) on the partition, and your job will start immediately after you submit it.

However, there are various reasons a job may take a while to start. Here are some details of how the scheduler works.

The ‘fair share’ policy governs the order of jobs that are waiting in the queue for resources to become available. In particular, if two users each have a job sitting in a queue, the job that will start first will be that of the user who has made less use of the cluster recently (measured in terms of CPU time). The measurement of CPU time downweights usage over time, with a half-life of one month, so a job that ran a month ago will count half as much as a job that ran yesterday. Apart from this prioritization based on recent use, all users are treated equally.

Node maintenance and reservations

Periodically, we perform maintenance on cluster nodes, such as OS upgrades or hardware repairs. During these times, we place a reservation on the affected nodes in the SLURM scheduler.

When a reservation is active, you can still submit jobs as usual. The scheduler will automatically choose an available node for your job in the partition you specify. However, if you want to run the job on a node (via the -w flag) with a reservation, and you want it to start before the maintenance window, you must specify a time limit that ensures your job will complete before the maintenance begins.

For example, if maintenance is scheduled in 48 hours:

$ sbatch -t 48:00:00 myjob.sh   # 48 hours
$ sbatch -t 1-6 myjobs.sh       # 1 day, 6 hours

The scheduler will launch your job on nodes that can complete it in time, or on nodes without reservations. You can check the status of nodes using sinfo. During maintenance, affected nodes will have a status of maint.

How to kill a job

First, find the job-id of the job, by typing squeue at the command line of a submit host (see How to Monitor Jobs).

Then use scancel to delete the job (with id 380 in this case):

scancel 380

Interactive jobs

You can work interactively on a node from the Linux shell command line by starting a job in the interactive queue.

The syntax for requesting an interactive (bash) shell session is:

srun --pty /bin/bash

This will start a shell on one of the four nodes. You can then act as you would on any SCF Linux compute server. For example, you might use top to assess the status of one of your non-interactive (i.e., batch) cluster jobs. Or you might test some code before running it as a batch job. You can also transfer files to the local disk of the cluster node.

By default, to limit forgotten sessions, the time limit for interactive jobs is set to 1 day (24 hours). If you need less or more time (up to the maximum time limits), you can use the -t (or --time) flag. For example to run for two days:

srun -t 2-00:00:00 --pty /bin/bash

If you want to run a program that involves a graphical interface (requiring an X11 window), you need to add --x11 to your srun command. So you could directly run MATLAB, e.g., as follows:

srun --pty --x11 matlab

or you could add the --x11 flag when requesting an interactive shell session and then subsequently start a program that has a graphical interface.

To run an interactive session in which you would like to use multiple cores, do the following (here we request 4 cores for our use):

srun --pty --cpus-per-task 4 /bin/bash

Note that -c is a shorthand for --cpus-per-task.

To transfer files to the local disk of a specific node, you need to request that your interactive session be started on the node of interest (in this case scf-sm20):

srun --pty -p high -w scf-sm20 /bin/bash

Note that if that specific node has all its cores in use by other users, you will need to wait until resources become available on that node before your interactive session will start.

Finally, you can request multiple cores using -c, as with batch jobs. As with batch jobs, you can change OMP_NUM_THREADS from its default of one, provided you make sure that that the total number of cores used (number of processes your code starts multiplied by threads per process) does not exceed the number of cores you request.

Footnotes
  1. If you use software that can parallelize across multiple nodes (e.g., R packages that use MPI or the future package, Python’s Dask or IPython Parallel, MATLAB, MPI), you can run individual jobs across more than one node. See Parallel Jobs.

  2. Preemptible when run at normal priority, as occurs for non-group members.