Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

This page describes how to submit jobs to the cluster.

Slurm configuration and job restrictions

The cluster has multiple partitions, corresponding to groups of nodes. The different partitions have different hardware and job restrictions as discussed here:

Table 1:Partition Restrictions

PartitionMax cores
per user (running)
Max cores
per job
Max CPU memory
per job
Time limitPreembtible[1]
high (default)9624128 GB7 daysNo
low25632256 GB28 daysNo
gpu886 GB28 daysNo
berkeleynlp3841281.5 TB28 days[2]Yes
epurdom256128528 GB28 daysYes
jsteinhardtvariedvariedvaried[3]28 daysYes
yugroupvariedvariedvaried28 daysYes
yss224varied528 GB28 daysYes

Single-core jobs

Prepare a shell script containing the instructions you would like the system to execute.

The instructions here are for the simple case of submitting a job without any parallelization; i.e., a job using a single core (CPU). When submitted using the instructions in this section, such jobs will have access to only a single core at a time. We also have extensive instructions for submitting parallelized jobs and automating the submission of multiple jobs.

For example a simple script to run an R program called ‘simulate.R’ would contain these lines:

#!/bin/bash
R CMD BATCH --no-save simulate.R simulate.out

Once logged onto a submit host, navigate to a directory within your home or scratch directory (i.e., make sure your working directory is not in /tmp or /var/tmp) and use the sbatch command with the name of the shell script (assumed to be job.sh here) to enter a job into the queue:

$ sbatch job.sh
Submitted batch job 380

Here the job and assigned job ID 380. Results that would normally be printed to the screen via standard output and standard error will be written to a file called slurm-380.out.

If you have many single-core jobs to run, there are various ways to automate submitting such jobs.

Note that Slurm is configured such that single-core jobs will have access to a single physical core (including both hyperthreads on our new machines), so there won’t be any contention between the two threads on a physical core. However, if you have many single-core jobs to run on the high, jsteinhardt, or yugroup partitions; you might improve your throughput by modifying your workflow so that you can one run job per hyperthread rather than one job per physical core. You could do this by taking advantage of parallelization strategies in R, Python, or MATLAB to distribute tasks across workers in a single job, or you could use GNU parallel or srun within sbatch.

Slurm provides a number of additional flags to control what happens; you can see the man page for sbatch for help with these. Here are some examples, placed in the job script file, where we name the job, ask for email updates and name the output and error files:

#!/bin/bash
#SBATCH --job-name=myAnalysisName
#SBATCH --mail-type=ALL                       
#SBATCH --mail-user=blah@berkeley.edu
#SBATCH -o myAnalysisName.out #File to which standard out will be written
#SBATCH -e myAnalysisName.err #File to which standard err will be written
R CMD BATCH --no-save simulate.R simulate.Rout

Parallel Jobs

One can use Slurm to submit parallel code of a variety of types.

High performance (CPU) partitions

high and low

Both high and low partitions have very old machines, although machines in high are somewhat faster.

They will all be generally slower than the machines in specific lab group partitions as well as modern laptops (on a per-core basis), particularly Apple Silicon Mac laptops.

epurdom

The epurdom partition has two nodes (frodo and samwise) with recent CPUs (128-core AMD EPYC) and 528 GB memory each.

You can request use of these nodes as follows:

arwen:~$ sbatch -p epurdom job.sh
Submitted batch job 380

Purdom group members have priority access to these nodes. If you are in the group, simply submit jobs to the epurdom partition and you will automatically preempt jobs by users not in the group if it is needed for your job to run.

Non-group members can submit jobs as well, but jobs may be preempted (killed) without warning if group member jobs need the resources being used. Pre-emptible jobs are requeued when preemption happens and should restart when the needed resources become available. If you see that your job is not being requeued, please contact us.

If you need more than one CPU, please request that using the --cpus-per-task flag. The value you specify actually requests that number of hardware threads, but with the caveat that a given job is allocated all the threads on a given core to avoid contention between jobs for a given physical core.

jsteinhardt

While the various nodes of the jsteinhardt partition are primarily intended for use for their GPUs, many of them have newer CPUs, a lot of memory, and very fast disk I/O to /tmp and /var/tmp using an NVMe SSD.

Non-group members can submit jobs as well, but jobs may be preempted (killed) without warning if group member jobs need the resources being used. Pre-emptible jobs are requeued when preemption happens and should restart when the needed resources become available. If you see that your job is not being requeued, please contact us.

For example to request use of one of these nodes, which are labelled as manycore nodes:

arwen:~$ sbatch -p jsteinhardt -C manycore job.sh
Submitted batch job 380

You can request specific resources as follows:

Also note that if you need more disk space on the NVMe SSD on some but not all of these nodes, we may be able to make available space on a much larger NVMe SSD if you request it.

Troubleshooting

The cluster is managed using the Slurm scheduling software. We configure Slurm to try to balance the needs of the various cluster users.

Often there may be enough available CPU cores (aka ‘resources’) on the partition, and your job will start immediately after you submit it.

However, there are various reasons a job may take a while to start. Here are some details of how the scheduler works.

The ‘fair share’ policy governs the order of jobs that are waiting in the queue for resources to become available. In particular, if two users each have a job sitting in a queue, the job that will start first will be that of the user who has made less use of the cluster recently (measured in terms of CPU time). The measurement of CPU time downweights usage over time, with a half-life of one month, so a job that ran a month ago will count half as much as a job that ran yesterday. Apart from this prioritization based on recent use, all users are treated equally.

Node maintenance and reservations

Periodically, we perform maintenance on cluster nodes, such as OS upgrades or hardware repairs. During these times, we place a reservation on the affected nodes in the SLURM scheduler.

When a reservation is active, you can still submit jobs as usual. The scheduler will automatically choose an available node for your job in the partition you specify. However, if you want to run the job on a node (via the -w flag) with a reservation, and you want it to start before the maintenance window, you must specify a time limit that ensures your job will complete before the maintenance begins.

For example, if maintenance is scheduled in 48 hours:

$ sbatch -t 48:00:00 myjob.sh   # 48 hours
$ sbatch -t 1-6 myjobs.sh       # 1 day, 6 hours

The scheduler will launch your job on nodes that can complete it in time, or on nodes without reservations. You can check the status of nodes using sinfo. During maintenance, affected nodes will have a status of maint.

How to kill a job

First, find the job-id of the job, by typing squeue at the command line of a submit host (see How to Monitor Jobs).

Then use scancel to delete the job (with id 380 in this case):

scancel 380

Interactive jobs

You can work interactively on a node from the Linux shell command line by starting a job in the interactive queue.

The syntax for requesting an interactive (bash) shell session is:

srun --pty /bin/bash

Alternatively, you can simply run slogin, which is a wrapper for the syntax above.

This will start a shell on one of the four nodes. You can then act as you would on any SCF Linux compute server. For example, you might use top to assess the status of one of your non-interactive (i.e., batch) cluster jobs. Or you might test some code before running it as a batch job. You can also transfer files to the local disk of the cluster node.

By default, to limit forgotten sessions, the time limit for interactive jobs is set to 1 day (24 hours). If you need less or more time (up to the maximum time limits), you can use the -t (or --time) flag. For example to run for two days:

srun -t 2-00:00:00 --pty /bin/bash

If you want to run a program that involves a graphical interface (requiring an X11 window), you need to add --x11 to your srun command. So you could directly run MATLAB, e.g., as follows:

srun --pty --x11 matlab

or you could add the --x11 flag when requesting an interactive shell session and then subsequently start a program that has a graphical interface.

To run an interactive session in which you would like to use multiple cores, do the following (here we request 4 cores for our use):

srun --pty --cpus-per-task 4 /bin/bash

Note that -c is a shorthand for --cpus-per-task.

To transfer files to the local disk of a specific node, you need to request that your interactive session be started on the node of interest (in this case scf-sm20):

srun --pty -p high -w scf-sm20 /bin/bash

Note that if that specific node has all its cores in use by other users, you will need to wait until resources become available on that node before your interactive session will start.

Finally, you can request multiple cores using -c, as with batch jobs. As with batch jobs, you can change OMP_NUM_THREADS from its default of one, provided you make sure that that the total number of cores used (number of processes your code starts multiplied by threads per process) does not exceed the number of cores you request.

Footnotes
  1. Preemptible jobs from non-group members run at normal priority.

  2. The berkeleynlp partition has a 24 hour time limit on interactive jobs, including those launched on JupyterHub.

  3. 288 GB (smaug), 792 GB (balrog, rainbowquartz), 1 TB (saruman), 128 GB (various)