Basic Job Submission - Computing in Statistics at Berkeley

This page describes how to submit jobs to the cluster.

Slurm configuration and job restrictions¶

The cluster has multiple partitions, corresponding to groups of nodes. The different partitions have different hardware and job restrictions as discussed here:

Table 1:Partition Restrictions

Partition	Max cores per user (running)	Max cores per job	Max CPU memory per job	Time limit	Preembtible^[1]
high (default)	96	24	128 GB	7 days	No
low	256	32	256 GB	28 days	No
gpu	8	8	6 GB	28 days	No
berkeleynlp	384	128	1.5 TB	28 days^[2]	Yes
epurdom	256	128	528 GB	28 days	Yes
jsteinhardt	varied	varied	varied^[3]	28 days	Yes
yugroup	varied	varied	varied	28 days	Yes
yss	224	varied	528 GB	28 days	Yes

Interactive jobs default to a 24-hour time limit on all partitions. You can adjust this with the -t flag (up to the partition maximum) on all partitions except berkeleynlp, where 24 hours is a hard maximum that cannot be extended. See Interactive jobs for details.

Single-core jobs¶

Prepare a shell script containing the instructions you would like the system to execute.

The instructions here are for the simple case of submitting a job without any parallelization; i.e., a job using a single core (CPU). When submitted using the instructions in this section, such jobs will have access to only a single core at a time. We also have extensive instructions for submitting parallelized jobs and automating the submission of multiple jobs.

For example a simple script to run an R program called ‘simulate.R’ would contain these lines:

#!/bin/bash
R CMD BATCH --no-save simulate.R simulate.out

Once logged onto a submit host, navigate to a directory within your home or scratch directory (i.e., make sure your working directory is not in /tmp or /var/tmp) and use the sbatch command with the name of the shell script (assumed to be job.sh here) to enter a job into the queue:

$ sbatch job.sh
Submitted batch job 380

Here the job and assigned job ID 380. Results that would normally be printed to the screen via standard output and standard error will be written to a file called slurm-380.out.

If you have many single-core jobs to run, there are various ways to automate submitting such jobs.

Note that Slurm is configured such that single-core jobs will have access to a single physical core (including both hyperthreads on our new machines), so there won’t be any contention between the two threads on a physical core. However, if you have many single-core jobs to run on the high, jsteinhardt, or yugroup partitions; you might improve your throughput by modifying your workflow so that you can one run job per hyperthread rather than one job per physical core. You could do this by taking advantage of parallelization strategies in R, Python, or MATLAB to distribute tasks across workers in a single job, or you could use GNU parallel or srun within sbatch.

Slurm provides a number of additional flags to control what happens; you can see the man page for sbatch for help with these. Here are some examples, placed in the job script file, where we name the job, ask for email updates and name the output and error files:

#!/bin/bash
#SBATCH --job-name=myAnalysisName
#SBATCH --mail-type=ALL                       
#SBATCH --mail-user=blah@berkeley.edu
#SBATCH -o myAnalysisName.out #File to which standard out will be written
#SBATCH -e myAnalysisName.err #File to which standard err will be written
R CMD BATCH --no-save simulate.R simulate.Rout

Memory¶

Unlike some Linux clusters, on the SCF cluster, you do not need to request a specific amount of memory. Your job will have access to all the memory available on the machine.

The advantages of this are that users don’t need to determine in advance how much memory a job needs and jobs are not killed because they exceed the anticipated memory use.

The downside is that since multiple jobs are often running on a node at a given time, jobs can try to use more memory collectively than is physically available and one or more of the jobs (potentially a job not needing even needing a lot of memory) may not be able to access the memory it needs and may fail. Historically, while this has happened on the SCF cluster nodes it has been relatively rare.

As a partial workaround for this issue, if your job requires most of the memory on a node and you do not want to share it with other jobs, you can use --exclusive to reserve the entire node. We generally leave this decision to the user, but if your job requires more than, say, 2/3 of the memory on a node, it is probably a good idea to do this. And if your job requires less than, say, half the memory on the node, we would generally prefer that you not do it.

Parallel Jobs¶

One can use Slurm to submit parallel code of a variety of types.

High performance (CPU) partitions¶

high and low¶

Both high and low partitions have very old machines, although machines in high are somewhat faster.

They will all be generally slower than the machines in specific lab group partitions as well as modern laptops (on a per-core basis), particularly Apple Silicon Mac laptops.

epurdom¶

The epurdom partition has two nodes (frodo and samwise) with recent CPUs (128-core AMD EPYC) and 528 GB memory each.

You can request use of these nodes as follows:

arwen:~$ sbatch -p epurdom job.sh
Submitted batch job 380

Purdom group members have priority access to these nodes. If you are in the group, simply submit jobs to the epurdom partition and you will automatically preempt jobs by users not in the group if it is needed for your job to run.

Non-group members can submit jobs as well, but jobs may be preempted (killed) without warning if group member jobs need the resources being used. Pre-emptible jobs are requeued when preemption happens and should restart when the needed resources become available. If you see that your job is not being requeued, please contact us. Note also the implications for interactive sessions.

If you need more than one CPU, please request that using the --cpus-per-task flag. The value you specify actually requests that number of hardware threads, but with the caveat that a given job is allocated all the threads on a given core to avoid contention between jobs for a given physical core.

jsteinhardt¶

While the various nodes of the jsteinhardt partition are primarily intended for use for their GPUs, many of them have newer CPUs, a lot of memory, and very fast disk I/O to /tmp and /var/tmp using an NVMe SSD.

For example to request use of one of these nodes, which are labelled as manycore nodes:

arwen:~$ sbatch -p jsteinhardt -C manycore job.sh
Submitted batch job 380

You can request specific resources as follows:

-C fasttmp for access to fast disk I/O in /tmp and /var/tmp,
-C manycore for access to many (64 or more) cores,
-C mem256g for up to 256 GB CPU memory,
-C mem768g for up to 768 GB CPU memory, and
-C mem1024g for up to 1024 GB CPU memory.

Also note that if you need more disk space on the NVMe SSD on some but not all of these nodes, we may be able to make available space on a much larger NVMe SSD if you request it.

Troubleshooting¶

The cluster is managed using the Slurm scheduling software. We configure Slurm to try to balance the needs of the various cluster users.

Often there may be enough available CPU cores (aka ‘resources’) on the partition, and your job will start immediately after you submit it.

However, there are various reasons a job may take a while to start. Here are some details of how the scheduler works.

If there aren’t enough resources in a given partition to run a job when it is submitted, it goes into the queue. The queue is sorted based how much CPU time you’ve used over the past few weeks, using the ‘fair share’ policy described below. Your jobs will be moved to a position in the queue below the jobs of users who have used less CPU time in recent weeks. This happens dynamically, so another user can submit a job well after you have submitted your job, and that other user’s job can be placed higher in the queue at that time and start sooner than your job. If this were not the case, imagine the situation of a user sending 100s of big jobs to the system. They get into the queue and everyone submitting after that has to wait for a long time while those jobs are run, if other jobs can’t get moved above those jobs in the queue.
When a job at the top of the queue needs multiple CPU cores (or in some cases an entire node), then jobs submitted to that same partition that are lower in the queue will not start even if there are enough CPU cores available for those lower priority jobs. That’s because the scheduler is trying to accumulate resources for the higher priority job(s) and guarantee that the higher priority jobs’ start times wouldn’t be pushed back by running the lower priority jobs.
In some cases, if the scheduler has enough information about how long jobs will run, it will run lower-priority jobs on available resources when it can do so without affecting when a higher priority job would start. This is called backfill. It can work well on some systems but on the SCF and EML it doesn’t work all that well because we don’t require that users specify their jobs’ time limits carefully. We made that tradeoff to make the cluster more user-friendly at the expense of optimizing throughput.

The ‘fair share’ policy governs the order of jobs that are waiting in the queue for resources to become available. In particular, if two users each have a job sitting in a queue, the job that will start first will be that of the user who has made less use of the cluster recently (measured in terms of CPU time). The measurement of CPU time downweights usage over time, with a half-life of one month, so a job that ran a month ago will count half as much as a job that ran yesterday. Apart from this prioritization based on recent use, all users are treated equally.

Node maintenance and reservations¶

Periodically, we perform maintenance on cluster nodes, such as OS upgrades or hardware repairs. During these times, we place a reservation on the affected nodes in the SLURM scheduler.

When a reservation is active, you can still submit jobs as usual. The scheduler will automatically choose an available node for your job in the partition you specify. However, if you want to run the job on a node (via the -w flag) with a reservation, and you want it to start before the maintenance window, you must specify a time limit that ensures your job will complete before the maintenance begins.

For example, if maintenance is scheduled in 48 hours:

$ sbatch -t 48:00:00 myjob.sh   # 48 hours
$ sbatch -t 1-6 myjobs.sh       # 1 day, 6 hours

The scheduler will launch your job on nodes that can complete it in time, or on nodes without reservations. You can check the status of nodes using sinfo. During maintenance, affected nodes will have a status of maint.

How to kill a job¶

First, find the job-id of the job, by typing squeue at the command line of a submit host (see How to Monitor Jobs).

Then use scancel to delete the job (with id 380 in this case):

scancel 380

Interactive jobs¶

You can work interactively on a node from the Linux shell command line by starting a job in the interactive queue.

The syntax for requesting an interactive (bash) shell session is:

srun --pty /bin/bash

Alternatively, you can simply run slogin, which is a wrapper for the syntax above.

This will start a shell on one of the four nodes. You can then act as you would on any SCF Linux compute server. For example, you might use top to assess the status of one of your non-interactive (i.e., batch) cluster jobs. Or you might test some code before running it as a batch job. You can also transfer files to the local disk of the cluster node.

By default, to limit forgotten sessions, the time limit for interactive jobs is set to 1 day (24 hours). If you need less or more time (up to the maximum time limits), you can use the -t (or --time) flag. For example to run for two days:

srun -t 2-00:00:00 --pty /bin/bash

If you want to run a program that involves a graphical interface (requiring an X11 window), you need to add --x11 to your srun command. So you could directly run MATLAB, e.g., as follows:

srun --pty --x11 matlab

or you could add the --x11 flag when requesting an interactive shell session and then subsequently start a program that has a graphical interface.

To run an interactive session in which you would like to use multiple cores, do the following (here we request 4 cores for our use):

srun --pty --cpus-per-task 4 /bin/bash

Note that -c is a shorthand for --cpus-per-task.

To transfer files to the local disk of a specific node, you need to request that your interactive session be started on the node of interest (in this case scf-sm20):

srun --pty -p high -w scf-sm20 /bin/bash

Note that if that specific node has all its cores in use by other users, you will need to wait until resources become available on that node before your interactive session will start.

Finally, you can request multiple cores using -c, as with batch jobs. As with batch jobs, you can change OMP_NUM_THREADS from its default of one, provided you make sure that that the total number of cores used (number of processes your code starts multiplied by threads per process) does not exceed the number of cores you request.

Because most partitions are shared (multiple jobs run on the same node), you may want to check who else is running on a node. Use squeue -w <nodename> — for example, squeue -w frodo. Interactive jobs appear with the job name bash (or whatever shell was invoked, e.g., zsh); JupyterHub sessions appear as jupyterhub.

If you are running an interactive session on a preemptible partition (such as epurdom or jsteinhardt for non-group members), be aware that your session can be terminated without warning if a higher-priority job needs the resources. The terminal connection will be lost and your session cannot be recovered.

Footnotes¶

Preemptible jobs from non-group members run at normal priority.
↩
On berkeleynlp, 24 hours is a hard maximum for interactive jobs (including JupyterHub sessions) that cannot be extended with -t.
↩
288 GB (smaug), 792 GB (balrog, rainbowquartz), 1 TB (saruman), 128 GB (various)
↩