Hardware¶
The SCF operates one GPU available to all SCF users on an equal basis and many other GPUs purchased by research groups that are available to group members at regular priority and to other SCF users at lower (preemptible) priority.
You need to use the Slurm scheduling software to run any job making use of the GPU. You may want to use an interactive session to develop and test your GPU code. That same link also has information on monitoring GPU usage of your job.
General Access GPUs¶
GPU partition¶
We have one Titan Xp with 12 GB memory on one of our Linux servers (roo), available through the gpu partition.
Savio¶
Access to GPUs on Savio is available through the Savio faculty computing allowance. Please contact SCF staff for more information.
Research Group GPUs¶
The SCF also operates the following research group GPUs. These GPUs are owned by individual faculty members, but anyone can run jobs on them. If you are not a member of the lab group, your jobs will run on a preemptible basis, which means they can be cancelled at any time by a higher-priority jobs. These servers can be accessed by submitting to specific partition of interest using the Slurm scheduling software.
See the first table below for information about the GPU servers and the second table for more detailed information related to local disk, GPU-to-GPU interconnect, and location/latency.
With regard to the latency, note that some GPU machines are located at the NASA Ames facility approximately 75 km from Berkeley, where the SCF fileservers hosting home and scratch directories (via NFS) are located. This distance results in a two microsecond latency that when working with many (often small) files (including Conda/Mamba-related work) can sometimes cause slowness and laggy behavior. Local disks are available to group members to help work around this problem.
GPU server specs¶
| Partition | Machine Name | GPU Type (Number of GPUs) | GPU Memory |
|---|---|---|---|
jsteinhardt | cubbins[1] | H200 (8) | 144 GB |
jsteinhardt | mcfuzz[1] | H200 (8) | 144 GB |
jsteinhardt | mooney[1] | H200 (8) | 144 GB |
jsteinhardt | sneetches[1] | H200 (8) | 144 GB |
jsteinhardt | balrog | A100 (8) | 40 GB |
jsteinhardt | saruman | A100 (10) | 80 GB |
jsteinhardt | rainbowquartz | A5000 (8) | 24 GB |
jsteinhardt | smokyquartz | A4000 (8) | 16 GB |
jsteinhardt | sunstone | A4000 (8) | 16 GB |
jsteinhardt | smaug | Quadro RTX 8000 (1) | 48 GB |
jsteinhardt | shadowfax | GeForce RTX 2080 Ti (8) | 11 GB |
yugroup | treebeard | A100 (1) | 40 GB |
yugroup | merry | GeForce GTX TITAN X (1) | 12 GB |
yugroup | morgoth | Titan Xp (1) | 12 GB |
yugroup | morgoth | Titan X (Pascal) (1) | 12 GB |
yss | luthien | A100 (4) | 80 GB |
yss | beren | A100 (8) | 80 GB |
songmei | feanor[1] | H200 (8) | 144 GB |
berkeleynlp | horton[1] | H200 (8) | 144 GB |
berkeleynlp | lorax[1] | H200 (8) | 144 GB |
GPU server additional specs¶
| Partition | Machine Name | GPU-to-GPU Interconnect | Local storage[2] | Location |
|---|---|---|---|---|
jsteinhardt | cubbins[1] | NVSwitch | 14 TB NVME | NASA Ames[3] |
jsteinhardt | mcfuzz[1] | NVSwitch | 14 TB NVME | NASA Ames[3] |
jsteinhardt | mooney[1] | NVSwitch | 14 TB NVME | NASA Ames[3] |
jsteinhardt | sneetches[1] | NVSwitch | 14 TB NVME | NASA Ames[3] |
jsteinhardt | balrog | NVLink (pairs) | 3.5 TB spinning | Berkeley |
jsteinhardt | saruman | NVLink (pairs) | 7 TB NVME | Berkeley |
jsteinhardt | rainbowquartz | A5000 (8) | 3.5 TB NVME | Berkeley |
jsteinhardt | smokyquartz | A4000 (8) | 3.5 TB NVME | Berkeley |
jsteinhardt | sunstone | A4000 (8) | 3.5 TB NVME | Berkeley |
jsteinhardt | smaug | NVLink (pairs) | 2 TB NVME | Berkeley |
jsteinhardt | shadowfax | None | 3.6 TB spinning | Berkeley |
yugroup | treebeard | N/A | Berkeley | |
yugroup | merry | N/A | Berkeley | |
yugroup | morgoth | N/A | Berkeley | |
yugroup | morgoth | N/A | Berkeley | |
yss | luthien | None | Berkeley | |
yss | beren | NVLink (pairs) | Berkeley | |
songmei | feanor[1] | NVSwitch | 6.6 TB NVME | NASA Ames[3] |
berkeleynlp | horton[1] | NVSwitch | 56 TB NVME | NASA Ames[3] |
berkeleynlp | lorax[1] | NVLink (pairs) | 56 TB NVME | NASA Ames[3] |
Storage available at
/datato group members. In addition, all machines have 100s of GB available on local/tmpand/var/tmpon spinning disks, available to all users.Note that some GPU machines are located at the NASA Ames facility approximately 75 km from Berkeley, where the SCF fileservers hosting home and scratch directories (via NFS) are located. This distance causes a two microsecond latency that when working with many (often small) files (including Conda/Mamba-related work) can sometimes cause slowness and laggy behavior. Local disks are available to group members to help work around this problem.