The below documentation was written as a roadmap for future implementation, and may be irrelevant/inaccurate to how Latte is operated today. Reach out to the VP of tech ([email protected]) or root staff (via Discord) if you have any questions.
Latte is a GPU server, donated in part by NVIDIA Corp. for use by the CS community. It features 8 datacenter-class NVIDIA Tesla P100 GPUs, which offer a large speedup for machine learning and related GPU computing tasks. The Tensorflow and PyTorch libraries are available for use as well.
To begin using latte
, you need to have a CSUA account.
To get a CSUA account, please create one here, or visit our office in 311 Soda. An officer will create an account for you.
Once you have an account, you can log into latte.csua.berkeley.edu
over SSH. SSH access from off campus is not allowed, so if you are currently off campus, proxy jump through Soda via ssh <username>@latte.csua.berkeley.edu -J <username>@soda.csua.berkeley.edu
. From here, you can begin setting up your jobs.
For information on how to best use the server, send an email to [email protected] with the following:
- Name
- CSUA Username
- Intended use
Most jobs can be run similarly on latte
to how they are run on any other Linux-based machine.
Slurm is an optional feature used to manage job scheduling.
slurmctld
is meant for testing only. There are limits to the amount of compute you can use while in this machine.
The /datasets/
directory has some publicly-available datasets to use in /datasets/share/
. If you are using your own dataset, please place them in /datasets/
because the contents of /home/
are mounted over a network filesystem and will be slower.
Once you run your program and it works, you can submit a job.
To run a job, you need to submit it using the srun
command. You can read about how to use Slurm here.
This will send the job to one of the GPU nodes and run the job.
If you have any questions, please email [email protected].
This repo contains the configurations used to test and deploy the slurm docker cluster known as latte
. The important commands can be found in the contents of Makefile
.
The cluster is created using docker-compose
, specifically using nvidia-docker-compose
. There are a number of other pieces of software involved, however.
(Copied from https://docs.docker.com/compose/overview/ )
Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration.
Using Compose is basically a three-step process:
-
Define your app’s environment with a
Dockerfile
so it can be reproduced anywhere. -
Define the services that make up your app in
docker-compose.yml
so they can be run together in an isolated environment. -
Run
docker-compose up
and Compose starts and runs your entire app.
The Makefile
describes all the necessary commands for building and testing the cluster.
This is a multi-container Slurm cluster using docker-compose. The compose file creates named volumes for persistent storage of MySQL data files as well as Slurm state and log directories.
The compose file will run the following containers:
mysql
slurmdbd
slurmctld
c1 (slurmd)
c2 (slurmd)
The compose file will create the following named volumes:
- etc_munge ( -> /etc/munge )
- etc_slurm ( -> /etc/slurm )
- slurm_jobdir ( -> /data )
- var_lib_mysql ( -> /var/lib/mysql )
- var_log_slurm ( -> /var/log/slurm )
Build the image locally:
$ docker build -t slurm-docker-cluster:17.02.9 .
Run docker-compose
to instantiate the cluster:
$ docker-compose up -d
To register the cluster to the slurmdbd daemon, run the register_cluster.sh
script:
$ ./register_cluster.sh
Note: You may have to wait a few seconds for the cluster daemons to become ready before registering the cluster. Otherwise, you may get an error such as sacctmgr: error: Problem talking to the database: Connection refused.
You can check the status of the cluster by viewing the logs:
docker-compose logs -f
Use docker exec
to run a bash shell on the controller container:
$ docker exec -it slurmctld bash
From the shell, execute slurm commands, for example:
[root@slurmctld /]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 5-00:00:00 2 idle c[1-2]
The slurm_jobdir
named volume is mounted on each Slurm container as /data
.
Therefore, in order to see job output files while on the controller, change to
the /data
directory when on the slurmctld container and then submit a job:
[root@slurmctld /]# cd /data/
[root@slurmctld data]# sbatch --wrap="uptime"
Submitted batch job 2
[root@slurmctld data]# ls
slurm-2.out
$ docker-compose stop
$ docker-compose start
To remove all containers and volumes, run:
$ docker-compose rm -sf
$ docker volume rm slurmdockercluster_etc_munge slurmdockercluster_etc_slurm slurmdockercluster_slurm_jobdir slurmdockercluster_var_lib_mysql slurmdockercluster_var_log_slurm