-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #26 from JamieMair/add-slurm-support
Add slurm support
- Loading branch information
Showing
14 changed files
with
330 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
*.jl.*.cov | ||
*.jl.cov | ||
*.jl.mem | ||
/Manifest.toml | ||
Manifest.toml | ||
/docs/build/ | ||
.vscode/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
# Clusters | ||
|
||
This package provides some basic support for running an experiment on a HPC. This uses `ClusterManagers.jl` under the hood. | ||
|
||
At the moment, we only support running on a SLURM cluster, but any PRs to support other clusters are welcome. | ||
|
||
## SLURM | ||
|
||
Normally when running on SLURM, one creates a bash script to tell the scheduler about the resource requirements for a job. The following is an example: | ||
```bash | ||
#!/bin/bash | ||
|
||
#SBATCH --nodes=2 | ||
#SBATCH --ntasks=2 | ||
#SBATCH --cpus-per-task=2 | ||
#SBATCH --mem-per-cpu=1024 | ||
#SBATCH --time=00:30:00 | ||
#SBATCH -o hpc/output/test_job_%j.out | ||
``` | ||
|
||
The function [`Experimenter.Cluster.create_slurm_template`](@ref) provides an easy way to create one of these bash scripts with everything you need to run. | ||
|
||
### Example | ||
|
||
Let us take the following end-to-end example. Say that we have an experiment script at `my_experiment.jl` (contents below), which now initialises the cluster: | ||
```julia | ||
using Experimenter | ||
|
||
config = Dict{Symbol,Any}( | ||
:N => IterableVariable([Int(1e6), Int(2e6), Int(3e6)]), | ||
:seed => IterableVariable([1234, 4321, 3467, 134234, 121]), | ||
:sigma => 0.0001) | ||
experiment = Experiment( | ||
name="Test Experiment", | ||
include_file="run.jl", | ||
function_name="run_trial", | ||
configuration=deepcopy(config) | ||
) | ||
|
||
db = open_db("experiments.db") | ||
|
||
# Init the cluster | ||
Experimenter.Cluster.init() | ||
|
||
@execute experiment db DistributedMode | ||
``` | ||
Additionally, we have the file `run.jl` containing: | ||
```julia | ||
using Random | ||
using Distributed | ||
function run_trial(config::Dict{Symbol,Any}, trial_id) | ||
results = Dict{Symbol, Any}() | ||
sigma = config[:sigma] | ||
N = config[:N] | ||
seed = config[:seed] | ||
rng = Random.Xoshiro(seed) | ||
# Perform some calculation | ||
results[:distance] = sum(rand(rng) * sigma for _ in 1:N) | ||
results[:num_threads] = Threads.nthreads() | ||
results[:hostname] = gethostname() | ||
results[:pid] = Distributed.myid() | ||
# Must return a Dict{Symbol, Any}, with the data we want to save | ||
return results | ||
end | ||
``` | ||
We can now create a bash script to run our experiment. We create a template by running the following in the terminal (or adjust or the REPL) | ||
```bash | ||
julia --project -e 'using Experimenter; Experimenter.Cluster.create_slurm_template("myrun.sh")' | ||
``` | ||
We then modify the create `myrun.sh` file to the following: | ||
```bash | ||
#!/bin/bash | ||
|
||
#SBATCH --ntasks=4 | ||
#SBATCH --cpus-per-task=2 | ||
#SBATCH --mem-per-cpu=1024 | ||
#SBATCH --time=00:30:00 | ||
#SBATCH -o hpc/logs/job_%j.out | ||
|
||
julia --project my_experiment.jl --threads=1 | ||
|
||
# Optional: Remove the files created by ClusterManagers.jl | ||
rm -fr julia-*.out | ||
|
||
``` | ||
|
||
Once written, we execute this on the cluster via | ||
```bash | ||
sbatch myrun.sh | ||
``` | ||
|
||
We can then open a Julia REPL (once the job has finished) to see the results: | ||
```julia | ||
using Experimenter | ||
db = open_db("experiments.db") | ||
trials = get_trials_by_name(db, "Test Experiment") | ||
|
||
for (i, t) in enumerate(trials) | ||
hostname = t.results[:hostname] | ||
id = t.results[:pid] | ||
println("Trial $i ran on $hostname on worker $id") | ||
end | ||
``` | ||
|
||
Support for running on SLURM is based on [this gist](https://gist.github.com/JamieMair/0b1ffbd4ee424c173e6b42fe756e877a) available on GitHub. This gist also provides information on how to adjust the SLURM script to allow for one GPU to be allocated to each worker. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
experiments/ | ||
*.out |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
[deps] | ||
ClusterManagers = "34f1f09b-3a8b-5176-ab39-66d58a4d544e" | ||
Distributed = "8ba89e20-285c-5b6f-9357-94700520ee1b" | ||
Experimenter = "6aee034a-9508-47b1-8e11-813cc29af79f" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
using Experimenter | ||
db = open_db("experiments.db") | ||
trials = get_trials_by_name(db, "Test Experiment") | ||
|
||
for (i, t) in enumerate(trials) | ||
hostname = t.results[:hostname] | ||
id = t.results[:pid] | ||
println("Trial $i ran on $hostname on worker $id") | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
This is a file to make sure this directory exists. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
using Experimenter | ||
|
||
config = Dict{Symbol,Any}( | ||
:N => IterableVariable([Int(1e6), Int(2e6), Int(3e6)]), | ||
:seed => IterableVariable([1234, 4321, 3467, 134234, 121]), | ||
:sigma => 0.0001) | ||
experiment = Experiment( | ||
name="Test Experiment", | ||
include_file="run.jl", | ||
function_name="run_trial", | ||
configuration=deepcopy(config) | ||
) | ||
|
||
db = open_db("experiments.db") | ||
|
||
# Init the cluster | ||
Experimenter.Cluster.init() | ||
|
||
@execute experiment db DistributedMode |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
#!/bin/bash | ||
|
||
#SBATCH --ntasks=4 | ||
#SBATCH --cpus-per-task=2 | ||
#SBATCH --mem-per-cpu=1024 | ||
#SBATCH --time=00:30:00 | ||
#SBATCH -o hpc/logs/job_%j.out | ||
|
||
module purge | ||
module load julia/1.9.4 | ||
|
||
julia --project my_experiment.jl --threads=1 | ||
|
||
# Optional: Remove the files created by ClusterManagers.jl | ||
rm -fr julia-*.out |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
using Random | ||
using Distributed | ||
function run_trial(config::Dict{Symbol,Any}, trial_id) | ||
results = Dict{Symbol, Any}() | ||
sigma = config[:sigma] | ||
N = config[:N] | ||
seed = config[:seed] | ||
rng = Random.Xoshiro(seed) | ||
# Perform some calculation | ||
results[:distance] = sum(rand(rng) * sigma for _ in 1:N) | ||
results[:num_threads] = Threads.nthreads() | ||
results[:hostname] = gethostname() | ||
results[:pid] = Distributed.myid() | ||
# Must return a Dict{Symbol, Any}, with the data we want to save | ||
return results | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
module SlurmExt | ||
|
||
############ Module dependencies ############ | ||
if isdefined(Base, :get_extension) | ||
using Experimenter | ||
using Distributed | ||
using ClusterManagers | ||
else | ||
using ..Experimenter | ||
using ..Distributed | ||
using ..ClusterManagers | ||
end | ||
|
||
|
||
############ Module Code ############ | ||
function Experimenter.Cluster.init_slurm(; sysimage_path::Union{String, Nothing}=nothing) | ||
@info "Setting up SLURM" | ||
# Setup SLURM | ||
num_tasks = parse(Int, ENV["SLURM_NTASKS"]) | ||
cpus_per_task = parse(Int, ENV["SLURM_CPUS_PER_TASK"]) | ||
@info "Using $cpus_per_task threads on each worker" | ||
exeflags = ["--project", "-t$cpus_per_task"] | ||
if !isnothing(sysimage_path) | ||
@info "Using the sysimage: $sysimage_path" | ||
push!(exeflags, "--sysimage") | ||
push!(exeflags, "\"$sysimage_path\"") | ||
end | ||
addprocs(SlurmManager(num_tasks); exeflags=exeflags, topology=:master_worker) | ||
|
||
@info "SLURM workers launched: $(length(workers()))" | ||
end | ||
|
||
# @doc """ | ||
# init_slurm(; sysimage_path=nothing) | ||
|
||
# Spins up all the processes as indicated by the SLURM environment variables. | ||
|
||
# # Arguments | ||
|
||
# - `sysimage_path`: A path to the sysimage that the workers should use to avoid unneccessary precompilation | ||
# """ Experimenter.Cluster.init_slurm | ||
|
||
|
||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.