Skip to content

metno/run-anemoi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

run-amemoi is a collection of utility scripts and packages to use anemoi-training.

Anemoi-training on LUMI

Use of virtual python environments is strongly dicouraged on LUMI, with a container based approach being the prefered solution. Therefore we use a singularity container which contains the entire software environment except for the anemoi repositories themselves (training, graphs, models, datasets, utils). These are installed in a lightweight virtual environment that we load on top of the container, which enables us to edit these packages without rebuilding the container.

  • The virtual environment is set up by executing bash make_env.sh in /lumi. This will download the anemoi-packages and install them in a .venv folder inside /lumi.

You can now train a model through the following steps:

  • Setup the desider model config file and make sure it is placed in /lumi. This file should not be named config.yaml or any other config name allready in anemoi-training.
  • Specify the config file name in lumi_jobscript.sh along with preferred sbatch settings for the job.
  • Submit the job with sbatch lumi_jobscript.sh

Automatized AnemoI training with SLURM

autorun-anemoi is a lightweight Python package for submitting Anemoi training runs to the SLURM queue.

Features:

  • Chained dependency jobs for long training
  • Auto-run inference after training is finalised
  • Modify config on-the-fly for efficient testing
  • Back ups config and jobscript to avoid overwriting

Install

This package is not available on PyPi. To install, run:

pip install git+https://github.com/metno/run-anemoi.git

Basic usage

autorun-anemoi comes with a command-line interface and a Python interface. The examples will focus on the command-line interface, but the python interface has the same support.

Command-line interface

The command-line interface comes with two required arguments: config-name and sbatch-yaml:

run-anemoi <config-name> <sbatch-yaml>

The first is the path to the config to be used, and the second is a YAML-file containing all SBATCH commands to be used in the job script. An example file can be found as job.yaml:

output: output.out
error: error.err
nodes: 1
ntasks-per-node: 4
gpus-per-node: 4
mem: 450G
account: DestE_330_24
partition: boost_usr_prod
job-name: test
exclusive: None
run-anemoi anemoi/config/config.yaml job.yaml

Python interface

The same operation can be done by creating an AutoRunAnemoi-object in Python:

from autorun_anemoi import AutoRunAnemoi

obj = AutoRunAnemoi('aifs/config/config.yaml', 'job.yaml')
obj.run()

Chained jobs

If total training time is longer than what is practical for a single job (due to system limitations or queue times), multiple dependency jobs can be submitted. This happens if the total_time, which is the expected time for the training procedure specified in the config, exceeds the max_time_per_job. Set total_time with the --total_time or -t argument (follows the SLURM time format):

run-anemoi anemoi/config/config.yaml job.yaml -t 3-00:00:00

The default max_time_per_job is set to the maximum running time for the specified partition. To override this, use the --max_time_per_job or -m argument:

run-anemoi anemoi/config/config.yaml job.yaml -t 3-00:00:00 -m 12:00:00

The command above will submit 6 jobs in total (one initial job and five dependency jobs), each with a total time of 12 hours.

Running inference

We can also run inference after training is finalised. Similar to the training job, the inference job needs a config name and a sbatch yaml, which can be specified by --inference_config_name (-i) or --inference_job_yaml (-j), respectively:

run-anemoi anemoi/config/config.yaml job.yaml -i inference.yaml -j inference_job.yaml

Use the argument --inference_python_script to change name of the inference script from inference.py.

Modifying config on-the-fly

Config overrides can be passed as command line arguments:

run-anemoi anemoi/config/config.yaml job.yaml diagnostics.plot.enabled=False

This is in particular useful if we want to submit a series of experiments with just small changes in the config:

for NCHANNELS in 256 512
do
    run-anemoi aifs/config/config.yaml job.yaml model.num_channels=$NCHANNELS
done

In python, use the modify_config-method:

from autorun_anemoi import AutoRunAnemoi

obj = AutoRunAnemoi('aifs/config/config.yaml', 'job.yaml')
for i in [256, 512]:
	obj.modify_config(f'model.num_channels={i}')
	obj.run()

Help

run-anemoi --help

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published