Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bind max steps and lr iterations #67

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ Keep it human-readable, your future self will thank you!
### Changed

- Updated configuration examples in documentation and corrected links - [#46](https://github.com/ecmwf/anemoi-training/pull/46)
- Modified training configuration to support max_steps and tied lr iterations to max_steps by default [#67](https://github.com/ecmwf/anemoi-training/pull/67)

## [0.1.0 - Anemoi training - First release](https://github.com/ecmwf/anemoi-training/releases/tag/0.1.0) - 2024-08-16

Expand Down
7 changes: 5 additions & 2 deletions src/anemoi/training/config/training/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,13 @@ rollout:
# maximum rollout to use
max: 1

max_epochs: 200
# Set max_epochs or max_steps. Training stops at the first limit reached.
max_epochs: null
max_steps: 150000

lr:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having this functionality it's great, so thanks @Rilwan-Adewoyin for implementing it!
just a quick question, what happens when/if they user pass both max_steps and max_epochs? will the code then run until max_epochs is reached and the scheduler used the max_steps?
My two cents here is that probably it could be nice to add some logger info to indicate this!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey thanks,
So the default pytorch lightning behaviour is that the code will run until the smallest of max_steps or max_epochs is reached, but the scheduler will be aligned to max_steps.

Yep that sounds like a good idea, so add a logger.info message when the user sets both

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rilwan-Adewoyin did you have a chance to implement the logger?

rate: 0.625e-4 #local_lr
iterations: 300000
iterations: ${training.max_steps} # NOTE: When max_epochs < max_steps, scheduler will run for max_steps
min: 3e-7 #Not scaled by #GPU

# Changes in per-gpu batch_size should come with a rescaling of the local_lr
Expand Down
1 change: 1 addition & 0 deletions src/anemoi/training/train/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,7 @@ def train(self) -> None:
num_nodes=self.config.hardware.num_nodes,
precision=self.config.training.precision,
max_epochs=self.config.training.max_epochs,
max_steps=self.config.training.max_steps or -1,
logger=self.loggers,
log_every_n_steps=self.config.diagnostics.log.interval,
# run a fixed no of batches per epoch (helpful when debugging)
Expand Down
Loading