Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Callbacks #60

Open
wants to merge 47 commits into
base: develop
Choose a base branch
from
Open

Refactor Callbacks #60

wants to merge 47 commits into from

Conversation

HCookie
Copy link
Member

@HCookie HCookie commented Sep 24, 2024

  • Split into seperate files
  • Use list in config to add callbacks
  • Provide legacy config enabled approach
  • Fix ruff issues

New Usage

Set config.diagnostics.callbacks to a list of callback names to include

Closes #59


📚 Documentation preview 📚: https://anemoi-training--60.org.readthedocs.build/en/60/

- Split into seperate files
- Use list in config to add callbacks
- Provide legacy config enabled approach
- Fix ruff issues
@HCookie HCookie self-assigned this Sep 24, 2024
@FussyDuck
Copy link

FussyDuck commented Sep 24, 2024

CLA assistant check
All committers have signed the CLA.

@HCookie
Copy link
Member Author

HCookie commented Sep 24, 2024

At the moment, this is the proposed refactor, I am yet to complete an exhaustive test of the changes

@HCookie HCookie removed the request for review from JesperDramsch September 24, 2024 09:57
@JesperDramsch
Copy link
Member

Great work, thank you for taking this on.

I was thinking that it might be nice to make this fully configurable through instantiate.

For example, no one is really using the stochastic weight averaging as far as I know, so having specific config entries for this is a bit of feature bloat.

Then the list of callbacks would just look like this:

callbacks:
  swa: _target_: pytorch_lightning.callbacks.stochastic_weight_avg.StochasticWeightAveraging
          swa_lr: 1e-4
          swa_epoch_start: 123
          annealing_epochs: 5
          annealing_strategy: cos
          device: null
  blabla: _target_: blabla_callback
             blabla: bla

This makes it more extensible and actually reduces some of or less used config entries.

Additionally, we can keep the standard callbacks, like model checkpoints as "permanent callback" (I don't think we have to make everything optional).

One idea I also had is that we could make a special list for "plot_callbacks" in the same style. Then we can easily keep the super convenient "plots.enabled = False" as a shortcut to disable them?

@HCookie HCookie marked this pull request as ready for review September 25, 2024 09:41
Copy link
Member

@JesperDramsch JesperDramsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HCookie, thanks for taking on the callbacks!

It's already much better, great work on that. I think we can take the refactor even further and make the callbacks (almost?) fully modular, which would be incredible for future extensibility.

One comment regarding the file names. So far we haven't been using <xyz>-ing.py as language. Especially "checkpointing" would be confusing with activation checkpointing (although that is and will stay confusing honestly). Can we rename these please?

src/anemoi/training/diagnostics/callbacks/plotting.py Outdated Show resolved Hide resolved
src/anemoi/training/diagnostics/callbacks/learning_rate.py Outdated Show resolved Hide resolved
src/anemoi/training/diagnostics/callbacks/__init__.py Outdated Show resolved Hide resolved
src/anemoi/training/diagnostics/callbacks/__init__.py Outdated Show resolved Hide resolved
src/anemoi/training/diagnostics/callbacks/__init__.py Outdated Show resolved Hide resolved
src/anemoi/training/diagnostics/callbacks/__init__.py Outdated Show resolved Hide resolved
src/anemoi/training/diagnostics/callbacks/__init__.py Outdated Show resolved Hide resolved
src/anemoi/training/config/diagnostics/eval_rollout.yaml Outdated Show resolved Hide resolved
src/anemoi/training/diagnostics/callbacks/__init__.py Outdated Show resolved Hide resolved
pre-commit-ci bot and others added 7 commits October 2, 2024 10:54
- Prefill config with callbacks
- Warn on deprecations for old config
- Expand config enabled
- Add back SWA
- Fix logging callback
- Add flag to disable checkpointing
- Add testing
[feature] Fix trainable attribute callbacks
@HCookie HCookie dismissed JesperDramsch’s stale review October 23, 2024 15:55

JesperDramsch is currently absent

CHANGELOG.md Outdated Show resolved Hide resolved
JPXKQX
JPXKQX previously approved these changes Oct 25, 2024
Copy link
Member

@JPXKQX JPXKQX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A long-awaited refactor. Thanks Harrison for the amazing work! I have tested with 1 GPU and 2 GPU (with num_gpus_per_model=2), and all plots in "detailed" option are produced as expected.

- 2t
- 10u
- 10v
- _target_: anemoi.training.diagnostics.callbacks.plot.LongRolloutPlots
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as dataloader.validation_rollout=1, which is the default, this callback only increases the runtime without providing any additional plots. Should we move it into rollout_eval.yaml?

Copy link
Member Author

@HCookie HCookie Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could provide a rollout plots configuration?
Addressed in f1d883f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor Callbacks
7 participants