Mlflow benchmark profiler update #38

anaprietonem · 2024-08-23T15:08:23Z

Describe your changes
PR to update the benchmark profiler to be able to:

run and track results with mlflow (including summary of system metrics)
Add the training and validation rates as metrics monitored by mlflow
Replace the memory profiler that used memray with torch profiler
Include option to generate model summary using torchinfo (already a listed dependency)

Please also include relevant motivation and context.
Previous version of the BP did not allow to configure the exact reports to be generated and did not make use of torch profiler. In this PR we address those issues, allowing one to choose what reports to generate, exploiting the features available in pytorch profiler for a more in depth breakdown of GPU/CPU memory usage and adding the option to generate a model summary

List any dependencies that are required for this change.
When running the memory report generation, it's possible to perform a Holistic Trace Analysis but for that you'd need to install https://hta.readthedocs.io/en/latest/source/intro/installation.html (so this is now included as extras for the profiler in setup.py)

Type of change
Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
* This change requires a documentation update

Issue ticket number and link
We did not create a ticket when we first started working on this task. Will do next time, apologies!

Checklist before requesting a review

*need to still update the confluence page but docstrings and comments are updated

Tag possible reviewers
You can @-tag people to review this PR in addition to formal review requests.
@cathalobrien @mchantry @JesperDramsch

[PR Migrated from aifs-mono to anemoi-training]

📚 Documentation preview 📚: https://anemoi-training--38.org.readthedocs.build/en/38/

…uency-ignoring-config-settings

cathalobrien · 2024-09-11T09:55:57Z

Hi, I tested this branch and the various profilers all worked as expected. Nice work!

At first, some of the memory profiler output can be confusing. It shows negative numbers, and positive numbers greater then max device memory. Turns out this is because the profiler tracks deallocations, and aggregates allocations across the entire program runtime. Maybe we could add a note above the memory profiler output explaining this. Or maybe better to link to the documentation for the profiler in stdout, as I understand there's more options which can be enabled if the user digs into the code

| Name | ... | CUDA Mem | Self CUDA Mem | # of Calls | ...
| autograd::engine::evaluate_function: AddmmBackward0 | ... | 72.27 Gb | -44.86 Gb | 340 | ...

theissenhelen · 2024-09-26T08:07:22Z

src/anemoi/training/diagnostics/callbacks/__init__.py

+
+# * [WHY ARE CALLBACKS UNDER __init__.py?]
+# * This functionality will be restructured in the near future
+# * so for now callbacks are under __init__.py
+


Harrison has opened a PR for this, so I think we can delete it.

for more information, see https://pre-commit.ci

…nemoi-training into mlflow_benchmark_profiler_update

mchantry · 2024-10-23T09:02:10Z

CHANGELOG.md

@@ -26,6 +26,8 @@ Keep it human-readable, your future self will thank you!
 - Feature: Add configurable models [#50](https://github.com/ecmwf/anemoi-training/pulls/50)
 - Feature: Support training for datasets with missing time steps [#48](https://github.com/ecmwf/anemoi-training/pulls/48)
 - Long Rollout Plots
+- Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report


Tag the PR please.

docs/user-guide/benchmarking.rst

…u sessions

for more information, see https://pre-commit.ci

…nemoi-training into mlflow_benchmark_profiler_update

gmertes · 2024-10-25T12:05:00Z

src/anemoi/training/commands/profiler.py

-        del args
+    def run(self, args: list[str], unknown_args: list[str] | None = None) -> None:
+        # This will be picked up by the logger
+        os.environ["ANEMOI_PROFILER_CMD"] = f"{sys.argv[0]} {args.command}"


Does this need to be a specific one for the profiler? I think we can just reuse the ANEMOI_TRAINING_CMD env var?

The "training" in that name doesn't need to refer to "train". It could just be "the command that anemoi-training was run with".

Yes I wanted to check that! I first opted to have the two of them just to check if it's was working fine, which it does! Right now there is also quite a bit of repeated code across the profiler and train command. So I was thinking I could directly inherit from Train to do the Profiler one to avoid repeating the _merge_sysargv and other functions? setting the command as an env variable could even go to a small function so then if I inherit I don't need to code it again. What do you think? (I have not looked a lot to the details of the Command class, so would like to check thoughts in inheritance could be okey or is not advised in this case)

mchantry

LGTM thanks very much.

anaprietonem added 6 commits August 20, 2024 11:15

fix: saving frequency bug for inference checkpoints

6507424

Merge branch 'develop' into 257-bug-inference-checkpoints-saving-freq…

7d2d620

…uency-ignoring-config-settings

chore: update CHANGELOG

0027046

feat: add anemoi profiler with mlflow compatibility

8cf698b

fix: format error

d647bf9

fix: removed atos path from noteook and fixed update_paths function

352cd29

anaprietonem added the enhancement New feature or request label Sep 13, 2024

theissenhelen reviewed Sep 26, 2024

View reviewed changes

add hta functionality in documentation

c7ab208

anaprietonem requested review from JesperDramsch and gmertes as code owners October 7, 2024 09:31

anaprietonem and others added 18 commits October 7, 2024 09:55

updating docs for profiler

ebe33bd

update profiler docs

9c67f3e

update profiler docs

2bcf957

update profiler docs

2e6a168

update profiler docs

29232ce

update profiler docs

c646e38

update profiler docs

4d9610b

update profiler docs

0a4070c

update profiler docs

45e7a7b

update profiler docs

3cea9d9

update profiler docs

3c2f2d9

update profiler docs

b8fcf99

Merge branch 'develop' into mlflow_benchmark_profiler_update

80e5522

[pre-commit.ci] auto fixes from pre-commit.com hooks

990aea9

for more information, see https://pre-commit.ci

fixing pre-commits on docs

5aeeca4

fix pre-commit docs

b85eac2

fix pre-commit docs

ef54ffb

[pre-commit.ci] auto fixes from pre-commit.com hooks

56e222f

for more information, see https://pre-commit.ci

anaprietonem added 7 commits October 7, 2024 14:19

minor updates

4aa225a

Merge branch 'mlflow_benchmark_profiler_update' of github.com:ecmwf/a…

81b57d8

…nemoi-training into mlflow_benchmark_profiler_update

added docs for anemoi profiler

86e58ba

add section about profiling in overview

e943782

add section about profiling in overview

e177bd6

add comment to avoid confussion with profiler for troubleshooting

328ca19

added note about limit batches

702287e

mchantry reviewed Oct 23, 2024

View reviewed changes

anaprietonem added 2 commits October 24, 2024 14:55

Merge branch 'develop' into mlflow_benchmark_profiler_update

36dc645

updated changelog

a7280ab

anaprietonem requested review from b8raoult, floriankrb, HCookie and JPXKQX as code owners October 24, 2024 15:19

HCookie reviewed Oct 24, 2024

View reviewed changes

docs/user-guide/benchmarking.rst Outdated Show resolved Hide resolved

anaprietonem and others added 5 commits October 25, 2024 09:48

making sure anemoi-training profiler commands works in interactive gp…

05289e4

…u sessions

[pre-commit.ci] auto fixes from pre-commit.com hooks

df76686

for more information, see https://pre-commit.ci

update docs

977c3e4

Merge branch 'mlflow_benchmark_profiler_update' of github.com:ecmwf/a…

d71d7c1

…nemoi-training into mlflow_benchmark_profiler_update

removed comment based on refactor callbacks PR

442dd9a

gmertes reviewed Oct 25, 2024

View reviewed changes

adapted patchedProfile to not break HTA

60368ae

mchantry previously approved these changes Oct 25, 2024

View reviewed changes

avoid code duplication in commands and fix copyright notice

9c50023

anaprietonem dismissed mchantry’s stale review via 9c50023 October 25, 2024 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mlflow benchmark profiler update #38

Mlflow benchmark profiler update #38

anaprietonem commented Aug 23, 2024 •

edited by github-actions bot

Loading

cathalobrien commented Sep 11, 2024

theissenhelen Sep 26, 2024

mchantry Oct 23, 2024

gmertes Oct 25, 2024 •

edited

Loading

anaprietonem Oct 25, 2024 •

edited

Loading

mchantry left a comment

Mlflow benchmark profiler update #38

Are you sure you want to change the base?

Mlflow benchmark profiler update #38

Conversation

anaprietonem commented Aug 23, 2024 • edited by github-actions bot Loading

cathalobrien commented Sep 11, 2024

theissenhelen Sep 26, 2024

Choose a reason for hiding this comment

mchantry Oct 23, 2024

Choose a reason for hiding this comment

gmertes Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

anaprietonem Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

mchantry left a comment

Choose a reason for hiding this comment

anaprietonem commented Aug 23, 2024 •

edited by github-actions bot

Loading

gmertes Oct 25, 2024 •

edited

Loading

anaprietonem Oct 25, 2024 •

edited

Loading