-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mlflow benchmark profiler update #38
base: develop
Are you sure you want to change the base?
Conversation
…uency-ignoring-config-settings
Hi, I tested this branch and the various profilers all worked as expected. Nice work! At first, some of the memory profiler output can be confusing. It shows negative numbers, and positive numbers greater then max device memory. Turns out this is because the profiler tracks deallocations, and aggregates allocations across the entire program runtime. Maybe we could add a note above the memory profiler output explaining this. Or maybe better to link to the documentation for the profiler in stdout, as I understand there's more options which can be enabled if the user digs into the code | Name | ... | CUDA Mem | Self CUDA Mem | # of Calls | ... |
|
||
# * [WHY ARE CALLBACKS UNDER __init__.py?] | ||
# * This functionality will be restructured in the near future | ||
# * so for now callbacks are under __init__.py | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Harrison has opened a PR for this, so I think we can delete it.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…nemoi-training into mlflow_benchmark_profiler_update
CHANGELOG.md
Outdated
@@ -26,6 +26,8 @@ Keep it human-readable, your future self will thank you! | |||
- Feature: Add configurable models [#50](https://github.com/ecmwf/anemoi-training/pulls/50) | |||
- Feature: Support training for datasets with missing time steps [#48](https://github.com/ecmwf/anemoi-training/pulls/48) | |||
- Long Rollout Plots | |||
- Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tag the PR please.
for more information, see https://pre-commit.ci
…nemoi-training into mlflow_benchmark_profiler_update
del args | ||
def run(self, args: list[str], unknown_args: list[str] | None = None) -> None: | ||
# This will be picked up by the logger | ||
os.environ["ANEMOI_PROFILER_CMD"] = f"{sys.argv[0]} {args.command}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a specific one for the profiler? I think we can just reuse the ANEMOI_TRAINING_CMD env var?
The "training" in that name doesn't need to refer to "train". It could just be "the command that anemoi-training was run with".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I wanted to check that! I first opted to have the two of them just to check if it's was working fine, which it does! Right now there is also quite a bit of repeated code across the profiler and train command. So I was thinking I could directly inherit from Train to do the Profiler one to avoid repeating the _merge_sysargv and other functions? setting the command as an env variable could even go to a small function so then if I inherit I don't need to code it again. What do you think? (I have not looked a lot to the details of the Command class, so would like to check thoughts in inheritance could be okey or is not advised in this case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks very much.
Describe your changes
PR to update the benchmark profiler to be able to:
Please also include relevant motivation and context.
Previous version of the BP did not allow to configure the exact reports to be generated and did not make use of torch profiler. In this PR we address those issues, allowing one to choose what reports to generate, exploiting the features available in pytorch profiler for a more in depth breakdown of GPU/CPU memory usage and adding the option to generate a model summary
List any dependencies that are required for this change.
When running the memory report generation, it's possible to perform a Holistic Trace Analysis but for that you'd need to install https://hta.readthedocs.io/en/latest/source/intro/installation.html (so this is now included as extras for the profiler in setup.py)
Type of change
Please delete options that are not relevant.
Issue ticket number and link
We did not create a ticket when we first started working on this task. Will do next time, apologies!
Checklist before requesting a review
*need to still update the confluence page but docstrings and comments are updated
Tag possible reviewers
You can @-tag people to review this PR in addition to formal review requests.
@cathalobrien @mchantry @JesperDramsch
[PR Migrated from aifs-mono to anemoi-training]
📚 Documentation preview 📚: https://anemoi-training--38.org.readthedocs.build/en/38/