Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mlflow benchmark profiler update #38

Open
wants to merge 41 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
6507424
fix: saving frequency bug for inference checkpoints
anaprietonem Aug 20, 2024
7d2d620
Merge branch 'develop' into 257-bug-inference-checkpoints-saving-freq…
anaprietonem Aug 20, 2024
0027046
chore: update CHANGELOG
anaprietonem Aug 20, 2024
8cf698b
feat: add anemoi profiler with mlflow compatibility
anaprietonem Aug 20, 2024
d647bf9
fix: format error
anaprietonem Aug 20, 2024
352cd29
fix: removed atos path from noteook and fixed update_paths function
anaprietonem Aug 23, 2024
c7ab208
add hta functionality in documentation
anaprietonem Oct 7, 2024
ebe33bd
updating docs for profiler
anaprietonem Oct 7, 2024
9c67f3e
update profiler docs
anaprietonem Oct 7, 2024
2bcf957
update profiler docs
anaprietonem Oct 7, 2024
2e6a168
update profiler docs
anaprietonem Oct 7, 2024
29232ce
update profiler docs
anaprietonem Oct 7, 2024
c646e38
update profiler docs
anaprietonem Oct 7, 2024
4d9610b
update profiler docs
anaprietonem Oct 7, 2024
0a4070c
update profiler docs
anaprietonem Oct 7, 2024
45e7a7b
update profiler docs
anaprietonem Oct 7, 2024
3cea9d9
update profiler docs
anaprietonem Oct 7, 2024
3c2f2d9
update profiler docs
anaprietonem Oct 7, 2024
b8fcf99
update profiler docs
anaprietonem Oct 7, 2024
80e5522
Merge branch 'develop' into mlflow_benchmark_profiler_update
anaprietonem Oct 7, 2024
990aea9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 7, 2024
5aeeca4
fixing pre-commits on docs
anaprietonem Oct 7, 2024
b85eac2
fix pre-commit docs
anaprietonem Oct 7, 2024
ef54ffb
fix pre-commit docs
anaprietonem Oct 7, 2024
56e222f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 7, 2024
4aa225a
minor updates
anaprietonem Oct 7, 2024
81b57d8
Merge branch 'mlflow_benchmark_profiler_update' of github.com:ecmwf/a…
anaprietonem Oct 7, 2024
86e58ba
added docs for anemoi profiler
anaprietonem Oct 7, 2024
e943782
add section about profiling in overview
anaprietonem Oct 8, 2024
e177bd6
add section about profiling in overview
anaprietonem Oct 8, 2024
328ca19
add comment to avoid confussion with profiler for troubleshooting
anaprietonem Oct 8, 2024
702287e
added note about limit batches
anaprietonem Oct 9, 2024
36dc645
Merge branch 'develop' into mlflow_benchmark_profiler_update
anaprietonem Oct 24, 2024
a7280ab
updated changelog
anaprietonem Oct 24, 2024
05289e4
making sure anemoi-training profiler commands works in interactive gp…
anaprietonem Oct 25, 2024
df76686
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 25, 2024
977c3e4
update docs
anaprietonem Oct 25, 2024
d71d7c1
Merge branch 'mlflow_benchmark_profiler_update' of github.com:ecmwf/a…
anaprietonem Oct 25, 2024
442dd9a
removed comment based on refactor callbacks PR
anaprietonem Oct 25, 2024
60368ae
adapted patchedProfile to not break HTA
anaprietonem Oct 25, 2024
9c50023
avoid code duplication in commands and fix copyright notice
anaprietonem Oct 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@ Keep it human-readable, your future self will thank you!

## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.2.1...HEAD)


### Added
- Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report [38](https://github.com/ecmwf/anemoi-training/pull/38/)


## [0.2.1 - Bugfix: resuming mlflow runs](https://github.com/ecmwf/anemoi-training/compare/0.2.0...0.2.1) - 2024-10-24

### Added
Expand Down Expand Up @@ -54,6 +59,7 @@ Keep it human-readable, your future self will thank you!
- Feature: `AnemoiMlflowClient`, an mlflow client with authentication support [#86](https://github.com/ecmwf/anemoi-training/pull/86)
- Long Rollout Plots


### Fixed

- Fix `TypeError` raised when trying to JSON serialise `datetime.timedelta` object - [#43](https://github.com/ecmwf/anemoi-training/pull/43)
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/anemoi_profiler_config.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_memory_report.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_memory_timeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_model_summary.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_model_summary_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_system_report.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_time_report.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/idle_time_breakdown.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/kernel_breakdown_dfs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/kernel_breakdown_plots.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/memory_snapshot_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/temporal_breakdown.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ This package provides the *Anemoi* training functionality.
user-guide/training
user-guide/models
user-guide/tracking
user-guide/benchmarking
user-guide/distributed
user-guide/debugging

Expand Down
12 changes: 12 additions & 0 deletions docs/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,18 @@ and resolve issues during the training process, including:
- Debug configurations for quick error identification
- Guidance on isolating and addressing common problems

8. Benchmarking and HPC Profiling
=================================

Anemoi Training offers tools and configurations to support benchmarking
and High-Performance Computing (HPC) profiling, allowing users to
optimize training performance. This includes:

- Benchmarking configurations for evaluating training efficiency across
different hardware setups.
- Profiling tools for monitoring resource utilization (CPU, GPU,
memory) and identifying performance bottlenecks.

**************************
Components and Structure
**************************
Expand Down
746 changes: 746 additions & 0 deletions docs/user-guide/benchmarking.rst

Large diffs are not rendered by default.

8 changes: 8 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,13 @@ optional-dependencies.docs = [
"sphinx-argparse",
"sphinx-rtd-theme",
]
optional-dependencies.profile = [
"holistictraceanalysis>=0.2",
"pandas>=1.3.2",
"rich>=13.6",
"tabulate>=0.9",
]

optional-dependencies.tests = [ "hypothesis", "pytest", "pytest-mock" ]

urls.Changelog = "https://github.com/ecmwf/anemoi-training/CHANGELOG.md"
Expand All @@ -86,6 +93,7 @@ urls.Repository = "https://github.com/ecmwf/anemoi-training/"
# command for interactive DDP (not supposed to be used directly)
# the dot is intentional, so it doesn't trigger autocomplete
scripts.".anemoi-training-train" = "anemoi.training.commands.train:main"
scripts.".anemoi-training-profiler" = "anemoi.training.commands.profiler:main"

# Add subcommand in the `commands` directory
scripts.anemoi-training = "anemoi.training.__main__:main"
Expand Down
85 changes: 85 additions & 0 deletions src/anemoi/training/commands/profiler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# (C) Copyright 2024 ECMWF.
#
# This software is licensed under the terms of the Apache Licence Version 2.0
# which can be obtained at http://www.apache.org/licenses/LICENSE-2.0.
# In applying this licence, ECMWF does not waive the privileges and immunities
# granted to it by virtue of its status as an intergovernmental organisation
# nor does it submit to any jurisdiction.

from __future__ import annotations

import logging
import os
import sys
from pathlib import Path
from typing import TYPE_CHECKING

from anemoi.training.commands import Command

if TYPE_CHECKING:
import argparse

LOGGER = logging.getLogger(__name__)


class Profiler(Command):
"""Commands to profile Anemoi models."""

accept_unknown_args = True

@staticmethod
def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
return parser

def run(self, args: list[str], unknown_args: list[str] | None = None) -> None:
# This will be picked up by the logger
os.environ["ANEMOI_PROFILER_CMD"] = f"{sys.argv[0]} {args.command}"
Copy link
Member

@gmertes gmertes Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a specific one for the profiler? I think we can just reuse the ANEMOI_TRAINING_CMD env var?

The "training" in that name doesn't need to refer to "train". It could just be "the command that anemoi-training was run with".

Copy link
Contributor Author

@anaprietonem anaprietonem Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I wanted to check that! I first opted to have the two of them just to check if it's was working fine, which it does! Right now there is also quite a bit of repeated code across the profiler and train command. So I was thinking I could directly inherit from Train to do the Profiler one to avoid repeating the _merge_sysargv and other functions? setting the command as an env variable could even go to a small function so then if I inherit I don't need to code it again. What do you think? (I have not looked a lot to the details of the Command class, so would like to check thoughts in inheritance could be okey or is not advised in this case)

# Merge the known subcommands with a non-whitespace character for hydra
new_sysargv = self._merge_sysargv(args)

# Add the unknown arguments (belonging to hydra) to sys.argv
if unknown_args is not None:
sys.argv = [new_sysargv, *unknown_args]
else:
sys.argv = [new_sysargv]

# Import and run the profiler command
LOGGER.info("Running anemoi profiling command with overrides: %s", sys.argv[1:])
main()

def _merge_sysargv(self, args: argparse.Namespace) -> str:
"""Merge the sys.argv with the known subcommands to pass to hydra.

Parameters
----------
args : argparse.Namespace
args from the command line

Returns
-------
str
Modified sys.argv as string
"""
argv = Path(sys.argv[0])

# this will turn "/env/bin/anemoi-training train" into "/env/bin/.anemoi-training-train"
# the dot at the beginning is intentional to not interfere with autocomplete
modified_sysargv = argv.with_name(f".{argv.name}-{args.command}")

if hasattr(args, "subcommand"):
modified_sysargv += f"-{args.subcommand}"
return str(modified_sysargv)


def main() -> None:
# Use the environment variable to check if main is being called from the subcommand, not from the ddp entrypoint
if not os.environ.get("ANEMOI_PROFILER_CMD"):
error = "This entrypoint should not be called directly. Use `anemoi-training profiler` instead."
raise RuntimeError(error)

from anemoi.training.train.profiler import main as anemoi_profiler

anemoi_profiler()


command = Profiler
22 changes: 22 additions & 0 deletions src/anemoi/training/config/diagnostics/eval_rollout.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,28 @@ debug:
# remember to also activate the tensorboard logger (below)
profiler: False

# Use anemoi-profile to profile the training process
benchmark_profiler:
memory:
enabled: True
steps: 5 # wait warmup steps and then do steps (too many steps would lead to a big file)
warmup: 2
extra_plots: False
trace_rank0_only: False #set to true and it will profile rank 0 only. Reads SLURM_PROC_ID so won't work when not running via Slurm
time:
enabled: True
verbose: False #If true, output every action the profiler caputres, otherwise output a subset defined in PROFILER_ACTIONS at the top of aifs/diagnostics/profiler.py
speed:
enabled: True
system:
enabled: True
model_summary:
enabled: True
snapshot:
enabled: True
steps: 4 # wait warmup steps and then do steps
warmup: 0

checkpoint:
every_n_minutes:
save_frequency: 30 # Approximate, as this is checked at the end of training steps
Expand Down
2 changes: 2 additions & 0 deletions src/anemoi/training/config/training/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ multistep_input: 2
# the effective batch size becomes num-devices * batch_size * k
accum_grad_batches: 1

num_sanity_val_steps: 6

# clipp gradients, 0 : don't clip, default algorithm: norm, alternative: value
gradient_clip:
val: 32.
Expand Down
66 changes: 66 additions & 0 deletions src/anemoi/training/diagnostics/callbacks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
from pytorch_lightning.callbacks import Callback
from pytorch_lightning.callbacks.model_checkpoint import ModelCheckpoint
from pytorch_lightning.utilities import rank_zero_only
from pytorch_lightning.utilities.types import STEP_OUTPUT

from anemoi.training.diagnostics.plots import init_plot_settings
from anemoi.training.diagnostics.plots import plot_graph_features
Expand Down Expand Up @@ -885,6 +886,71 @@ def on_load_checkpoint(
pl_module.hparams["metadata"]["parent_uuid"] = checkpoint["hyper_parameters"]["metadata"]["uuid"]


class MemorySnapshotRecorder(Callback):
"""Record memory snapshot using torch.cuda._record_memory_history()."""

def __init__(self, config):
super().__init__()
self.config = config
self.dirpath = Path(self.config.hardware.paths.profiler)

self.warmup = self.config.diagnostics.benchmark_profiler.snapshot.warmup
if not self.warmup:
self.warmup = 0
self.num_steps = (
self.config.diagnostics.benchmark_profiler.snapshot.steps + self.warmup
) # be consistent with profiler scheduler
self.status = False

assert (
self.num_steps % self.config.dataloader.batch_size.training == 0
), "Snapshot steps is not a multiple of batch size"
assert (
self.warmup % self.config.dataloader.batch_size.training == 0
), "Snapshot Warmup steps is not a multiple of batch size"

@rank_zero_only
def _start_snapshot_recording(self):
LOGGER.info("Starting snapshot record_memory_history")
torch.cuda.memory._record_memory_history()
self.status = True

@rank_zero_only
def _save_snapshot(self):
self.memory_snapshot_fname = self.dirpath / "memory_snapshot.pickle"
try:
LOGGER.info("Saving memory snapshot to %s", self.memory_snapshot_fname)
torch.cuda.memory._dump_snapshot(f"{self.memory_snapshot_fname}")
except Exception as e:
LOGGER.error(f"Failed to capture memory snapshot {e}")

@rank_zero_only
def stop_record_memory_history(self) -> None:
LOGGER.info("Stopping snapshot record_memory_history")
torch.cuda.memory._record_memory_history(enabled=None)

def on_train_batch_start(
self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", batch: Any, batch_idx: int
) -> None:
if trainer.global_step == self.warmup:
self._start_snapshot_recording()

def on_train_batch_end(
self,
trainer: "pl.Trainer",
pl_module: "pl.LightningModule",
outputs: STEP_OUTPUT,
batch: Any,
batch_idx: int,
) -> None:
if trainer.global_step == self.num_steps:
if self.status is True:
self._save_snapshot()
self.stop_record_memory_history()
else:
LOGGER.info("Snapshot recording was not started so no snapshot was saved")


class AnemoiCheckpoint(ModelCheckpoint):
"""A checkpoint callback that saves the model after every validation epoch."""

Expand Down
4 changes: 3 additions & 1 deletion src/anemoi/training/diagnostics/mlflow/logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -377,7 +377,9 @@ def _get_mlflow_run_params(
tags = {"projectName": project_name}

# create a tag with the command used to run the script
command = os.environ.get("ANEMOI_TRAINING_CMD", sys.argv[0])
command = os.environ.get("ANEMOI_TRAINING_CMD", sys.argv[0]) or os.environ.get(
"ANEMOI_PROFILER_CMD", sys.argv[0],
)
tags["command"] = command.split("/")[-1] # get the python script name
tags["mlflow.source.name"] = command
if len(sys.argv) > 1:
Expand Down
Loading
Loading