Feature/improve dataloader memory #76

japols · 2024-10-09T09:25:17Z

Describe your changes

This PR adds a configurable read_frequency (config.dataloader.read_frequency) that defines how many GPUs per model communication group read (the same) data. Increasing the read_frequency heavily reduces CPU memory usage as dataloaders don't reproduce the same data.

The model communication group is further subdivided into reader groups of size read_frequency. For each reader group, only rank 0 reads data from the dataloader and communicates it to the rest via broadcast.

The following experiments on n320 show that CPU memory usage goes down as we increase the read_frequency:

MLFlow

This additional broadcasting step doesn't affect runtime (time spent waiting for broadcast would otherwise be spent loading data):

read_frequency	avg_epoch_time(s)
1	130.43
2	129.15
4	128.78

for 10 epochs @100 steps.

Type of change

New feature (non-breaking change which adds functionality)

Checklist before requesting a review

Tag possible reviewers

@ssmmnn11 @mishooax @theissenhelen @JesperDramsch @sahahner @mchantry

…via dataloader.read_frequency

…-memory

FussyDuck · 2024-10-09T09:25:23Z

All committers have signed the CLA.

ssmmnn11

Great work! Would be nice to clean up the group creation a bit more and consolidate everything to the strategy :-).

ssmmnn11 · 2024-10-09T10:43:31Z

src/anemoi/training/train/forecaster.py

@@ -124,6 +125,23 @@ def __init__(
            config.hardware.num_gpus_per_node * config.hardware.num_nodes / config.hardware.num_gpus_per_model,
        )



Do we actually need any of these here? same for the model_comm_group etc. ... above. I think these are properly initialised by the strategy which uses model.set_model_comm_group and set_reader_group. So might be enough to initialise these to sensible default values when the model is not sharded.

I changed it to pass everything from the DDPGroupStrategy via model.set_model_comm_group() and set_reader_group().

ssmmnn11 · 2024-10-09T10:44:21Z

src/anemoi/training/train/forecaster.py

+        else:
+            # init batch tensor with correct shape on non-root ranks
+            shape = (batch.shape[0],) + tuple(batch[0].tolist())
+            batch = torch.zeros(shape, device=self.device)


why not torch.empty? Does probably not matter ...

You're right, no reason to use zeros instead of empty, changed it.

ssmmnn11 · 2024-10-09T10:52:48Z

src/anemoi/training/data/datamodule.py

@@ -74,11 +74,19 @@ def __init__(self, config: DictConfig) -> None:
            * self.config.hardware.num_nodes
            // self.config.hardware.num_gpus_per_model
        )  # number of model communication groups
+


It would be nice to get model comm groups and model reader groups from the strategy / use the rountines in the strategy to compute them instead of having code here to re-compute the groups. This should be possible because we initialise the strategy before loading the datamodule.

The datamodule only needed these to pass them down to the dataloader.dataset, I removed them entirely from the datamodule now.

ssmmnn11 · 2024-10-09T10:55:34Z

src/anemoi/training/train/train.py

@@ -308,6 +308,7 @@ def strategy(self) -> DDPGroupStrategy:
        """Training strategy."""
        return DDPGroupStrategy(
            self.config.hardware.num_gpus_per_model,
+            self.config.dataloader.read_frequency,


could we think of a way to make use of the routines / groups computed by the strategy in self.datamodule?

I managed to pass them to the dataloader directly via DDPGroupStrategy.process_dataloader() which is called by pytorch_lightning in trainer.fit(model, datamodule).

…nstead of SLURM_PROCID

for more information, see https://pre-commit.ci

gabrieloks · 2024-10-10T16:01:02Z

Very nice feature, Jan. I tested this on a rollout run on n320. These runs are painful because we need to reduce the number of workers to avoid out of memory issues and training speed is reduced drastically.

But I did a test with your branch and the develop branch and the results are quite good!

Here is a comparison in terms of memory usage for num_workers = 6:

The very good thing is that the job on the develop branch actually crashes at the end of rollout=2 while with your new feature, the rollout fine-tuning keeps going. This will considerably speed up rollout fine-tuning.

@mchantry

…aloader

mishooax

Very nice work, Jan! 👍

mishooax · 2024-10-25T12:54:58Z

src/anemoi/training/data/dataset.py

+            # get the grid shard size and start/end indices
+            grid_shard_size = self.grid_size // self.reader_group_size
+            self.grid_start = self.reader_group_rank * grid_shard_size
+            if self.reader_group_rank == self.reader_group_size - 1:


can this be shortened to

self.grid_end = min(self.grid_size, (self.reader_group_rank + 1) * grid_shard_size)

?

mishooax · 2024-10-25T12:59:01Z

src/anemoi/training/data/dataset.py

@@ -233,7 +280,11 @@ def __iter__(self) -> torch.Tensor:
            start = i - (self.multi_step - 1) * self.timeincrement
            end = i + (self.rollout + 1) * self.timeincrement

-            x = self.data[start : end : self.timeincrement]
+            if self.reader_group_size > 1:  # read only a subset of the grid
+                x = self.data[start : end : self.timeincrement, :, :, self.grid_start : self.grid_end]


@japols i'm puzzled a bit by this: i get what you're doing here, but given the way the zarr is chunked on disk (chunk i == self.data[i]), wouldn't this imply that each worker is still reading a full chunk (a time slice, i.e. all latlons) and then discards the points that are not in its shard?

tagging @floriankrb in case i misunderstood how the zarr chunking is done on disk (or how the slice index is implemented in anemoi-datasets)

mishooax · 2024-10-25T13:01:20Z

src/anemoi/training/train/train.py

@@ -106,7 +105,7 @@ def initial_seed(self) -> int:
        (torch.rand(1), np_rng.random())
        LOGGER.debug(
            "Initial seed: Rank %d, initial seed %d, running with random seed: %d",
-            int(os.environ.get("SLURM_PROCID", "0")),
+            self.strategy.global_rank,


is self.strategy guaranteed to be correctly initialized at this point?

mishooax · 2024-10-25T13:03:29Z

src/anemoi/training/data/dataset.py

@@ -93,6 +87,7 @@ def __init__(
        assert self.multi_step > 0, "Multistep value must be greater than zero."
        self.ensemble_dim: int = 2
        self.ensemble_size = self.data.shape[self.ensemble_dim]
+        self.grid_size = self.data.shape[-1]


may be useful (easier to understand) if we defined a self.grid_dim = -1 and used that instead? (like we do for the ensemble dim just above)

japols added 5 commits September 24, 2024 12:51

feat: initial implementation of dataloader memory optimization

cdaf082

fix: non-reader tasks actually return before reading

fcc7c93

feat: add reader_group to define per-model_comm_group read behaviour …

5d171c7

…via dataloader.read_frequency

Merge remote-tracking branch 'origin' into feature/improve-dataloader…

8c16e54

…-memory

docs: cleanup, add comments

ee94593

ssmmnn11 requested changes Oct 9, 2024

View reviewed changes

refactor: Pass model/reader group information from DDPGroupStrategy i…

3c6b5c9

…nstead of SLURM_PROCID

japols force-pushed the feature/improve-dataloader-memory branch from 4cc8a78 to 3c6b5c9 Compare October 9, 2024 15:50

[pre-commit.ci] auto fixes from pre-commit.com hooks

57a13c5

for more information, see https://pre-commit.ci

japols added the enhancement New feature or request label Oct 9, 2024

gabrieloks and others added 3 commits October 24, 2024 16:53

Merge branch 'develop' into feature/improve-dataloader-memory

bcd0fe6

feat: Add support for sharded reading in dataloader

9a22225

refactor: merge read groups with sharded reading functionality in dat…

6615c97

…aloader

mishooax reviewed Oct 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/improve dataloader memory #76

Feature/improve dataloader memory #76

japols commented Oct 9, 2024 •

edited

Loading

FussyDuck commented Oct 9, 2024 •

edited

Loading

ssmmnn11 left a comment

ssmmnn11 Oct 9, 2024

japols Oct 9, 2024

ssmmnn11 Oct 9, 2024

japols Oct 9, 2024

ssmmnn11 Oct 9, 2024

japols Oct 9, 2024

ssmmnn11 Oct 9, 2024

japols Oct 9, 2024

gabrieloks commented Oct 10, 2024 •

edited

Loading

mishooax left a comment

mishooax Oct 25, 2024

mishooax Oct 25, 2024 •

edited

Loading

mishooax Oct 25, 2024

mishooax Oct 25, 2024

		@@ -124,6 +125,23 @@ def __init__(
		config.hardware.num_gpus_per_node * config.hardware.num_nodes / config.hardware.num_gpus_per_model,
		)

Feature/improve dataloader memory #76

Are you sure you want to change the base?

Feature/improve dataloader memory #76

Conversation

japols commented Oct 9, 2024 • edited Loading

Describe your changes

Type of change

Checklist before requesting a review

Tag possible reviewers

FussyDuck commented Oct 9, 2024 • edited Loading

ssmmnn11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabrieloks commented Oct 10, 2024 • edited Loading

mishooax left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mishooax Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

japols commented Oct 9, 2024 •

edited

Loading

FussyDuck commented Oct 9, 2024 •

edited

Loading

gabrieloks commented Oct 10, 2024 •

edited

Loading

mishooax Oct 25, 2024 •

edited

Loading