Running ALIGNN on Multi-GPUs #90

aydinmirac · 2023-01-22T10:57:42Z

Dear All,

I would like to run ALIGNN on multi GPUs. When I checked the code I could not find any option.

Is there any method to run ALIGNN on multi GPUs such as using PyTorch Lightning or DDP function from PyTorch (Distributed Data Parallel)?

Best regards,
Mirac

bdecost · 2023-01-22T20:04:57Z

There is a config option to wrap the model in DistributedDataParallel, but I think that may be a dead code path at this point.

alignn/alignn/train.py

Line 651 in 736ec73

if config.distributed:

I think the most straightforward path is through ignite.distributed. Lightning would probably be pretty straightforward to use as well if you wrap our models and dataloaders (I tried using lightning once but it's been a while).

I have some experimental code for training on multiple GPUs that I think hasn't gotten merged into the the main repo. This more or less followed the ignite.distributed CIFAR example. I don't think there was much required beyond wrapping the model, optimizer, and dataloader in ignite's auto distributed helpers, and fixing the training entry point to have the right call signature

Cleaning that up is actually on my near-term to-do list. Is there a particular form or use case for multi-GPU training/inference that you think would be useful?

aydinmirac · 2023-01-23T17:47:25Z

Hi @bdecost,

Thank you so much for your reply.

Let me try your solution. I will inform you as soon as possible.

I am using 2xRTX-4500 and distributing data with the following PyTorch-Lightning example code snippet. It is very useful and simple:

trainer = pl.Trainer(
    callbacks=callbacks,
    logger=logger,
    default_root_dir=root_dir,
    max_epochs=200,
    devices=2,
    accelerator="gpu",
    strategy="ddp"
)
trainer.fit(task, datamodule=data)

My model does not fit into single GPU because of big molecule structures. These use cases are pretty common if you are studying big structures. I think adding a feature like that (I mean, being able to split a model easily) would be very helpful when using ALIGNN.

bdecost · 2023-01-24T15:04:26Z

My model does not fit into single GPU because of big molecule structures. These use cases are pretty common if you are studying big structures. I think adding a feature like that (I mean, being able to split a model easily) would be very helpful when using ALIGNN.

Are you looking to shard your model across multiple GPUs? In that case lightning looks like it might be able to set that up automatically? I've only tried using data parallelism with ignite to get better throughput, I'm not sure ignite supports model parallelism.

What kind of batch sizes are you working with? I'm just wondering if you could reduce your batch size and model size to work within your GPU memory limit, and maybe use data parallelism to get higher throughput / effective batch size. The layer configuration study we did indicates that you can drop to a configuration with 2 ALIGNN and 2 GCN layers without really sacrificing performance, and this substantially reduces the model size (not quite by a factor of 2). See also the cost vs accuracy summary of that experiment. I'm not sure this tradeoff would hold for larger molecular structures since the performance tradeoff study was done using Jarvis dft_3d dataset, but it might be worth trying.

aydinmirac · 2023-01-24T16:27:01Z

Hi @bdecost,

Let me elaborate my case.

I am using OMDB dataset which includes big molecules that has 82 atoms per molecule on average. That is quite bigger than QM9 dataset. As a model, I am using SchNetPack and ALIGNN to train the dataset.

I trained Schnet and ALIGNN on A100 (40GB GPU memory) before. Batch size and GPU memory usage on A100 are below:

Schnet --> batch size = 64, Memory usage = 39GB
ALIGNN --> batch size = 48, Memory usage = 32GB

If I increase batch size, both models crash on A100. Now I have to train these models on 2xRTX A4500 (20GB memory each) and split them into 2 GPUs without out of memory problem. Schnet uses Lightning to eliminate this issue.

I just wondered if I can do this with ALIGNN. Otherwise, as you mentioned, I have to reduce batch size. Also I will try your suggestions as you mentioned above.

bdecost · 2023-01-24T20:07:21Z

Ok, cool. I wasn't sure if you had such large molecules that fitting a single instance in GPU memory was a problem.

What you want is more straightforward to do than model parallelism I think. Our current training setup doesn't make it as simple as a configuration setting, but I think it would be nice for us to support

JonathanSchmidt1 · 2024-01-04T22:31:01Z

We are also interested in training on some larger datasets.
How is the state of distributed training right now?

knc6 · 2024-01-07T05:57:22Z

Hi,

did you try distributed:true which is based on accelerate package to manage multi-gpu training?

JonathanSchmidt1 · 2024-01-09T10:45:45Z

With distributed:true the training hangs at "building line graphs" even when only using one node. However there seem to be quite a lot of issues with hanging processes in accelerate. And unfortunately the kernel version of the supercomputing system I am on is 5.3.18 for which this seems to be a known issue https://huggingface.co/docs/accelerate/basic_tutorials/troubleshooting .

Did you have any reports of the training hanging at this stage?

JonathanSchmidt1 · 2024-04-23T14:01:42Z

@knc6 just a quick check in concerning data parallel as the distributed was removed.
I am getting some device errors with dataparallel and I am also not sure whether dataparallel can properly split the batches for dgl graphs.
dgl._ffi.base.DGLError: Cannot assign node feature "e_src" on device cuda:1 to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device.

Is the dataparallel training working for you?
thank you so much!

knc6 · 2024-04-28T02:02:26Z

There is a WIP DistributedDataParallel implementation of ALIGNN in the ddp branch. Currently, it prints outputs multiple times instead, hopefully should be an easy fix. Also, I need to check the reproducibility and few other aspects of DDP.

JonathanSchmidt1 · 2024-04-28T09:03:48Z

Great, will try it out.

JonathanSchmidt1 · 2024-04-29T18:19:11Z

Thank for sharing the branch.
I tested it with cached datasets and 2 gpus and it was reproducible and consistent with what I would expect from 1 gpu.

However I am noticing already that RAM will become an issue at some point (2 tasks with 400k materials is already 66 Gb, 8 tasks with 4M materials would be ~3.2Tb) due to the in-memory datasets.

knc6 · 2024-04-29T19:13:28Z

@JonathanSchmidt1 Thanks for checking. I have perhaps a possible solution to lessen memory footprint using LMDB dataset, will get back to you soon.

JonathanSchmidt1 · 2024-04-29T19:32:55Z

that's a good idea, lmdb datasets definitely work for this.
If you would like to use lmdb datasets, there are a few examples of how to do lmdb datasets in e.g. https://github.com/IntelLabs/matsciml/tree/main for both dgl as well as pyg. I have also implemented one of them, unfortunately I just dont have the capacity right now to promise anything.
However taking one of their classes that parse pmg structures, e.g. materials project dataset or alexandriadataset and switching to your line graph conversion might be easy.
E.g. for m3gnet preprocessing they just use:

@registry.register_dataset("M3GMaterialsProjectDataset")
class M3GMaterialsProjectDataset(MaterialsProjectDataset):
    def __init__(
        self,
        lmdb_root_path: str | Path,
        threebody_cutoff: float = 4.0,
        cutoff_dist: float = 20.0,
        graph_labels: list[int | float] | None = None,
        transforms: list[Callable[..., Any]] | None = None,
    ):
        super().__init__(lmdb_root_path, transforms)
        self.threebody_cutoff = threebody_cutoff
        self.graph_labels = graph_labels
        self.cutoff_dist = cutoff_dist
        self.clear_processed = True

    def _parse_structure(
        self,
        data: dict[str, Any],
        return_dict: dict[str, Any],
    ) -> None:
        super()._parse_structure(data, return_dict)
        structure: None | Structure = data.get("structure", None)
        self.structures = [structure]
        self.converter = Structure2Graph(
            element_types=element_types(),
            cutoff=self.cutoff_dist,
        )
        graphs, lg, sa = M3GNetDataset.process(self)
        graphs, lattices, lg, sa = M3GNetDataset.process(self)
        return_dict["graph"] = graphs[0]

knc6 · 2024-04-30T18:19:24Z

@JonathanSchmidt1 I have implemented a basic LMDB dataset here, feel free to give a try.

JonathanSchmidt1 · 2024-05-02T21:21:48Z

Thank you very much. Will give it a try this week.

knc6 mentioned this issue Apr 29, 2024

Multi-GPU training for DGL+ALIGNN dmlc/dgl#7372

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running ALIGNN on Multi-GPUs #90

Running ALIGNN on Multi-GPUs #90

aydinmirac commented Jan 22, 2023

bdecost commented Jan 22, 2023 •

edited

Loading

aydinmirac commented Jan 23, 2023

bdecost commented Jan 24, 2023

aydinmirac commented Jan 24, 2023

bdecost commented Jan 24, 2023

JonathanSchmidt1 commented Jan 4, 2024

knc6 commented Jan 7, 2024

JonathanSchmidt1 commented Jan 9, 2024

JonathanSchmidt1 commented Apr 23, 2024

knc6 commented Apr 28, 2024

JonathanSchmidt1 commented Apr 28, 2024

JonathanSchmidt1 commented Apr 29, 2024

knc6 commented Apr 29, 2024

JonathanSchmidt1 commented Apr 29, 2024 •

edited

Loading

knc6 commented Apr 30, 2024

JonathanSchmidt1 commented May 2, 2024

Running ALIGNN on Multi-GPUs #90

Running ALIGNN on Multi-GPUs #90

Comments

aydinmirac commented Jan 22, 2023

bdecost commented Jan 22, 2023 • edited Loading

aydinmirac commented Jan 23, 2023

bdecost commented Jan 24, 2023

aydinmirac commented Jan 24, 2023

bdecost commented Jan 24, 2023

JonathanSchmidt1 commented Jan 4, 2024

knc6 commented Jan 7, 2024

JonathanSchmidt1 commented Jan 9, 2024

JonathanSchmidt1 commented Apr 23, 2024

knc6 commented Apr 28, 2024

JonathanSchmidt1 commented Apr 28, 2024

JonathanSchmidt1 commented Apr 29, 2024

knc6 commented Apr 29, 2024

JonathanSchmidt1 commented Apr 29, 2024 • edited Loading

knc6 commented Apr 30, 2024

JonathanSchmidt1 commented May 2, 2024

bdecost commented Jan 22, 2023 •

edited

Loading

JonathanSchmidt1 commented Apr 29, 2024 •

edited

Loading