Multi-GPU model training #46

nathancooperjones · 2022-01-27T16:27:37Z

Is your feature request related to a problem? Please describe.

Some datasets have so many users or so many items that an embeddings table will not fit and train within a single GPU's memory bounds. Eventually, you'll get the dreaded CUDA error: out of memory message. Oof.

Describe the solution you'd like

While PyTorch Lightning allows for data-parallelization (data is processed simultaneously on many GPUs), it does not easily allow for model-parallelization (the model itself lives on many GPUs). This issue deals with the latter case.

Imagine we have two GPUs of 8GB memory each, and a model with embeddings of size 10GB. A model-parallel solution to training this model would be to split the embeddings tables in half, where each half lives on another GPU. Now, both GPUs are utilized in storing and training the model, and we can train without worry for our memory!

Ideally, we would be able to create a model that looks and functions the same as existing Collie models, except on the back-end, the model will be parallelized across multiple GPUs to avoid OOM errors. This should be as seamless as possible.

Describe alternatives you've considered

At some point, we cannot reduce the number of embedding dimensions anymore, the batch size anymore, or the dataset anymore, and we have to consider additional scalability concerns such as this.

Any additional information?

Colossal-AI has a method for doing this that we can either adapt from or just outright use: https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.nn.layer.colossalai_layer.embedding.html

I think starting with a model-parallel replacement for nn.Embedding is a good start, and we can see how this works going forward.

The text was updated successfully, but these errors were encountered:

nathancooperjones added enhancement New feature or request help wanted Extra attention is needed labels Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU model training #46

Multi-GPU model training #46

nathancooperjones commented Jan 27, 2022

Multi-GPU model training #46

Multi-GPU model training #46

Comments

nathancooperjones commented Jan 27, 2022