Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU model training #46

Open
nathancooperjones opened this issue Jan 27, 2022 · 0 comments
Open

Multi-GPU model training #46

nathancooperjones opened this issue Jan 27, 2022 · 0 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@nathancooperjones
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Some datasets have so many users or so many items that an embeddings table will not fit and train within a single GPU's memory bounds. Eventually, you'll get the dreaded CUDA error: out of memory message. Oof.

Describe the solution you'd like

While PyTorch Lightning allows for data-parallelization (data is processed simultaneously on many GPUs), it does not easily allow for model-parallelization (the model itself lives on many GPUs). This issue deals with the latter case.

Imagine we have two GPUs of 8GB memory each, and a model with embeddings of size 10GB. A model-parallel solution to training this model would be to split the embeddings tables in half, where each half lives on another GPU. Now, both GPUs are utilized in storing and training the model, and we can train without worry for our memory!

Ideally, we would be able to create a model that looks and functions the same as existing Collie models, except on the back-end, the model will be parallelized across multiple GPUs to avoid OOM errors. This should be as seamless as possible.

Describe alternatives you've considered

At some point, we cannot reduce the number of embedding dimensions anymore, the batch size anymore, or the dataset anymore, and we have to consider additional scalability concerns such as this.

Any additional information?

Colossal-AI has a method for doing this that we can either adapt from or just outright use: https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.nn.layer.colossalai_layer.embedding.html

I think starting with a model-parallel replacement for nn.Embedding is a good start, and we can see how this works going forward.

@nathancooperjones nathancooperjones added enhancement New feature or request help wanted Extra attention is needed labels Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant