You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Some datasets have so many users or so many items that an embeddings table will not fit and train within a single GPU's memory bounds. Eventually, you'll get the dreaded CUDA error: out of memory message. Oof.
Describe the solution you'd like
While PyTorch Lightning allows for data-parallelization (data is processed simultaneously on many GPUs), it does not easily allow for model-parallelization (the model itself lives on many GPUs). This issue deals with the latter case.
Imagine we have two GPUs of 8GB memory each, and a model with embeddings of size 10GB. A model-parallel solution to training this model would be to split the embeddings tables in half, where each half lives on another GPU. Now, both GPUs are utilized in storing and training the model, and we can train without worry for our memory!
Ideally, we would be able to create a model that looks and functions the same as existing Collie models, except on the back-end, the model will be parallelized across multiple GPUs to avoid OOM errors. This should be as seamless as possible.
Describe alternatives you've considered
At some point, we cannot reduce the number of embedding dimensions anymore, the batch size anymore, or the dataset anymore, and we have to consider additional scalability concerns such as this.
Is your feature request related to a problem? Please describe.
Some datasets have so many users or so many items that an embeddings table will not fit and train within a single GPU's memory bounds. Eventually, you'll get the dreaded
CUDA error: out of memory
message. Oof.Describe the solution you'd like
While PyTorch Lightning allows for data-parallelization (data is processed simultaneously on many GPUs), it does not easily allow for model-parallelization (the model itself lives on many GPUs). This issue deals with the latter case.
Imagine we have two GPUs of 8GB memory each, and a model with embeddings of size 10GB. A model-parallel solution to training this model would be to split the embeddings tables in half, where each half lives on another GPU. Now, both GPUs are utilized in storing and training the model, and we can train without worry for our memory!
Ideally, we would be able to create a model that looks and functions the same as existing Collie models, except on the back-end, the model will be parallelized across multiple GPUs to avoid OOM errors. This should be as seamless as possible.
Describe alternatives you've considered
At some point, we cannot reduce the number of embedding dimensions anymore, the batch size anymore, or the dataset anymore, and we have to consider additional scalability concerns such as this.
Any additional information?
Colossal-AI has a method for doing this that we can either adapt from or just outright use: https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.nn.layer.colossalai_layer.embedding.html
I think starting with a model-parallel replacement for
nn.Embedding
is a good start, and we can see how this works going forward.The text was updated successfully, but these errors were encountered: