-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NeMo Llama 2 70B Pre-Training 20TB Dataset OOM #11784
Comments
Increasing num_workers can increase the host memory usage. Is my understanding correct that these A3 mega nodes have ~1.8TB of host memory? Do you have data points of how the process mem increase as you increase num_workers? Does increasing the shared memory like in this summary comment help in your case? |
Also, if you're not using Nemo 2.0, I think that would be worth a try. I have seen anecdotes of host OOMs going away after switching |
Correct, our two-node cluster has a total of 3,760 GB
What would be the most helpful format to share this data? Reducing num_nodes or reducing the dataset size will reduce the total amount of CPU memory used. We will experiment increasing shared memory. |
Hi Terry, thank you for your help. We did try NeMo 2.0, and found that it does indeed resolve the OOM issue. However, it appears that NeMo 2.0 does not have Kubernetes support for it yet. Do you have any ideas on when that might happen? |
Actually there is k8s support via skypilot: https://github.com/NVIDIA/NeMo-Run/blob/main/docs/source/guides/execution.md#execute-nemo-run Please give that a try and feel free to leave feedback for the nemo-run team if something is missing. |
Describe the bug
NeMo Llama 2 70B Pre-Training 20TB Dataset out-of-memory (CPU memory)
Steps/Code to reproduce bug
Using the default YAML file here: https://github.com/NVIDIA/NeMo-Framework-Launcher/blob/main/launcher_scripts/conf/training/llama/llama2_70b.yaml
The dataset is the pile dataset replicated 35 times such that its size is 20TB
We run this on a GKE cluster with 2 A3 Mega nodes
Expected behavior
We expect the training to run successfully with num_workers=4. But this runs into OOM error. Reducing num_workers to 2 worked. But this is problematic because we are using GCSFuse and the latency is high, so num_workers=2 is not able to saturate the training.
So we want to know if this is a known problem on Nvidia's end.
Environment overview (please complete the following information)
Environment details
NeMo 24.07
The text was updated successfully, but these errors were encountered: