Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NeMo Llama 2 70B Pre-Training 20TB Dataset OOM #11784

Open
shengshiqi-google opened this issue Jan 7, 2025 · 5 comments
Open

NeMo Llama 2 70B Pre-Training 20TB Dataset OOM #11784

shengshiqi-google opened this issue Jan 7, 2025 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@shengshiqi-google
Copy link

Describe the bug

NeMo Llama 2 70B Pre-Training 20TB Dataset out-of-memory (CPU memory)

Steps/Code to reproduce bug

Using the default YAML file here: https://github.com/NVIDIA/NeMo-Framework-Launcher/blob/main/launcher_scripts/conf/training/llama/llama2_70b.yaml

The dataset is the pile dataset replicated 35 times such that its size is 20TB

We run this on a GKE cluster with 2 A3 Mega nodes

Expected behavior

We expect the training to run successfully with num_workers=4. But this runs into OOM error. Reducing num_workers to 2 worked. But this is problematic because we are using GCSFuse and the latency is high, so num_workers=2 is not able to saturate the training.

So we want to know if this is a known problem on Nvidia's end.

Environment overview (please complete the following information)

  • Environment location: GCP GKE

Environment details

NeMo 24.07

@shengshiqi-google shengshiqi-google added the bug Something isn't working label Jan 7, 2025
@terrykong
Copy link
Collaborator

Increasing num_workers can increase the host memory usage. Is my understanding correct that these A3 mega nodes have ~1.8TB of host memory? Do you have data points of how the process mem increase as you increase num_workers?

Does increasing the shared memory like in this summary comment help in your case?

@terrykong terrykong self-assigned this Jan 10, 2025
@terrykong
Copy link
Collaborator

Also, if you're not using Nemo 2.0, I think that would be worth a try. I have seen anecdotes of host OOMs going away after switching

@awonak
Copy link

awonak commented Jan 10, 2025

Is my understanding correct that these A3 mega nodes have ~1.8TB of host memory?

Correct, our two-node cluster has a total of 3,760 GB

Do you have data points of how the process mem increase as you increase num_workers?

What would be the most helpful format to share this data?

Reducing num_nodes or reducing the dataset size will reduce the total amount of CPU memory used. We will experiment increasing shared memory.

@shengshiqi-google
Copy link
Author

Also, if you're not using Nemo 2.0, I think that would be worth a try. I have seen anecdotes of host OOMs going away after switching

Hi Terry, thank you for your help. We did try NeMo 2.0, and found that it does indeed resolve the OOM issue.

However, it appears that NeMo 2.0 does not have Kubernetes support for it yet. Do you have any ideas on when that might happen?

@terrykong
Copy link
Collaborator

Hi Terry, thank you for your help. We did try NeMo 2.0, and found that it does indeed resolve the OOM issue.

Actually there is k8s support via skypilot: https://github.com/NVIDIA/NeMo-Run/blob/main/docs/source/guides/execution.md#execute-nemo-run

Please give that a try and feel free to leave feedback for the nemo-run team if something is missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants