NeMo Llama 2 70B Pre-Training 20TB Dataset OOM #11784

shengshiqi-google · 2025-01-07T22:14:17Z

Describe the bug

NeMo Llama 2 70B Pre-Training 20TB Dataset out-of-memory (CPU memory)

Steps/Code to reproduce bug

Using the default YAML file here: https://github.com/NVIDIA/NeMo-Framework-Launcher/blob/main/launcher_scripts/conf/training/llama/llama2_70b.yaml

The dataset is the pile dataset replicated 35 times such that its size is 20TB

We run this on a GKE cluster with 2 A3 Mega nodes

Expected behavior

We expect the training to run successfully with num_workers=4. But this runs into OOM error. Reducing num_workers to 2 worked. But this is problematic because we are using GCSFuse and the latency is high, so num_workers=2 is not able to saturate the training.

So we want to know if this is a known problem on Nvidia's end.

Environment overview (please complete the following information)

Environment location: GCP GKE

Environment details

NeMo 24.07

terrykong · 2025-01-10T18:01:20Z

Increasing num_workers can increase the host memory usage. Is my understanding correct that these A3 mega nodes have ~1.8TB of host memory? Do you have data points of how the process mem increase as you increase num_workers?

Does increasing the shared memory like in this summary comment help in your case?

terrykong · 2025-01-10T18:16:09Z

Also, if you're not using Nemo 2.0, I think that would be worth a try. I have seen anecdotes of host OOMs going away after switching

awonak · 2025-01-10T19:00:52Z

Is my understanding correct that these A3 mega nodes have ~1.8TB of host memory?

Correct, our two-node cluster has a total of 3,760 GB

Do you have data points of how the process mem increase as you increase num_workers?

What would be the most helpful format to share this data?

Reducing num_nodes or reducing the dataset size will reduce the total amount of CPU memory used. We will experiment increasing shared memory.

shengshiqi-google · 2025-01-10T21:54:55Z

Also, if you're not using Nemo 2.0, I think that would be worth a try. I have seen anecdotes of host OOMs going away after switching

Hi Terry, thank you for your help. We did try NeMo 2.0, and found that it does indeed resolve the OOM issue.

However, it appears that NeMo 2.0 does not have Kubernetes support for it yet. Do you have any ideas on when that might happen?

terrykong · 2025-01-10T22:58:25Z

Hi Terry, thank you for your help. We did try NeMo 2.0, and found that it does indeed resolve the OOM issue.

Actually there is k8s support via skypilot: https://github.com/NVIDIA/NeMo-Run/blob/main/docs/source/guides/execution.md#execute-nemo-run

Please give that a try and feel free to leave feedback for the nemo-run team if something is missing.

shengshiqi-google added the bug Something isn't working label Jan 7, 2025

terrykong self-assigned this Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NeMo Llama 2 70B Pre-Training 20TB Dataset OOM #11784

NeMo Llama 2 70B Pre-Training 20TB Dataset OOM #11784

shengshiqi-google commented Jan 7, 2025

terrykong commented Jan 10, 2025

terrykong commented Jan 10, 2025

awonak commented Jan 10, 2025

shengshiqi-google commented Jan 10, 2025

terrykong commented Jan 10, 2025

NeMo Llama 2 70B Pre-Training 20TB Dataset OOM #11784

NeMo Llama 2 70B Pre-Training 20TB Dataset OOM #11784

Comments

shengshiqi-google commented Jan 7, 2025

terrykong commented Jan 10, 2025

terrykong commented Jan 10, 2025

awonak commented Jan 10, 2025

shengshiqi-google commented Jan 10, 2025

terrykong commented Jan 10, 2025