Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update performance and loss converging results #800

Open
wants to merge 2 commits into
base: gh/tianyu-l/28/base
Choose a base branch
from

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Jan 22, 2025

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request Jan 22, 2025
ghstack-source-id: c9eb138b7011e6ea99907c1aeee9ed0fda7d9b16
Pull Request resolved: #800
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 22, 2025

The experiments are conducted on NVIDIA H100 GPUs[^1] with 95 GiB memory, where each host is equipped with 8 GPUs and NVSwitch. Two hosts form a rack connected to a TOR switch. A backend RDMA network connects the TOR switches.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the recent Meta CP paper (https://arxiv.org/abs/2411.01783), they mentioned:

We ran our performance benchmarks on the Grand Teton platform (Meta Engineering, 2022), where each host has 8 Nvidia H100 GPUs fully connected with NVLink (“host” and “node” are interchangeable in the subsequent text). Each H100 GPU is equipped with 96GB HBM2e with 2.4 TB/sec peak memory bandwidth. We tested on two subtypes of Grand Teton platforms: Grand Teton Training (GTT) and Grand Teton Inference (GTI). GTT hosts are inter-connected with backend RDMA network with 400 Gb/s per GPU, and GTI hosts are inter-connected with frontend network over TCP/IP with 100 Gb/s per GPU.

Would like to hear your thought on this @yifuwang

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request Jan 23, 2025
ghstack-source-id: ec5512945a3f156a989111b8226d274650eed4d2
Pull Request resolved: #800
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants