-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add TensorBoard logging with loss and wps #57
Conversation
[ghstack-poisoned]
ghstack-source-id: 297bec2b7acdf83c0af32dbf89dda3c6672095c9 Pull Request resolved: #57
[ghstack-poisoned]
ghstack-source-id: cdfe4c2c496feae23399019ec2a63b443fb3b6a9 Pull Request resolved: #57
train.py
Outdated
|
||
time_delta = timer() - time_last_log | ||
wps = nwords_since_last_log / ( | ||
time_delta * parallel_dims.sp * parallel_dims.pp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a neater way is to define a model_parallel_size
in the parallel_dims
class that return this number directly (i.e. a cached property)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! one minor comment, and please update the readme to include how to setup and use tensorboard.
Each rank build its own TensorBoard writer. The global loss is communicated among all ranks before logging. To visualize using SSH tunneling: `ssh -L 6006:127.0.0.1:6006 your_user_namemy_server_ip` in torchtrain repo `tensorboard --logdir=./torchtrain/outputs/tb` then on web browser go to http://localhost:6006/ <img width="722" alt="Screenshot 2024-02-12 at 6 39 28 PM" src="https://github.com/pytorch-labs/torchtrain/assets/150487191/6304103c-fa89-4f1c-a8a2-57c887b07cd3"> [ghstack-poisoned]
ghstack-source-id: d0828f16c06747a5af2586630e5205bf786de1c4 Pull Request resolved: #57
ghstack-source-id: d0828f16c06747a5af2586630e5205bf786de1c4 Pull Request resolved: #57
ghstack-source-id: d0828f16c06747a5af2586630e5205bf786de1c4 Pull Request resolved: #57
ghstack-source-id: d0828f16c06747a5af2586630e5205bf786de1c4 Pull Request resolved: pytorch#57
Stack from ghstack (oldest at bottom):
Each rank build its own TensorBoard writer. The global loss is communicated among all ranks before logging.
To visualize using SSH tunneling:
ssh -L 6006:127.0.0.1:6006 your_user_name@my_server_ip
in torchtrain repo
tensorboard --logdir=./torchtrain/outputs/tb
then on web browser go to http://localhost:6006/