Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add memory metrics to TensorBoard #60

Merged
merged 2 commits into from
Feb 17, 2024
Merged

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Feb 16, 2024

Stack from ghstack (oldest at bottom):

Screenshot 2024-02-15 at 5 19 09 PM

tianyu-l added a commit that referenced this pull request Feb 16, 2024
ghstack-source-id: 4cf9b3ad5c8369f65c1bd384f2ea99900a6c4084
Pull Request resolved: #60
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 16, 2024
This was linked to issues Feb 16, 2024
train.py Outdated
"global_avg_loss": global_avg_loss,
"global_max_loss": global_max_loss,
"loss/global_avg": global_avg_loss,
"loss/global_max": global_max_loss,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - using the / here is confusing to me...I thought it represented the loss divided by the global avg, and same for max...
maybe consider just an _ or : or even :: as the separator? (loss:global_avg, loss::global_max, memory_current_active).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh good point. I use [tag]/[metric] here because TB collects plots under the same [tag] together in a row, so that they form a visual group. Just like in the picture in PR summary, memory metrics are grouped into memory_current, and memory_peak. I'll explore a way that can achieve this but without ambiguity for losses.

Copy link
Contributor Author

@tianyu-l tianyu-l Feb 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some exploration, e.g. tried to put related metrics into a single plot. The options we have are add_scalars and add_custom_scalars, and it seems neither is ideal (e.g.). I'm changing loss/global_avg to loss_metrics/global_avg for now to make it less ambiguous.

Copy link
Contributor

@lessw2020 lessw2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, thanks for integrating these stats!
one very minor nit about the / being possibly confused as division when used in labelling.

tianyu-l added a commit that referenced this pull request Feb 17, 2024
ghstack-source-id: da7e02b1c2f21a7471ce1dda8bd4d0ee888ad9ac
Pull Request resolved: #60
@tianyu-l tianyu-l merged commit b77c89f into gh/tianyu-l/1/base Feb 17, 2024
3 checks passed
tianyu-l added a commit that referenced this pull request Feb 17, 2024
ghstack-source-id: da7e02b1c2f21a7471ce1dda8bd4d0ee888ad9ac
Pull Request resolved: #60
@tianyu-l tianyu-l deleted the gh/tianyu-l/1/head branch February 17, 2024 01:42
lessw2020 pushed a commit that referenced this pull request Apr 18, 2024
ghstack-source-id: da7e02b1c2f21a7471ce1dda8bd4d0ee888ad9ac
Pull Request resolved: #60
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
ghstack-source-id: da7e02b1c2f21a7471ce1dda8bd4d0ee888ad9ac
Pull Request resolved: pytorch#60
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Tensorboard Add metrics to collect during training
3 participants