Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability test wall clock #239

Merged
merged 71 commits into from
Nov 7, 2024
Merged

Scalability test wall clock #239

merged 71 commits into from
Nov 7, 2024

Conversation

jarlsondre
Copy link
Collaborator

@jarlsondre jarlsondre commented Oct 30, 2024

Summary

Update the implementation for the original scalability test to include an absolute plot. In addition, many changes have been made to streamline the different scalability tests.

Note: As I have been rebasing from the gpu-monitoring branch, many of the changes from that branch will appear in this PR until the other PR has been merged.

Related issue :
#221

@jarlsondre jarlsondre self-assigned this Oct 30, 2024
@jarlsondre jarlsondre added the enhancement New feature or request label Oct 30, 2024
matbun
matbun previously approved these changes Oct 31, 2024
Copy link
Collaborator

@matbun matbun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@matbun
Copy link
Collaborator

matbun commented Nov 1, 2024

I just realized that it would be nice to enable/disable all scalability analyses from a flag passed to the Trainer's constructor. This would give the possibility to the user to decide when profiling should actually be done and when to have less overhead

@jarlsondre
Copy link
Collaborator Author

I just realized that it would be nice to enable/disable all scalability analyses from a flag passed to the Trainer's constructor. This would give the possibility to the user to decide when profiling should actually be done and when to have less overhead

The way I had it in mind was that the decorator was going to be used as a toggle for this. The obvious problem with doing that is of course that you have to actually change the source code every time you want to do it. However, if you add a flag to the constructor that overrides the decorator then you in a sense have two contradicting toggles, which to me sounds quite confusing. To counteract this, I would suggest then that the decorators are a permanent part of every trainer, somehow, so that we limit the number of places for configuration, as anything else quickly becomes convoluted.

matbun
matbun previously approved these changes Nov 7, 2024
@jarlsondre jarlsondre merged commit 86f536f into main Nov 7, 2024
11 checks passed
@jarlsondre jarlsondre deleted the scalability-test-wall-clock branch November 7, 2024 16:12
jarlsondre added a commit that referenced this pull request Nov 8, 2024
* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* temporary changes

* remove redundant variable

* add absolute time plot

* remove trailing whitespace

* remove redundant variable

* remove trailing whitespace

* begin implementation of backup

* fix issues from PR

* fix issues from PR

* add backup to gpu monitoring

* fix import in eurac trainer

* cleanup backup mechanism slightly

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup to gpu monitoring

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* add configurable and dynamic wait and warmup times for the profiler

* temporary changes

* add absolute time plot

* begin implementation of backup

* add backup to gpu monitoring

* cleanup backup mechanism slightly

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* fix linting errors

* fix more linting errors

* add utilization percentage plot

* run isort for linting

* update default save path for metrics

* add decorators to virgo and some cleanup

* add contributions and cleanup

* fix linting errors

* change 'credits' to 'credit'

* update communication plot style

* update function names

* update scalability function for a more streamlined approach

* run isort

* move horovod import

* fix linting errors

* add contributors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants