Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor submission templates #722

Merged
merged 52 commits into from
Oct 23, 2023
Merged

Refactor submission templates #722

merged 52 commits into from
Oct 23, 2023

Conversation

b-butler
Copy link
Member

@b-butler b-butler commented Feb 27, 2023

Description

This PR moves much of the task and number of nodes logic to the environment classes where they can specify any needed parameters using the resources template context value now provided to the Jinja templates. This allows more complicated logic in a simpler to understand format compared to Jinja. This also promotes similar computation of resources, and explicit overrides through inheritance. For some more discussion see #702.

Motivation and Context

This is attempting to solve some bugs in the existing template logic. For one, we sometimes fail on multi-nodes GPU submissions (see #702) and related issues. We also currently modify user resource requests (e.g. rounding CPU tasks to the same number per node) rather than submitting as is or erroring.

Checklist:

This moves much of the task and number of nodes logic to the
environments where it is easier to manage the more complicated logic.
@b-butler
Copy link
Member Author

b-butler commented Feb 27, 2023

TODO:

  • Document _get_scheduler_resources more fully
  • Get feedback on new choices on default settings (e.g. --ntasks over --ntasks-per-node)
  • Port over all templates (that we have access to)
  • Test all templates
  • Regenerate template test reference and validate

Note need to make sure to only set -N when required to not over request resources.

Anyone working on this please feel free to modify this.

b-butler and others added 15 commits March 7, 2023 18:37
CPUS and GPUS per partition are in theory supported.
Use sentinal of -1 to denote no node structure and always return 1 node
requested for either CPU or GPU tasks.
Add the ComputeEnvironment._shared_partitions attribute to check if less
than single node submissions should be allowed in
ComputeEnvironment._get_scheduler_values.
The Delta template is now tested and works.
The environment is now tested.
I do not have access to the cluster, so to prevent regressions, I am
reseting this.
@bdice
Copy link
Member

bdice commented Jul 12, 2023

@b-butler Can we push this through or close it?

pre-commit-ci bot and others added 11 commits July 20, 2023 11:21
updates:
- [github.com/psf/black: 23.1.0 → 23.3.0](psf/black@23.1.0...23.3.0)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* First pass at fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add filter test

* Update changelog

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Delta changed their compute node hostnames in their April 19th
maintanance. This fixes the detection of the delta environment.
* doc: Update changelog.

* Bump up to version 0.25.1.
* feat: Add the Frontier supercomputer to environments.

* test: Add Frontier to template testing.

* test: Update environment test template generation to signac 2.0

* doc: Update changelog

* doc: Add Frontier documentation.

* doc: Update incode comment clarity

Co-authored-by: Bradley Dice <[email protected]>

---------

Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Bradley Dice <[email protected]>
* feat (WIP): create flow CLI subcommand for testing templates

* feat: Finish new CLI option.

* test: flow test-workflow.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* doc: Add test-workflow to documentation.

* doc: Add changes to changelog.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Bumps [ruamel-yaml](https://sourceforge.net/p/ruamel-yaml/code/ci/default/tree) from 0.17.21 to 0.17.31.

---
updated-dependencies:
- dependency-name: ruamel-yaml
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Oct 10, 2023

Codecov Report

Merging #722 (f8db823) into main (fd1f3d0) will decrease coverage by 0.09%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #722      +/-   ##
==========================================
- Coverage   69.25%   69.17%   -0.09%     
==========================================
  Files          44       44              
  Lines        4297     4331      +34     
  Branches      950     1052     +102     
==========================================
+ Hits         2976     2996      +20     
- Misses       1109     1129      +20     
+ Partials      212      206       -6     
Files Coverage Δ
flow/environment.py 82.18% <100.00%> (+3.86%) ⬆️
flow/environments/incite.py 72.58% <100.00%> (-13.79%) ⬇️
flow/environments/umich.py 85.71% <100.00%> (+3.89%) ⬆️
flow/environments/xsede.py 85.07% <100.00%> (+1.74%) ⬆️
flow/project.py 83.15% <100.00%> (+0.13%) ⬆️

@b-butler b-butler marked this pull request as ready for review October 11, 2023 00:03
@b-butler b-butler requested review from a team as code owners October 11, 2023 00:03
@b-butler b-butler requested review from mikemhenry and tommy-waltmann and removed request for a team October 11, 2023 00:03
Copy link
Contributor

@tommy-waltmann tommy-waltmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two small suggestions and a question:

This PR touches a lot of cluster templates. Does flow normally need to be validated on each of the clusters in addition to unit tests? If so, has the validation been done after the changes in this PR?

flow/project.py Outdated Show resolved Hide resolved
flow/project.py Outdated Show resolved Hide resolved
@b-butler
Copy link
Member Author

This PR touches a lot of cluster templates. Does flow normally need to be validated on each of the clusters in addition to unit tests? If so, has the validation been done after the changes in this PR?

@tommy-waltmann that is part of the reason the PR took so long. All of the templates changed have been validated, I believe (need to double check Andes). That is also the reason some are not changed since I could not access them.

flow/templates/andes.sh Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants