-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add pp validation for schedule #568
base: gh/H-Huang/16/base
Are you sure you want to change the base?
Conversation
[ghstack-poisoned]
@@ -41,6 +41,18 @@ def build_pipeline_schedule(job_config, stages, loss_fn): | |||
if n_microbatches is None: | |||
n_microbatches = job_config.experimental.pipeline_parallel_degree | |||
|
|||
# Validation that the stages are compatible with the schedule | |||
if isinstance(schedule_class, PipelineScheduleSingle): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, shouldn't this validation actually go inside of PipelineSchedule* __init__
functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm yeah for PipelineScheduleMulti
init we do check that len(stages) > 2
. I guess the issue I ran into was because of this line below:
stages if looped_schedule else stages[0], |
I accidentally commented out the schedule config so it defaulted to a single stage schedule, but still had the stages cut such that it was a "looped schedule". Since in the line above, only stage[0]
is retrieved the schedule was created correctly, but later in runtime caused a hang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to be more consistent then we could just support passing in a list of stages for both single and multi, then all the validation could be done in pytorch
I ran into this issue with config where I was using the wrong schedule so hopefully an error message can prevent this for users as well [ghstack-poisoned]
I ran into this issue with config where I was using the wrong schedule so hopefully an error message can prevent this for users as well [ghstack-poisoned]
Stack from ghstack (oldest at bottom):
I ran into this issue with config where I was using the wrong schedule so hopefully an error message can prevent this for users as well