-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to accumulate train loss over tokens. #3273
Conversation
@dakinggg @irene-dea can you please look? I agree we should have an option for this. I'm not sure if it's necessary to pass to Composer vs. check if its an attribute/property on a dataloader |
@mvpatel2000 I think trainer arg is right for this...code looks fine at a glance, would want to test a bit more before merging. |
You will own testing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a unit test please?
Is there a good template I can base this test on? I'm not sure how to isolate the impact of this change from the rest of the trainer. |
@aadyotb Here is a test that exercises the full NLP pipeline ( composer/tests/test_full_nlp.py Lines 231 to 338 in 01eec3a
|
So basically make a trainer with a dummy model and a dummy dataset, and then call it with sample weighting and token weighting (with microbatching on), and assert the losses are different in the expected way. |
@dakinggg I've added a unit test that requires sample-based and token-based weighting result in different outcomes when padding is present. |
@aadyotb awesome, thank you!! Will take a look soon. |
Reproducibility isn't good enough on CPU.
This is to ensure that each rank contributes the appropriate gradient amount based on the number of samples/tokens present.
Thanks @mvpatel2000. For the time being, I've implemented these changes by overriding the |
@dakinggg bumping this PR for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but requiring sign off from @dakinggg as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will let daniel review the unit test
Hey @aadyotb taking a look now, mostly convincing myself that the math is correct :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than waiting for Mihir's thought on changing the behavior for the sample weighting case, looks good to me! Thanks so much for the contribution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving
Ok cool, LGTM. I'm running a before and after loss curve just to double check (in the normal case, with even number of samples per device batch), and will post that graph here when done. |
@aadyotb I think one more run of precommit should do it, and we should be good to merge. |
What does this PR do?
Adds an option to accumulate train loss over the number of tokens in a batch instead of the number of samples.
What issue(s) does this change relate to?
Currently, losses in the trainer are accumulated in a way that weights each sample in a batch equally. However, for NLP use cases where batches contain padding tokens, it makes more sense to accumulate the loss in a way that instead weights every (non-padding) token equally.
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)