Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to accumulate train loss over tokens. #3273

Merged
merged 32 commits into from
Jun 6, 2024
Merged

Conversation

aadyotb
Copy link
Contributor

@aadyotb aadyotb commented May 9, 2024

What does this PR do?

Adds an option to accumulate train loss over the number of tokens in a batch instead of the number of samples.

What issue(s) does this change relate to?

Currently, losses in the trainer are accumulated in a way that weights each sample in a batch equally. However, for NLP use cases where batches contain padding tokens, it makes more sense to accumulate the loss in a way that instead weights every (non-padding) token equally.

Before submitting

  • Have you read the contributor guidelines?
  • Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@mvpatel2000 mvpatel2000 requested review from dakinggg and irenedea May 14, 2024 20:08
@mvpatel2000
Copy link
Contributor

@dakinggg @irene-dea can you please look?

I agree we should have an option for this. I'm not sure if it's necessary to pass to Composer vs. check if its an attribute/property on a dataloader

@dakinggg
Copy link
Contributor

@mvpatel2000 I think trainer arg is right for this...code looks fine at a glance, would want to test a bit more before merging.

@mvpatel2000
Copy link
Contributor

@mvpatel2000 I think trainer arg is right for this...code looks fine at a glance, would want to test a bit more before merging.

You will own testing?

Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a unit test please?

composer/trainer/trainer.py Show resolved Hide resolved
@aadyotb
Copy link
Contributor Author

aadyotb commented May 14, 2024

Can we add a unit test please?

Is there a good template I can base this test on? I'm not sure how to isolate the impact of this change from the rest of the trainer.

@dakinggg
Copy link
Contributor

@aadyotb Here is a test that exercises the full NLP pipeline (

@device('cpu', 'gpu')
# Note: the specificity of these settings are due to incompatibilities (e.g. the simpletransformer model is not traceable)
@pytest.mark.parametrize(
'model_type,algorithms,save_format',
[
('tinybert_hf', [GatedLinearUnits], 'onnx'),
('simpletransformer', [], 'torchscript'),
],
)
@pytest.mark.parametrize('onnx_opset_version', [13, None])
def test_full_nlp_pipeline(
model_type,
algorithms,
save_format,
tiny_bert_tokenizer,
onnx_opset_version,
tmp_path,
request,
device,
):
"""This test is intended to exercise our full pipeline for NLP.
To this end, it performs pretraining, loads the pretrained model with a classification head for finetuning
and finetunes it, exports the model for inference, and loads it back in to make predictions.
"""
pytest.importorskip('libcloud')
pytest.importorskip('transformers')
if onnx_opset_version == None and version.parse(torch.__version__) < version.parse('1.13'):
pytest.skip("Don't test prior PyTorch version's default Opset version.")
algorithms = [algorithm() for algorithm in algorithms]
device = get_device(device)
tiny_bert_model = None
if model_type == 'tinybert_hf':
tiny_bert_model = request.getfixturevalue('tiny_bert_model')
# pretraining
if model_type == 'tinybert_hf':
assert tiny_bert_model is not None
pretraining_metrics = [LanguageCrossEntropy(ignore_index=-100), MaskedAccuracy(ignore_index=-100)]
pretraining_model = HuggingFaceModel(
tiny_bert_model,
tiny_bert_tokenizer,
use_logits=True,
metrics=pretraining_metrics,
)
elif model_type == 'simpletransformer':
pretraining_model = SimpleTransformerMaskedLM(vocab_size=tiny_bert_tokenizer.vocab_size)
else:
raise ValueError('Unsupported model type')
pretraining_output_path = pretraining_test_helper(
tiny_bert_tokenizer,
pretraining_model,
algorithms,
tmp_path,
device,
)
# finetuning
if model_type == 'tinybert_hf':
finetuning_metric = MulticlassAccuracy(num_classes=3, average='micro')
hf_finetuning_model, _ = HuggingFaceModel.hf_from_composer_checkpoint(
pretraining_output_path,
model_instantiation_class='transformers.AutoModelForSequenceClassification',
model_config_kwargs={'num_labels': 3},
)
finetuning_model = HuggingFaceModel(
model=hf_finetuning_model,
tokenizer=tiny_bert_tokenizer,
use_logits=True,
metrics=[finetuning_metric],
)
elif model_type == 'simpletransformer':
finetuning_model = SimpleTransformerClassifier(vocab_size=tiny_bert_tokenizer.vocab_size, num_classes=3)
else:
raise ValueError('Unsupported model type.')
finetuning_model_copy = copy.deepcopy(finetuning_model)
finetuning_trainer, finetuning_dataloader, rud, finetuning_output_path = finetuning_test_helper(
tiny_bert_tokenizer,
finetuning_model,
algorithms,
pretraining_output_path,
pretraining_model,
tmp_path,
device,
)
# inference
batch = next(iter(finetuning_dataloader))
finetuning_trainer.state.model.to('cpu')
finetuning_trainer.state.model.eval()
original_output = finetuning_trainer.state.model(batch)
inference_test_helper(
finetuning_output_path,
rud,
finetuning_model_copy,
algorithms,
batch,
original_output,
onnx_opset_version,
tmp_path,
save_format,
device,
)
). I think to test this we probably want to (for just the training part of the code that I linked) construct a model that has deterministic loss (based on num padding tokens maybe?) and then test that the results are different in the expected way between sample weighting and token weighting.

@dakinggg
Copy link
Contributor

So basically make a trainer with a dummy model and a dummy dataset, and then call it with sample weighting and token weighting (with microbatching on), and assert the losses are different in the expected way.

@aadyotb
Copy link
Contributor Author

aadyotb commented May 24, 2024

@dakinggg I've added a unit test that requires sample-based and token-based weighting result in different outcomes when padding is present.

@dakinggg
Copy link
Contributor

@aadyotb awesome, thank you!! Will take a look soon.

tests/test_simple_nlp.py Outdated Show resolved Hide resolved
aadyotb and others added 5 commits May 25, 2024 02:00
Reproducibility isn't good enough on CPU.
This is to ensure that each rank contributes the appropriate gradient
amount based on the number of samples/tokens present.
@aadyotb
Copy link
Contributor Author

aadyotb commented May 28, 2024

Thanks @mvpatel2000. For the time being, I've implemented these changes by overriding the Trainer class in our local repo, so we will be okay for now. Happy to get further review once Daniel returns.

@aadyotb
Copy link
Contributor Author

aadyotb commented Jun 5, 2024

@dakinggg bumping this PR for review.

Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but requiring sign off from @dakinggg as well

composer/trainer/trainer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will let daniel review the unit test

tests/test_simple_nlp.py Outdated Show resolved Hide resolved
tests/test_simple_nlp.py Outdated Show resolved Hide resolved
tests/test_simple_nlp.py Outdated Show resolved Hide resolved
tests/test_simple_nlp.py Outdated Show resolved Hide resolved
@dakinggg
Copy link
Contributor

dakinggg commented Jun 5, 2024

Hey @aadyotb taking a look now, mostly convincing myself that the math is correct :)

composer/trainer/trainer.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Show resolved Hide resolved
Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than waiting for Mihir's thought on changing the behavior for the sample weighting case, looks good to me! Thanks so much for the contribution!

Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving

@dakinggg
Copy link
Contributor

dakinggg commented Jun 6, 2024

Ok cool, LGTM. I'm running a before and after loss curve just to double check (in the normal case, with even number of samples per device batch), and will post that graph here when done.

@dakinggg
Copy link
Contributor

dakinggg commented Jun 6, 2024

Screenshot 2024-06-05 at 10 58 34 PM

For posterity, I validated that finetuning behavior is unchanged before and after this PR (for a case with constant samples per device batch), and does change if you specify the new flag.

@dakinggg
Copy link
Contributor

dakinggg commented Jun 6, 2024

@aadyotb I think one more run of precommit should do it, and we should be good to merge.

@dakinggg dakinggg merged commit f039f06 into mosaicml:dev Jun 6, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants