Add option to accumulate train loss over tokens. #3273

aadyotb · 2024-05-09T22:01:39Z

What does this PR do?

Adds an option to accumulate train loss over the number of tokens in a batch instead of the number of samples.

What issue(s) does this change relate to?

Currently, losses in the trainer are accumulated in a way that weights each sample in a batch equally. However, for NLP use cases where batches contain padding tokens, it makes more sense to accumulate the loss in a way that instead weights every (non-padding) token equally.

Before submitting

Have you read the contributor guidelines?
Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
Did you update any related docs and document your change?
Did you update any related tests and add any new tests related to your change? (see testing)
Did you run the tests locally to make sure they pass?
Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

mvpatel2000 · 2024-05-14T20:08:53Z

@dakinggg @irene-dea can you please look?

I agree we should have an option for this. I'm not sure if it's necessary to pass to Composer vs. check if its an attribute/property on a dataloader

dakinggg · 2024-05-14T20:52:57Z

@mvpatel2000 I think trainer arg is right for this...code looks fine at a glance, would want to test a bit more before merging.

mvpatel2000 · 2024-05-14T21:30:12Z

@mvpatel2000 I think trainer arg is right for this...code looks fine at a glance, would want to test a bit more before merging.

You will own testing?

mvpatel2000

Can we add a unit test please?

composer/trainer/trainer.py

aadyotb · 2024-05-14T22:52:30Z

Can we add a unit test please?

Is there a good template I can base this test on? I'm not sure how to isolate the impact of this change from the rest of the trainer.

dakinggg · 2024-05-14T23:43:18Z

@aadyotb Here is a test that exercises the full NLP pipeline (

composer/tests/test_full_nlp.py

Lines 231 to 338 in 01eec3a

    
           @device('cpu', 'gpu') 
        
           # Note: the specificity of these settings are due to incompatibilities (e.g. the simpletransformer model is not traceable) 
        
           @pytest.mark.parametrize( 
        
               'model_type,algorithms,save_format', 
        
               [ 
        
                   ('tinybert_hf', [GatedLinearUnits], 'onnx'), 
        
                   ('simpletransformer', [], 'torchscript'), 
        
               ], 
        
           ) 
        
           @pytest.mark.parametrize('onnx_opset_version', [13, None]) 
        
           def test_full_nlp_pipeline( 
        
               model_type, 
        
               algorithms, 
        
               save_format, 
        
               tiny_bert_tokenizer, 
        
               onnx_opset_version, 
        
               tmp_path, 
        
               request, 
        
               device, 
        
           ): 
        
               """This test is intended to exercise our full pipeline for NLP. 
        
               To this end, it performs pretraining, loads the pretrained model with a classification head for finetuning 
        
               and finetunes it, exports the model for inference, and loads it back in to make predictions. 
        
               """ 
        
               pytest.importorskip('libcloud') 
        
               pytest.importorskip('transformers') 
        
               if onnx_opset_version == None and version.parse(torch.__version__) < version.parse('1.13'): 
        
                   pytest.skip("Don't test prior PyTorch version's default Opset version.") 
        
               algorithms = [algorithm() for algorithm in algorithms] 
        
               device = get_device(device) 
        
               tiny_bert_model = None 
        
               if model_type == 'tinybert_hf': 
        
                   tiny_bert_model = request.getfixturevalue('tiny_bert_model') 
        
               # pretraining 
        
               if model_type == 'tinybert_hf': 
        
                   assert tiny_bert_model is not None 
        
                   pretraining_metrics = [LanguageCrossEntropy(ignore_index=-100), MaskedAccuracy(ignore_index=-100)] 
        
                   pretraining_model = HuggingFaceModel( 
        
                       tiny_bert_model, 
        
                       tiny_bert_tokenizer, 
        
                       use_logits=True, 
        
                       metrics=pretraining_metrics, 
        
                   ) 
        
               elif model_type == 'simpletransformer': 
        
                   pretraining_model = SimpleTransformerMaskedLM(vocab_size=tiny_bert_tokenizer.vocab_size) 
        
               else: 
        
                   raise ValueError('Unsupported model type') 
        
               pretraining_output_path = pretraining_test_helper( 
        
                   tiny_bert_tokenizer, 
        
                   pretraining_model, 
        
                   algorithms, 
        
                   tmp_path, 
        
                   device, 
        
               ) 
        
               # finetuning 
        
               if model_type == 'tinybert_hf': 
        
                   finetuning_metric = MulticlassAccuracy(num_classes=3, average='micro') 
        
                   hf_finetuning_model, _ = HuggingFaceModel.hf_from_composer_checkpoint( 
        
                       pretraining_output_path, 
        
                       model_instantiation_class='transformers.AutoModelForSequenceClassification', 
        
                       model_config_kwargs={'num_labels': 3}, 
        
                   ) 
        
                   finetuning_model = HuggingFaceModel( 
        
                       model=hf_finetuning_model, 
        
                       tokenizer=tiny_bert_tokenizer, 
        
                       use_logits=True, 
        
                       metrics=[finetuning_metric], 
        
                   ) 
        
               elif model_type == 'simpletransformer': 
        
                   finetuning_model = SimpleTransformerClassifier(vocab_size=tiny_bert_tokenizer.vocab_size, num_classes=3) 
        
               else: 
        
                   raise ValueError('Unsupported model type.') 
        
               finetuning_model_copy = copy.deepcopy(finetuning_model) 
        
               finetuning_trainer, finetuning_dataloader, rud, finetuning_output_path = finetuning_test_helper( 
        
                   tiny_bert_tokenizer, 
        
                   finetuning_model, 
        
                   algorithms, 
        
                   pretraining_output_path, 
        
                   pretraining_model, 
        
                   tmp_path, 
        
                   device, 
        
               ) 
        
               # inference 
        
               batch = next(iter(finetuning_dataloader)) 
        
               finetuning_trainer.state.model.to('cpu') 
        
               finetuning_trainer.state.model.eval() 
        
               original_output = finetuning_trainer.state.model(batch) 
        
               inference_test_helper( 
        
                   finetuning_output_path, 
        
                   rud, 
        
                   finetuning_model_copy, 
        
                   algorithms, 
        
                   batch, 
        
                   original_output, 
        
                   onnx_opset_version, 
        
                   tmp_path, 
        
                   save_format, 
        
                   device, 
        
               )

). I think to test this we probably want to (for just the training part of the code that I linked) construct a model that has deterministic loss (based on num padding tokens maybe?) and then test that the results are different in the expected way between sample weighting and token weighting.

dakinggg · 2024-05-14T23:45:16Z

So basically make a trainer with a dummy model and a dummy dataset, and then call it with sample weighting and token weighting (with microbatching on), and assert the losses are different in the expected way.

aadyotb · 2024-05-24T21:56:38Z

@dakinggg I've added a unit test that requires sample-based and token-based weighting result in different outcomes when padding is present.

dakinggg · 2024-05-24T22:13:10Z

@aadyotb awesome, thank you!! Will take a look soon.

tests/test_simple_nlp.py

Reproducibility isn't good enough on CPU.

This is to ensure that each rank contributes the appropriate gradient amount based on the number of samples/tokens present.

aadyotb · 2024-05-28T22:10:01Z

Thanks @mvpatel2000. For the time being, I've implemented these changes by overriding the Trainer class in our local repo, so we will be okay for now. Happy to get further review once Daniel returns.

aadyotb · 2024-06-05T18:21:54Z

@dakinggg bumping this PR for review.

mvpatel2000

LGTM but requiring sign off from @dakinggg as well

composer/trainer/trainer.py

mvpatel2000

will let daniel review the unit test

tests/test_simple_nlp.py

dakinggg · 2024-06-05T22:24:17Z

Hey @aadyotb taking a look now, mostly convincing myself that the math is correct :)

composer/trainer/trainer.py

dakinggg

Other than waiting for Mihir's thought on changing the behavior for the sample weighting case, looks good to me! Thanks so much for the contribution!

mvpatel2000

Approving

dakinggg · 2024-06-06T01:36:30Z

Ok cool, LGTM. I'm running a before and after loss curve just to double check (in the normal case, with even number of samples per device batch), and will post that graph here when done.

dakinggg · 2024-06-06T05:59:13Z

For posterity, I validated that finetuning behavior is unchanged before and after this PR (for a case with constant samples per device batch), and does change if you specify the new flag.

dakinggg · 2024-06-06T06:00:36Z

@aadyotb I think one more run of precommit should do it, and we should be good to merge.

aadyotb and others added 2 commits May 9, 2024 14:55

Add option to accumulate train loss over tokens.

da75658

Merge branch 'dev' into dev

a185ccc

mvpatel2000 requested review from dakinggg and irenedea May 14, 2024 20:08

Merge branch 'dev' into dev

8a87e63

mvpatel2000 reviewed May 14, 2024

View reviewed changes

composer/trainer/trainer.py Show resolved Hide resolved

Add warning in case zero tokens in batch.

14c13a1

Fix linting.

13ef2db

dakinggg and others added 4 commits May 15, 2024 22:14

Merge branch 'dev' into dev

112413f

Merge branch 'dev' into dev

eda3cb1

Add test for token-based train loss accumulation.

55ad23f

Merge branch 'dev' into dev

2d9fe7b

aadyotb and others added 4 commits May 24, 2024 22:33

Fix unit test bugs

a9069b2

Add new hyperparameter to relevant test.

b46e3b5

Merge branch 'dev' into dev

fc8062a

Merge branch 'dev' into dev

05f5db9

dakinggg reviewed May 25, 2024

View reviewed changes

tests/test_simple_nlp.py Outdated Show resolved Hide resolved

aadyotb and others added 5 commits May 25, 2024 02:00

Update test to be more rigorous.

8478b4b

Merge branch 'dev' into dev

8023d43

Run new test on GPU only.

4dc2518

Reproducibility isn't good enough on CPU.

Average current_batch_size across ranks.

421b5c1

This is to ensure that each rank contributes the appropriate gradient amount based on the number of samples/tokens present.

Merge branch 'dev' into dev

705ab4d

Merge branch 'dev' into dev

f190bbf

aadyotb added 2 commits May 30, 2024 12:31

Merge branch 'dev' into dev

1bfdae5

Merge branch 'dev' into dev

b07c423

aadyotb and others added 2 commits June 5, 2024 11:21

Merge branch 'dev' into dev

4d781e4

Manually take average batch size.

6fcb0c5

aadyotb force-pushed the dev branch from da2ae0e to 6fcb0c5 Compare June 5, 2024 19:36

mvpatel2000 reviewed Jun 5, 2024

View reviewed changes

composer/trainer/trainer.py Outdated Show resolved Hide resolved

mvpatel2000 reviewed Jun 5, 2024

View reviewed changes

tests/test_simple_nlp.py Outdated Show resolved Hide resolved

tests/test_simple_nlp.py Outdated Show resolved Hide resolved

tests/test_simple_nlp.py Outdated Show resolved Hide resolved

tests/test_simple_nlp.py Outdated Show resolved Hide resolved

aadyotb force-pushed the dev branch from be3a9b2 to a435aa7 Compare June 5, 2024 22:19

Move seeding to trainer in unit test.

e10ef6d

aadyotb force-pushed the dev branch from a435aa7 to e10ef6d Compare June 5, 2024 22:19

dakinggg and others added 2 commits June 5, 2024 18:24

Merge branch 'dev' into dev

4fb8486

Linting.

30bdf94

dakinggg reviewed Jun 6, 2024

View reviewed changes

composer/trainer/trainer.py Outdated Show resolved Hide resolved

composer/trainer/trainer.py Outdated Show resolved Hide resolved

composer/trainer/trainer.py Show resolved Hide resolved

aadyotb added 2 commits June 6, 2024 00:23

Update accumulate_train_batch_on_tokens docstring.

a3deb31

Change warning to exception.

de558fb

dakinggg reviewed Jun 6, 2024

View reviewed changes

Merge branch 'dev' into dev

8412b9e

mvpatel2000 approved these changes Jun 6, 2024

View reviewed changes

dakinggg approved these changes Jun 6, 2024

View reviewed changes

dakinggg and others added 2 commits June 6, 2024 02:00

Merge branch 'dev' into dev

7444969

Run pre-commit.

b54d454

dakinggg merged commit f039f06 into mosaicml:dev Jun 6, 2024
17 checks passed

mvpatel2000 pushed a commit that referenced this pull request Jul 22, 2024

Add option to accumulate train loss over tokens. (#3273)

8a9b19b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to accumulate train loss over tokens. #3273

Add option to accumulate train loss over tokens. #3273

aadyotb commented May 9, 2024 •

edited

Loading

mvpatel2000 commented May 14, 2024

dakinggg commented May 14, 2024

mvpatel2000 commented May 14, 2024

mvpatel2000 left a comment

aadyotb commented May 14, 2024

dakinggg commented May 14, 2024

dakinggg commented May 14, 2024

aadyotb commented May 24, 2024

dakinggg commented May 24, 2024

aadyotb commented May 28, 2024

aadyotb commented Jun 5, 2024

mvpatel2000 left a comment

mvpatel2000 left a comment

dakinggg commented Jun 5, 2024

dakinggg left a comment

mvpatel2000 left a comment

dakinggg commented Jun 6, 2024

dakinggg commented Jun 6, 2024

dakinggg commented Jun 6, 2024

Add option to accumulate train loss over tokens. #3273

Add option to accumulate train loss over tokens. #3273

Conversation

aadyotb commented May 9, 2024 • edited Loading

What does this PR do?

What issue(s) does this change relate to?

Before submitting

mvpatel2000 commented May 14, 2024

dakinggg commented May 14, 2024

mvpatel2000 commented May 14, 2024

mvpatel2000 left a comment

Choose a reason for hiding this comment

aadyotb commented May 14, 2024

dakinggg commented May 14, 2024

dakinggg commented May 14, 2024

aadyotb commented May 24, 2024

dakinggg commented May 24, 2024

aadyotb commented May 28, 2024

aadyotb commented Jun 5, 2024

mvpatel2000 left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

dakinggg commented Jun 5, 2024

dakinggg left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

dakinggg commented Jun 6, 2024

dakinggg commented Jun 6, 2024

dakinggg commented Jun 6, 2024

aadyotb commented May 9, 2024 •

edited

Loading