update modular_modernbert -- add inputs_embeds param to ModernBertModel #35373

jxmorris12 · 2024-12-20T18:57:00Z

What does this PR do?

Hi! Congrats on the release of ModernBERT; it looks amazing. I'm interested in using ModernBERT eventually to train a new Contextual Document Embeddings model.

One desired feature is to pass the contextual and word embeddings together in the second stage, which requires setting the inputs_embeds kwarg so that we can pass hidden states directly. This is a feature of typical BERT and other transformer implementations but isn't yet allowed by ModernBERT, so I added it. It's only a few additional lines of code.

cc: @warner-benjamin @tomaarsen @orionw @staghado @bclavie @NohTow @ArthurZucker

ArthurZucker

Thanks! Yep would be nice to add. let's make it a bit simpler (si how llama has less if elses) and add a small test ! 🤗

First of all, the inputs_embeds shouldn't fully replace `self.embeddings(input_ids)`, because this call also does layer normalization and dropout. So, now both input_ids and inputs_embeds is passed to the ModernBertEmbeddings, much like how BertEmbeddings is implemented. I also added `inputs_embeds` to the docstring, and propagated the changes to the other model classes. I also introduced an error if input_ids and input_embeds are both or neither provided. Lastly, I fixed an issue with device being based solely on input_ids with attention_mask.

Also reintroduce inputs_embeds test

tomaarsen · 2024-12-30T13:23:22Z

Hello @jxmorris12, @ArthurZucker,

I pushed some changes into this PR to get it closer to completion. Let me know if you're not okay with this, and you can easily revert or delete the commits.

The changes:

The inputs_embeds shouldn't fully replace self.embeddings(input_ids), because this call also does layer normalization and dropout. So, now both input_ids and inputs_embeds is passed to the ModernBertEmbeddings, much like how BertEmbeddings is implemented.
I also added inputs_embeds to the docstring, and propagated the changes to the other model classes.
I also introduced an error if input_ids and input_embeds are both or neither provided.
I fixed an issue with device being based solely on input_ids with attention_mask.
Fix a test, and reintroduce another test.

let's make it a bit simpler (si how llama has less if elses)

This is sadly not as simple as it seems due to _unpad_modernbert_input.

Tom Aarsen

tomaarsen · 2024-12-30T13:24:15Z

src/transformers/models/modernbert/modular_modernbert.py

        if self.config._attn_implementation == "flash_attention_2":
            if indices is None and cu_seqlens is None and max_seqlen is None:


It's a bit frustrating that this entire tree is necessary, but there's no other convenient way to avoid the base model from repadding while allowing this class to repad, as then the base model would also have to return the batch size, indices, seqlens, etc. so this class could repad.

TBH it might make more sense! This way only the base model unpads, and the other models can freely do the unpadding. Kinda up to you 🤗

NohTow · 2024-12-30T13:49:39Z

Hello,

Thanks for opening this PR, I really love your work on CDE and I look forward to seeing how it will perform with ModernBERT!
FYI, this feature (being able to pass input_embeds) has also been asked for other use cases here, so besides retro-compatibility and CDE, it seems like this feature is used by the community and is thus a very cool addition.

I'll make a more thorough review when I come back from vacation this week, but I already checked how to implement it for the linked issue and it seems to be in line with the latest change from Tom.

HuggingFaceDocBuilderDev · 2024-12-30T13:50:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

jxmorris12 · 2025-01-03T17:41:30Z

Hello @jxmorris12, @ArthurZucker,

I pushed some changes into this PR to get it closer to completion. Let me know if you're not okay with this, and you can easily revert or delete the commits.
...

thanks Tom! Looks great.

jxmorris12 · 2025-01-06T16:12:29Z

Hi @NohTow @ArthurZucker just following up on this -- thanks :)

znsoftm · 2025-01-07T03:55:47Z

when will it be merged into the main branch?

ArthurZucker

Thanks for updating, let's remove complexity and good to go!

ArthurZucker · 2025-01-07T08:44:25Z

src/transformers/models/modernbert/modular_modernbert.py

+            if inputs_embeds is not None:
+                batch_size, seq_len = inputs_embeds.shape[:2]
+            else:
+                batch_size, seq_len = input_ids.shape[:2]


if you first embed input ids you don't need two branches!

ArthurZucker · 2025-01-07T08:45:14Z

src/transformers/models/modernbert/modular_modernbert.py

+                    with torch.no_grad():
+                        input_ids, indices, cu_seqlens, max_seqlen, *_ = _unpad_modernbert_input(
+                            inputs=input_ids, attention_mask=attention_mask
+                        )
+                else:
+                    inputs_embeds, indices, cu_seqlens, max_seqlen, *_ = _unpad_modernbert_input(
+                        inputs=inputs_embeds, attention_mask=attention_mask
                    )


only the input is different here! if you first embed the input, then you don't need two branches!

Well the old code contains a torch.no_grad here. That's why there are two branches. Did you all mean to wrap the unpad function in a no_grad? If so, we need to keep it. We definitely can't unpad inputs_embeds in a no_grad block because it will break gradient flow

Ah I mentioned last time, I still don't understand why you would need gradient flow for padding / unpadding when it's weight agnostic?

I believe #35386 is a more detailed bug report on the gradient breaks

My bad! Indeed for input embedding you need grads if you are training an encoder!

ArthurZucker · 2025-01-07T08:48:46Z

src/transformers/models/modernbert/modular_modernbert.py

        if self.config._attn_implementation == "flash_attention_2":
            if indices is None and cu_seqlens is None and max_seqlen is None:


TBH it might make more sense! This way only the base model unpads, and the other models can freely do the unpadding. Kinda up to you 🤗

NohTow · 2025-01-07T09:09:03Z

Hello,
Sorry for the delay.
As mentioned earlier, the code looks good to me.
I only have two small nitpick. As mentioned by @ArthurZucker, first embedding the input_ids should be cleaner and less error prone.
The second is that the XOR test is somewhat a bit less informative than checking if the two are not set together and then checking if at least one is defined. Maybe it's just me, but I find "You must specify exactly one of input_ids or inputs_embeds" a bit ambiguous, yet I don't have a better wording.

Besides those very small details, everything looks good!

ArthurZucker · 2025-01-07T10:31:13Z

The xor test is on a few models like Llama -> it's standard for us! 🤗

jxmorris12 · 2025-01-08T15:12:55Z

Sorry @NohTow - can you make a concrete suggestion on how to fix the code? It's not clear how the code can be much cleaner since all the conditionals seem necessary to me.

NohTow · 2025-01-08T15:35:22Z

@jxmorris12 I am very sorry my message was unclear. I did not mean that the code was incorrect nor that testing the two conditions sequentially was cleaner, I just meant that from my reading, the error message could be ambiguous (in the case where the user do not feed anything, the exactly might not be explicit enough).
But again, that is really a nitpick and probably comes from me not properly reading the message, let's ignore that!

jxmorris12 · 2025-01-08T15:38:42Z

@jon-tow Oh okay, that makes sense to me! I agree it could be a really confusing error for the user. But maybe since it's in a lot of files (such as LLAMA) we could open a separate issue to improve that error message everywhere?

NohTow · 2025-01-08T15:41:59Z

@jxmorris12 Yeah totally, I raised that because I compared the implementations from BERT and this one and the BERT one does the check sequentially and thus have more informative messages, but since Arthur raised that it is already done like that in other models and is standard now, let's just follow the standard!

ArthurZucker

My comments are not adressed 😓

let's embed inputs, then we computes shapes
let's never require grad on unpadding / padding as it's weight agnostic

NohTow · 2025-01-09T09:33:49Z

For the latter, I think it is related to this.
It was discussed during the original PR, but it breaks the gradients.
I'll let @warner-benjamin give more information.

tomaarsen · 2025-01-09T11:54:21Z

My comments are not adressed 😓
* let's embed inputs, then we computes shapes

This is a very logical approach, so that we don't need

            if inputs_embeds is not None:
                batch_size, seq_len = inputs_embeds.shape[:2]
            else:
                batch_size, seq_len = input_ids.shape[:2]

But instead can just use

batch_size, seq_len = hidden_states.shape[:2]

However, in ModernBERT, before we can compute the embedded inputs, we need to potentially unpad. To unpad, we need to have an attention_mask, which needs the shapes. In short: we need the shapes before the input embeddings.

One alternative is to use attention_mask = torch.ones((input_ids or input_embeds).shape[:2], ...), but this feels like the same as the shape computation but as a 1-liner.

* let's never require grad on unpadding / padding as it's weight agnostic

Please have a look at #35386, it looks like the torch.no_grad is killing the gradients that were there before re-padding. Running the same script from that PR on BERT does give gradients on the logits.

@warner-benjamin made a PR to patch it here: #35404, which makes the gradient for repadding optional.

Tom Aarsen

ArthurZucker · 2025-01-09T13:04:51Z

Got it! Okay, for output, we do need the gradients, my bad!
TLDR:

when training: with torch no grad for input, but you need output gradients
when training: with input embeddings: torch grad because you need propagation
when inference, never grad, but usually people do this outside the modeling.

So the only "gain" is using no grad on input ids all the time

ArthurZucker

Thanks for answering my questions! Let's go 🤗

ArthurZucker · 2025-01-09T13:10:51Z

src/transformers/models/modernbert/modular_modernbert.py

+                    with torch.no_grad():
+                        input_ids, indices, cu_seqlens, max_seqlen, *_ = _unpad_modernbert_input(
+                            inputs=input_ids, attention_mask=attention_mask
+                        )
+                else:
+                    inputs_embeds, indices, cu_seqlens, max_seqlen, *_ = _unpad_modernbert_input(
+                        inputs=inputs_embeds, attention_mask=attention_mask
                    )


My bad! Indeed for input embedding you need grads if you are training an encoder!

jxmorris12 added 2 commits December 20, 2024 10:54

update modular_modernbert -- add inputs_embeds param to ModernBertModel

78d4ea7

Merge branch 'main' into patch-1

754d959

ArthurZucker reviewed Dec 23, 2024

View reviewed changes

tomaarsen self-requested a review December 28, 2024 12:18

tomaarsen added 2 commits December 30, 2024 13:51

Propagate inputs_embeds to ModernBertForMaskedLM correctly

1aa3f4f

Also reintroduce inputs_embeds test

tomaarsen reviewed Dec 30, 2024

View reviewed changes

tomaarsen approved these changes Dec 30, 2024

View reviewed changes

NohTow mentioned this pull request Dec 30, 2024

Doesn't support input vector? There is no inputs_embeds AnswerDotAI/ModernBERT#165

Open

KoichiYasuoka mentioned this pull request Jan 5, 2025

Update modular_modernbert.py to support input_embeddings #35422

Closed

5 tasks

ArthurZucker reviewed Jan 7, 2025

View reviewed changes

KoichiYasuoka mentioned this pull request Jan 8, 2025

ModernBERT does not have inputs_embeds input #35555

Closed

Merge branch 'main' into patch-1

b9a64b4

jxmorris12 requested a review from Rocketknight1 as a code owner January 8, 2025 17:19

ArthurZucker reviewed Jan 9, 2025

View reviewed changes

ArthurZucker approved these changes Jan 9, 2025

View reviewed changes

tomaarsen merged commit 832c619 into huggingface:main Jan 9, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update modular_modernbert -- add inputs_embeds param to ModernBertModel #35373

update modular_modernbert -- add inputs_embeds param to ModernBertModel #35373

jxmorris12 commented Dec 20, 2024

ArthurZucker left a comment

tomaarsen commented Dec 30, 2024

tomaarsen Dec 30, 2024

ArthurZucker Jan 7, 2025

NohTow commented Dec 30, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 30, 2024

jxmorris12 commented Jan 3, 2025

jxmorris12 commented Jan 6, 2025

znsoftm commented Jan 7, 2025

ArthurZucker left a comment

ArthurZucker Jan 7, 2025

ArthurZucker Jan 7, 2025

jxmorris12 Jan 8, 2025

ArthurZucker Jan 8, 2025

tomaarsen Jan 9, 2025

ArthurZucker Jan 9, 2025

ArthurZucker Jan 7, 2025

NohTow commented Jan 7, 2025

ArthurZucker commented Jan 7, 2025

jxmorris12 commented Jan 8, 2025

NohTow commented Jan 8, 2025

jxmorris12 commented Jan 8, 2025

NohTow commented Jan 8, 2025

ArthurZucker left a comment

NohTow commented Jan 9, 2025

tomaarsen commented Jan 9, 2025

ArthurZucker commented Jan 9, 2025

ArthurZucker left a comment

ArthurZucker Jan 9, 2025

		if self.config._attn_implementation == "flash_attention_2":
		if indices is None and cu_seqlens is None and max_seqlen is None:

update modular_modernbert -- add inputs_embeds param to ModernBertModel #35373

update modular_modernbert -- add inputs_embeds param to ModernBertModel #35373

Conversation

jxmorris12 commented Dec 20, 2024

What does this PR do?

ArthurZucker left a comment

Choose a reason for hiding this comment

tomaarsen commented Dec 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NohTow commented Dec 30, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Dec 30, 2024

jxmorris12 commented Jan 3, 2025

jxmorris12 commented Jan 6, 2025

znsoftm commented Jan 7, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NohTow commented Jan 7, 2025

ArthurZucker commented Jan 7, 2025

jxmorris12 commented Jan 8, 2025

NohTow commented Jan 8, 2025

jxmorris12 commented Jan 8, 2025

NohTow commented Jan 8, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

NohTow commented Jan 9, 2025

tomaarsen commented Jan 9, 2025

ArthurZucker commented Jan 9, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NohTow commented Dec 30, 2024 •

edited

Loading