-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for left padding #344
Add support for left padding #344
Conversation
…into add-left-padding
Thanks! I'm on holiday but @JayBaileyCS will take a look and I'll also look once he's done that since we've discussed this feature before :) |
Hi Joseph, Thank you for the comment! I've gone through the discussions you had with @JayBaileyCS on this feature. One thing that I want to discuss is about what you wrote in #291 (comment). I think there are mainly three options on how to enable left padding: (BTW, I think
|
I initially wrote in the above comment that I'd prefer option 1+2, but I changed my mind to option 1+3 ( # (Let's assume that I am using Jupyter notebook.)
[1] model = HookedTransformer.from_pretrained("gpt2")
# (This logits and cache are calculated with right padding.)
[2] movie_review_logits, movie_review_cache = model.run_with_cache(movie_reviews)
# ... (I do a lot of things and forget the fact that movie_review_logits and movie_review_cache used right padding)
# Oh! I just realised that I can use left padding as default.
# It would be much more convenient to use left padding as default.
[203] model.set_default_padding_side("left")
# (This logits and cache are calculated with left padding.)
[204] book_review_logits, book_review_cache = model.run_with_cache(book_reviews)
...
# Okay, now let's investigate the last token activations of the model...
[246] movie_review_last_pre8 = movie_review_cache['pre8'][:, -1, :] # This is an unintentional bug!!!!
book_review_last_pre8 = book_review_cache['pre8'][:, -1, :]
... In addition, it might be confusing for the code readers when there are @jbloomAus @JayBaileyCS How do you think about going with 1+3? (Whether we go with option 1+2 or 1+3, it would be good to use the same interface as the one for @neelnanda-io For exactly same reasons, I think that it would be better to change |
Oh lol I didn't realise you'd done 2 rather than 3 - I agree that 3 is better and would be happy to get a quick PR shifting prepend_bos to option 3 |
@neelnanda-io Oh no, I just realised that I didn't even explain the difference between option 3 and PR #343 in my comment #344 (comment). Sorry for making you (and possibly others as well) confused 😂 The code I put as the description for option 3 is also applied in PR #343 which is why you said "I didn't realise you'd done The thing that I really wanted to change from PR #343 was when and where the default value from the user is given to the model and is set. To be more specific, I want to make the following changes:
The rationale for this change is described in #344 (comment). |
Doing it all via the config sounds good to me, I agree having special
methods and attributes is unnecessary. Doing everything via the config
object, including being able to edit it on the fly, seems good by me
…On Mon, 17 Jul 2023, 12:37 am Sohee Yang, ***@***.***> wrote:
@neelnanda-io <https://github.com/neelnanda-io> Oh no, I just realised
that I didn't even explain the difference between option 3 and option 2 in #344
(comment)
<#344 (comment)>.
Sorry for making you (and possibly others as well) confused 😂 The code I
put as the description for option 3 is also applied in option 2, which is
why you said "I didn't realise you'd done 2 rather than 3".
The thing that I really wanted to change (from option 2) with option 3 was
when and where the default value from the user is given to the model and is
set. To be more specific, I want to make the following changes:
1. Get rid of model.set_default_prepend_bos() and model.prepend_bos.
2. If the user wants to change the default prepend_bos value to False,
instead of calling model.set_default_prepend_bos(), they should pass
the value when they create the model instance. Then, the model will
set the value to model.cfg.default_prepend_bos. If the user does not
provide the default value, True is used.
3. Instead of having a separate model.prepend_bos, the model directly
accesses to self.cfg.default_prepend_bos to check the default setup
(which can be locally overriden by prepend_bos=... is given to the
method call of string processing methods).
The rationale for this change is described in #344 (comment)
<#344 (comment)>
.
—
Reply to this email directly, view it on GitHub
<#344 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASRPNKJQMKW6ZZU7PYSLJ6LXQR3NNANCNFSM6AAAAAA2LTC46M>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you so much for the feedback! I'll make the quick PR when I wake up (it's time for 🛌 now haha). @jbloomAus @JayBaileyCS I will do the same thing with |
@neelnanda-io I just thought of an idea that can nicely handle all the overriden flags without touching the code of the actual methods. How do you find this? I originally came up with this to guarantee the reset of the default value of # in utils.py
def locally_override_and_restore_defaults(function):
"""
This decorator is used to override the parameters with default values during a function's execution.
It guarantees that the default parameters are not changed after the function execution,
even when an error occurs during the execution.
"""
def wrapper(self, *args, **kwargs):
sig = inspect.signature(function)
arg_names = list(sig.parameters.keys())
if arg_names[0] == "self":
arg_names = arg_names[1:]
arg_values = dict(zip(arg_names, args))
arg_values.update(kwargs)
# prepend_bos
# Prepare the overriden value
default_prepend_bos = self.cfg.default_prepend_bos
prepend_bos = arg_values.pop("prepend_bos", None)
assert prepend_bos in [None, True, False], (
f"prepend_bos must be one of None, True, or False, but got {prepend_bos}."
)
arg_values["prepend_bos"] = override_or_use_default_flag(default_prepend_bos, override=prepend_bos)
# padding_side
# Prepare the overriden value
default_padding_side = self.tokenizer.padding_side
padding_side = arg_values.pop("padding_side", None)
assert padding_side in [None, "left", "right"], (
f"padding_side must be one of None, 'left', or 'right', but got {padding_side}."
)
arg_values["padding_side"] = override_or_use_default_flag(default_padding_side, override=padding_side)
try:
# This is important because self.tokenizer.padding_side is
# the actual padding_side used by Transformers Tokenizer.
self.tokenizer.padding_side = arg_values["padding_side"]
# Execute the original function with the overridden padding_side
outputs = function(self, **arg_values)
# Reset the padding_side of the tokenizer
self.tokenizer.padding_side = default_padding_side
return outputs
except Exception as e:
# If an error occurs, reset the padding_side of the tokenizer before propagating the exception
self.tokenizer.padding_side = default_padding_side
raise e
return wrapper
# in HookedTransformer class
@utils.locally_override_and_restore_defaults # This is all we need!
def to_str_tokens(
self,
input: Union[
str,
Int[torch.Tensor, "pos"],
Int[torch.Tensor, "1 pos"],
Int[np.ndarray, "pos"],
Int[np.ndarray, "1 pos"],
list,
],
prepend_bos: Optional[bool] = None,
padding_side: Optional[Literal["left", "right"]] = None,
) -> Union[List[str], List[List[str]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks quite good! I'm humbled by the skill that went into this, having tried and failed to create this feature myself. There's quite a few things in here that I'm going to try to learn from, like the use of .signature and creating a custom decorator to solve this one.
I've made a couple small comments, but overall I am very impressed!
transformer_lens/utils.py
Outdated
is_pad_token = 1 - attetnion_mask.int() | ||
|
||
# Find the position of the pad token used as the BOS token and thus should get attended | ||
pad_bos_positions = is_pad_token.cumsum(dim=-1).argmax(dim=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will break when the BOS EOS and PAD tokens are the same, as is the case in eg GPT-2. It would be good to add a test for this case. EOS is often used to separate documents in the same context, eg "The cat sat on the matThe dog sat on the log". I would add a separate check that the pad tokens are contiguous (eg I believe ((is_pad_token.cumsum(-1) * ((1-is_pad_token).cumsum(-1)>0)).argmax(dim=-1) should work?) I think you should also check for the case where there's no PAD/BOS/EOS in the context, and add a test for correct handling. And this can be different by row, ugh.
This is very fiddly, sorry for all the nitpicks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the assumption here was that all the PAD tokens would be contiguously at the beginning and no PAD token would appear in the middle. I didn't consider the case of PAD token being an EOS token and appearing at the middle of the text. The suggestion of checking if they are contiguous sounds great. Thank you so much for pointing out the exception case!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved by commit 0591924
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: 0591924 always considers (regardless of whether pad == bos or pad == sep) only the leftmost leading pads (when padding_side == left) or rightmost trailing pads (when padding_side == right) as real pad tokens that should not be attended.
Rational: Sep tokens should be attended, and there might be the cases where users use sep tokens multiple times.
Thanks so much for the PR, I've wanted this feature for a while and it's a pain in the ass to implement |
Hi @JayBaileyCS, Thank you so much for the compliment! Actually, the use of .signature and custom decorator makes the code complicated and harder to read, which increases the risk of potential bug. I used the approach just because I couldn't think of a simpler one, but @neelnanda-io just suggested me a brilliant idea of using a context manager instead, which makes the code simpler and less bug-prone, getting rid of the need of complicated function signature parting but can do the exactly the same thing. I'll make this update and commit it soon! |
Hey! I wanted to check up on the progress of this @soheeyang ? |
Hi @neelnanda-io, thank you so much for the reminder! I've made all the necessary updates except for some minor things, but my schedule has been so packed these days and I have completely forgotten about this 😂 I will wrap it up today or tomorrow! |
…into add-left-padding
# past_kv_cache is not None, so we're doing caching. | ||
# We need to extend the past_left_attention_mask. | ||
# Append '1' to the right of the past_left_attention_mask to account for the new tokens | ||
left_attention_mask = utils.extend_tensor_with_ones( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't think of a way that avoids introducing past_left_attention_mask
which is necessary for index > 0
steps of generate(use_past_kv_cache=True)
. Does anyone have a better idea?
raise ValueError( | ||
f"Invalid positional_embedding_type passed in {self.cfg.positional_embedding_type}" | ||
) | ||
with utils.LocallyOverridenDefaults( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neelnanda-io I changed the decorator to a context manager; a downside of using a context manager is that it is more cumbersome to add than a decorator. (I intentionally avoided overriding the defaults only partially in the methods to prevent any possible inconsistency)
Context manager that allows temporary overriding of default values within a model. | ||
Once the context is exited, the default values are restored. | ||
|
||
WARNING: This context manager must be used for any function/method that directly accesses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about the cases where people forget to add this context manager when they need to, but it's hard to enforce it... One possible way to do this is to do something like the following code, but since padding_side
is a property of Tokenizer
, how to deal with it is not very straightforward.
@dataclass
class HookedTransformerConfig:
...
_default_prepend_bos: bool = field(default=True, repr=False)
_inside_context: bool = field(default=False, init=False, repr=False)
@property
def default_prepend_bos(self):
if not self._inside_context:
raise ValueError("Direct access to default_prepend_bos without context manager is prohibited.")
return self._default_prepend_bos
def set_inside_context(self, state: bool):
self._inside_context = state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me on a skim, but I don't have the time to read in detail. @JayBaileyCS do you have the time to look at this? If not, I'll just approve.
@neelnanda-io The change has now gotten too big and involved for me to be confident in my own judgment of it, in all honesty. I don't have a good enough understanding of the HookedTransformer in order to track all the variables that have shifted. If it looks good to you on a skim, that's probably better than what I could do for this one, so if you want to approve it I'd say go ahead. |
Fair enough! Thanks Jay. @soheeyang is it ready to be merged in? |
@neelnanda-io Yes, it's ready! Now that we can use left padding, I want to make an enhancement in
However, I will make it as another PR rather than making the change in this PR. |
@@ -84,15 +87,16 @@ def __init__( | |||
self.cfg = cfg | |||
|
|||
if tokenizer is not None: | |||
self.set_tokenizer(tokenizer) | |||
self.set_tokenizer(tokenizer, default_padding_side=default_padding_side) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not a review, but a question)
What is the reason for overwriting the tokenizer padding direction like this? This caused some existing trained (left-padded) models to break in a very confusing way, since the tokenizer property we were setting was being overridden without us noticing.
(in reference to #344 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason was that left-padded models were technically not supported previous to this commit, which means that only right-padded models were supported as default. Left-padded models would not have worked correctly previous to this PR.
Could you give me more details on what model you used and how the model was breaking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't seem to have problems training and running left-padded models, but perhaps there were underlying issues which we did not notice.
I managed to fix the issue by modifying the tokenizer in our model's __init__
after it gets modified by the new HookedTransformer.__init__
understanding-search/maze-transformer#195
as of transformer_lens 1.6.1, in pr TransformerLensOrg/TransformerLens#344 the padding size works differently, and our left-padding was being overriden. this fixes it by doing some things in `ZanjHookedTransformer.__init__` and in the `HuggingMazeTokenizer`
Description
This PR introduces support for left padding for decoder-only models by properly adjusting the absolute position embedding and causal/attention masks. This change enables an easier retrieval of the final outputs, e.g.,
logits[:, -1, :]
, getting rid of the need to perform a complicated indexing, even when a batch of varying lengths of sequences are processed.Introduction of
util.get_attention_mask()
Method: Introduced a new method that is called internally by the model inforward()
to generate attention masks based on the inputtokens
(works for both left-padded or right-paddedtokens
) andprepend_bos
flag used to create the token. This mask ensures that the model does not pay attention to padding tokens during processing; while it is not necessary when right padding is used, it becomes necessary when left padding is used to adjust the absolute position embedding and attention. The method also handles a special case where theBOS
token is identical to thepad
token and is prepended to the input sequences, so that such a token can be properly attended.prepend_bos
to pass to the function should be the same with the one used to createtokens
. At first, I considered making the function infer the value ofprepend_bos
by investigatingtokens
without the need to pass the value as a parameter, but then it requires several assumptions on howtokens
is processed, which might not be always true. Therefore, I decided to go with the current design.Adjustment of Positional Embeddings: Made changes to the positional embeddings such that the first real token in each sequence receives the 0th positional embedding. This is critical to the proper functioning of the model when left padding is used. The positional embeddings of the pad tokens are set to zero vectors.
Adjustment of Causal Masking: Made changes to the causal mask such that the causal mask also prevents the model from attending to the pad tokens that appears at the front of the real tokens.
Test Cases: Added comprehensive test cases to ensure the correctness of the left padding feature, attention mask generation, adjusted positional embeddings, adjusted causal mask, and the same output behaviours for all of right padded tokens, left padded tokens, and single batch tokens.
Explanation: Added an explanation of the support for the left padding in the exploratory analysis demo.
Fixes #240
Type of change
Example code
before
after
Checklist: