Fix `random_mask_tokenize` when the text is long #680

bryant1410 · 2023-10-18T01:18:43Z

Without this patch, the function crashes for long texts. See https://colab.research.google.com/drive/1SHBAUEnI1dNJmXQPUqZekFqXm7xrwH65?usp=sharing

bryant1410 · 2023-10-18T01:27:19Z

BTW, wouldn't it be simpler (and probably avoid some longer memory copies?) to, instead of doing:

indices = np.random.permutation(len(tokens)).tolist()
indices = indices[:context_length - 2]
tokens = [tokens[i] for i in indices]

to do:

tokens = random.sample(tokens, context_length - 2)

?

And similarly for other random sampling in this file that uses NumPy methods.

rwightman · 2023-10-18T07:31:35Z

@bryant1410 so you're right about the issue, and random.sample would be a good alternative if tokens are lists as they are right now. Having looked at these tokenizers more closely since merge though, I'm not sure if they've all been used and tested. so currently wondering if we should have them in there in this state.

Clearly this one does not work as implemented, it was written for tokens to be a np array or tensor (where lists of indices are valid), but it's not, tokens is a list. Also, I'm failing to see how this is random masking as in the paper, it looks like a full random shuffle of the tokens. @zw615 ?

bryant1410 · 2023-10-18T13:45:13Z

I see. I wait for clarification then, before making more changes to this PR.

rwightman · 2023-10-18T17:27:44Z

Wouldn't this be more appropriate 'random mask' vs current impl which seems to be 'random shuffle'?

N = 10  # context len
B = 1 # batch size (for a batch impl)
mask_rate = 0.3
num_keep = max(1, int(N * (1 - mask_rate)))
indices = torch.argsort(torch.randn(B, N), dim=-1)[:, :num_keep]
# indices = torch.randperm(N)[:num_keep]  # this would be good for unbatched impl, one line at a time
indices = indices.sort(dim=-1)[0]  # back in order

@zw615

rwightman · 2023-10-18T18:34:40Z

for version in #660 was thinkin of

class RandomMaskTokenizer(SimpleTokenizer):
    def __init__(
            self,
            bpe_path: str = default_bpe(),
            special_tokens=None,
            clean: str = 'lower',
            shuffle: bool = False,
    ):
        super().__init__(bpe_path, special_tokens, clean)
        self.shuffle = shuffle

    def __call__(self, texts: Union[str, List[str]], context_length: int = 77) -> torch.LongTensor:
        """
        Returns the tokenized representation of given input string(s)

        Parameters
        ----------
        texts : Union[str, List[str]]
            An input string or a list of input strings to tokenize
        context_length : int
            The context length to use; all CLIP models use 77 as the context length

        Returns
        -------
        A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length]
        """
        if isinstance(texts, str):
            texts = [texts]

        sot_token = self.encoder["<start_of_text>"]
        eot_token = self.encoder["<end_of_text>"]
        all_tokens = [self.encode(text) for text in texts]
        result = torch.zeros(len(all_tokens), context_length, dtype=torch.long)

        for i, tokens in enumerate(all_tokens):
            tokens = torch.tensor(tokens)
            num_tokens = len(tokens)
            if num_tokens > context_length - 2:  # 2 for sot and eot token
                keep_num = context_length - 2
                indices = torch.randperm(len(tokens))
                indices = indices[:keep_num]
                if not self.shuffle:
                    indices = indices.sort()[0]
                tokens = tokens[indices]
                num_tokens = keep_num
            result[i, 0] = sot_token
            result[i, 1:num_tokens + 1] = tokens
            result[i, num_tokens + 1] = eot_token

        return result

…default __call__() arg to None. Clean up reduction masking logic and fix #680

Fix random_mask_tokenize when the text is long

07ab8ae

Without this patch, the function crashes for long texts. See https://colab.research.google.com/drive/1SHBAUEnI1dNJmXQPUqZekFqXm7xrwH65?usp=sharing

rwightman added a commit that referenced this pull request Oct 18, 2023

More tokenizer rework, add context_len as class attr set in factory, …

05e9864

…default __call__() arg to None. Clean up reduction masking logic and fix #680

rwightman mentioned this pull request Oct 18, 2023

Combining CLIPA-v2 and SigLIP (both big_vision based) models #660

Merged

rwightman closed this in a5f3ae9 Oct 20, 2023

bryant1410 deleted the patch-1 branch October 20, 2023 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `random_mask_tokenize` when the text is long #680

Fix `random_mask_tokenize` when the text is long #680

bryant1410 commented Oct 18, 2023

bryant1410 commented Oct 18, 2023

rwightman commented Oct 18, 2023 •

edited

Loading

bryant1410 commented Oct 18, 2023

rwightman commented Oct 18, 2023 •

edited

Loading

rwightman commented Oct 18, 2023

Fix random_mask_tokenize when the text is long #680

Fix random_mask_tokenize when the text is long #680

Conversation

bryant1410 commented Oct 18, 2023

bryant1410 commented Oct 18, 2023

rwightman commented Oct 18, 2023 • edited Loading

bryant1410 commented Oct 18, 2023

rwightman commented Oct 18, 2023 • edited Loading

rwightman commented Oct 18, 2023

Fix `random_mask_tokenize` when the text is long #680

Fix `random_mask_tokenize` when the text is long #680

rwightman commented Oct 18, 2023 •

edited

Loading

rwightman commented Oct 18, 2023 •

edited

Loading