Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adding OBELICS DataLoader #663

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

TJ-Solergibert
Copy link

@TJ-Solergibert TJ-Solergibert commented Oct 30, 2024

Hi,

In this PR I present a first draft of the Multimodal DataLoader. First I will describe how the batches are created and then I will explain the padding problem.

image

Let's begin checking the OBELICS dataset. For every sample on the dataset we have 4 keys, but we are just interested in 2 of them:

  • images: A list either with URLs of images OR Nones to specify the position of the text.
  • texts: A list either with text strings OR Nones to specify the position of the images.
    It's important to highlight that len(images)==len(texts) and that for each index, one element and only one is not None.

The format_obelics function will transform each sample to a format that can be later fed into the transform block that will prepare the samples to the target type. Each formatted sample will be a dictionary containing 2 keys:

  • images: List of PIL Images with the loaded images.
  • text: str with the text of the sample ready to be tokenized, including the image tokens.
image

Once formatted, we will process each sample with the transform block. This transform block is composed of CLIPPreprocess, TikTokenizer & VisionCrossAttentionMask modules.

CLIPPreprocess

image

This module will prepare the List of images to be fed into the CLIP model. The most relevant steps is resizing the image without distortion, dividing the image into tiles and padding if necessary. Highlight the fact that it will still produce a List of tensors and NOT a tensor as every image can have a different number of tiles. This will be addressed in the collator where we will pad the image tiles to the largest in the batch. Also, we keep the maximum number of tiles to 4 and the tile size to 448 for pretraining [1], [2].

TikTokenizer

I've included a new method in the tokenizer to encode the multimodal text. In short, it just encodes the text adding the special image_id token and returns both the input_ids & labels masking the bos, eos & image_id tokens.

VisionCrossAttentionMask

image

This module will create the attention mask for the Fused layers. In short, for each TILE we will have 1025 image_tokens and this mask will specify for each text_token to which image_tokens should attend to. We are returning again a List of tensors as the quantity of image_tokens will depend on the number of tiles. Again, we will solve this in the collator.

Padding & the collator

As we've previously seen, both the outputs of the CLIPPreprocess & VisionCrossAttentionMask are list of tensors because of the different number of tiles. Within the same sample we should pad both artifacts to the maximum number of tiles, but the issue arises when we run batch_size > 1 as we will also need to pad the input_ids (& labels) which is relatively cheap BUT also the Number of images, as the input to the CLIP model will be a tensor of shape [Batch size, Number of images, Number of tiles, Channels, Tile size, Tile size]. Padding to the maximum number of tiles is bad, but in the worst case scenario you end up increasing the tensor x4 (from 1 tile to maximum number of tiles = 4). But for the number of images it can get really really big, as there are samples with +30 images.

To check this phenomenon I've included scripts/check_padding_mm.py which computes the % of padding in a sample. Feel free to give it a try but it's very easy to get samples where the majority of the input is padding.

python3 scripts/check_padding_mm.py
Unpadded tokens: 8717, Total tokens in batch: 21728
Padded text tokens: 13011, 59.88%
########################################
Unpadded images: 25, Total images in batch: 64
Padded images: 39, 60.94% (Each image with shape [4, 3, 448, 448])
########################################
Unpadded number of tiles: 61, Total number of tiles: 256
Padded tiles: 195, 68.72% (Each with shape [3, 448, 448])
########################################
Unpadded cross attention mask elements: 545030425, Total cross attention mask elements: 5701427200
Padded cross attention mask elements: 5156396775, 90.44%

That's why I proposed continue working on a DataLoader & Dataset than can pack multiple samples up to a given input_ids length OR number of images in a batch. Packing the input_ids is fairly easy while packing the cross attention masks will require a bit more effort. Let me know if you would be interested on supporting that feature or you just want to include in the repo an example of the multimodal pipeline despite the padding issue described. I also plan including some unit test, to check the generated samples & recovering from failures abilities.

Other comments:

Toni

@facebook-github-bot
Copy link

Hi @TJ-Solergibert!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 30, 2024

import torch

import torchvision
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a quick scan for this PR. I need more time to look into this PR. But we should add torchvision as the dependent of Torchtitan, if haven't.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Don't worry, it will take time 😅

BATCH_NUMBER = 4


def main():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can make this as a unit test? WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add as a unit test some checks of shapes & types on the DP axis rather than this script that just checks the amount of padding in each batch

.pre-commit-config.yaml Outdated Show resolved Hide resolved
text, bos=True, eos=True, allowed_special=set(["<|image|>"])
)
input_ids = tokens[:-1]
labels = tokens[1:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the first token is null? And for HuggingFaceM4/OBELICS specifically, "text" is a list of string or null. We don't need special treatment of it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We take care of the null/None values in the format_obelics function.

This function produces:

  • images: List of decoded images in the sample
  • text: str with the text of the sample ready to be tokenized including the image tokens (the null values you are referring)

So the text is ready to be tokenized!

@tianyu-l tianyu-l linked an issue Nov 22, 2024 that may be closed by this pull request
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for super late review. I finally finish the first round of review and let me know what do you think about my questions and comments. I will revisit this PR in the coming week.

@@ -67,6 +70,7 @@ def __init__(self, model_path: str):
self.special_tokens = {
token: num_base_tokens + i for i, token in enumerate(special_tokens)
}
self.special_tokens["<|image|>"] = 128256 # TODO(tj.solergibert) Hardcoded!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for a late review. I think hard-coded is fine but can we make these IDs as const variables?

I mean something like:

IMAGE_TOKEN_ID = 128256
self.special_tokens["<|image|>"] = IMAGE_TOKEN_ID

Thanks

Comment on lines +212 to +218
labels = list(
np.where(
np.isin(labels, [self.bos_id, self.eos_id, self.image_id]),
-100, # TODO(tj.solergibert) Hardcoded!
labels,
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not to use np here? Can we use torch for this?

Mapping[str, Any]: The sample with an updated "image" filed and added
"aspect_ratio" field.
"""
image = sample["image"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering instead of assuming a "image" field, shall we just pass in the image itself so that this part can be a generic for other dataset as well?

best_resolution = (448, 1344) # canvas that allows maximum upscaling, with minimum padding, up to 16 tiles
image is resized without distortion (300,800) -> (448, 1194) #448 is the limiting side for the resize
image is padded (448, 1194) -> (448, 1344)
Image is tiled 2x5, for a final output shape of (10, 3, 224, 224)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 x 6?

max_num_tiles (Optional[int]): Only used if possible_resolutions is NOT given.
Maximum number of tiles to break an image into.
This will be used to generate possible_resolutions,
e.g. [(224, 224), (224, 448), (448, 224)] if max_num_tiles = 2 and tile_size = 224.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we don't do (448, 448) as well? Also max_num_tiles should be 4?

].count(self.image_token)
self._sample_idx += 1
# Transform sample
processed_sample = self.transform(processed_sample)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you get this from TorchTune, can we make the name not like transformer? Maybe prepoc? Because we will have transform or transformer in later stage of the model as well. And this part is not trainable, so to better differentiate, could you give it a different name?

>>> transform = VisionCrossAttentionMask(tile_size=400, patch_size=40, image_token_id=1)
>>> intervals = transform._get_image_attention_intervals(tokens)
>>> print(intervals)
[[0, 7], [1, 7], [7, 12]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm this is slightly different from what I have read about masking.. I need more time to think and validate on the logic of this part.

batch_size: int,
collator_fn: Callable,
):
super().__init__(dataset=hf_ds, batch_size=batch_size, collate_fn=collator_fn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this collate_fn being used or called?

... "encoder_mask": [torch.ones(2, 5 * 4)],
... },
... ]
>>> model_inputs = padded_collate_tiled_images_and_mask(batch=batch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the way this collator is used?


# NOTE Inspired from torchtune.data._collate.py
@dataclass
class MultiModalCollator:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this generates the all the data needed before sending into encoder right? Also for the MM model, we need to feed text tokens into decoder as well? shall we just reuse the existing dataloader for llama3?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Multimodal] Adding OBELICS DataLoader
3 participants