-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Multimodal] Adding OBELICS DataLoader #650
Comments
A more general multimodal data solution might be using the following library. |
@TJ-Solergibert thanks for your comments. Regarding what you said here:
This is an ongoing work and I plan to improve it as well. What you mentioned here is part of it.
We can use multiprocess dataloader but maybe we can start with a really slow first and then optimize it?
For the sequence length, can we make the longest sequence length same as model seq length? Also for the trainer, ideally we want to reuse the current |
Hi @casper-hansen, thanks for your suggestion, but it's not a matter of loading "lot's of images efficiently at scale" but rather how to prepare the inputs for the model |
Hi @fduwjj,
Nice! So I'll prepare a
Setting
Yes, usually you pack sequences until filling up the seq length of the model BUT now you will also want to control the size of the
Yes, my intention is to maintain the compatibility with I will continue working over the weekend on a first prototype. So far it's looking great, now I have to figure out which is the best way to pack multiple samples properly respecting both the masks from the tokens & the Toni |
On the necessity of shuffling:
I'd assume that most of the time, the sample/document is less than the max_seq_length of training, as you also mentioned
If consecutive samples are all from the same source, then what needs to be done is either (1) (if training is still done at the sample level) data preprocess which falls outside the scope of this repo, or (2) (o/w) we should support longer sequence length to cover most full documents. |
@andrewkho wonder if the PyTorch dataloading solution would be a good fit here |
Hi @tianyu-l yes definitely a good fit here. Hi @TJ-Solergibert and everyone, I'm coming from pytorch/data side of things and think we have some things up our sleeve we could propose that would help here. We're also in contact with the torchtune folks. Let's spend some time testing out some solutions and hopefully find some common ground. |
Hi @tianyu-l & @andrewkho, I've recently submitted #663 with a first prototype. Most of the code comes from Toni |
Hi!
I’ve started developing the Multimodal DataLoader. After taking a (deep) look at this whole multimodal universe, I would like to discuss a couple of things before continuing. I’m using the torchtune repo as a reference.
As we have already mentioned, the DataLoader will only be compatible with the OBELICS dataset. It’s worth noting that this is a nice dataset since it not only contains (Image, Text) pair samples but also other patterns like (Image, Image, Text, Image, Text) or (Text, Image, Image, Text), among others.
Iterable dataset: I assume the solution must be an Iterable Dataset, like the one already available for text-only pretraining. However, I think it’s necessary to consider the following:
num_workers > 1
in the DataLoader, something we can’t (easily) do with an Iterable one.torchtitan
doesn’t support introducing different position ids for each sample, as it directly uses a precomputed one. For images,torchtitan
does consider the image masks.batch size > 1
or SP, we will have to pad the samples. For the first case, it’s only necessary to pad to the longest sequence in the batch (and the longest number of images in the batch), while for the second case, we will have to pad the sequences to the model's sequence length, or else the SPreduce_scatter
calls will fail.I was surprised to see that torchtune doesn’t currently support this feature for MultiModal datasets, whereas it does for SFT ones. I think it’s necessary to develop a solution with packing to achieve maximum performance.
LearnableProjection
forward method, this line is duplicated.train.py
, but usingTensorDict
could be a good idea both for the model's forward pass (model(**batch)
) and for device placement (batch.cuda()
).Without a doubt, this is a great (and fun) exercise to dive into multimodality! Let me know your thoughts!
Toni
cc: @tianyu-l @fduwjj
The text was updated successfully, but these errors were encountered: