Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add video support #430

Draft
wants to merge 28 commits into
base: main
Choose a base branch
from
Draft

Add video support #430

wants to merge 28 commits into from

Conversation

iejMac
Copy link
Contributor

@iejMac iejMac commented Feb 15, 2023

Current plan is to use a ViViT architecture (factorised encoder variant) with the image and text encoder initialized with CLIP weights. We can also opt to train this CoCa style, then we would initialize the image encoder and entire text tower (including decoder) with CoCa weights so we'd likely want to train an L/14 size model.

This PR will likely require the addition/adaptation of:

  • vivit.py - the implementation of the ViViT model which would depend on 2 layers of transformers coming from our transformer.py file. It would implement the whole CLIP-like model that would be used in train.py
  • data.py - need to add a video dataloader

Will use this PR to track progress. Lmk if there's a better way of approaching this
@rwightman @rom1504 @mitchellnw

@iejMac
Copy link
Contributor Author

iejMac commented Feb 15, 2023

Hmmm I don't see how we can reliably train this though. Let's say for L/14 max local batch size on a 40GB GPU is like a few hundred images so if a video has 100 frames at 1 FPS that gives us a local batch size <10 for rather short videos which probably still won't work very well unless we have more GPU's than we have

@rom1504
Copy link
Collaborator

rom1504 commented Feb 15, 2023 via email

@lucidrains
Copy link
Contributor

lucidrains commented Feb 15, 2023

@iejMac nice! i can contribute to this

i believe for video, we can do much more aggressive patch dropout in the beginning. well, if the video does not resemble this lol

@lucidrains
Copy link
Contributor

7baaai

@iejMac iejMac marked this pull request as draft February 16, 2023 01:21
@iejMac
Copy link
Contributor Author

iejMac commented Feb 16, 2023

@lucidrains cool! I'll start filling out the code a bit today. And yeah good idea with aggressive patch dropout. So currently we have:

  • aggressive patch dropout
  • grad accumulation

as some tricks to make this a bit more tractable. Anything to maximize batch size here will be really important.

@iejMac
Copy link
Contributor Author

iejMac commented Feb 16, 2023

ah also another thing I want to add here: simultaneous (image, video) - text training. I.e. the final model should be able to both handle temporal sequences and static images. The thing I'm unsure about is if we apply the image loss after the temporal decoder essentially saying "the time transformer should understand singular images" or before, just at the spatial transformer side

@iejMac
Copy link
Contributor Author

iejMac commented Feb 21, 2023

@lucidrains, does anything look wrong to you in the modeling code? Specifically this ViViT class. I'm getting very strange loss curves and was wondering if you might have an idea.

@iejMac
Copy link
Contributor Author

iejMac commented Feb 21, 2023

could also be the data loader code but I figured I'd ask you about the model since I'm comparing with your vivit

@lucidrains
Copy link
Contributor

@iejMac nice! i'll do a code review later this week when i find some downtime

@iejMac
Copy link
Contributor Author

iejMac commented Mar 27, 2023

Next task - initialize spatial and text transformer from CLIP model

@iejMac
Copy link
Contributor Author

iejMac commented Apr 4, 2023

Next things to do:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants