-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add video support #430
base: main
Are you sure you want to change the base?
Add video support #430
Conversation
Hmmm I don't see how we can reliably train this though. Let's say for L/14 max local batch size on a 40GB GPU is like a few hundred images so if a video has 100 frames at 1 FPS that gives us a local batch size <10 for rather short videos which probably still won't work very well unless we have more GPU's than we have |
Gradient accumulation maybe ?
…On Wed, Feb 15, 2023, 12:48 Maciej Kilian ***@***.***> wrote:
Hmmm I don't see how we can reliably train this though. Let's say for L/14
max local batch size on a 40GB GPU is like a few hundred images so if a
video has 100 frames at 1 FPS that gives us a local batch size <10 for
rather short videos which probably still won't work very well unless we
have more GPU's than we have
—
Reply to this email directly, view it on GitHub
<#430 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437QTXKWHS75KDPHW6OTWXS7CFANCNFSM6AAAAAAU4WTXDE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@lucidrains cool! I'll start filling out the code a bit today. And yeah good idea with aggressive patch dropout. So currently we have:
as some tricks to make this a bit more tractable. Anything to maximize batch size here will be really important. |
ah also another thing I want to add here: simultaneous (image, video) - text training. I.e. the final model should be able to both handle temporal sequences and static images. The thing I'm unsure about is if we apply the image loss after the temporal decoder essentially saying "the time transformer should understand singular images" or before, just at the spatial transformer side |
@lucidrains, does anything look wrong to you in the modeling code? Specifically this ViViT class. I'm getting very strange loss curves and was wondering if you might have an idea. |
could also be the data loader code but I figured I'd ask you about the model since I'm comparing with your vivit |
@iejMac nice! i'll do a code review later this week when i find some downtime |
Next task - initialize spatial and text transformer from CLIP model |
Next things to do:
|
Current plan is to use a ViViT architecture (factorised encoder variant) with the image and text encoder initialized with CLIP weights. We can also opt to train this CoCa style, then we would initialize the image encoder and entire text tower (including decoder) with CoCa weights so we'd likely want to train an L/14 size model.
This PR will likely require the addition/adaptation of:
Will use this PR to track progress. Lmk if there's a better way of approaching this
@rwightman @rom1504 @mitchellnw