I'm building my own multi media GPT; a competitor to Merlot Reserve & Vid2Seq. It's pre-trained from scratch on youtube data, mostly the YT-1B dataset of 20M curated youtube videos containing significant spoken language (English only).
📜 Arxiv: https://arxiv.org/abs/2304.10505
👉 Project highlights & intuition with photos, check it out: https://twitter.com/KastanDay/status/1595991960380411905
My design follows the "Embedding + Trunk + Head" pattern I first noticed succeeding in DETER and Alphafold2. Now in early 2023, it's successful in PALM-E and Vid2Seq from Google, and Prismer from Nvidia, and many more listed on my Twitter announcement.
- Install Git LFS
# Install `git-lfs` (via apt or brew)
brew install git-lfs
-OR-
conda install -c conda-forge -y git-lfs
Then start GitLFS
git-lfs install
- Install ffmpeg
A simple install should work fine, despite how convoluted the library tends to be.
# preffered
sudo apt update && sudo apt install ffmpeg
-OR-
# conda method is not as well tested for this project
conda install -c conda-forge -y ffmpeg
# An update command might be necessary to get all of ffmpeg's codec-specifc extensions, which we need.
# solves error in parallel_whisper.py: ❌❌Error during whisper: Expecting value: line 1 column 1 (char 0)
conda update ffmpeg
- Clone the repo with our custom submodules
git clone --recurse-submodules [email protected]:KastanDay/video-pretrained-transformer.git
- Install pip requirements
pip install -r ./requirements.txt
Later, if updates are made to submodules, you can pull new changes using:
git submodule update --remote
We have submodules in the first place because we needed to modify the internal logic of three libraries used in preprocessing: Lhotse (to be faster), OpenPSG, and Transformers to modify the T5 implementation to suport modality encodings.
Install is complete!
- (Oct 2022) Start of project.
- (Dec 2022) MVP completed, but messed up the evaluation.
- (Dec 2022) Migrated all data to Deeplake database library, overall much cleaner & more reliable for distributed database updates.
- (Jan 2023) Migrated all training logic to Composer, by MosaicML. Super cool library for efficient LLM training, even of huggingface models.
- (Jan 2023) Finished scaling up distributed pre-processing (i.e. inference w/ Whisper, FlanT5, OpenS and Clip). Rock solid Deeplake distributed
dataset.append()
operations on any size SLURM cluster. - (Feb 2023) Tested different backbones: T5 vs T5 v1.1 vs Flan-TS. Somehow, v1.1 was terrible and Flan-T5 was by far the best. As suggested by another finetuning study. The author confirmed this in my follow-up question.
- (Mar 2023) WIP: TVQA evaluation. Need to fit more video frames into our 1024 context window, probably by using fewer final hidden states from CLIP.
Up next:
- Find better scene-graph implementation: conly 55 classes from COCO is not enough for YouTube data. Ours relies on Detectron2 as a base, which is great for in-domain objects but not general. I think the best we can do is to use the 1k classes from imagenet.
- Totally reimplement sound/audio model to move away from Whisper -- I think Google's AudioSet with 600+ classes based on YouTube data, will enable the best models. Here's my favorite from that competition.