📃 Intuition

I'm building my own multi media GPT; a competitor to Merlot Reserve & Vid2Seq. It's pre-trained from scratch on youtube data, mostly the YT-1B dataset of 20M curated youtube videos containing significant spoken language (English only).

📜 Arxiv: https://arxiv.org/abs/2304.10505

👉 Project highlights & intuition with photos, check it out: https://twitter.com/KastanDay/status/1595991960380411905

My design follows the "Embedding + Trunk + Head" pattern I first noticed succeeding in DETER and Alphafold2. Now in early 2023, it's successful in PALM-E and Vid2Seq from Google, and Prismer from Nvidia, and many more listed on my Twitter announcement.

🚀 Quickstart

Install Git LFS

# Install `git-lfs` (via apt or brew)
brew install git-lfs
-OR-
conda install -c conda-forge -y git-lfs

Then start GitLFS

git-lfs install

Install ffmpeg

A simple install should work fine, despite how convoluted the library tends to be.

# preffered
sudo apt update && sudo apt install ffmpeg
-OR-
# conda method is not as well tested for this project
conda install -c conda-forge -y ffmpeg
# An update command might be necessary to get all of ffmpeg's codec-specifc extensions, which we need. 
# solves error in parallel_whisper.py: ❌❌Error during whisper: Expecting value: line 1 column 1 (char 0)
conda update ffmpeg

Clone the repo with our custom submodules

git clone --recurse-submodules [email protected]:KastanDay/video-pretrained-transformer.git

Install pip requirements

pip install -r ./requirements.txt

Later, if updates are made to submodules, you can pull new changes using:

git submodule update --remote

We have submodules in the first place because we needed to modify the internal logic of three libraries used in preprocessing: Lhotse (to be faster), OpenPSG, and Transformers to modify the T5 implementation to suport modality encodings.

Install is complete!

Progress

(Oct 2022) Start of project.
(Dec 2022) MVP completed, but messed up the evaluation.
(Dec 2022) Migrated all data to Deeplake database library, overall much cleaner & more reliable for distributed database updates.
(Jan 2023) Migrated all training logic to Composer, by MosaicML. Super cool library for efficient LLM training, even of huggingface models.
(Jan 2023) Finished scaling up distributed pre-processing (i.e. inference w/ Whisper, FlanT5, OpenS and Clip). Rock solid Deeplake distributed dataset.append() operations on any size SLURM cluster.
(Feb 2023) Tested different backbones: T5 vs T5 v1.1 vs Flan-TS. Somehow, v1.1 was terrible and Flan-T5 was by far the best. As suggested by another finetuning study. The author confirmed this in my follow-up question.
(Mar 2023) WIP: TVQA evaluation. Need to fit more video frames into our 1024 context window, probably by using fewer final hidden states from CLIP.

Up next:

Find better scene-graph implementation: conly 55 classes from COCO is not enough for YouTube data. Ours relies on Detectron2 as a base, which is great for in-domain objects but not general. I think the best we can do is to use the 1k classes from imagenet.
Totally reimplement sound/audio model to move away from Whisper -- I think Google's AudioSet with 600+ classes based on YouTube data, will enable the best models. Here's my favorite from that competition.

Name		Name	Last commit message	Last commit date
Latest commit History 223 Commits
Vision_Embeddings		Vision_Embeddings
data_preprocessing		data_preprocessing
delta		delta
downloading_yt1b		downloading_yt1b
model		model
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📃 Intuition

🚀 Quickstart

Progress

About

Releases

Packages

Languages

License

rohancsalvi/video-pretrained-transformer

Folders and files

Latest commit

History

Repository files navigation

📃 Intuition

🚀 Quickstart

Progress

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages