Skip to content

Latest commit

 

History

History
29 lines (21 loc) · 1.84 KB

File metadata and controls

29 lines (21 loc) · 1.84 KB

Pretraining

(If you are new to the ideas of mmpt.processors, see README first.) We mostly use howto100M dataset for pretraining (other datasets are coming). So you are less likely to write a new MetaProcessor, VideoProcessor or TextProcessor but only working on a new Aligner, a new model and loss.

Data Sharding

Pretraining on Howto100M is heavy on IO since we have millions of videos or captions on the hard disk that cannot be fit into the memory. It is desirable to have an optimized preprocessing step before the actual dataloading.

We support data sharding to pack multiple videos into a shards of training data for both videos and captions. (see dataset for preprocessing). These shards will be mapped into memory to reduce the frequency of IO access on millions of files. See (processors starting with Sharded*). This will be the default config for a how2 dataset projects/task/how2.yaml.

Great thanks to Dmytro Okhonko for sharing the code from MARGE project.

Training

Pretraining on Howto100m is expected on one or multiple nodes, where each node has 8 GPUS with 32 GB mem. launching a pretraing on MFM+MLM can be done, via:
python locallaunch.py projects/mfmmlm/how2.yaml

Pre-training with a Retrieval Model (VideoCLIP)

This projects now support alternatively run a retrieval model and pre-training. We implement a basic retrieval model that is built on the hidden states of a video and faiss.

You may need to install faiss via conda install faiss-cpu -c pytorch.

Right now, the hidden states of a video is computed as the average of 8 clips of their pooled visual/text hidden states. See mmpt/tasks/retritask.py for more details. The .yaml config for running pre-training with a retrieval model can be found at projects/retri/videoretri.yaml.