This repository contains the code of two distinct research projects which are closely related and share much of the same codebase. The second project is and extension to the multimodal domain of the first one.
We want to generalize the self-distillation learning paradigm so that it applied to any kind of unimodal or fused multimodal data without the need of modality-specific augmentation or masking strategies. Instead we embed the input data into a universal input array and apply a single masking strategy in the latent space instead of the data space. We test this genealized apporach on a multitude of datasets containing text, images, audio and video data.
#TODO update this section to run with poetry Install dependencies
# clone project
git clone https://github.com/marcomoldovan/multimodal-self-distillation
cd multimodal-self-distillation
# install the correct python version
sudo apt-get install python3.10 # Linux, Python 3.7 or higher
brew install [email protected] #MacOS, Python 3.7 or higher
choco install python --version=3.9 # Windows, Python 3.7-3.9
# create python virtual environment and activate it
python3 -m venv myenv
source myenv/bin/activate
# if you have several version of python you can create a virtual environment with a specific version:
virtualenv --python=/usr/bin/<python3.x> myenv
myenv\Scripts\activate.bat
# [ALTERNATIVE] create conda environment
conda create -n myenv python=<3.x>
conda activate myenv
# install pytorch according to instructions
# https://pytorch.org/get-started/
# install requirements
pip install -r requirements.txt
Train model with default configuration
# train on CPU
python train.py trainer.gpus=0
# train on GPU
python train.py trainer.gpus=1
Train model with chosen experiment configuration from configs/experiment/unimodal
python train.py experiment=unimodal/experiment_name.yaml
You can override any parameter from command line like this
python train.py trainer.max_epochs=20 datamodule.batch_size=64
We view pairs of multimodal datapoints as augmentations of the same semantic concept and leverage this observation to apply the self-distillation paradigm to the multimodal setting in order to learn a coordinated multimodal representation space. We show that this approach is able to learn a representation space that is more aligned than the one learned by a standard contrastive loss while avoiding the need for negative mining, a cruicial weekness of the contrastive approach.
Install dependencies
# clone project
git clone https://github.com/marcomoldovan/multimodal-self-distillation
cd multimodal-self-distillation
# install the correct python version
sudo apt-get install python3.10 # Linux, Python 3.7 or higher
brew install [email protected] #MacOS, Python 3.7 or higher
choco install python --version=3.9 # Windows, Python 3.7-3.9
# create python virtual environment and activate it
python3 -m venv myenv
source myenv/bin/activate
# if you have several version of python you can create a virtual environment with a specific version:
virtualenv --python=/usr/bin/<python3.x> myenv
myenv\Scripts\activate.bat
# [ALTERNATIVE] create conda environment
conda create -n myenv python=<3.x>
conda activate myenv
# install pytorch according to instructions
# https://pytorch.org/get-started/
# install requirements
pip install -r requirements.txt
Train model with default configuration
# train on CPU
python train.py trainer.gpus=0
# train on GPU
python train.py trainer.gpus=1
Train model with chosen experiment configuration from configs/experiment/multimodal
python train.py experiment=multimodal/experiment_name.yaml
You can override any parameter from command line like this
python train.py trainer.max_epochs=20 datamodule.batch_size=64