We tackle the problem of learning a multimodal representation space for language in the form of text as well as speech. We contrastively align semantically similar text and speech segments in the representation space in order to enable cross-modal retrieval of speech segments given a text query and vice versa.
Install dependencies
# clone project
git clone https://github.com/marcomoldovan/cross-modal-speech-segment-retrieval
cd cross-modal-speech-segment-retrieval
# [OPTIONAL] create python virtual environment
# Requires Python 3.7-3.9 on Windows or Python 3.7 or higher on Linux and MacOS
python3 -m venv myenv # uses default python version
virtualenv --python=/usr/bin/<python3.x> myenv # to specify python version
myenv\Scripts\activate.bat # for Windows
source myenv/bin/activate # for Linux or MacOS
# [ALTERNATIVE] create conda environment
conda create -n myenv python=3.8
conda activate myenv
# install pytorch according to instructions
# https://pytorch.org/get-started/
# install requirements
pip install -r requirements.txt
Train model with default configuration
# train on CPU
python train.py trainer.gpus=0
# train on GPU
python train.py trainer.gpus=1
Train model with chosen experiment configuration from configs/experiment/
python train.py experiment=experiment_name.yaml
You can override any parameter from command line like this
python train.py trainer.max_epochs=20 datamodule.batch_size=64