VQVC-Pytorch

An unofficial implementation of Vector Quantization Voice Conversion(VQVC, D. Y. Wu et. al., 2020)

How-to-run

Install dependencies.
- python=3.7
- pytorch=1.7
```
pip install -r requirements.txt
```
Download dataset and pretrained VocGAN model.
- Please download VCTK dataset and edit dataset_path in config.py.
- Download VocGAN pretrained model
Preprocess
- preprocess mel-spectrogram via following command:
```
python prepro.py 1 1
```
- first argument: mel-preprocessing
- second argument: metadata split (You may change the portion of samples used on train/eval via data_split_ratio in config.py)
Train the model
```
python train.py
```
- In config.py, you may edit train_visible_device to choose GPU for training.
- Same as paper, 60K steps are enough.
- Training the model spends only 30 minutes.
Voice conversion
- After training, point the source and reference speech for voice conversion. (You may edit src_paths and ref_paths in conversion.py.)
- As a result of conversion, you may check samples in directory results.
```
python conversion.py
```

Visualization of training

Train loss visualization

reconstruction loss

commitment loss

perplexity of codebook

total loss

Mel-spectrogram visualization

Ground-truth(top), reconstructed-mel(top-middle), Contents-mel(bottom-middle), Style-mel(bottom, i.e., encoder output subtracted by code)

Inference results

You may hear audio samples.
Visualization of converted mel-spectrogram
- source mel(top), reference mel(middle), converted mel(bottom)

Pretrained models

VQVC pretrained model

download pretrained VQVC model and place it in ckpts/VCTK-Corpus/

VocGAN pretrained model

download pretrained VocGAN model and place it in vocoder/vocgan/pretrained_models

Experimental Notes

Trimming silence and stride of convolution are very important to transfer the style from reference speech.
Unlike paper, I used NVIDIA's preprocessing method to use pretrained VocGAN model.
Training is very unstable. (After 70K steps, perplexity of codebook is substantially decreased to 1.)
(Future work) The model trained on Korean Emotional Speech dataset is not completed yet.

References (or acknowledgements)

One-shot Voice Conversion by Vector Quantization (D. Y. Wu et. al., 2020)
VocGAN implementation by rishikksh20
NVIDIA's preprocessing method