Skip to content

Latest commit

 

History

History
93 lines (65 loc) · 3.64 KB

README.md

File metadata and controls

93 lines (65 loc) · 3.64 KB

VQVC-Pytorch

An unofficial implementation of Vector Quantization Voice Conversion(VQVC, D. Y. Wu et. al., 2020)

model

How-to-run

  1. Install dependencies.

    pip install -r requirements.txt
    
  2. Download dataset and pretrained VocGAN model.

  3. Preprocess

    • preprocess mel-spectrogram via following command:
    python prepro.py 1 1
    
    • first argument: mel-preprocessing
    • second argument: metadata split (You may change the portion of samples used on train/eval via data_split_ratio in config.py)
  4. Train the model

    python train.py
    
    • In config.py, you may edit train_visible_device to choose GPU for training.
    • Same as paper, 60K steps are enough.
    • Training the model spends only 30 minutes.
  5. Voice conversion

    • After training, point the source and reference speech for voice conversion. (You may edit src_paths and ref_paths in conversion.py.)
    • As a result of conversion, you may check samples in directory results.
    python conversion.py
    

Visualization of training

Train loss visualization

  • reconstruction loss

recon_loss

  • commitment loss

commitment_loss

  • perplexity of codebook

perplexity

  • total loss

total_loss

Mel-spectrogram visualization

  • Ground-truth(top), reconstructed-mel(top-middle), Contents-mel(bottom-middle), Style-mel(bottom, i.e., encoder output subtracted by code)

melspectrogram

Inference results

  • You may hear audio samples.

  • Visualization of converted mel-spectrogram

    • source mel(top), reference mel(middle), converted mel(bottom)

converted_melspectrogram

Pretrained models

  1. VQVC pretrained model
  • download pretrained VQVC model and place it in ckpts/VCTK-Corpus/
  1. VocGAN pretrained model
  • download pretrained VocGAN model and place it in vocoder/vocgan/pretrained_models

Experimental Notes

  • Trimming silence and stride of convolution are very important to transfer the style from reference speech.
  • Unlike paper, I used NVIDIA's preprocessing method to use pretrained VocGAN model.
  • Training is very unstable. (After 70K steps, perplexity of codebook is substantially decreased to 1.)
  • (Future work) The model trained on Korean Emotional Speech dataset is not completed yet.

References (or acknowledgements)