Skip to content

WIP: VITS 2 with quantized output of text-encoder and voice cloning

License

Notifications You must be signed in to change notification settings

shigabeev/Q-VITS2-Voice-Cloning

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The ultimate VITS2

Alt text

The idea for this repo is to implement the most comprehensive VITS2 out here.

Changelist

  • Bump Librosa and python version to the highest
  • Implement d-vector instead of speaker id for external speaker encoder as in YourTTS.
  • Implement YourTTS styled d-vector-free text encoder and d-vector as an input to vocoder (currenlty only HiFiGAN does that)
  • implement dataloader that would load d-vectors
  • Add quantized Text Encoder. BERT -> bottleneck -> text features.
  • VCTK audio loader
  • Implement a better vocoder with support for d-vector
  • Remove boilerplate code in attentions.py and replace it with native torch.nn.Encoder
  • Adan optimizer
  • PyTorch Lightning support
  • Add Bidirectional Flow Loss

pre-requisites

  1. Python >= 3.8

  2. CUDA

  3. Pytorch version 1.13.1 (+cu117)

  4. Clone this repository

  5. Install python requirements.

    pip install -r requirements.txt
    

    If you want to proceed with those cleaned texts in filelists, you need to install espeak.

    apt-get install espeak
    
  6. Prepare datasets & configuration

    1. wav files (22050Hz Mono, PCM-16)

    2. Prepare text files. One for training(ex) and one for validation(ex). Split your dataset to each files. As shown in these examples, the datasets in validation file should be fewer than the training one, while being unique from those of training text.

      • Single speaker(ex)
      wavfile_path|transcript
      
      wavfile_path|speaker_id|transcript
      
    3. Run preprocessing with a cleaner of your interest. You may change the symbols as well.

      • Single speaker
      python preprocess.py --text_index 1 --filelists PATH_TO_train.txt --text_cleaners CLEANER_NAME
      python preprocess.py --text_index 1 --filelists PATH_TO_val.txt --text_cleaners CLEANER_NAME
      
      • Multi speaker
      python preprocess.py --text_index 2 --filelists PATH_TO_train.txt --text_cleaners CLEANER_NAME
      python preprocess.py --text_index 2 --filelists PATH_TO_val.txt --text_cleaners CLEANER_NAME
      

      The resulting cleaned text would be like this(single). ex - multi

  7. Build Monotonic Alignment Search.

# Cython-version Monotonoic Alignment Search
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace
  1. Edit configurations based on files and cleaners you used.

Setting json file in configs

Model How to set up json file in configs Sample of json file configuration
iSTFT-VITS2 "istft_vits": true,
"upsample_rates": [8,8],
istft_vits2_base.json
MB-iSTFT-VITS2 "subbands": 4,
"mb_istft_vits": true,
"upsample_rates": [4,4],
mb_istft_vits2_base.json
MS-iSTFT-VITS2 "subbands": 4,
"ms_istft_vits": true,
"upsample_rates": [4,4],
ms_istft_vits2_base.json
Mini-iSTFT-VITS2 "istft_vits": true,
"upsample_rates": [8,8],
"hidden_channels": 96,
"n_layers": 3,
mini_istft_vits2_base.json
Mini-MB-iSTFT-VITS2 "subbands": 4,
"mb_istft_vits": true,
"upsample_rates": [4,4],
"hidden_channels": 96,
"n_layers": 3,
"upsample_initial_channel": 256,
mini_mb_istft_vits2_base.json

Training Example

# train_ms.py for multi speaker
# train_l.py to use Lightning
python train_ms.py -c configs/shergin_d_vector_hfg.json -m models/test

Contact

If you have any questions regarding how to run it, contact us in Telegram

https://t.me/voice_stuff_chat

Credits

About

WIP: VITS 2 with quantized output of text-encoder and voice cloning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.1%
  • Jupyter Notebook 3.5%
  • Cython 0.4%