The idea for this repo is to implement the most comprehensive VITS2 out here.
- Bump Librosa and python version to the highest
- Implement d-vector instead of speaker id for external speaker encoder as in YourTTS.
- Implement YourTTS styled d-vector-free text encoder and d-vector as an input to vocoder (currenlty only HiFiGAN does that)
- implement dataloader that would load d-vectors
- Add quantized Text Encoder. BERT -> bottleneck -> text features.
- VCTK audio loader
- Implement a better vocoder with support for d-vector
- Remove boilerplate code in attentions.py and replace it with native torch.nn.Encoder
- Adan optimizer
- PyTorch Lightning support
- Add Bidirectional Flow Loss
-
Python >= 3.8
-
CUDA
-
Pytorch version 1.13.1 (+cu117)
-
Clone this repository
-
Install python requirements.
pip install -r requirements.txt
If you want to proceed with those cleaned texts in filelists, you need to install espeak.
apt-get install espeak
-
Prepare datasets & configuration
-
wav files (22050Hz Mono, PCM-16)
-
Prepare text files. One for training(ex) and one for validation(ex). Split your dataset to each files. As shown in these examples, the datasets in validation file should be fewer than the training one, while being unique from those of training text.
- Single speaker(ex)
wavfile_path|transcript
- Multi speaker(ex)
wavfile_path|speaker_id|transcript
-
Run preprocessing with a cleaner of your interest. You may change the symbols as well.
- Single speaker
python preprocess.py --text_index 1 --filelists PATH_TO_train.txt --text_cleaners CLEANER_NAME python preprocess.py --text_index 1 --filelists PATH_TO_val.txt --text_cleaners CLEANER_NAME
- Multi speaker
python preprocess.py --text_index 2 --filelists PATH_TO_train.txt --text_cleaners CLEANER_NAME python preprocess.py --text_index 2 --filelists PATH_TO_val.txt --text_cleaners CLEANER_NAME
The resulting cleaned text would be like this(single). ex - multi
-
-
Build Monotonic Alignment Search.
# Cython-version Monotonoic Alignment Search
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace
- Edit configurations based on files and cleaners you used.
Setting json file in configs
Model | How to set up json file in configs | Sample of json file configuration |
---|---|---|
iSTFT-VITS2 | "istft_vits": true, "upsample_rates": [8,8], |
istft_vits2_base.json |
MB-iSTFT-VITS2 | "subbands": 4, "mb_istft_vits": true, "upsample_rates": [4,4], |
mb_istft_vits2_base.json |
MS-iSTFT-VITS2 | "subbands": 4, "ms_istft_vits": true, "upsample_rates": [4,4], |
ms_istft_vits2_base.json |
Mini-iSTFT-VITS2 | "istft_vits": true, "upsample_rates": [8,8], "hidden_channels": 96, "n_layers": 3, |
mini_istft_vits2_base.json |
Mini-MB-iSTFT-VITS2 | "subbands": 4, "mb_istft_vits": true, "upsample_rates": [4,4], "hidden_channels": 96, "n_layers": 3, "upsample_initial_channel": 256, |
mini_mb_istft_vits2_base.json |
# train_ms.py for multi speaker
# train_l.py to use Lightning
python train_ms.py -c configs/shergin_d_vector_hfg.json -m models/test
If you have any questions regarding how to run it, contact us in Telegram