This repository provides Kaldi-style recipes, as the same as ESPnet.
Currently, the following recipes are supported.
- LJSpeech: English female speaker
- JSUT: Japanese female speaker
- JSSS: Japanese female speaker
- CSMSC: Mandarin female speaker
- CMU Arctic: English speakers
- JNAS: Japanese multi-speaker
- VCTK: English multi-speaker
- LibriTTS: English multi-speaker
- YesNo: English speaker (For debugging)
- KSS: Single Korean female speaker
- Oniku_kurumi_utagoe_db/: Single Japanese female singer (singing voice)
- Kiritan: Single Japanese male singer (singing voice)
- Ofuton_p_utagoe_db: Single Japanese female singer (singing voice)
- Opencpop: Single Mandarin female singer (singing voice)
- CSD: Single Korean/English female singer (singing voice)
- KiSing: Single Mandarin female singer (singing voice)
# Let us move on the recipe directory
$ cd egs/ljspeech/voc1
# Run the recipe from scratch
$ ./run.sh
# You can change config via command line
$ ./run.sh --conf <your_customized_yaml_config>
# You can select the stage to start and stop
$ ./run.sh --stage 2 --stop_stage 2
# If you want to specify the gpu
$ CUDA_VISIBLE_DEVICES=1 ./run.sh --stage 2
# If you want to resume training from 10000 steps checkpoint
$ ./run.sh --stage 2 --resume <path>/<to>/checkpoint-10000steps.pkl
You can check the command line options in run.sh
.
The integration with job schedulers such as slurm can be done via cmd.sh
and conf/slurm.conf
.
If you want to use it, please check this page.
All of the hyperparameters are written in a single yaml format configuration file.
Please check this example in ljspeech recipe.
You can monitor the training progress via tensorboard.
$ tensorboard --logdir exp
If you want to accelerate the training, you can try distributed multi-gpu training based on apex.
You need to install apex for distributed training. Please make sure you already installed it.
Then you can run distributed multi-gpu training via following command:
# in the case of the number of gpus = 8
$ CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" ./run.sh --stage 2 --n_gpus 8
In the case of distributed training, the batch size will be automatically multiplied by the number of gpus.
Please be careful.
Here, I will show how to make the recipe for your own dataset.
-
Setup your dataset to be the following structure.
# For single-speaker case $ tree /path/to/databse /path/to/database ├── utt_1.wav ├── utt_2.wav │ ... └── utt_N.wav # The directory can be nested, but each filename must be unique # For multi-speaker case $ tree /path/to/databse /path/to/database ├── spk_1 │ ├── utt1.wav ├── spk_2 │ ├── utt1.wav │ ... └── spk_N ├── utt1.wav ... # The directory under each speaker can be nested, but each filename in each speaker directory must be unique
-
Copy the template directory.
cd egs # For single speaker case cp -r template_single_spk <your_dataset_name> # For multi speaker case cp -r template_multi_spk <your_dataset_name> # Move on your recipe cd egs/<your_dataset_name>/voc1
-
Modify the options in
run.sh
.
What you need to change at least inrun.sh
is as follows:db_root
: Root path of the database.num_dev
: The number of utterances for development set.num_eval
: The number of utterances for evaluation set.
-
Modify the hyperpameters in
conf/parallel_wavegan.v1.yaml
.
What you need to change at least in config is as follows:sampling_rate
: If you can specify the lower sampling rate, the audio will be downsampled by sox.
-
(Optional) Change command backend in
cmd.sh
.
If you are not familiar with kaldi and run in your local env, you do not need to change.
See more info on https://kaldi-asr.org/doc/queue.html. -
Run your recipe.
# Run all stages from the first stage ./run.sh # If you want to specify CUDA device CUDA_VISIBLE_DEVICES=0 ./run.sh
If you want to try the other advanced model, please check the config files in egs/ljspeech/voc1/conf
.
Make sure already you finished the espnet2-tts recipe experiments (at least starting the training).
cd egs
# Please use single spk template for both single and multi spk case
cp -r template_single_spk <recipe_name>
# Move on your recipe
cd egs/<recipe_name>/voc1
# Make symlink of data directory (Better to use absolute path)
mkdir dump data
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw dump/
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw/tr_no_dev data/train_nodev
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw/dev data/dev
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw/eval1 data/eval
# Edit config to match TTS model setting
vim conf/parallel_wavegan.v1.yaml
# Run from stage 1
./run.sh --stage 1 --conf conf/parallel_wavegan.v1.yaml
That's it!
Here, assume that you already finished the training of text2mel model with ESPnet2 recipe and vocoder with this repository. At first, run teacher-forcing decoding for train and dev sets in ESPnet2 recipe.
cd espnet/egs2/your_recipe/tts1
./run.sh \
--ngpu 1 \
--stage 7 \
--test_sets "tr_no_dev dev" \
--inference_args "--use_teacher_forcing true"
Then move on vocoder recipe and run fine-tuning:
cd parallel_wavegan/egs/your_recipe/voc1
# make symlink to text2mel model and dump dirs
ln -s /path/to/espnet/egs2/your_recipe/tts1/dump/raw dump
ln -s /path/to/espnet/egs2/your_recipe/tts1/exp/your_text2mel_mode_dir exp/
# make config for finetune
# e.g.
vim conf/hifigan_melgan.v1.finetune.yaml
# run fine-tuning
. ./path.sh
parallel-wavegan-train \
--config conf/hifigan_melgan.v1.finetune.yaml \
--train-wav-scp dump/raw/tr_no_dev/wav.scp \
--train-feats-scp exp/your_text2mel_mode_dir/decode_use_teacher_forcingtrue_train.loss.ave/tr_no_dev/norm/feats.scp \
--dev-wav-scp dump/raw/dev/wav.scp \
--dev-feats-scp exp/your_text2mel_mode_dir/decode_use_teacher_forcingtrue_train.loss.ave/dev/norm/feats.scp \
--outdir exp/your_finetuned_vocoder_outdir \
--pretrain /path/to/vocoder_checkpoint.pkl \
--verbose 1