This is an another pytorch implementation of Tacotron2 MMI hugly based on bfs18's code.
-
I decided to implement this to address robustness and slow training of NVIDIA/tacotron2. While I searched for issues handling it, I found bfs18's Taoctron2 MMI from issue #280 regarding to effectiveness of reduction windows in tacotron frameworks.
-
In bfs18's Taoctron2 MMI, there are two main contributions: drop frame rate and CTC loss based MMI to maximize "the dependency between the autoregressive module and the condition module". But as reported in the follow-up issue, it seems somewhat unstable. So I didn't use MMI term in training by setting
use_mmi==False
. -
Instead, I applied two things to get robust alignments as follows.
n_frames_per_step>1
mode (it is not supported in NVIDIA/tacotron2)- I only tried
n_frames_per_step==2
, but it should work for any number greater than 2. - espnet's implementation of diagonal guided attention loss
-
As a result, aligments are learned more than 3 times faster than NVIDIA/tacotron2 with Blizzard Challenge 2013 dataset
-
However, the overall quality of the synthesized speech is poor even with excellent alignments due to the regularizational effects of both the drop frame rate and the reduction windows. I trained
~130k
steps, but it only shows0.3621
val loss. This is significantly slower than NVIDIA/tacotron2 with warm start model. It may converge in later with more training, but I am not going any further in my current implementation since I don't want to spend too much time on training. -
You can enjoy of my code, and I hope to see an exceptional improvement from you. Any suggestions are appreciated.
- NVIDIA GPU + CUDA cuDNN
- Download and extract the Blizzard Challenge 2013 dataset
- Follow the remain process as in NVIDIA/tacotron2
python train.py --output_directory=outdir --log_directory=logdir
- (OPTIONAL)
tensorboard --logdir=outdir/logdir
- Single sample:
python inference.py -c checkpoint/path -r reference_audio/wav/path -t "synthesize text"
- Multi samples:
python inference_all.py -c checkpoint/path -r reference_audios/dir/path
N.b. When performing Mel-Spectrogram to Audio synthesis, make sure Tacotron 2 and the Mel decoder were trained on the same mel-spectrogram representation.
- Not supported in current implementation.
- You may remove mel_layer in decoder to lower the training loss. It is not existing in NVIDIA/tacotron2 but in bfs18's code.
- In my experements, there was no big difference between using drop frame rate and reduction windows as described in issue #280 especially in terms of learning alignments. But the trace of both training and validation loss are different. Specifically, using reduction windows shows more large val loss at the same training steps compared to drop frame rate. Also, training time is reduced almost by half when using reduction windows.
- I found another implementation from BogiHsu which also has
n_frames_per_step>1
mode. Main difference is the way to deal with the length of gate mask. You may try this too.
@misc{lee2021tacotron2_mmi,
author = {Lee, Keon},
title = {tacotron2_MMI},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/keonlee9420/tacotron2_MMI}}
}
WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis