-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can you share experience on tuning the multiplers c_kl_fwd and c_e2e #12
Comments
Hello, @feng-yufei. I have also observed similar issues with larger values for c_kl_fwd and c_e2e affecting the model training, which is why their values are set to such. But note that these values may not be optimal since the paper doesn't mention the multipliers they used for loss terms. Regarding c_kl_fwd, I haven't rigorously compared the results of c_kl_fwd=0 and c_kl_fwd=0.001 in terms of audio quality. However, for the loss_fwd, it seemed that even setting a very small value for c_kl_fwd significantly lowered the loss_fwd term as the training progressed. This indicates that this change affects the distribution of the enhanced prior or posterior in some way, and maybe to a better direction (reducing the training-inference mismatch) as stated in the paper. As for the c_e2e term, the authors didn't conduct an ablation study for its use, and I didn't notice any improvement from using it. This term might help if we use tuning stage (last 2k epochs), but certainly using high values for c_e2e (such as 1.0) ruins training. So I think it is safe to set this lower enough. I hope this helps. Please let me know if you have any further questions or concerns. Thank you. |
Thanks for your reply, I will update my further findings, and hopefully some quality comparison, once several experiments finished. |
@feng-yufei Hey, results please? Feels like c_kl_fwd indeed has to be small |
Hello Heatz123,
Recently I am doing experiments on adding the bi-directional posterior/prior loss, and I found during finetuning (after the warm-up) this additional loss with a larger multiplier ruins the trained VITS model. I saw you mention multipliers in the readme file and you set c_kl_fwd = 0.001, which is very small, as well as the c_e2e =0.1, so I hope to confirm with you about the experiment results you may encounter.
Since I did not see any clue about these parameters in the original natural speech paper, so I guess you tuned it based on your experiments. Do you also observe similar problems when a larger c_kl_fwd or c_e2e ruin the model? Do you think or compared a very small c_kl_fwd will have effects on the inference quality?
Thanks
The text was updated successfully, but these errors were encountered: