You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is probably the current SOTA model, much better than F5-TTS. They just haven't posted any promotional posts on Reddit or anywhere else, so they aren't well-known yet. I'm not that good with technical stuff, but from what I get from tests:
Pros:
-Uses the same architecture as F5, but is way better since it's a much bigger model (needs 12+ GB VRAM).
-Outputs clearer and higher quality voices for every reference voice.
-Supports longer reference voices (I tried up to 5 minutes and it worked fine and fast).
-Supports multiple languages, including English, Chinese, Japanese, German, French, and Korean (although languages other than English and Chinese are undertrained, as it was trained on the Emilia dataset, it still sounds great).
-Can simulate the emotion of the text better, as in, it doesn't just copy the emotion of the reference voice, but can simulate the emotion of the text more accurately and produce a more natural voice.
-More robust and can handle tough tongue twisters without errors.
-Can clone harder voices like whispers, which F5 couldn't do. CosyVoice could do this too, but it's slower and lower quality.
Cons:
-Super hard to get working on Win 11 (required help from other users to make it work).
-Still a bit wonky on Win 11, with lower quality outputs compared to the demo page.
-Struggles with predicting duration for non-English languages.
-Generally a bit worse at non-English languages on my local version.
-Can't replicate the demo page examples, for example with the whisper voice it outputs something between a whisper and a low voice.
Despite all these cons, it's still better than F5 even on my Win 11. I couldn't figure out what's the problem, but maybe you can get it working better :)
The text was updated successfully, but these errors were encountered:
I see a lot of things, but usually will set a time out on a day to respond to comments so that's it. As long as it's not true spam, that's ok, this is not true spam lol
GitHub:
https://github.com/open-mmlab/Amphion/blob/main/models/tts/maskgct/README.md
Demo Page:
https://maskgct.github.io/
This is probably the current SOTA model, much better than F5-TTS. They just haven't posted any promotional posts on Reddit or anywhere else, so they aren't well-known yet. I'm not that good with technical stuff, but from what I get from tests:
Pros:
-Uses the same architecture as F5, but is way better since it's a much bigger model (needs 12+ GB VRAM).
-Outputs clearer and higher quality voices for every reference voice.
-Supports longer reference voices (I tried up to 5 minutes and it worked fine and fast).
-Supports multiple languages, including English, Chinese, Japanese, German, French, and Korean (although languages other than English and Chinese are undertrained, as it was trained on the Emilia dataset, it still sounds great).
-Can simulate the emotion of the text better, as in, it doesn't just copy the emotion of the reference voice, but can simulate the emotion of the text more accurately and produce a more natural voice.
-More robust and can handle tough tongue twisters without errors.
-Can clone harder voices like whispers, which F5 couldn't do. CosyVoice could do this too, but it's slower and lower quality.
Cons:
-Super hard to get working on Win 11 (required help from other users to make it work).
-Still a bit wonky on Win 11, with lower quality outputs compared to the demo page.
-Struggles with predicting duration for non-English languages.
-Generally a bit worse at non-English languages on my local version.
-Can't replicate the demo page examples, for example with the whisper voice it outputs something between a whisper and a low voice.
Despite all these cons, it's still better than F5 even on my Win 11. I couldn't figure out what's the problem, but maybe you can get it working better :)
The text was updated successfully, but these errors were encountered: