Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: MaskGCT-TTS #75

Open
GalenMarek14 opened this issue Nov 7, 2024 · 3 comments
Open

Feature request: MaskGCT-TTS #75

GalenMarek14 opened this issue Nov 7, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@GalenMarek14
Copy link

GitHub:
https://github.com/open-mmlab/Amphion/blob/main/models/tts/maskgct/README.md
Demo Page:
https://maskgct.github.io/

This is probably the current SOTA model, much better than F5-TTS. They just haven't posted any promotional posts on Reddit or anywhere else, so they aren't well-known yet. I'm not that good with technical stuff, but from what I get from tests:

Pros:
-Uses the same architecture as F5, but is way better since it's a much bigger model (needs 12+ GB VRAM).
-Outputs clearer and higher quality voices for every reference voice.
-Supports longer reference voices (I tried up to 5 minutes and it worked fine and fast).
-Supports multiple languages, including English, Chinese, Japanese, German, French, and Korean (although languages other than English and Chinese are undertrained, as it was trained on the Emilia dataset, it still sounds great).
-Can simulate the emotion of the text better, as in, it doesn't just copy the emotion of the reference voice, but can simulate the emotion of the text more accurately and produce a more natural voice.
-More robust and can handle tough tongue twisters without errors.
-Can clone harder voices like whispers, which F5 couldn't do. CosyVoice could do this too, but it's slower and lower quality.

Cons:
-Super hard to get working on Win 11 (required help from other users to make it work).
-Still a bit wonky on Win 11, with lower quality outputs compared to the demo page.
-Struggles with predicting duration for non-English languages.
-Generally a bit worse at non-English languages on my local version.
-Can't replicate the demo page examples, for example with the whisper voice it outputs something between a whisper and a low voice.

Despite all these cons, it's still better than F5 even on my Win 11. I couldn't figure out what's the problem, but maybe you can get it working better :)

@JarodMica
Copy link
Owner

I am super aware of this, some 👀 people keep posting YouTube comments about it. I'll be taking a look at it sometime next week as I get time

@JarodMica JarodMica added the enhancement New feature or request label Nov 7, 2024
@GalenMarek14
Copy link
Author

Oh, I posted there too, sorry if I spammed a bit. I thought you didn't see them because you were so busy.

@JarodMica
Copy link
Owner

I see a lot of things, but usually will set a time out on a day to respond to comments so that's it. As long as it's not true spam, that's ok, this is not true spam lol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants