-
Notifications
You must be signed in to change notification settings - Fork 627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training vs tensorboard metrics #211
Comments
We usually look at g/total, and from your graph, it seems to be decreasing pretty well. But I’m not sure if 2 hours of training data is enough; I initially used around 8 to 10 hours for training. |
@jeremy110 Thank you for your response! I’m honestly a bit hooked on watching the progress as it keeps going down, so I can’t seem to stop checking in :-) Currently at 68 hours. I’m planning to create an 8-10 hour audio dataset for the next training session. Could you suggest what kind of text data I should gather for it? So far, I’ve used random articles and some ChatGPT-generated data, but I’ve heard that people sometimes read books, for example. Is there perhaps a dataset available with quality English sentences that covers a variety of language phenomena? I tried to find it but no results. |
@smlkdev I haven’t specifically researched text types. My own dataset was professionally recorded, with sentences that resemble reading books. I’m not very familiar with English datasets—are you planning to train in English? |
This is my first attempt with ML/training/voice cloning and decided to use english. I read briefly Thai thread and it was way too complex for me to start with. Your training was 32 hours long and for me (I'm not the expert) infer voice matched original :) That's really nice. Is it a voice that had 8-10 hours of audio as you mentioned earlier? |
Yes, that's correct. I tried both single-speaker and multi-speaker models, and the total duration is around 8-10 hours. If this is your first time getting into it, I recommend you try F5-TTS. There are a lot of people in the forums who have trained their own models, and some even wrote a Gradio interface, which is very convenient. |
@jeremy110 thank your for your responses. Is F5-TTS better than MeloTTS in terms of quality? I just realized that my cloned MeloTTS voice doesn’t add breaks between sentences. I have to add them manually—by splitting the text into sentences, breaking it down into smaller parts, generating and then merging it back together after adding pauses. It can be made automatically of course but still a bit of work. (I was focusing on single sentences before and I liked the quality) |
In terms of quality, I think F5-TTS is quite good. You can try it out on the Huggingface demo. The pauses within sentences mainly depend on your commas (","). The program adds a space after punctuation to create a pause. However, if the audio files you trained on have very little silence before and after the speech, the generated audio will also have little silence. Of course, you can add the pauses manually, but you could also address it by adjusting the training data. |
Will my training yield better results over time? Currently, the training took about 9 hours.
I have 1500 wav samples, with a total audio length of approximately 2 hours.
What other metrics should I pay attention to in TensorBoard?
The text was updated successfully, but these errors were encountered: