Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training vs tensorboard metrics #211

Open
smlkdev opened this issue Nov 8, 2024 · 8 comments
Open

Training vs tensorboard metrics #211

smlkdev opened this issue Nov 8, 2024 · 8 comments

Comments

@smlkdev
Copy link

smlkdev commented Nov 8, 2024

Will my training yield better results over time? Currently, the training took about 9 hours.
I have 1500 wav samples, with a total audio length of approximately 2 hours.

Screenshot 2024-11-08 at 11 53 27

What other metrics should I pay attention to in TensorBoard?

@smlkdev
Copy link
Author

smlkdev commented Nov 9, 2024

Update after ~34h:
Little improvement visible but note sure should I keep it longer because of the flattening.

Screenshot 2024-11-09 at 10 41 25
Screenshot 2024-11-09 at 10 41 41

@jeremy110
Copy link

We usually look at g/total, and from your graph, it seems to be decreasing pretty well. But I’m not sure if 2 hours of training data is enough; I initially used around 8 to 10 hours for training.

@smlkdev
Copy link
Author

smlkdev commented Nov 10, 2024

We usually look at g/total, and from your graph, it seems to be decreasing pretty well. But I’m not sure if 2 hours of training data is enough; I initially used around 8 to 10 hours for training.

@jeremy110 Thank you for your response! I’m honestly a bit hooked on watching the progress as it keeps going down, so I can’t seem to stop checking in :-)

Currently at 68 hours.

Screenshot 2024-11-10 at 22 05 29

I’m planning to create an 8-10 hour audio dataset for the next training session. Could you suggest what kind of text data I should gather for it? So far, I’ve used random articles and some ChatGPT-generated data, but I’ve heard that people sometimes read books, for example. Is there perhaps a dataset available with quality English sentences that covers a variety of language phenomena? I tried to find it but no results.

@jeremy110
Copy link

@smlkdev
Basically, this training can be kept short since it’s just a fine-tuning session; no need to make it too long. Here’s my previous tensorboard log for your reference(#120 (comment)).

I haven’t specifically researched text types. My own dataset was professionally recorded, with sentences that resemble reading books. I’m not very familiar with English datasets—are you planning to train in English?

@smlkdev
Copy link
Author

smlkdev commented Nov 11, 2024

This is my first attempt with ML/training/voice cloning and decided to use english. I read briefly Thai thread and it was way too complex for me to start with.

Your training was 32 hours long and for me (I'm not the expert) infer voice matched original :) That's really nice. Is it a voice that had 8-10 hours of audio as you mentioned earlier?

@jeremy110
Copy link

Yes, that's correct. I tried both single-speaker and multi-speaker models, and the total duration is around 8-10 hours.

If this is your first time getting into it, I recommend you try F5-TTS. There are a lot of people in the forums who have trained their own models, and some even wrote a Gradio interface, which is very convenient.

@smlkdev
Copy link
Author

smlkdev commented Nov 12, 2024

@jeremy110 thank your for your responses.

Is F5-TTS better than MeloTTS in terms of quality?

I just realized that my cloned MeloTTS voice doesn’t add breaks between sentences. I have to add them manually—by splitting the text into sentences, breaking it down into smaller parts, generating and then merging it back together after adding pauses. It can be made automatically of course but still a bit of work. (I was focusing on single sentences before and I liked the quality)

@jeremy110
Copy link

jeremy110 commented Nov 13, 2024

In terms of quality, I think F5-TTS is quite good. You can try it out on the Huggingface demo.

The pauses within sentences mainly depend on your commas (","). The program adds a space after punctuation to create a pause. However, if the audio files you trained on have very little silence before and after the speech, the generated audio will also have little silence. Of course, you can add the pauses manually, but you could also address it by adjusting the training data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants