Documentation for voice training? #2

RobAley · 2022-07-02T19:48:39Z

Hi,

Is there any documentation anywhere on how to train/create a new voice for this from e.g. audio collected by mimic-recording-studio?

Many thanks

Rob

inventionsbyhamid · 2022-07-20T14:12:12Z

This would be helpful for the community to create models for languages which are not currently supported. Some direction would be helpful on how to structure data/hours of audio needed/scripts to run for training/save model for future use.

synesthesiam · 2022-07-21T23:35:56Z

No documentation yet. I'm currently rewriting the training code to be usable by people other than myself. It was written over the course of a year or more with different experiments and dead-ends left it. Definitely needs some clean up!

The structure of the training data is very simple, currently just a CSV file with two columns: (1) path or name of the audio file, and (2) text transcription. For example:

path/to/1.wav|This is a test.
path/to/2.wav|This is another test.

If you have multiple speakers, it becomes:

path/to/1.wav|speaker1|This is a test by speaker 1.
path/to/2.wav|speaker2|This is another test by speaker 2.

Eventually, I'd like to use the data format from Mimic Recording Studio. Audio files can be anything that librosa will load.

As for the amount of data, that depends if you're starting from scratch or will be reusing an existing model. From scratch, I've found that 3-5 hours will get you a good voice but 10+ will usually make a great voice. What really matters is the recording quality and phonetic diversity of what you read.

If you reuse an existing model, I've had as little as 30 minutes of data work using the Harvard Sentences. I'd recommend at least an hour, though.

inventionsbyhamid · 2022-07-22T07:50:02Z

Thanks for the quick reply. I am starting out to create a good quality TTS for Hindi so gathering info on what is required for a good dataset.

Does the length of audio and words in it matter? I saw LJ Speech Dataset has 1-10 seconds audio clips. The audio files I currently have are 1-3 min in duration each (with lot of repetition in words).
Should I create a fresh dataset (I have a studio available for recording) or split existing audios into sentences? (Would have to do this manually maybe but it is doable).
While creating fresh dataset, is there any open dataset for Hindi text transcripts that I should use? (similar to Harvard Sentences but in Hindi?)

Few questions above maybe out of scope of this repo, but if you could help it would be great.

synesthesiam · 2022-07-23T15:07:53Z

Hi @inventionsbyhamid,

The length of audio does matter. Each clip should only be a sentence or two, and ideally you would have include clips with one or two words as well.
It's possible to get help splitting the audio with tools like aeneas and finetuneas
I don't know of any text dataset like that for Hindi. If you plan to use eSpeak for phonemization, I may be able to help create one with your assistance. I typically take sentences from the Oscar corpus and use a simple algorithm to create a phonetically balanced subset. I need help from native speakers, though, to figure out if the sentences make any sense :)

jyapayne · 2022-09-24T00:05:14Z

@synesthesiam do you have an update on getting the training code ready to use? I am interested in using it as well.

lumpidu · 2022-10-17T09:24:29Z

And it would be interesting to know, which model you are using or a reference to the paper ?

fivestones · 2022-11-29T15:13:25Z

@synesthesiam I'd also love to make a voice model (in english). From what you've said on this thread, I think I could get started, but I'm just wondering what I would need to do with the CSV file once I have it made. Or maybe I'm more wondering if you are getting close to finishing the cleaning up of the training code, since I bet that would be easier to use than forging ahead alone. Or maybe better still, if Mimic Recording Studio is close to being ready to use for Mimic 3.
Thanks for your work on this!

RobAley added the enhancement New feature or request label Jul 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation for voice training? #2

Documentation for voice training? #2

RobAley commented Jul 2, 2022

inventionsbyhamid commented Jul 20, 2022

synesthesiam commented Jul 21, 2022

inventionsbyhamid commented Jul 22, 2022 •

edited

Loading

synesthesiam commented Jul 23, 2022

jyapayne commented Sep 24, 2022

lumpidu commented Oct 17, 2022

fivestones commented Nov 29, 2022

Documentation for voice training? #2

Documentation for voice training? #2

Comments

RobAley commented Jul 2, 2022

inventionsbyhamid commented Jul 20, 2022

synesthesiam commented Jul 21, 2022

inventionsbyhamid commented Jul 22, 2022 • edited Loading

synesthesiam commented Jul 23, 2022

jyapayne commented Sep 24, 2022

lumpidu commented Oct 17, 2022

fivestones commented Nov 29, 2022

inventionsbyhamid commented Jul 22, 2022 •

edited

Loading