Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for voice training? #2

Open
RobAley opened this issue Jul 2, 2022 · 7 comments
Open

Documentation for voice training? #2

RobAley opened this issue Jul 2, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@RobAley
Copy link

RobAley commented Jul 2, 2022

Hi,

Is there any documentation anywhere on how to train/create a new voice for this from e.g. audio collected by mimic-recording-studio?

Many thanks

Rob

@RobAley RobAley added the enhancement New feature or request label Jul 2, 2022
@inventionsbyhamid
Copy link

This would be helpful for the community to create models for languages which are not currently supported. Some direction would be helpful on how to structure data/hours of audio needed/scripts to run for training/save model for future use.

@synesthesiam
Copy link
Collaborator

No documentation yet. I'm currently rewriting the training code to be usable by people other than myself. It was written over the course of a year or more with different experiments and dead-ends left it. Definitely needs some clean up!

The structure of the training data is very simple, currently just a CSV file with two columns: (1) path or name of the audio file, and (2) text transcription. For example:

path/to/1.wav|This is a test.
path/to/2.wav|This is another test.

If you have multiple speakers, it becomes:

path/to/1.wav|speaker1|This is a test by speaker 1.
path/to/2.wav|speaker2|This is another test by speaker 2.

Eventually, I'd like to use the data format from Mimic Recording Studio. Audio files can be anything that librosa will load.

As for the amount of data, that depends if you're starting from scratch or will be reusing an existing model. From scratch, I've found that 3-5 hours will get you a good voice but 10+ will usually make a great voice. What really matters is the recording quality and phonetic diversity of what you read.

If you reuse an existing model, I've had as little as 30 minutes of data work using the Harvard Sentences. I'd recommend at least an hour, though.

@inventionsbyhamid
Copy link

inventionsbyhamid commented Jul 22, 2022

Thanks for the quick reply. I am starting out to create a good quality TTS for Hindi so gathering info on what is required for a good dataset.

  1. Does the length of audio and words in it matter? I saw LJ Speech Dataset has 1-10 seconds audio clips. The audio files I currently have are 1-3 min in duration each (with lot of repetition in words).
  2. Should I create a fresh dataset (I have a studio available for recording) or split existing audios into sentences? (Would have to do this manually maybe but it is doable).
  3. While creating fresh dataset, is there any open dataset for Hindi text transcripts that I should use? (similar to Harvard Sentences but in Hindi?)

Few questions above maybe out of scope of this repo, but if you could help it would be great.

@synesthesiam
Copy link
Collaborator

Hi @inventionsbyhamid,

  1. The length of audio does matter. Each clip should only be a sentence or two, and ideally you would have include clips with one or two words as well.
  2. It's possible to get help splitting the audio with tools like aeneas and finetuneas
  3. I don't know of any text dataset like that for Hindi. If you plan to use eSpeak for phonemization, I may be able to help create one with your assistance. I typically take sentences from the Oscar corpus and use a simple algorithm to create a phonetically balanced subset. I need help from native speakers, though, to figure out if the sentences make any sense :)

@jyapayne
Copy link

@synesthesiam do you have an update on getting the training code ready to use? I am interested in using it as well.

@lumpidu
Copy link

lumpidu commented Oct 17, 2022

And it would be interesting to know, which model you are using or a reference to the paper ?

@fivestones
Copy link

@synesthesiam I'd also love to make a voice model (in english). From what you've said on this thread, I think I could get started, but I'm just wondering what I would need to do with the CSV file once I have it made. Or maybe I'm more wondering if you are getting close to finishing the cleaning up of the training code, since I bet that would be easier to use than forging ahead alone. Or maybe better still, if Mimic Recording Studio is close to being ready to use for Mimic 3.
Thanks for your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants