Losing speakers when attempting to fine-tune YourTTS #2274

nanonomad · 2023-01-09T23:04:33Z

nanonomad
Jan 9, 2023

Hi everyone,
I've been trying to fine-tune/transfer learn the YourTTS model on a custom dataset, and run in to an issue with engaging the speaker encodings, I think.

I'm doing this on Colab, with a modified version of the 'original recipe' by Edresson & iamkhalidbashir, but I have run things with modifications on my own machine with the same outcome.

tl;dr - at inference time tts reports an empty list of speakers {}
(coqui) C:\tts>tts --text "test test" --model_path test/best_model.pth --config_path test/config2.json --list_speaker_idxs --speakers_file_path test/speakers.pth

Using model: vits
Setting up Audio Processor...
| > sample_rate:16000
[cut]
External Speaker Encoder Loaded !!
Model fully restored.
Setting up Audio Processor...
| > sample_rate:16000
| > resample:False
| > num_mels:64
| > log_func:np.log10
[cut]
Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{}

I've modified the config used to point to the new speakers.pth, same results. Same results with not specifying speakers_file_path in the CLI.

Potential training issues: When calling Trainer(), the config is updated, and the incorrect speakers.pth seems to be copied to the run directory.

The speakers.pth created by compute_embeddings is 2.7mb, the one copied to the run directory is 431 bytes.

I've tried replacing the 431 byte file with the computed embeddings before running trainer.fit(), same results as above.

Trainer seems to be setting
"speakers_file": "/content/drive/MyDrive/duke-yt/traineroutput/YourTTS-EN-VCTK-January-09-2023_08+37PM-0000000/speakers.pth" as well, regardless of how many different ways I attempt to set this to null.

Does this override d_vector_file? The generated config.json file points to the newly created speakers file "d_vector_file": ["/content/drive/MyDrive/duke-yt/speakers.pth"]

Pastebin link with config.json generated at run time: https://pastebin.com/bKrkPAyE

Pastebin link with trainer output logs: https://pastebin.com/SU9Ebnwh

I've made a dataset that mirrors the VCTK format, with 2 speaker directories named 'duke' and 'cash'. The samples that I can listen to in Tensorboard sound great, but I'm not sure if there's a way to hear both voices in Tensorboard, or if only one is being trained. It appears that the samples are all the 'duke' voice.

Pastebin link with colab code and results: https://pastebin.com/Wb1PBfi5

I've searched up the other discussions regarding YourTTS and found others with a similar issue, but couldn't figure out a solution. I'm sure its user error on my part, but I'm in the weeds. Any help would be appreciated.

iamkhalidbashir · 2023-01-13T19:56:52Z

iamkhalidbashir
Jan 13, 2023

Try now, its fixed

1 reply

nanonomad Jan 16, 2023
Author

Thank you. Looking forward to trying this out some more. I had also made a couple fat-finger errors, and I think #2234 (comment) this fixed the issue I was having with the weighted sampler.

In regard to that, if I have a dataset with 3 voices, is this normal?

Using weighted sampler for attribute 'speaker_name' with alpha '1.0'
{}
Attribute weights for '['VCTK_b', 'VCTK_bg', 'VCTK_ly']'
| > [0.021294541942282544, 0.024964696388349195, 0.025655601269945573]

I would have thought it would show 0.212, 0.249, 0.256, but I'm probably just misunderstanding how it works

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Losing speakers when attempting to fine-tune YourTTS #2274

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Losing speakers when attempting to fine-tune YourTTS #2274

nanonomad Jan 9, 2023

Replies: 1 comment · 1 reply

iamkhalidbashir Jan 13, 2023

nanonomad Jan 16, 2023 Author

nanonomad
Jan 9, 2023

Replies: 1 comment 1 reply

iamkhalidbashir
Jan 13, 2023

nanonomad Jan 16, 2023
Author