Optimizing XTTSv2 Cloning with Multiple Audio Tracks: Speed vs. Quality Trade-offs and Inference Efficiency #4013

240db · 2024-10-08T15:20:39Z

240db
Oct 8, 2024

I have been using TTS for multilingual purposes, and I typically rely on XTTSv2. Until recently, I had always used a single audio track as the input speaker_wav, but I realized we could leverage multiple audio tracks to improve the cloning during inference. I conducted some tests using around 7-10 hours from one speaker, and 48 hours from another speaker.

Inference Times

The first thing I noticed is that generation time increased significantly:

Using 10 seconds of reference for cloning: A few seconds of inference time.
Using 7-10 hours of reference for cloning: It now takes around 15 minutes to generate 4 minutes of audio.
Using over 40 hours of reference for cloning: It takes almost 60 minutes to generate 4 minutes of audio.

My speaker_wav tracks are in .wav format and sampled at 44,100 kHz. I plan to try downsampling to 22,050 kHz or similar to see if it improves performance without sacrificing too much quality.

Results

Despite the longer inference times, the results have been highly encouraging. The model's performance has significantly improved, especially in terms of handling long texts, and the pronunciation keeps getting more robust as I feed it more hours of reference audio.

Next Steps and Questions

I'm curious whether I can speed up the inference process by building a model of the speaker_wav files, instead of loading all these files during each generation. Would that be faster? Additionally, since we already build a dataset of transcripts, I wonder if pronunciation and cloning quality will improve by focusing more on the audio tracks alone.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing XTTSv2 Cloning with Multiple Audio Tracks: Speed vs. Quality Trade-offs and Inference Efficiency #4013

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Optimizing XTTSv2 Cloning with Multiple Audio Tracks: Speed vs. Quality Trade-offs and Inference Efficiency #4013

240db Oct 8, 2024

Inference Times

Results

Next Steps and Questions

Replies: 0 comments

240db
Oct 8, 2024