Made with 🐸CoquiTTS #2602

erogol · 2023-05-08T09:11:05Z

erogol
May 8, 2023
Maintainer

Let us know what you do with 🐸CoquiTTS, and feel free to post it here.

I see many cool projects, so gathering them up would be good.

erogol · 2023-05-08T09:12:48Z

erogol
May 8, 2023
Maintainer Author

👀 EPUB reader using 🐸TTS by @knochenhans -> #2580

0 replies

Hexatona · 2023-05-14T06:21:56Z

Hexatona
May 14, 2023

I've been making text to speech audiobooks for a long time, using the zira voice from Microsoft. I took it as a personal challenge to see if I could make something you could actually listen to for long periods.

I made the experience better by fixing spelling mistakes, making new words pronounced properly in the engine, adding pauses where appropriate, emphasizing italics text with slower speed, and most importantly making dialogue have a higher pitch than narration so you can always know when someone is talking.

Now, I've been experimenting with coqui-tts for a little while now, and over the past few weeks have been working on a program that will output any size text Into chapter separated mp3s with a great voice. On top of that, I managed to incorporate all the extra features: pauses, pitch, and rate changes!

I can't begin to describe how much tweaking and experimentation went in to the project to get something I was happy with. So many little gotchas where the model I was using would error out randomly, or just plain completely skip some text in the prompt. And then tweaking every little thing for quality of use.

Anyway, if anyone is interested, I've posted a showcase of the output on YouTube that I'll be using as a basis for making the next audiobook, feel free to check it out:

https://youtu.be/T2qKaJK508M

4 replies

knochenhans May 14, 2023

Hi, might I ask how you tackles pitch and rate changes? Is this something you get out of the model or do you do this via post-processing?
I’d love to implement something like this for my conversion library, but so far, I’ve found no way to tell the model (I’m using VITS for now) to change pitch accordingly (I guess this will be possible with SSML, at some point). And doing this via post-processing would mean somehow getting word boundaries for the synthesized text, which doesn’t seem to be possible, either. Another way I tried is splitting up the text in pre-processing, but that always sounded unnatural in my case.

Hexatona May 14, 2023

Hey! Well, I'm also using VITS, and you're right, I have to do it in post, and then cobble together the thousands of clips into a chapter file before moving on to the next. The real secret sauce is having pre-formatted text with tags that chop up the text into what comes next. Since I already have a program I wrote to do just that from my previous work, I adapted it for this purpose.

In retrospect, the easiest way would be to work on a file in html format since all the tags are pretty much guaranteed to be properly closed, but i was no good with that at the time when i made the text program, but maybe in the future i'll do that.

knochenhans May 14, 2023

Just to be clear, are you feeding specific segments of sentences into the synthesizer? For example, would the sentence This is a [tag]just[/tag] a test. split into the segments This is , just, a test., each one being synthesized separately and the just segment being post-processed for pitch etc?

The reason I’m asking specifically is that I noticed that VITS has problems synthesizing short segments. It tends to cut off consonants at the beginning and it can’t really utter single letters that are not part of a sentence. On the other hand stitching separately synthesized segments of a sentence together resulted in an unnatural flow for me compared to synthesizing the whole sentence in one go. But maybe I was just imagining things in my past attempts.

Hexatona May 14, 2023

You're right that if you give it short sentences, the output can be a bit unpredictable. The only time this really shows up in my text is when people have sentences like : "I know you must have questions." In those cases, I pass it in three chunks - I don't pass it in as one sentence and let the TTS parser break it up. What I do to make it sound slightly less stilted is that I pass the audio through a silence removing filter so long unnatural pauses are removed. This aspect was the absolute hardest part to figure out and make passable.

some other fun things I discovered: this sentence: " , I knew what she was going to say, I just didn't know why" won't actually synthesize the text 'I just didn't know why' - you need to clear out that garbage at the front to fix that problem.

Since there doesn't seem to be a good way to configure length of time between sentences and the like, I've also put in controls to add in the appropriate pauses myself with the text. That was also a lot of trial and error to get natural sounding dialogue. So far I'm really happy, though.

Celeste-AI · 2023-06-20T15:43:31Z

Celeste-AI
Jun 20, 2023

https://youtu.be/-_vBBpwi-Ec

This uses COQUI-TTS to speak, she is a social chatbot in the game "vrchat"

more info can be found here and there is a 'howsheworks' page that describes the design ideas about her (no source code though).
part 5 mentions the use of coqui and links to it <3

She uses a finetuned LJSpeech, with currently 13,100 samples usually 2-10 seconds long, I plan to increase this by double soon to capture some words she doesn't express well.

0 replies

howardbaik · 2023-06-23T20:39:17Z

howardbaik
Jun 23, 2023

Introducing Loqui: A Shiny app for Creating Automated Courses

https://loqui.fredhutch.org/

Loqui is an open source Shiny application that enables the creation of automated courses using ari, an R package for generating videos from text and images. Loqui takes as input a Google Slides URL, extracts the speaker notes from the slides, and converts them into an audio file using Coqui TTS. Then, it converts the Google Slides to images and ultimately, generates an mp4 video file where each image is presented with its corresponding audio.

Any feedback is much appreciated!

1 reply

erogol Jun 25, 2023
Maintainer Author

so cool!

csukuangfj · 2023-11-13T12:42:46Z

csukuangfj
Nov 13, 2023

We have supported exporting vits models from Coqui to ONNX and run it with sherpa-onnx

sherpa-onnx supports both text-to-speech and speech-to-text and it runs on

Linux (x64, arm32, arm64)
macOS (x64, arm64)
Windows (win32, win64)
Android
iOS

and provides various APIs for different languages, e.g.,

C++
C
Python
C#
Kotlin
Swift
Java
Go

We are working on WebAssembly support.

The following colab notebook shows how to convert vits models from Coqui to sherpa-onnx
https://colab.research.google.com/drive/1cI9VzlimS51uAw4uCR-OBeSXRPBc4KoK?usp=sharing

You can also try the exported models by visiting the following huggingface space
https://huggingface.co/spaces/k2-fsa/text-to-speech

We also have pre-built Android APKs for the VITS English models from Coqui.
https://k2-fsa.github.io/sherpa/onnx/tts/apk.html

1 reply

csukuangfj Dec 6, 2023

By the way, we just supported nearly all VITS models from coqui-ai/TTS in sherpa-onnx.

You can find the exported models at
https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models

The following is a screenshot of an exported model
https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-coqui-fr-css10.tar.bz2

Sharrnah · 2023-11-14T12:54:27Z

Sharrnah
Nov 14, 2023

It might just be a TTS Plugin, but i like to use it because of its speed and quality. It would be the default TTS if its dependency conflicts would not crash everything. 😬

I made Whispering Tiger
https://github.com/Sharrnah/whispering-ui#readme

It can Translate / Transcribe Audio, Text and Text in Images using a variety of different AI models, and output using different TTS Models.
Has VRChat integration so you can speak with foreign language speakers and understand foreign language speakers alike.

It all runs completely locally, so unless you use a Plugin with API requirement, it works completely offline (if you have let it download the AI models beforehand)

There are also more Plugins available like RVC Voice-Conversion which can even be used together with the Coqui TTS Plugin to get state of the art Voice Conversion.

1 reply

erogol Nov 16, 2023
Maintainer Author

Cool!

yalsaffar · 2024-06-28T10:28:09Z

yalsaffar
Jun 28, 2024

Seamless Speech to Speech Translation with Voice Replication (S3TVR)

Hey everyone! 🌟

I'm thrilled to share Seamless Speech to Speech Translation with Voice Replication (S3TVR), an AI model I developed for live translation and voice cloning. It uses Couqi's XTTS_V2 as the Text-to-Speech (TTS) model.

Try the Demo

Curious to see it in action? Check out the demo here: S3TVR Demo 😊

Run Locally for Better Performance

For a more optimized experience, you can run the whole pipeline locally. Everything you need is in this repo: S3TVR GitHub Repo.

Let's Connect!

I'm always up for collaboration or just chatting more about this project. Your support and feedback mean a lot to me! Feel free to reach out if you're interested in learning more or working together. 🤗

Thanks for stopping by!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Made with 🐸CoquiTTS #2602

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Made with 🐸CoquiTTS #2602

erogol May 8, 2023 Maintainer

Replies: 7 comments · 7 replies

erogol May 8, 2023 Maintainer Author

Introducing Loqui: A Shiny app for Creating Automated Courses

erogol Jun 25, 2023 Maintainer Author

erogol Nov 16, 2023 Maintainer Author

Seamless Speech to Speech Translation with Voice Replication (S3TVR)

Try the Demo

Run Locally for Better Performance

Let's Connect!

erogol
May 8, 2023
Maintainer

Replies: 7 comments 7 replies

erogol
May 8, 2023
Maintainer Author

erogol Jun 25, 2023
Maintainer Author

erogol Nov 16, 2023
Maintainer Author