Skip to content

Commit

Permalink
docs(ai): add text-to-speech docs
Browse files Browse the repository at this point in the history
  • Loading branch information
pschroedl committed Oct 31, 2024
1 parent 65d7513 commit ed754c7
Show file tree
Hide file tree
Showing 8 changed files with 1,298 additions and 9,160 deletions.
927 changes: 304 additions & 623 deletions ai/api-reference/gateway.openapi.yaml

Large diffs are not rendered by default.

21 changes: 21 additions & 0 deletions ai/api-reference/text-to-speech.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
openapi: post /text-to-speech
---

<Info>
The default Gateway used in this guide is the public
[Livepeer.cloud](https://www.livepeer.cloud/) Gateway. It is free to use but
not intended for production-ready applications. For production-ready
applications, consider using the [Livepeer Studio](https://livepeer.studio/)
Gateway, which requires an API token. Alternatively, you can set up your own
Gateway node or partner with one via the `ai-video` channel on
[Discord](https://discord.gg/livepeer).
</Info>

<Note>
Please note that the exact parameters, default values, and responses may vary
between models. For more information on model-specific parameters, please
refer to the respective model documentation available in the [text-to-speech
pipeline](/ai/pipelines/text-to-speech). Not all parameters might be available
for a given model.
</Note>
7 changes: 7 additions & 0 deletions ai/orchestrators/models-config.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,13 @@ currently **recommended** models and their respective prices.
"SFAST": true,
"DEEPCACHE": false
}
},
{
"pipeline": "text-to-speech",
"model_id": "parler-tts/parler-tts-large-v1",
"price_per_unit": 11,
"pixels_per_unit": 1e2,
"currency": "USD",
}
]
```
Expand Down
7 changes: 7 additions & 0 deletions ai/pipelines/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -82,4 +82,11 @@ pipelines:
The segment-anything-2 pipeline offers promptable visual segmentation for
images and videos.
</Card>
<Card
title="Text-to-Speech"
icon="message-dots"
href="/ai/pipelines/text-to-speech"
>
The text-to-speech pipeline generates high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc).
</Card>
</CardGroup>
76 changes: 76 additions & 0 deletions ai/pipelines/text-to-speech.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
title: Text-to-Speech
---

## Overview

The text-to-speech endpoint in Livepeer utilizes [Parler-TTS](https://github.com/huggingface/parler-tts), specifically `parler-tts/parler-tts-large-v1`. This model can generate speech with customizable characteristics such as voice type, speaking style, and audio quality.

## Basic Usage Instructions

<Tip>
For a detailed understanding of the `text-to-speech` endpoint and to experiment
with the API, see the [Livepeer AI API
Reference](/ai/api-reference/text-to-speech).
</Tip>

To use the text-to-speech feature, submit a POST request to the `/text-to-speech` endpoint. Here's an example of how to structure your request:

```bash
curl -X POST "http://<GATEWAY_IP>/text-to-speech" \
-H "Content-Type: application/json" \
-d '{
"model_id": "parler-tts/parler-tts-large-v1",
"text_input": "A cool cat on the beach",
"description": "Jon's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise."
}'
```
### Request Parameters
- `model_id`: The ID of the text-to-speech model to use. Currently, this should be set to `"parler-tts/parler-tts-large-v1"`.
- `text_input`: The text you want to convert to speech.
- `description`: A description of the desired voice characteristics. This can include details about the speaker's voice, speaking style, and audio quality.
### Voice Customization
You can customize the generated voice by adjusting the `description` parameter. Some aspects you can control include:
- Speaker identity (e.g., "Jon's voice")
- Speaking style (e.g., "monotone", "expressive")
- Speaking speed (e.g., "slightly fast")
- Audio quality (e.g., "very close recording", "no background noise")
The checkpoint was trained on 34 speakers. The full list of available speakers includes: Laura, Gary, Jon, Lea, Karen, Rick, Brenda, David, Eileen, Jordan, Mike, Yann, Joy, James, Eric, Lauren, Rose, Will, Jason, Aaron, Naomie, Alisa, Patrick, Jerry, Tina, Jenna, Bill, Tom, Carol, Barbara, Rebecca, Anna, Bruce, and Emily.
However, the models performed better with certain speakers. A list of the top 20 speakers for each model variant, ranked by their average speaker similarity scores can be found [here](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md#speaker-consistency)
## Limitations and Considerations
- The maximum length of the input text may be limited. For long-form content, you will need to split your text into smaller chunks. The training default configuration in parler-tts is max 30sec, max text length 600 characters.
https://github.com/huggingface/parler-tts/blob/main/training/README.md#3-training
- While the model supports various voice characteristics, the exact replication of a specific speaker's voice is not guaranteed.
- The quality of the generated speech can vary based on the complexity of the input text and the specificity of the voice description.
## Orchestrator Configuration
To configure your Orchestrator to serve the `text-to-speech` pipeline, refer to
the [Orchestrator Configuration](/ai/orchestrators/get-started) guide.
### System Requirements
The following system requirements are recommended for optimal performance:
- [NVIDIA GPU](https://developer.nvidia.com/cuda-gpus) with **at least 12GB** of
VRAM.
## API Reference
<Card
title="API Reference"
icon="rectangle-terminal"
href="/ai/api-reference/text-to-speech"
>
Explore the `text-to-speech` endpoint and experiment with the API in the
Livepeer AI API Reference.
</Card>
4 changes: 4 additions & 0 deletions api-reference/generate/text-to-speech.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
title: "Text to Speech"
openapi: "POST /api/beta/generate/text-to-speech"
---
4 changes: 3 additions & 1 deletion mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -538,6 +538,7 @@
"ai/pipelines/image-to-video",
"ai/pipelines/segment-anything-2",
"ai/pipelines/text-to-image",
"ai/pipelines/text-to-speech",
"ai/pipelines/upscale"
]
},
Expand Down Expand Up @@ -602,7 +603,8 @@
"ai/api-reference/image-to-image",
"ai/api-reference/image-to-video",
"ai/api-reference/segment-anything-2",
"ai/api-reference/upscale"
"ai/api-reference/upscale",
"ai/api-reference/text-to-speech"
]
}
]
Expand Down
Loading

0 comments on commit ed754c7

Please sign in to comment.