docs(ai): add text-to-speech docs

livepeer · Oct 31, 2024 · ed754c7 · ed754c7
1 parent 65d7513
commit ed754c7
Show file tree

Hide file tree

Showing 8 changed files with 1,298 additions and 9,160 deletions.
diff --git a/ai/api-reference/gateway.openapi.yaml b/ai/api-reference/gateway.openapi.yaml
diff --git a/ai/api-reference/text-to-speech.mdx b/ai/api-reference/text-to-speech.mdx
@@ -0,0 +1,21 @@
+---
+openapi: post /text-to-speech
+---
+
+<Info>
+  The default Gateway used in this guide is the public
+  [Livepeer.cloud](https://www.livepeer.cloud/) Gateway. It is free to use but
+  not intended for production-ready applications. For production-ready
+  applications, consider using the [Livepeer Studio](https://livepeer.studio/)
+  Gateway, which requires an API token. Alternatively, you can set up your own
+  Gateway node or partner with one via the `ai-video` channel on
+  [Discord](https://discord.gg/livepeer).
+</Info>
+
+<Note>
+  Please note that the exact parameters, default values, and responses may vary
+  between models. For more information on model-specific parameters, please
+  refer to the respective model documentation available in the [text-to-speech
+  pipeline](/ai/pipelines/text-to-speech). Not all parameters might be available
+  for a given model.
+</Note>
diff --git a/ai/orchestrators/models-config.mdx b/ai/orchestrators/models-config.mdx
@@ -49,6 +49,13 @@ currently **recommended** models and their respective prices.
       "SFAST": true,
       "DEEPCACHE": false
     }
+  },
+  {
+    "pipeline": "text-to-speech",
+    "model_id": "parler-tts/parler-tts-large-v1",
+    "price_per_unit": 11,
+    "pixels_per_unit": 1e2,
+    "currency": "USD",
   }
 ]
 ```

diff --git a/ai/pipelines/overview.mdx b/ai/pipelines/overview.mdx
@@ -82,4 +82,11 @@ pipelines:
     The segment-anything-2 pipeline offers promptable visual segmentation for
     images and videos.
   </Card>
+  <Card
+    title="Text-to-Speech"
+    icon="message-dots"
+    href="/ai/pipelines/text-to-speech"
+  >
+    The text-to-speech pipeline generates high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc).
+  </Card>
 </CardGroup>
diff --git a/ai/pipelines/text-to-speech.mdx b/ai/pipelines/text-to-speech.mdx
@@ -0,0 +1,76 @@
+---
+title: Text-to-Speech
+---
+
+## Overview
+
+The text-to-speech endpoint in Livepeer utilizes [Parler-TTS](https://github.com/huggingface/parler-tts), specifically `parler-tts/parler-tts-large-v1`. This model can generate speech with customizable characteristics such as voice type, speaking style, and audio quality.
+
+## Basic Usage Instructions
+
+<Tip>
+  For a detailed understanding of the `text-to-speech` endpoint and to experiment
+  with the API, see the [Livepeer AI API
+  Reference](/ai/api-reference/text-to-speech).
+</Tip>
+
+To use the text-to-speech feature, submit a POST request to the `/text-to-speech` endpoint. Here's an example of how to structure your request:
+
+```bash
+curl -X POST "http://<GATEWAY_IP>/text-to-speech" \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model_id": "parler-tts/parler-tts-large-v1",
+        "text_input": "A cool cat on the beach",
+        "description": "Jon's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise."
+    }'
+```
+
+### Request Parameters
+
+- `model_id`: The ID of the text-to-speech model to use. Currently, this should be set to `"parler-tts/parler-tts-large-v1"`.
+- `text_input`: The text you want to convert to speech.
+- `description`: A description of the desired voice characteristics. This can include details about the speaker's voice, speaking style, and audio quality.
+
+### Voice Customization
+
+You can customize the generated voice by adjusting the `description` parameter. Some aspects you can control include:
+
+- Speaker identity (e.g., "Jon's voice")
+- Speaking style (e.g., "monotone", "expressive")
+- Speaking speed (e.g., "slightly fast")
+- Audio quality (e.g., "very close recording", "no background noise")
+
+The checkpoint was trained on 34 speakers. The full list of available speakers includes: Laura, Gary, Jon, Lea, Karen, Rick, Brenda, David, Eileen, Jordan, Mike, Yann, Joy, James, Eric, Lauren, Rose, Will, Jason, Aaron, Naomie, Alisa, Patrick, Jerry, Tina, Jenna, Bill, Tom, Carol, Barbara, Rebecca, Anna, Bruce, and Emily.
+
+However, the models performed better with certain speakers.  A list of the top 20 speakers for each model variant, ranked by their average speaker similarity scores can be found [here](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md#speaker-consistency)
+
+## Limitations and Considerations
+
+- The maximum length of the input text may be limited. For long-form content, you will need to split your text into smaller chunks. The training default configuration in parler-tts is max 30sec, max text length 600 characters.
+https://github.com/huggingface/parler-tts/blob/main/training/README.md#3-training
+- While the model supports various voice characteristics, the exact replication of a specific speaker's voice is not guaranteed.
+- The quality of the generated speech can vary based on the complexity of the input text and the specificity of the voice description.
+
+## Orchestrator Configuration
+
+To configure your Orchestrator to serve the `text-to-speech` pipeline, refer to
+the [Orchestrator Configuration](/ai/orchestrators/get-started) guide.
+
+### System Requirements
+
+The following system requirements are recommended for optimal performance:
+
+- [NVIDIA GPU](https://developer.nvidia.com/cuda-gpus) with **at least 12GB** of
+  VRAM.
+
+## API Reference
+
+<Card
+  title="API Reference"
+  icon="rectangle-terminal"
+  href="/ai/api-reference/text-to-speech"
+>
+  Explore the `text-to-speech` endpoint and experiment with the API in the
+  Livepeer AI API Reference.
+</Card>
diff --git a/api-reference/generate/text-to-speech.mdx b/api-reference/generate/text-to-speech.mdx
@@ -0,0 +1,4 @@
+---
+title: "Text to Speech"
+openapi: "POST /api/beta/generate/text-to-speech"
+---
diff --git a/mint.json b/mint.json
@@ -538,6 +538,7 @@
             "ai/pipelines/image-to-video",
             "ai/pipelines/segment-anything-2",
             "ai/pipelines/text-to-image",
+            "ai/pipelines/text-to-speech",
             "ai/pipelines/upscale"
           ]
         },
@@ -602,7 +603,8 @@
             "ai/api-reference/image-to-image",
             "ai/api-reference/image-to-video",
             "ai/api-reference/segment-anything-2",
-            "ai/api-reference/upscale"
+            "ai/api-reference/upscale",
+            "ai/api-reference/text-to-speech"
           ]
         }
       ]