This is a Truss for Ultravox using the vLLM OpenAI Compatible server. This Truss is designed to provide an efficient and scalable way to serve Ultravox and other models in an OpenAI compatible way using vLLM.
This Truss is compatible with a custom version of our bridge endpoint for OpenAI ChatCompletion users. This means you can easily integrate this model into your existing applications that use the OpenAI API format.
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url=f"https://bridge.baseten.co/{model_id}/direct/v1"
)
Truss is an open-source model serving framework developed by Baseten. It allows you to develop and deploy machine learning models onto Baseten (and other platforms like AWS or GCP). Using Truss, you can develop a GPU model using live-reload, package models and their associated code, create Docker containers, and deploy on Baseten.
First, clone this repository:
git clone https://github.com/basetenlabs/truss-examples.git
cd ultravox
Before deployment:
- Make sure you have a Baseten account and API key.
- Install the latest version of Truss:
pip install --upgrade truss
With ultravox
as your working directory, you can deploy the model with:
truss push
Paste your Baseten API key if prompted.
For more information, see Truss documentation.
This Truss demonstrates how to start vLLM's OpenAI compatible server. The Truss is primarily used to start the server and then route requests to it. It currently supports ChatCompletions only.
In the config any key-values under model_metadata: arguments:
will be passed to the vLLM OpenAI-compatible server at startup.
You can use any vLLM compatible base image.
The API follows the OpenAI ChatCompletion format. You can interact with the model using the standard ChatCompletion interface.
Example usage:
from openai import OpenAI
client = OpenAI(
api_key="YOUR-API-KEY",
base_url="https://bridge.baseten.co/MODEL-ID/v1"
)
response = client.chat.completions.create(
model="fixie-ai/ultravox-v0.2",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Summarize the following: <|audio|>"},
{"type": "image_url", "image_url": {"url": f"data:audio/wav;base64,{base64_wav}"}}
]
}]
stream=True
)
for chunk in response:
print(chunk.choices[0].delta)
We are actively working on enhancing this Truss. Some planned improvements include:
- Adding support for distributed serving with Ray (https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
- Implementing model caching for improved performance
Stay tuned for updates!
If you have any questions or need assistance, please open an issue in this repository or contact our support team.