-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[draft] Support STT with Google realtime API #1321
base: main
Are you sure you want to change the base?
Conversation
|
6407a82
to
663f44f
Compare
663f44f
to
aee4c1c
Compare
if self._model.capabilities.supports_truncate: | ||
user_msg = ChatMessage.create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this only done when it supports truncate? it seems you are trying to update an item, instead of truncate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some methods are not implemented in Gemini. We maintain remoteconversations in OpenAI, but not in Gemini. We should prevent invoking those methods when using Gemini. The purpose of supports_truncate is to differentiate between that
text="LiveKit is the platform for building realtime AI. The main use cases are to build AI voice agents. LiveKit also powers livestreaming apps, robotics, and video conferencing.", | ||
role="assistant", | ||
) | ||
chat_ctx.append(text="What is the LiveKit Agents framework?", role="user") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the last message have to be user.. in order for gemini to respond first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it can be either assistant
or user
.
@self._session.on("agent_speech_completed") | ||
def _agent_speech_completed(): | ||
self._update_state("listening") | ||
if self._playing_handle is not None and not self._playing_handle.done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you include comments on why this is needed?
def _agent_speech_completed(): | ||
self._update_state("listening") | ||
if self._playing_handle is not None and not self._playing_handle.done(): | ||
self._playing_handle.interrupt() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we should interrupt here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we call this function when speech is interrupted as well. They likely made some changes, but now Gemini returns server.turn_complete
instead of server.interrupted
when interrupted. It's confusing. In both cases, we are calling this function.
from typing import Any, Dict, List, Literal, Sequence, Union | ||
|
||
from livekit.agents import llm | ||
|
||
from google.genai import types # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here is hard to follow, really sad we don't have types (it's unclear what is the structure of the dicts)
self._transcriber.on("input_speech_done", self._on_input_speech_done) | ||
self._agent_transcriber.on("input_speech_done", self._on_agent_speech_done) | ||
# init dummy task | ||
self._init_sync_task = asyncio.create_task(asyncio.sleep(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really doing anything?
Where do we make sure the transcribed user speech is inside the chat_ctx and always before the generated agent speech? |
Yes, it is being called directly from the base class. We need to keep the dummy task unless we wrap it with capabilities.support_truncate in the base class.
I don't think we need it, as the transcriber and LLM are independent of each other. |
I'm not sure to follow, the baseclass is utils.EventEmitter[EventTypes]
How do we get the user messages inside the ChatContext? |
I mean
from here when audio transcription is done- https://github.com/livekit/agents/pull/1321/files#diff-4b3e6842c9b1bf3130541b6b2fd18dcc7d1b0051285496eca0355e62938d13fbR351 |
What I mean here is that on some bad timings or if the VAD events are different, the data inside the chat context will not be "stable". E.g;
|
User audio is usually processed in real-time, and we receive transcriptions quickly. However, you're right that these scenarios can occur. |
No description provided.