In the previous lesson, we discussed how we'll connect our client device to the voice agent via WebRTC. That's great. Your user can now speak to your agent with less than 30 to 50 milliseconds of latency. But what is a voice agent anyway? A voice agent is simply a stateful computer program that can consume and process voice data streaming into it, like from a user speaking into their phone, and it can generate a spoken response to send back to that user. Most of the application logic of your agent program will be specific to your use case, but voice agents built using LiveKit's Agents SDK have a persistent WebRTC connection, linking it to one or more client devices. LiveKit's Agents SDK also takes care of things like managing the conversation context and spinning up an instance of your agent for every user that wants to speak to it. Every agent instance may keep a database connection for performing actions like RAG, interact with libraries or services running on the same machine, or make HTTP requests or establish WebSocket connections to external services for things like speech to text or LLM inference. In the lessons that follow, you'll use LiveKit's Agents SDK for building your voice agent. You'll build your agent in Python, but you can also use Node.js as well. How do we give the agent the ability to listen, think, and speak back to the user? Let's zoom in on that part. This is what's known as a pipeline or cascaded or component model voice architecture. As voice input data from a user streams into the agent it passes through an ordered series of steps before a voice response from the agent is sent back to the user. First, the agent relays the user's speech to a smaller speech to text AI model, often abbreviated as STT. The STT model converts the speech to text in real-time and passes it back to the agent. Once the user is done speaking and the agent has the full transcription of what the user said, it relays that full transcription to an LLM as the user's prompt. The LLM takes that prompt and runs inference against it. As output tokens are generated, They stream back out to the agent from the LLM. The agent collects and organizes these tokens. For every sentence that it collects, the agent will relay that sentence to another, smaller text-to-speech model, often abbreviated TTS. The TTS model will convert those sentences sent by the agent back into speech and stream them to the agent. The agent takes those and will relay the bytes of voice data as it receives them from TTS back to the client device over the persistent WebRTC connection we talked about in the previous lesson. And so that's the full end-to-end pipeline that your agent is running through, where it takes user speech converts it into text, puts that text through an LLM, takes the tokens coming out of the LLM, pipes out through TTS, and then from TTS streams the audio back to the user. I want to turn our attention now, though, to one of the hardest problems in building convincingly human voice agents. And that's turn detection. In a human conversation, this concept of alternating between speaking and listening is known as turn taking. A human is quite good at sort of automatically knowing when they should speak or listen. Turn detection is a heuristic a voice agent uses to know when the user is done speaking, and it can respond. Contemporary turn detection systems combine two signals. The first is signal processing. Is the user actually speaking or not? And the second is semantic processing. What did the user actually say? Here's how this is done typically. As user audio streams into the agent, it's not just sent to STT, but in parallel it's also streamed into something called Voice activity detection, or VAD for short. VAD analyzes the audio signal. It's a small binary classifier neural network that simply outputs whether human speech was detected in the input sample or not. When the bit flips from human speech detected to not detected, VAD starts a timer for a developer configurable number of millisecond before firing an event marking the end of the user's turn. If VAD detects human speech again before the timer has fired, everything is reset. As the user speech is converted into text by STT and sent back to the agent, the agent also forwards the transcriptions from STT to another part of the turn detection system. A semantic turn detector model, LiveKit has an open-source transformer model that we've trained in-house. It takes in a user's transcribed speech and the transcriptions from the last 3 or 4 previous turns in the conversation, and outputs a prediction whether it thinks the user is done speaking or not based on what they've said. A VAD detected silence and set a timer to fire the end of an event. But the semantic model believes the user isn't done speaking yet. For example, if they're just pausing between thoughts, the turn detector will delay the timer from firing for a configurable amount of time. And now that we have an understanding of turn detection, this is what our overall voice agent architecture looks like with it incorporated. Only when the turn detection system has fired an end-of-turn event can the agent proceed to forwarding the user's transcribed speech to the LLM for inference. VAD isn't just used for turn detection either. It's also used for interruption handling. To mimic the dynamics of two humans having a conversation, we need to be able to handle the case when a user interrupts the voice agent mid-speech. This could happen for a variety of reasons, including the LLM may be speaking more than necessary. Or the user changed their mind about something. The user may want to correct themselves even. Under the hood, since the user's speech is already being passed into VAD via turn detection, we are using the presence of human speech as opposed to the absence of human speech and turn detection to signal an interruption. When an interruption occurs, every part of the voice pipeline downstream is flushed. If the LLM was performing inference at that time that stopped. If there was any agent speech being generated by TTS, that's also stopped. One final update to our overall architecture here is context management. When it's the agent's turn to think and speak not only is the most recent transcription of what the user said sent to the LLM, but the full history of everything that's been said thus far by either party during that session is also sent along. This includes things like function calls and their results, which is part of most production grade voice agents. LiveKit's agents SDK also automatically synchronizes the LLM context when the user interrupts the agent. It uses timestamps to determine the last thing the user heard played back from the agent and aligns the conversation on the agent side to this point. And now we've put it all together. That's it. In the next lesson, we're going to take all of these concepts in theory we've discussed here and we're going to put them into practice by building a voice agent that you can speak to.

Please sign in to view this content

Next Lesson

Building AI Voice Agents for Production

Introduction
Video
・
3 mins

Voice Agent Overview
Video
・
13 mins

End-to-end architecture - Part 1
Video
・
12 mins

End-to-end architecture - Part 2
Video
・
8 mins

Voice Agent Components
Video with Code Example
・
5 mins

Optimizing Latency
Video with Code Example
・
7 mins

Conclusion
Video
・
1 min

Appendix-Tips and Help
Code Example
・
10 mins

Course Feedback

Community