In this lesson, you'll dive into the fundamentals of AI voice agents and their key components such as speech to text, text to speech, and large language models. While analyzing the latency introduced at each layer of the voice agent stack. You'll explore how platforms like LiveKit mitigate these latency challenges by providing optimized network infrastructure and implementing low latency communication protocols. Finally, you walk through a minimal example of building a voice agent in Python and touch on practical approaches for evaluating and improving voice agent performance. Let's get started. Before we dive in, you might be wondering what exactly is an AI voice agent? Simply put, an AI voice agent brings together speech capabilities and the reasoning power of foundation models to enable real-time human like conversations. Voice agents are useful in a wide range of scenarios. In education, they can guide personalized skill development or conduct mock interviews. In business, they can help handle customer service calls by booking a table at a restaurant or assisting with a sale. And because voice agent interactions are hands-free, they can also enhance accessibility. Think of a patient using a voice agent to log symptoms or practice talk therapy at home. Now let's break down the anatomy of a voice agent stack. The system takes as input the user's voice like a user question or request, and produces a voice response. In some cases, that audio output might also be synchronized with video, such as a talking head avatar. But for this course, we'll focus on audio only interactions. When designing the internals of a voice agent, we have two main options. The first is to use a speech-to-speech or real-time API. This option is simpler to implement, but it offers less flexibility and control over the agent's behavior. The second and the approach will primarily focus on in this course, is the pipeline approach. The voice agent pipeline is made up of three components a speech-to-text model or API. An LLM or agentic framework, and a text-to-speech model or API that generates the final audio output. Let's take a more detailed look at each component of the voice agent pipeline. The first component is Automatic Speech Recognition, or ASR, also referred to as speech-to-text or STT. This involves transcribing a given audio signal, typically a waveform, into text. The input is raw audio and the output is the corresponding transcription. The second component is the large language model or a broader agentic workflow, which generates a response based on the transcribed text. This layer may involve one or more LLM agents, often enhanced with tool use, memory or planning abilities. As an aside, a voice agent can also produce transcripts enriched with supporting materials such as images or links as a byproduct of its functionality. This pipeline can be generalized to support such multimodal LLM responses, enabling the display of both spoken output and visual context when needed. The third component is text-to-speech or TTS, also known as speech synthesis. This is the task of converting the generated text back into natural sounding intelligible speech. The input here is text and the output is audio. In a demo you'll see shortly, the synthesized voice is that of Andrew. Beyond these three main components, it is important to highlight two additional tasks that are essential for correctly processing human speech and occur before the ASR step. The first is Voice Activity Detection, or VAD, which determines whether human speech is present in the audio signal. For example, a lack of detected speech may correspond to unnatural pause or section sections dominated by background noise. The second task is end-of-turn detection, which identifies when a speaker has completed their turn in the conversation. This is a non-trivial challenge, as speech often includes pauses of varying lengths depending on the speaker's language, habits, and expressive style. Now that we've reviewed the core components of the two voice agent architectures, the next question is how do we actually build these components? Fortunately, we don't need to start from scratch. Instead, we can focus on the parts of the stack that matter most for our specific use case. For instance, depending on the application, some components may require more attention than others. If you're developing a voice agent for a clinical setting, for example, the ASR component becomes critical. You'll need to accurately recognize specialized medical vocabulary and meet strict precision requirements. On the other hand, if you're working on a restaurant booking agent, the LLM or agentic workflow becomes more important as you'll need robust reasoning and reliable tool use to avoid issues like overbooking tables. Unless your use case requires a specialized or on-device model, you can choose from a wide range of providers for TTS, STT, and LLM APIs. On this slide, we've listed some of those options in the gray boxes, and as you can see, many providers are available and worth exploring. In the demo, at the end of the lesson, we use OpenAI for STT and ElevenLabs for TTS. For a text-to-speech output, we trained an ElevenLabs custom voice model using a recording of Andrew's voice to create a consistent and personalized AI voice call for speech synthesis. Finally, for the LLM component, if low latency is a key requirement, you may want to explore open source Llama models served by fast inference providers such as Groq, Cerebras, or TogetherAI. If you choose to build your voice agent using the speech-to-speech or real real-time API approach, there are also several providers available to support that workflow. These APIs abstract away much of the underlying pipeline, and are ideal for use cases where rapid deployment is more important than fine grained control. Regardless of which agent architecture you choose, you'll face one major challenge: timing. Humans expected responses within a narrow window, and if the system lags, the interactions quickly feel unnatural. To maintain a smooth conversational flow, your infrastructure must support low-latency audio streaming and efficiently manage input and output streams. In the pipeline approach, for example, you'll need to orchestrate real-time interactions across multiple users while ensuring seamless transitions between components like VAD, STT, the LLM, and TTS, without introducing noticeable delays at any stage. When considering latency, it is helpful to start with a baseline- how quickly do humans expect a response in natural conversation? User studies have shown that, on average, people anticipate a response within 236 milliseconds after their conversation partner finishes speaking. However, the standard deviation is quite high around 520 milliseconds, which reflects the natural variability in human speech. It's also important to note that these numbers are based on English speakers. Other languages can exhibit significant faster or slower response times. Now, if you look at the table on the slide, we can see the latency introduced by each step in the voice agent pipeline. In the best-case scenario, with efficient input output stream handling, the lower bound for full voice agent response is approximately 540 milliseconds. This places a just within one standard deviation of human expectations. However, depending on the service level agreements of the providers you use, the latency can increase to over a second and a half, which users will almost certainly notice. So how can we approach the lower bounds of latency that align with natural human conversation? The key lies in real-time peer-to-peer communication, which enables direct data exchange between devices, bypassing intermediary servers and significantly reducing delays. In this set up, your client, such as a web browser or mobile device, acts as one peer, while your voice agent back end functions as the other. LiveKit's infrastructure is designed to support this with a globally distributed mesh network for media forwarding. At the core of this system are several technologies. First, WebRTC, or Web Real-Time communication is an open-source project that provides web and mobile applications with real-time communication capabilities through standardized APIs. Second, WebSocket is used to establish a client-server handshake, enabling efficient signaling and session management. Finally, LiveKit's open-source implementation relies on asynchronous processing and careful management of input/output streams and streaming APIs, particularly for the STT,TTS and LLM components. This ensures smooth, low-latency performance throughout the voice agent pipeline. While the underlying peer-to-peer infrastructure is complex and something Russ will walk through later in the course, LiveKit abstracts much of that complexity and makes it remarkably simple to define an AI voice agents with just a few lines of code. On this slide, you can see a minimal example of how to set up a voice agent back end. There are three main components to focus on. First, defining the agent itself, including any prompts. Second, the agent session, which links together your chosen speech to text, LLM and text to speech providers into a functional pipeline. And third, the entrypoint() function, which is executed as the main function for each new peer to peer communication. You'll dive deeper into the code and configurations later in the course. For now, note that in the demo of this lesson, we said the TTS voice ID variable in the code to reference a custom ElevenLabs voice clone. We've trained a real avatar. To wrap up this section, I want to highlight a few unique challenges that come with building voice based applications. First, speech disfluencies, such as filler words like "um" or long pauses, can introduce artifacts in transcription and affect end of turn detection. These issues then propagate into the input given to the LLM, potentially reducing output quality. Second, if you're working on multilingual voice agents, keep in mind that multilingual ASR models generally underperform compared to English ASR. Now let's turn briefly to latency optimization. Accurately measuring latency in practice is challenging, especially when trying to separate client side delays from server-side delays. To help minimize those delays by design, LiveKit provides a low-latency network infrastructure. In STT-LLM-TTS pipelines, the LLM component is often the main source of latency. To reduce it, you can use smaller or quantized models when self-hosting, or opt for faster inference providers if you're relying on LLM APIs. You can also prompt the LLM to generate shorter or staged replies to reduce perceived latency response time. Now let's see this in action. I'm going to show you a brief demonstration of the voice agent we use in our Andrew avatar. The voice agent backend is running locally on my laptop. For the front end, we'll use the LiveKit playground, which is a versatile web frontend that makes it easy to test your multimodal AI agent without having to worry about the UI until you're happy with your backend. Here on the left-hand side, you can see the voice agent code, which is very similar to the minimal example code I showed you earlier. You'll learn more about voice pipeline agents later in the course. In the back end on the right hand side, you see the code running locally on my computer in the command line. Here's the LiveKit playground running on my browser. I've created a project for the course, LiveKit course. And now I'm going to connect. Hi, how are you? Do you want to learn more about machine learning? Tell me about AI voice agents. AI voice agents are systems that use natural language processing and machine learning to understand and respond to human speech, providing information or performing tasks. Are you interested in building one or learning how they work in more detail? Wow, I think it turned out really well. It sounds a lot like Andrew. Don't you think? Okay, let's try voice activity detection. If this works, we should be able to interrupt and the avatar should stop speaking. Tell me about speech-to-text models. Speech-to-text models convert spoken language into written text using techniques like deep learning, often leveraging architectures like recurrent neural networks or transforming... I meant text-to-speech. Text to speech... TTS models transform written text into spoken words using machine learning algorithm to produce natural-sounding speech. Are you curious about how these models can be applied in real-world applications? Okay, the voice activity detection allowed us to interrupt. This is great. Let's go on to the next lesson where you'll learn more about the details of the end to end architecture of a voice agent.

Please sign in to view this content

Next Lesson

Building AI Voice Agents for Production

Introduction
Video
・
3 mins

Voice Agent Overview
Video
・
13 mins

End-to-end architecture - Part 1
Video
・
12 mins

End-to-end architecture - Part 2
Video
・
8 mins

Voice Agent Components
Video with Code Example
・
5 mins

Optimizing Latency
Video with Code Example
・
7 mins

Conclusion
Video
・
1 min

Appendix-Tips and Help
Code Example
・
10 mins

Course Feedback

Community