In this lesson, we're going to walk through the different building blocks of a voice agent from the moment someone starts speaking all the way to when the agent responds. We'll break down each stage in the pipeline, talk about trade-offs, and look at where you can make decisions that really shaped the user experience. Whether you're optimizing for speed, quality, or control. Understanding how each layer works will help you build smarter, more effective voice agents. There are two main types of voice agents. First, there's the pipeline approach, which Russ walked through in detail earlier. And then there's speech-to-speech or real-time agents, which not Nedalina introduced back in the first lesson. Speech to speech agents are usually simpler to implement, and they can sound really natural. But in exchange, you give up control. Since the model takes in speech and output speech, it's harder to see or tweak what's happening in the middle. Pipeline agents, on the other hand, have more moving parts and yes, more complexity. But you get fine grained control over each stage. You can see and manage exactly what's coming in and going out at each step, which can be a big deal for real-world applications. Every system has trade-offs. At some point, you'll probably have to choose between latency, quality and cost. One of the big advantages of using a pipeline approach is that you don't have to make that trade-off globally. You can change out different parts of your system based on what matters most for your use case and the example that Nedalina gave earlier, if you're doing something like restaurant bookings, you might want to optimize for LLM reasoning getting the best possible response from the model. But if you're handling something like medical triage, where accurate transcription is critical, then it might make more sense to spend some more of your latency budget on the speech-to-text layer instead. All right, let's talk about what really matters when the rubber meets the road. Some of the key things to keep in mind as you're designing each of these sections. Starting with voice activity detection. This component only sends audio when it detects someone is actually speaking. That's a big deal. It reduces hallucinations from your speech-to-text model, and it cuts down on costs since you're not sending silence for transcription when there's no voice to transcribe. Then we've got the speech-to-text layer. This is where we need to make some key decisions, like which languages we want to support, whether we're doing direct speech translation, and whether we want to use a specially trained model for recognizing voice in narrower use cases like telephony. The final two steps in your voice agent are the LLM layer and text-to-speech. Now, the LLM layer is the one that people are already most familiar with. This is where we run text to text inference to get a response from the large language model. If you want to add things like content filtering, this is the layer where you do it. It's where you'll see the biggest latency hits depending on which model you're using. And finally, you have text to speech or TTS. This layer takes the text from the LLM and turns it back into speech. The TTS layer is where you'll choose things like which voice or accent to use, and whether you want to apply any pronunciation overrides for specific words or phrases that you want spoken in a certain way. Now let's put everything we learned into action and get coding. First, we'll import the necessary LiveKit agent classes and plugins, including OpenAI for speech-to-text. And as the inference layer. ElevenLabs for TTS and Cerebras for voice activity detection. We'll also import data, and so we can get our environment variables into memory and logging so that we can see what's going on with our agent. We define our assistant class that inherits from LiveKit agent class. This assistant is given basic instructions about its role, and we'll keep track of what's being said so far in the conversation. By default, requests that are sent to an LLM for a response or stateless. But we want to maintain a conversation with history. The assistant will keep all of the messages that we've sent back and forth, and context, so that it can hold a conversation that's aware of the things we've already said. It also keeps track of whose turn it is to talk, if it can be interrupted, and which, if any, tools that it has at its disposal to answer questions for the user. Today we're just going to set the instructions parameter and inherit all of the other defaults from the base agent. Next, we'll define an asynchronous entry point function which will run when LiveKit tells her agent that it's needed. By default, every new room will request an agent. Rooms are the primitive that connect agent sessions to users. When an agent and a user are having a conversation that's happening inside a room. Step by step, this entry point function connects to a LiveKit room. We define a agent session with all necessary plugins that will need to listen and speak to the user, and we assign our assistant to the session. Finally, we use the LiveKit agent Jupyter CLI command to register the application with LiveKit. This will allow our agent to be dispatched to rooms. When we run, this next cell will be able to have a conversation with our agent. Hello there. It's wonderful to have you here. How can I assist you today? Hi there. What kind of things can you help with? Hi. I'm here to help with a wide range of topics. Whether you need information, help solving a problem, or just want a chat, feel free to ask about anything. Great. So our agent works. Now that our agent is running, let's change the voice out for something else. So we're going to scroll back up here to where we define our custom agent. And we're going to add a voice ID. Now, when we run our agent. Hello there. It's wonderful to have you here. How can I assist you today? Hi, Roger, thanks for joining us. I love your voice. Thank you so much. It's great to be here with you. How can I make your day better? So to recap from this lesson, we were able to get our agent running and we were able to change some of the settings so that we could have a different voice. In the next lesson, we'll learn a little bit about metrics and how we can optimize our agents.