Let's now add voice to an agent you have already built. In this lesson, you will take an agent and give it a voice in about 10 lines of code. Without touching any of your existing logic. Your prompts, your RAG, your tools, all stay exactly as they are. All right, let's go. Okay, in lesson two, we added voice to an application. In lesson three, we're going to add voice to an agent. Here's a situation most of you are actually in. You already have an LLM in production. The prompt's tuned, the tools work, RAG is wired up, eval suite is green, and someone walks in and asks, Can we add a voice surface? Now, the last thing you want to do is rewrite all of it. What you actually want is three things. Keep your agent exactly as is, prompt, tools, RAG, evals, all of it. And add voice as a transport layer, not a rebuild. And don't sacrifice latency or naturalness for the privilege. So the interesting question for this lesson, how do you wrap voice around an agent without breaking how it works? The way most platforms do this is what I call the sandwich. Speech-to-text in front, your agent in the middle, text to speech at the back. Looks simple until you realize what just got handed to your agent. Now your agent owns the dialogue flow. It owns turn-taking, it owns barging, it owns paralinguistics except speech to text already threw all the tone, the pitch, the hesitation, the emotion away before your agent even saw the input. And there are real costs underneath. Every turn pays speech to text plus LLM plus text to speech sequentially. So latency goes up. Your agent has to do dialogue work it wasn't built for. Backchannels, mm-hmm, when to wait, when to ask. You lose paralinguistic signal entirely. And the user sits in silence while your agent runs because nothing in the stack is keeping the conversation alive. The sandwich works for a demo. It does not survive contact with real users. Our approach is different. We put a realtime voice-to-voice brain in front of your agent. It has its own model. It handles all the dialogue work, listening, turn-taking, paralinguistics, the conversational glue. And when the user asks something substantial, the Vocal Bridge brain delegates just that query to your agent. Your agent stays exactly as is. The wins fall out of the architecture. Latency is low. There is no speech-to-text to LLM to text-to-speech waterfall on the dialogue itself. Paralinguistic context survives. Vocal Bridge brain hears tone, hesitation, packages that as context when it delegates. User keeps talking while your agent runs in the background because the Vocal Bridge brain keeps the conversation alive. And there is no agent rewrite. Your agent answers structured queries just like it would for chat. Okay, here's what one turn actually looks like. The notebook walks you through wrapping voice around an LLM-powered agent, the kind of thing most of you already have. A user asks, what's the most recent Nvidia price stock? with a slight hesitation in their voice. The Vocal Bridge brain hears both the question and the uncertainty. it immediately speaks a quick acknowledgement so the user isn't in dead air. Let's say something like let me check on that. And at the same time sends your agent a structured query agent call, including information about the tone and other paralinguistics. Your agent runs the same code path it would for a chatbot, RAG over your internal knowledge graph or maybe a web search tool. No dialogue logic added. The Vocal Bridge brain takes that answer and reads it back naturally. Slower with confirmation pauses because the user sounded uncertain. And the user can interrupt at any time. That's the whole pattern. Voice on the outside, your agent untouched in the middle. Let's go build it. All right, it's time for lesson three, voice for your existing agent. The pattern in this notebook is the lowest effort way to give voice to an LLM agent you already have. We are not going to rewrite anything. We're just going to wrap voice around the agent itself. Vocal Bridge sits in front as a thin voice layer. Every spoken question goes to your endpoint as a query agent action. Your endpoint responds with a string, Vocal Bridge speaks it. That's the whole pattern. Now, same setup story as lesson two. In this DeepLearning.AI environment, both the Vocal Bridge key and the Anthropic API key are already populated in the environment. You don't have to do anything. outside of here on your own machine, you would set both of them yourself as environment variables. A quick note, for simplicity, the agent we are going to build in this notebook is just Claude with a web search tool. In a real production system, your agent might be way more complicated. Could be built with LangChain, LangGraph, LlamaIndex, the Anthropic agent's SDK, your own in-house framework with RAG, tools, memory, business logic, all of it. And the pattern still works. Whatever your agent looks like, all Vocal Bridge needs from you is a function that takes a query string and returns a response. That's the entire interface. So we make sure both the API keys are loaded. The standard imports, asserting that the keys are present and we are all set. Now, a quick note on the helpers as well. Four helpers this time, mint_token and vb, you are already familiar with them from the previous lesson. The new ones are start_query_server, which spins up a tiny FastAPI server in a background thread, and voice_widget, which renders the in-notebook widget with the right React hook wired in. All right, so the next step is to build the agent text only first. This is agent first principle in action. Every voice agent starts life as a text agent. So we are going to build a Claude powered agent with access to web search, prove it works on its own with the text modality, no voice in the loop. And then in the next step we will wrap voice around it. From the voice layer's perspective, the function we are about to write is the entire agent. Vocal Bridge just calls it with a question or a query, it returns a string. Vocal Bridge communicates it back to the user. That's the whole interface. So we'll start with some quick imports, importing anthropic here, instantiating the Claude client, and then defining answer, which is the entry point for your agent. So it accepts a query, runs inference against claude-sonnet-4-6. We define the web search tool here, and then this entry point returns a response from your LLM that has access to the web search tool back to the caller. So we're going to test it here with a quick query, say hi in five words. There you go. We can see that this agent is working with the text modality. Up next, we will expose the agent over HTTP. So Vocal Bridge needs an endpoint it can post queries to. In production, that's just another route in your existing backend. Could be Flask, FastAPI, a LangChain or LangSmith endpoint, whatever you already use. Here in the notebook, we run a tiny FastAPI server in a background thread. So the in-page widget has somewhere to call. It's one single call, we spin up the server, run a health check, and it returns a URL. So we can see that the query endpoint is now up at this location. All right. Now to make this work, a Vocal Bridge agent in AI agent mode is going to need two artifacts. One is a system prompt for the voice layer, which tells Vocal Bridge what to delegate versus what to answer itself. And the other one is a configuration, an AI agent.json file declaring what's on the other side, so that Vocal Bridge knows when to forward and delegate queries. versus handling it by itself. So, let me go through both of them. Starting with the prompt. This is the voice layer prompt. The key idea is right at the top. You are the ears and the mouth. Claude is the brain. Don't make up answers, delegate every substantial question to the endpoint, to the agent endpoint. And the most important section here is the bridge line instruction. The moment Vocal Bridge delegates a query, it has to immediately say something short, like hang on, checking that, because otherwise the user just sits in dead air for 1 to 6 seconds while Claude comes up with a response. That bridge line is the difference between feeling fast and feeling broken. Trust me on that one. And here's the ai-agent.json file. Three flags. enabled turns AI agent mode on for Vocal Bridge. description tells Vocal Bridge what the agent's good at. So, what is the specialization, what is the task your agent is supposed to carry out? And that's how it knows when to delegate versus handle small-talk or conversation itself. In this case, the description is Claude with web search. It knows general knowledge, current events, and it will forward any substantive question, only handle small-talk yourself. This specific section is an instruction to Vocal Bridge. And finally, verbatim setting it to false ensures that Vocal Bridge will lightly polish phrasing for natural voice delivery. So it'll contract numbers, drop citations, that kind of thing. But if we want exact word-for-word feedback, you can just flip verbatim to true. All right. Now we create the agent. We just pass the prompt and the AI agent configuration to the CLI along with the usual flags. Note, the greeting is empty. Claude's going to own the opening line, not Vocal Bridge. Using the same if-else pattern here for making sure that this is idempotent. Pre-provisioned ID in a sandbox gets reused, otherwise we will create an agent from scratch. Now it's time to create the widget. This widget will connect to Vocal Bridge, listen for the query agent event, post to your local endpoint that we just created and it'll reply back with an agent response. Everything else, the transcript, the audio, the conversational glue, all of it is handled by Vocal Bridge. So we'll run it real quick here. We will click connect. and then ask Claude something that needs the web. Hey, what was Nvidia's closing price yesterday? Let me check that for you. One moment. Nvidia's closing price yesterday was $220.61, down 0.769% from the previous close of $222.32. The day's low was $217.91 and the high was $224.48. Great, so as you might have noticed, when I was just having a conversation, making small talk, or just asking the agent um how it's going. it decided to hold the conversation itself, so responded by itself. But then when I asked what was Nvidia's closing price yesterday, it knew that it needs to forward and delegate that query. to the underlying agent, which we can see here. So it forwarded that query to the endpoint that we just created and it came back with a response. This is the response that Claude came out with and it communicated that back to me. And that's how you give voice to an existing AI agent. And this is a little housekeeping here. We will stop the query server so that I can run another one later if you need to. Okay, so now you've seen it work. Let's recap how it actually works and how Vocal Bridge connects with your agent. The whole give my LLM a voice wiring is this one React hook, use AI agent. And we leverage the on query handler to connect it and pointed to the endpoint where your agent lives. Every spoken question arrives as a query and whatever your agent returns, Vocal Bridge communicates it back. That's the entire client-side delta from text chat to voice chat. It works with Claude, you know, GPT, Gemini, your fine-tune model, your RAG stack, and agent framework, whatever. Anything that takes text in and returns text out will work with this setup. So, great job going through lesson three. See you in the next one.