AI is the new electricity and will transform and improve nearly all areas of human lives.

Quick Guide & Tips

💻 Accessing Utils File and Helper Functions

In each notebook on the top menu:

1: Click on "File"

2: Then, click on "Open"

You will be able to see all the notebook files for the lesson, including any helper functions used in the notebook on the left sidebar. See the following image for the steps above.

🔄 Reset User Workspace

If you need to reset your workspace to its original state, follow these quick steps:

1: Access the Menu: Look for the three-dot menu (⋮) in the top-right corner of the notebook toolbar.

2: Restore Original Version: Click on "Restore Original Version" from the dropdown menu.

For more detailed instructions, please visit our Reset Workspace Guide.

💻 Downloading Notebooks

In each notebook on the top menu:

1: Click on "File"

2: Then, click on "Download as"

3: Then, click on "Notebook (.ipynb)"

💻 Uploading Your Files

After following the steps shown in the previous section ("File" => "Open"), then click on "Upload" button to upload your files.

📗 See Your Progress

Once you enroll in this course—or any other short course on the DeepLearning.AI platform—and open it, you can click on 'My Learning' at the top right corner of the desktop view. There, you will be able to see all the short courses you have enrolled in and your progress in each one.

Additionally, your progress in each short course is displayed at the bottom-left corner of the learning page for each course (desktop view).

📱 Features to Use

🎞 Adjust Video Speed: Click on the gear icon (⚙) on the video and then from the Speed option, choose your desired video speed.

🗣 Captions (English and Spanish): Click on the gear icon (⚙) on the video and then from the Captions option, choose to see the captions either in English or Spanish.

🔅 Video Quality: If you do not have access to high-speed internet, click on the gear icon (⚙) on the video and then from Quality, choose the quality that works the best for your Internet speed.

🖥 Picture in Picture (PiP): This feature allows you to continue watching the video when you switch to another browser tab or window. Click on the small rectangle shape on the video to go to PiP mode.

√ Hide and Unhide Lesson Navigation Menu: If you do not have a large screen, you may click on the small hamburger icon beside the title of the course to hide the left-side navigation menu. You can then unhide it by clicking on the same icon again.

🧑 Efficient Learning Tips

The following tips can help you have an efficient learning experience with this short course and other courses.

🧑 Create a Dedicated Study Space: Establish a quiet, organized workspace free from distractions. A dedicated learning environment can significantly improve concentration and overall learning efficiency.

📅 Develop a Consistent Learning Schedule: Consistency is key to learning. Set out specific times in your day for study and make it a routine. Consistent study times help build a habit and improve information retention.

Tip: Set a recurring event and reminder in your calendar, with clear action items, to get regular notifications about your study plans and goals.

☕ Take Regular Breaks: Include short breaks in your study sessions. The Pomodoro Technique, which involves studying for 25 minutes followed by a 5-minute break, can be particularly effective.

💬 Engage with the Community: Participate in forums, discussions, and group activities. Engaging with peers can provide additional insights, create a sense of community, and make learning more enjoyable.

✍ Practice Active Learning: Don't just read or run notebooks or watch the material. Engage actively by taking notes, summarizing what you learn, teaching the concept to someone else, or applying the knowledge in your practical projects.

📚 Enroll in Other Short Courses

Keep learning by enrolling in other short courses. We add new short courses regularly. Visit DeepLearning.AI Short Courses page to see our latest courses and begin learning new topics. 👇

👉👉 🔗 DeepLearning.AI – All Short Courses [+]

🙂 Let Us Know What You Think

Your feedback helps us know what you liked and didn't like about the course. We read all your feedback and use them to improve this course and future courses. Please submit your feedback by clicking on "Course Feedback" option at the bottom of the lessons list menu (desktop view).

Also, you are more than welcome to join our community 👉👉 🔗 DeepLearning.AI Forum

Sign in

Or, sign in with your email

Email

Password

Forgot password?

Don't have an account? Create account

By signing up, you agree to our Terms Of Use and Privacy Policy

Create Your Account

Or, sign up with your email

Email Address

Already have an account? Sign in here!

By signing up, you agree to our Terms Of Use and Privacy Policy

Choose Your Plan

Planning for more users?

What best describes you?

This helps us tune the catalog to suit you best.

Software Engineer

Data Scientist

Machine Learning Engineer

Data Analyst

Product Manager

Entrepreneur

Business / Consulting

Research / Academic

Student

Other

Subscribe to receive AI news, events and course updates from DeepLearning.AI!

Join Team Success

You have successfully joined undefined

You now have access to all Pro features. Click below to start learning!

Session Expired

Session expired — please return to Cornerstone to restart the session and complete the course.

/

Voice for AI Agents and Applications

All Courses

/

Voice for AI Agents and Applications

All Courses

Voice for AI Agents and Applications

Voice for AI Agents and Applications

Course Syllabus

Elevate Your Career with Full Learning Experience

Unlock Plus AI learning and gain exclusive insights from industry leaders

Access exclusive features like graded notebooks and quizzes

Earn unlimited certificates to enhance your resume

Starting at $1 USD/mo after a free trial – cancel anytime

Let's now add voice to an agent you have already built. In this lesson, you will take an agent and give it a voice in about 10 lines of code. Without touching any of your existing logic. Your prompts, your RAG, your tools, all stay exactly as they are. All right, let's go. Okay, in lesson two, we added voice to an application. In lesson three, we're going to add voice to an agent. Here's a situation most of you are actually in. You already have an LLM in production. The prompt's tuned, the tools work, RAG is wired up, eval suite is green, and someone walks in and asks, Can we add a voice surface? Now, the last thing you want to do is rewrite all of it. What you actually want is three things. Keep your agent exactly as is, prompt, tools, RAG, evals, all of it. And add voice as a transport layer, not a rebuild. And don't sacrifice latency or naturalness for the privilege. So the interesting question for this lesson, how do you wrap voice around an agent without breaking how it works? The way most platforms do this is what I call the sandwich. Speech-to-text in front, your agent in the middle, text to speech at the back. Looks simple until you realize what just got handed to your agent. Now your agent owns the dialogue flow. It owns turn-taking, it owns barging, it owns paralinguistics except speech to text already threw all the tone, the pitch, the hesitation, the emotion away before your agent even saw the input. And there are real costs underneath. Every turn pays speech to text plus LLM plus text to speech sequentially. So latency goes up. Your agent has to do dialogue work it wasn't built for. Backchannels, mm-hmm, when to wait, when to ask. You lose paralinguistic signal entirely. And the user sits in silence while your agent runs because nothing in the stack is keeping the conversation alive. The sandwich works for a demo. It does not survive contact with real users. Our approach is different. We put a realtime voice-to-voice brain in front of your agent. It has its own model. It handles all the dialogue work, listening, turn-taking, paralinguistics, the conversational glue. And when the user asks something substantial, the Vocal Bridge brain delegates just that query to your agent. Your agent stays exactly as is. The wins fall out of the architecture. Latency is low. There is no speech-to-text to LLM to text-to-speech waterfall on the dialogue itself. Paralinguistic context survives. Vocal Bridge brain hears tone, hesitation, packages that as context when it delegates. User keeps talking while your agent runs in the background because the Vocal Bridge brain keeps the conversation alive. And there is no agent rewrite. Your agent answers structured queries just like it would for chat. Okay, here's what one turn actually looks like. The notebook walks you through wrapping voice around an LLM-powered agent, the kind of thing most of you already have. A user asks, what's the most recent Nvidia price stock? with a slight hesitation in their voice. The Vocal Bridge brain hears both the question and the uncertainty. it immediately speaks a quick acknowledgement so the user isn't in dead air. Let's say something like let me check on that. And at the same time sends your agent a structured query agent call, including information about the tone and other paralinguistics. Your agent runs the same code path it would for a chatbot, RAG over your internal knowledge graph or maybe a web search tool. No dialogue logic added. The Vocal Bridge brain takes that answer and reads it back naturally. Slower with confirmation pauses because the user sounded uncertain. And the user can interrupt at any time. That's the whole pattern. Voice on the outside, your agent untouched in the middle. Let's go build it. All right, it's time for lesson three, voice for your existing agent. The pattern in this notebook is the lowest effort way to give voice to an LLM agent you already have. We are not going to rewrite anything. We're just going to wrap voice around the agent itself. Vocal Bridge sits in front as a thin voice layer. Every spoken question goes to your endpoint as a query agent action. Your endpoint responds with a string, Vocal Bridge speaks it. That's the whole pattern. Now, same setup story as lesson two. In this DeepLearning.AI environment, both the Vocal Bridge key and the Anthropic API key are already populated in the environment. You don't have to do anything. outside of here on your own machine, you would set both of them yourself as environment variables. A quick note, for simplicity, the agent we are going to build in this notebook is just Claude with a web search tool. In a real production system, your agent might be way more complicated. Could be built with LangChain, LangGraph, LlamaIndex, the Anthropic agent's SDK, your own in-house framework with RAG, tools, memory, business logic, all of it. And the pattern still works. Whatever your agent looks like, all Vocal Bridge needs from you is a function that takes a query string and returns a response. That's the entire interface. So we make sure both the API keys are loaded. The standard imports, asserting that the keys are present and we are all set. Now, a quick note on the helpers as well. Four helpers this time, mint_token and vb, you are already familiar with them from the previous lesson. The new ones are start_query_server, which spins up a tiny FastAPI server in a background thread, and voice_widget, which renders the in-notebook widget with the right React hook wired in. All right, so the next step is to build the agent text only first. This is agent first principle in action. Every voice agent starts life as a text agent. So we are going to build a Claude powered agent with access to web search, prove it works on its own with the text modality, no voice in the loop. And then in the next step we will wrap voice around it. From the voice layer's perspective, the function we are about to write is the entire agent. Vocal Bridge just calls it with a question or a query, it returns a string. Vocal Bridge communicates it back to the user. That's the whole interface. So we'll start with some quick imports, importing anthropic here, instantiating the Claude client, and then defining answer, which is the entry point for your agent. So it accepts a query, runs inference against claude-sonnet-4-6. We define the web search tool here, and then this entry point returns a response from your LLM that has access to the web search tool back to the caller. So we're going to test it here with a quick query, say hi in five words. There you go. We can see that this agent is working with the text modality. Up next, we will expose the agent over HTTP. So Vocal Bridge needs an endpoint it can post queries to. In production, that's just another route in your existing backend. Could be Flask, FastAPI, a LangChain or LangSmith endpoint, whatever you already use. Here in the notebook, we run a tiny FastAPI server in a background thread. So the in-page widget has somewhere to call. It's one single call, we spin up the server, run a health check, and it returns a URL. So we can see that the query endpoint is now up at this location. All right. Now to make this work, a Vocal Bridge agent in AI agent mode is going to need two artifacts. One is a system prompt for the voice layer, which tells Vocal Bridge what to delegate versus what to answer itself. And the other one is a configuration, an AI agent.json file declaring what's on the other side, so that Vocal Bridge knows when to forward and delegate queries. versus handling it by itself. So, let me go through both of them. Starting with the prompt. This is the voice layer prompt. The key idea is right at the top. You are the ears and the mouth. Claude is the brain. Don't make up answers, delegate every substantial question to the endpoint, to the agent endpoint. And the most important section here is the bridge line instruction. The moment Vocal Bridge delegates a query, it has to immediately say something short, like hang on, checking that, because otherwise the user just sits in dead air for 1 to 6 seconds while Claude comes up with a response. That bridge line is the difference between feeling fast and feeling broken. Trust me on that one. And here's the ai-agent.json file. Three flags. enabled turns AI agent mode on for Vocal Bridge. description tells Vocal Bridge what the agent's good at. So, what is the specialization, what is the task your agent is supposed to carry out? And that's how it knows when to delegate versus handle small-talk or conversation itself. In this case, the description is Claude with web search. It knows general knowledge, current events, and it will forward any substantive question, only handle small-talk yourself. This specific section is an instruction to Vocal Bridge. And finally, verbatim setting it to false ensures that Vocal Bridge will lightly polish phrasing for natural voice delivery. So it'll contract numbers, drop citations, that kind of thing. But if we want exact word-for-word feedback, you can just flip verbatim to true. All right. Now we create the agent. We just pass the prompt and the AI agent configuration to the CLI along with the usual flags. Note, the greeting is empty. Claude's going to own the opening line, not Vocal Bridge. Using the same if-else pattern here for making sure that this is idempotent. Pre-provisioned ID in a sandbox gets reused, otherwise we will create an agent from scratch. Now it's time to create the widget. This widget will connect to Vocal Bridge, listen for the query agent event, post to your local endpoint that we just created and it'll reply back with an agent response. Everything else, the transcript, the audio, the conversational glue, all of it is handled by Vocal Bridge. So we'll run it real quick here. We will click connect. and then ask Claude something that needs the web. Hey, what was Nvidia's closing price yesterday? Let me check that for you. One moment. Nvidia's closing price yesterday was $220.61, down 0.769% from the previous close of $222.32. The day's low was $217.91 and the high was $224.48. Great, so as you might have noticed, when I was just having a conversation, making small talk, or just asking the agent um how it's going. it decided to hold the conversation itself, so responded by itself. But then when I asked what was Nvidia's closing price yesterday, it knew that it needs to forward and delegate that query. to the underlying agent, which we can see here. So it forwarded that query to the endpoint that we just created and it came back with a response. This is the response that Claude came out with and it communicated that back to me. And that's how you give voice to an existing AI agent. And this is a little housekeeping here. We will stop the query server so that I can run another one later if you need to. Okay, so now you've seen it work. Let's recap how it actually works and how Vocal Bridge connects with your agent. The whole give my LLM a voice wiring is this one React hook, use AI agent. And we leverage the on query handler to connect it and pointed to the endpoint where your agent lives. Every spoken question arrives as a query and whatever your agent returns, Vocal Bridge communicates it back. That's the entire client-side delta from text chat to voice chat. It works with Claude, you know, GPT, Gemini, your fine-tune model, your RAG stack, and agent framework, whatever. Anything that takes text in and returns text out will work with this setup. So, great job going through lesson three. See you in the next one.

deco top

deco bottom

Voice for AI Agents and Applications

Sign in to continue learning

Voice for AI Agents and Applications

Beginner

1h26m

Topics

Agents

GenAI Applications

LLMOps

Collaborator

Voice for AI Agents and Applications

Introduction
Video
・
4m

Overview of Voice UI
Video
・
9m

Voice in Your App
Video with Code Example
・
10m

Voice for Your Agent
Video with Code Example
・
12m

Voice as a Tool
Video with Code Example
・
9m

Voice AI Evals
Video with Code Example
・
10m

Voice Agents in Production
Video
・
8m

Conclusion
Video
・
1m

Glossary
Reading
・
10m

(Optional) Claim Vocal Bridge Credits
Code Example
・
1m

Graded・Quiz

Course Details