AI is the new electricity and will transform and improve nearly all areas of human lives.

Quick Guide & Tips

💻 Accessing Utils File and Helper Functions

In each notebook on the top menu:

1: Click on "File"

2: Then, click on "Open"

You will be able to see all the notebook files for the lesson, including any helper functions used in the notebook on the left sidebar. See the following image for the steps above.

🔄 Reset User Workspace

If you need to reset your workspace to its original state, follow these quick steps:

1: Access the Menu: Look for the three-dot menu (⋮) in the top-right corner of the notebook toolbar.

2: Restore Original Version: Click on "Restore Original Version" from the dropdown menu.

For more detailed instructions, please visit our Reset Workspace Guide.

💻 Downloading Notebooks

In each notebook on the top menu:

1: Click on "File"

2: Then, click on "Download as"

3: Then, click on "Notebook (.ipynb)"

💻 Uploading Your Files

After following the steps shown in the previous section ("File" => "Open"), then click on "Upload" button to upload your files.

📗 See Your Progress

Once you enroll in this course—or any other short course on the DeepLearning.AI platform—and open it, you can click on 'My Learning' at the top right corner of the desktop view. There, you will be able to see all the short courses you have enrolled in and your progress in each one.

Additionally, your progress in each short course is displayed at the bottom-left corner of the learning page for each course (desktop view).

📱 Features to Use

🎞 Adjust Video Speed: Click on the gear icon (⚙) on the video and then from the Speed option, choose your desired video speed.

🗣 Captions (English and Spanish): Click on the gear icon (⚙) on the video and then from the Captions option, choose to see the captions either in English or Spanish.

🔅 Video Quality: If you do not have access to high-speed internet, click on the gear icon (⚙) on the video and then from Quality, choose the quality that works the best for your Internet speed.

🖥 Picture in Picture (PiP): This feature allows you to continue watching the video when you switch to another browser tab or window. Click on the small rectangle shape on the video to go to PiP mode.

√ Hide and Unhide Lesson Navigation Menu: If you do not have a large screen, you may click on the small hamburger icon beside the title of the course to hide the left-side navigation menu. You can then unhide it by clicking on the same icon again.

🧑 Efficient Learning Tips

The following tips can help you have an efficient learning experience with this short course and other courses.

🧑 Create a Dedicated Study Space: Establish a quiet, organized workspace free from distractions. A dedicated learning environment can significantly improve concentration and overall learning efficiency.

📅 Develop a Consistent Learning Schedule: Consistency is key to learning. Set out specific times in your day for study and make it a routine. Consistent study times help build a habit and improve information retention.

Tip: Set a recurring event and reminder in your calendar, with clear action items, to get regular notifications about your study plans and goals.

☕ Take Regular Breaks: Include short breaks in your study sessions. The Pomodoro Technique, which involves studying for 25 minutes followed by a 5-minute break, can be particularly effective.

💬 Engage with the Community: Participate in forums, discussions, and group activities. Engaging with peers can provide additional insights, create a sense of community, and make learning more enjoyable.

✍ Practice Active Learning: Don't just read or run notebooks or watch the material. Engage actively by taking notes, summarizing what you learn, teaching the concept to someone else, or applying the knowledge in your practical projects.

📚 Enroll in Other Short Courses

Keep learning by enrolling in other short courses. We add new short courses regularly. Visit DeepLearning.AI Short Courses page to see our latest courses and begin learning new topics. 👇

👉👉 🔗 DeepLearning.AI – All Short Courses [+]

🙂 Let Us Know What You Think

Your feedback helps us know what you liked and didn't like about the course. We read all your feedback and use them to improve this course and future courses. Please submit your feedback by clicking on "Course Feedback" option at the bottom of the lessons list menu (desktop view).

Also, you are more than welcome to join our community 👉👉 🔗 DeepLearning.AI Forum

Sign in

Or, sign in with your email

Email

Password

Forgot password?

Don't have an account? Create account

By signing up, you agree to our Terms Of Use and Privacy Policy

Create Your Account

Or, sign up with your email

Email Address

Already have an account? Sign in here!

By signing up, you agree to our Terms Of Use and Privacy Policy

Choose Your Plan

Planning for more users?

What best describes you?

This helps us tune the catalog to suit you best.

Software Engineer

Data Scientist

Machine Learning Engineer

Data Analyst

Product Manager

Entrepreneur

Business / Consulting

Research / Academic

Student

Other

Subscribe to receive AI news, events and course updates from DeepLearning.AI!

Join Team Success

You have successfully joined undefined

You now have access to all Pro features. Click below to start learning!

Session Expired

Session expired — please return to Cornerstone to restart the session and complete the course.

/

Voice for AI Agents and Applications

All Courses

/

Voice for AI Agents and Applications

All Courses

Voice for AI Agents and Applications

Voice for AI Agents and Applications

Course Syllabus

Elevate Your Career with Full Learning Experience

Unlock Plus AI learning and gain exclusive insights from industry leaders

Access exclusive features like graded notebooks and quizzes

Earn unlimited certificates to enhance your resume

Starting at $1 USD/mo after a free trial – cancel anytime

In this lesson, you will learn about the traditional voice stack with all its complexity. Then, you will walk through three live demos that each demonstrate a different pattern for building voice agents. Let's dive in. This first lesson will walk you through the voice AI landscape. There is no notebook for this lesson. What I want you to walk away with is a clear mental model of three things. One, what a production voice agent actually involves, two, where voice belongs in your stack and three, what we're going to build hands-on starting in lesson two. Before we get into the architecture, I want to spend a minute on why voice. Why does this matter? There are four reasons it's worth treating as a first-class modality in the AI era. First one's the obvious one. Voice is the most natural interface humans have. We've been talking for hundreds of thousands of years. Typing is just a few decades old. So voice is the lowest friction way to express intent. No menus, no forms, no scrolling. Second is paralinguistics. And this one is underrated. A voice utterance carries way more signal than the same sentence as text. It could be tone, pace, hesitation, urgency, prosody, emotion. Agents that can hear how you mean something not just what you said. They behave really differently from agents that just read your text. Third is multimodality. Voice doesn't replace your UI, it pairs with it. You speak intent, you see structured info back. The best voice agents are the ones that use voice and the GUI together each modality doing what it's best at. Finally, the fourth one is accessibility. Voice reaches users who cannot type or they can't read on the screen, can't navigate a complex GUI, is often the most accessible interface you can ship. And that matters. With real-time AI, voice finally works as an interface. and paired with your existing UI, it's a step change in how products feel. So, let me make that concrete. One example that we're going to use here is booking a flight. With a classic GUI, you go through four phases. You open the app, You search, you pick dates, apply filters, and then choose from the options, and then finally pay and confirm. Every phase has its own subtasks. Total time, easily over two minutes and you're tapping the entire time. Now, imagine the same task on voice with a single utterance, book me a flight from SF to NYC next Friday morning. The agent does the screens. You stay in the conversation. You can confirm out loud, you'll be done in under 10 seconds. Now the point isn't that voice replaces GUIs. The point is voice collapses the friction between intent and outcome. And you will see this pattern over and over in every demo for the rest of this lesson. So where's voice actually being used today? And where's it going next? This whole scene is one big picture. What I call the voice agent universe. And we are going to fill it one bubble at a time. So, if you've heard about voice AI over the last couple of years, the context was almost certainly the contact center. Support, IVR, Agent Assist, Call Q&A. That's the orange cluster on the right. Economics are clear, workflows are constrained, the buyers exist. That's why the industry started there. But that's a tiny slice of where voice actually belongs. The bigger picture is the rest of this universe. Voice as a feature of every kind of software across every industry. Fintech, voice as the interface to your account, to your portfolio. Or it could be in-car for the obvious reason because you can't take your hands off the wheel. Take Healthcare for example. You can have use cases like clinical dictation, patient intake, check-ins, you name it. It could be Devtools. Coding assistance you can actually talk to. And obviously Accessibility for users who can't easily use a GUI. and productivity, meeting, scheduling, inbox agents. And finally, education for kids who can't read yet. Gaming, field service, the list keeps growing. So the framing for this entire course is simple. Voice today lives in the contact center. But voice tomorrow lives everywhere a developer ships software. To make it more concrete, over the rest of this course, we will work with three surfaces Vocal Bridge will plug into your stack. One, voice for your application. Using our React SDK, you can drop in our VocalBridgeProvider component and you can voice enable any application. The agent simplified actions and your UI can react. Number two is voice for your agents. Again, using our SDK and leveraging the useAIAgent hook will give your existing LLM agent a voice in just two lines of code. That's lesson three. And finally, voice as a tool. Using our CLI, "vb call" is the command that'll let your agent place real phone calls. Voice becomes a function your LLMs can invoke. That's lesson four. Now, let me show you what you would actually have to build if you try to wire this up yourself. And why we built Vocal Bridge to collapse all of it. This is the production stack. 12 steps, I will walk through each one. First block is simple Capture. Microphone permissions, WebRTC ingest, and the codecs, track muxing, device hot swaps, that's just getting the audio in the door. Next is Audio Pre-Process. which includes noise cancellation, echo cancellation, handling silence frames. None of this is glamorous. All of it is what makes the agent intelligible. You would want to add STT + Voice Accurate Detection. You would probably want streaming speech to text, because if you want real-time voice agents, speech should be converted into streaming text and you will have to leverage frameworks like Whisper, Deepgram, voice activity detection also known as VAD to figure out when someone's actually speaking and endpointing for the agent to figure out when the user is finished with their turn. or maybe even diarization if there's more than one speaker involved on the call. And finally, your agent, which will also have to be your dialogue manager. It'll have to handle turn taking, tool routing, your RAG pipeline, memory, the prompt, the business logic, all of it. This is the only block that's actually about your product. Everything else is this plumbing. Finally, we will have text to speech coupled with sentence chunking, selecting voices, selecting and configuring the prosody, synthesizing the speech in streaming or as streaming, and choosing from a number of TTS providers. And there are more branches that come in, right? Specifically two branches, the Telephony Bridge. which will require you handling the telephony stack. So working with providers like Twilio, handling the SIP protocol, handling DTMF, handling inbound and outbound separately. That all of that sits above capture. And wrapping all of it, which cuts across the different components in this architecture is authorization, session state management, handling reconnections, failovers, latency budgets, concurrency, observability. And that makes up your entire stack in production. All right. There are essentially two architectures in voice AI today. And there's a tradeoff between them that nobody wants to make. First one is your Cascaded setup. It's the classic stack. You have speech to text, then you have your LLM, and then you finally have text to speech. You might also need deep reasoning because you're using your full LLM stack as is. It is easy to debug. Every step is text, but the latency is going to be 1 to 3 seconds per turn, end to end, which is not ideal for a real-time voice agent. And speech to text strips out the tone, the emotion, the pacing, all the paralinguistics. You lose all of that signal before your agent ever sees the input. Second one is the real-time architecture, aka voice-to-voice models. where the input is speech and the output is also speech. And there's just one single model. The latency drops to, you know, 200 to 500 milliseconds. That's the kind of latency you want. The model hears tone, pitch, hesitation, emotion, all the paralinguistic signals that we lost with the cascaded stack. But you do lose access to your LLM stack. The brain here is generic. Wiring it up with your RAG, with your tools, with your domain knowledge is not straightforward. So, you have to pick one. Either you get Low Latency or you get deep reasoning. Either you get naturalness or you can leverage your existing brain. That is the tradeoff nobody wants to make. So Vocal Bridge's answer is what we call the concierge architecture. A real-time brain that handles the conversation. So things like fillers, turn-taking, paralinguistics. And it only delegates to your LLM when it needs Deep Reasoning or it needs to execute a specialized task that your LLM or your agentic workflow is really good at. So you get real-time latency and your existing LLM stack stay intact. The implementation is really our secret sauce, but the high-level shape is what you see on the screen. That's the architectural picture you should walk into lesson two with. In lesson two, we start building.

deco top

deco bottom

Voice for AI Agents and Applications

Sign in to continue learning

Voice for AI Agents and Applications

Beginner

1h26m

Topics

Agents

GenAI Applications

LLMOps

Collaborator

Voice for AI Agents and Applications

Introduction
Video
・
4m

Overview of Voice UI
Video
・
9m

Voice in Your App
Video with Code Example
・
10m

Voice for Your Agent
Video with Code Example
・
12m

Voice as a Tool
Video with Code Example
・
9m

Voice AI Evals
Video with Code Example
・
10m

Voice Agents in Production
Video
・
8m

Conclusion
Video
・
1m

Glossary
Reading
・
10m

(Optional) Claim Vocal Bridge Credits
Code Example
・
1m

Graded・Quiz

Course Details