Voice systems regress silently, and you'll only find out when users complain. In this lesson, you will set up evaluation-driven development using Vocal Bridge's built-in multimodal evaluator. It scores your calls, suggests prompt improvements, and helps you catch issues before they reach production. Let's have some fun. Lessons two, three and four were about building. Lesson five is about not shipping by vibes. An Eval is a repeatable test that scores your AI system against the outcomes you actually care about. Think of it like a unit test, but for a non-deterministic system that you cannot diff check. Every shipping AI system needs them. And here's why. A prompt edit ships clean, but quietly breaks with three edge cases. A model swap changes behavior in ways no integration test catches. Without ground truth, you find regressions from your customers, which is the worst possible place to find them. And without scoring, you can't tell whether a change you made actually moved things forward or backward. Eval-Driven Development which is building and iterating on a system using its eval scores as the primary signal. That is the discipline that turns a demo into a product. Text evals just compare strings. Voice doesn't give you that luxury. Voice actually involves audio, timing, tone, multiple turns, an agent that can interrupt or be interrupted, and none of that show up in a text diff or even in a transcript. Six things make voice evals harder than text evals. One, The signal is multimodal. A voice turn carries text, audio, pacing, prosody. Score one dimension and you miss the others. Number two, regressions are silent. A new TTS voice or a text-to-speech voice can sound robotic on specific phrasing and the transcript looks identical. Number three, turn-taking matters. Did the agent wait for the user to finish? Did it interrupt at the right moment? Text alone can't tell you that. Number four, calls are hard to replay. You can't rerun a live call deterministically. So recordings plus structured logs become your only ground truth. Number five, tool calls happen inline. Did the agent invoke the right function at the right time with the right arguments mid-conversation. And six. Latency itself is part of user experience. A correct answer four seconds late is just the wrong answer here. If you only score the text, you will miss most of what actually broke. Okay, so how do we actually score it? There are two layers. First layer is this Hard Numerical Metrics. Things that are actually measurable. WER or word error rate for transcription accuracy, MOS or mean opinion score for synthesis quality, TTFB or time to first byte for first token or first audio latency. Turn duration end to end. Tool-call accuracy, whether the agent called the right function with the right arguments, and whether it triggered a tool call at all. Completion rate. These are all things that you can compute deterministically. You can create a baseline with these numbers. You can fail builds on regressions. Now the second layer, using a Multimodal LLM as a judge. For the qualitative things the metrics can't see. Did the agent's tone match the situation, was the turn-taking natural, with no awkward pauses? Did the call actually achieve the objective? Did it gracefully handle the user going off-script? How well did it recover from interruptions or errors? And really importantly, concrete prompt edits that judge suggests for next time, so that you can improve the agent and its performance. Numerical metrics catch what's objective, and the judge catches the qualitative drift. That's otherwise invisible. Together, and especially at scale, that's what makes voice evals tractable. So the notebook will have you run this on real recorded calls with one single CLI command, vb eval. All you have to do is pass it the session ID for the call. and the objective. It bundles everything, the audio recording of the conversation, the transcript, agent configuration, the tool call log into a payload for the multimodal judge, which will then return a structured report you can wire into your CI/CD loop. Sample report looks something like this. You'll have the Objective at the top, confirm appointment for tomorrow at 2pm. Let's say something like Outcome Confirmed. It'll give you an overall score out of 10, and then the Breakdown. Such as the conversation quality, what the turn taking was like, whether it was natural or awkward, what was the satisfaction of the caller, you can describe and define all of this criteria in the scenario. And the most important part, which will actually help you close the loop is the suggestion for editing the configuration for the agent. It could be something like soften the opening greeting because the current phrasing reads as scripted comes off as unnatural. Maybe you can try, Hi! Just checking in on your 2pm tomorrow... So that is a complete loop. Let's open the notebook. All right, time for lesson five. Voice AI Evals. So you've built three working voice agents. Now we install the discipline that separates a demo from a production system. It's called Eval-Driven Development. Vocal Bridge ships a built-in multimodal evaluator. We give it a recorded session and an objective. It sends the audio, the agent's full configuration, the transcript, the tool call log, all of it to a multimodal LLM, which then scores the call and proposes concrete prompt edits. For this notebook, you need at least one completed call with a recording on it. So the easiest source would be the call you placed in lesson four. We will pull its session ID off the list in a second. First we make sure that the setup goes through. Pretty standard stuff. Right, so just two helpers this lesson. The vb wrapper you already know from earlier, and eval_session. This is the new one. That's a one-line wrapper around vb eval. So, our CLI or the Vocal Bridge CLI can be accessed through the vb alias and eval is the command. So vb eval is wrapped inside this helper, wrapped by this helper eval_session. That's literally the entire eval API surface. Okay, so now we pull the most recent completed calls, so we can pick one. If you ran the lesson four call, its session ID will be at the top of the list. We will copy the session ID and use it in the next cell. So we copied the session ID in the previous step. Now, you need to give it an objective. That's what the agent was supposed to accomplish on that call. And the objective is what makes the eval meaningful. Without an objective, the LLM is just grading vibes, really. With one it can actually tell you whether the call succeeded. Under the hood, this is one CLI command. vb eval session ID and then the objective. Again, we just pass the session ID and the objective. The objective here, if you remember from lesson four, was the caller would place an outbound demo call and have a brief on-topic conversation with the callee. Takes a few seconds because it's a multimodal LLM doing the evaluation. The CLI sends the audio, the agent config, the transcript, the tool log, all of it, all the context to the multimodal judge model. So we are using LLM as the judge approach here. Eval reports come back as a structured object. As you can see, here we have the session_id, we have the objective, which we provided the eval_session helper, and then a score. So, the result object has a few different properties, the score, which is 10. Great to hear that. This is a score out of 10, so the best you can do is 10. And then the verdict, whether the evaluation passed the agent and whether it met the objective. So it's a pass by binary decision evaluation. And then the summary, the multimodal LLM, the LLM as a judge approach will give you a summary, a qualitative summary of how the agent performed. It'll also give you what worked and what didn't. So in this case, everything worked, which is great to hear. It was a simple prompt after all. And then finally, this is the most useful one, which is suggested prompt improvements. This is telling you, the developer, what improvements you can make to the prompt and configuration of the agent to meet the objective configuration of the agent to meet the objective and get closer to a perfect score. In this case, it is already perfect, so you can see it says the prompt is highly effective for this scenario, no changes are suggested. Now the eval reports, as you saw in the previous step, come back as a structured object. The two things you will actually act on are the score, which is a 0 to 10 quality rating, and the suggestions for improving the prompt. If the call missed the objective, the model tells you exactly what went wrong. That's your prompt diff, right there. All right, so again, this cell just shows you how you can parse that report and display it in a more readable format. Now you can take it one step further. You can also build a completely comprehensive eval suite. I'm not going to run the cell. I will let you explore this on your own. But this is an example of the kind of eval suite you can build where you can evaluate the agent against a set of scenarios. So the same agent getting evaluated against different scenarios. So the first one is, you know, confirming appointment for tomorrow at 2:00 p.m. The second one is the same objective, but it's a different scenario. So you can configure those situations and really evaluate your agent comprehensively at scale before going to production. And here's the thing. The discipline is the loop, not any single number. Build, call, eval, look at the suggestions, patch the prompt, call again, eval again. A regression isn't actually fixed until a fresh call against the same objective scores higher. So the eval suite is what will keep you honest about it. And typically, this is what a patch loop would look like. Again, I'm not going to run it for you, but feel free to explore it on your own and try different versions of objectives, scenario, and the prompt and see how having an eval suite improves the performance of your agent. And that's the course. You've built across all three Vocal Bridge services: voice in your app, voice for your existing agent, voice as a tool and installed the eval driven dev loop on top. From here, the natural next step is production hardening. Auth, monitoring, latency tuning, all the stuff that turns a working demo into a deployed product. And that is exactly the topic of my conversation with Scott Johnston, former CEO of Docker and current board member of Vocal Bridge. I'll see you there.