In the previous lesson, you gave feedback to your agent as text. You'll now provide feedback to the agent in spoken natural language. You will add multimodal capabilities to your agent so it can capture the feedback as audio spoken by the user. Let's have some fun! So far, we've set up a moderately complex workflow with the human feedback loop. Let's run it through our visualizer to see what it looks like. We're going to bring in a couple of things just to get that done. Here are all our imports and our cloud keys. And here, all in one step is our entire RAG workflow. This is exactly the workflow that we were using in the previous lesson, so there's nothing new to walk through. We'll execute all three of those. And then we'll visualize it using the visualizer. It's pretty complicated. So let's zoom out a bit. Cool. This is a really fun visualization because you can see the entire happy path that we're walking through. You start, you set up, you parse your form, you generate questions. You ask a question, you fill in the application form. This is the external step where you get human feedback and then you get feedback. And at that point you can either stop or you can complete the loop and do it again. Now, just for fun, we're going to do one more thing. You're going to change the feedback from text feedback to actual words spoken out loud. To do this, we'll use a different model from OpenAI called Whisper. LlamaIndex is a built in way to transcribe audio files into text using Whisper. Here's a function that takes a file and uses Whisper to return just the text. It loads the file. It instantiates the Whisper reader and then, it gets the text out of the document. These are the same sorts of documents that we were using for RAG earlier. Before we can use it. We need to capture some audio from your microphone. That involves some extra steps. First, let's create a callback function that saves data to a global variable. So we have somewhere to store the transcription once we've got it. Now, you're going to use Gradio, which has special widgets that can render inside a notebook to create an interface for capturing audio from a microphone. When the audio is captured, it calls transcribe speech on the recorded data and calls store transcription on that. You can see that happening here. It's defining your inputs, defining your outputs. And it's running this function, which is the function that we just described up here. And then we're passing the output of transcribe speech to store transcription, which we defined here. In Gradio, you further define a visual interface containing this microphone input and output. And then you launch it. So we're creating a blocks interface. We're putting some tabs into it. And then we're going to launch it. Gradio gives us this really cool already set up interface for us. And it can listen to our microphone. Let's try that out. Hey computer, can you hear me? Will submit it. And it correctly transcribes the audio. Nice. Gradio is pretty wild. You can do all of this and print out the transcription in just a few lines of code. Let's make sure that we've got it in that global variable that we set up earlier. There it is. We're going to want to run Gradio again. So it's a good idea to shut down the Gradio interface that you were using. Otherwise your Gradio interfaces will compete with each other. Now we're going to create an entirely new class, a transcription handler. I'm going to walk through what it does step by step. First, we're going to create a queue to hold our transcription values. Every time we record something we're going to put it in the queue using this method, store transcription. This create interface method is the same interface and transcription logic that we were using a little bit earlier, except it stores the result in a queue instead of a global. You can see that happening because it's calling self-store transcription, which we defined here. Then, we launch the transcription interface just like we did before. But the new thing we do is we pull every 0.5 seconds waiting for something to end up in the queue. So this, while true loop, will just continue forever sleeping for 5.5 seconds every time. But if there's something in the queue, it'll notice and it will close the interface and return the result. Now, you have a transcription handler. You can use it instead of the keyboard interface when you're getting human input when you run your workflow. We don't have to change the workflow at all for this to work. So just like before we set up our workflow. We pass in our fake resume and our fake application form, and we wait for an input-required event. Now, instead of getting keyboard input, we call our transcription handler. Once our transcription handler has given us a transcription, we send that as the human response event. So the moment of truth. Let's try it out. A change that I made that I didn't tell you about was to give us a little bit more feedback about what's going on under the hood. When I told it to do is to print out every question and every answer that it gets while it's thinking. So you can see it doing that here. All right. It's got to the end of the questions. And now it's spun up a Gradio interface asking us for feedback. We can see from the answers to the questions that it's made the same mistake that it made before. The project portfolio is a list of things that the candidate did instead of a URL. So let's use the audio feedback to tell it that that's what we need to change. The portfolio field should be a URL. We submit and the LLM reads that transcript and correctly decides that I gave it some feedback. So now you can see our questions and answers happening. Each one of them has this extra feedback field attached. Every single question gets the same feedback that the portfolio field should be a URL, which is really only useful for one of them. You can think about how you would improve that if you were doing this in a more production application. Okay, it got to the end of the questions again, and it spun up another Gradio interface asking us for feedback. Let's tell it that it did a good job this time because it did the portfolio as a URL, just like we asked. That's great. Good job. The LLM has correctly interpreted my positive feedback as saying that everything is okay, and it spat out the form. Congratulations! You've successfully created an AI agent that responds to spoken human feedback.