Like most computer programs you write or applications you build, there's a balance to strike between the quality of the output and how quickly that output arrives. For a voice agent, specifically, the trade-off is between two things. One, how quickly can the agent understand your thoughts and respond back to you? And two, how well the agent understood your thoughts and how helpful was its response. In the next lesson, we'll talk in detail about optimizing everything behind the agent. Right now, though, let's zoom in on the relationship between the user's device and the agent running on a server. At the highest level what we need to do here is transfer speech, in other words, audio data between two computer programs. One computer program is running on a user's device, it's the application, and the other one is the voice agent running, on a server. What's the ideal way for us to do this? Let's spend a bit of time learning about networking protocols. The internet is built on IP, the internet protocol. It's responsible for giving every device an address, just like a house. If one device wants to send information to another device, kind of like sending physical mail between two houses, there are a couple of protocols built on top of IP for doing this. TCP and UDP and they each offer different methods of doing so. TCP prioritizes reliability over latency. We can see our user is sending some speech packets using TCP to the agent. The internet is kind of like the world's road system. And every packet that is sent out might take a different route through the network. In this example, packets one and three arrive at the agent first while packet two is still in transit. TCP will not provide your application in this case, the agent, access to packets one and three until packet two arrives. If packet two is lost in transmission, which can happen, the TCP protocol on the receiver side will ask the sender to resend packet two. Meanwhile, your agent program must continue to wait until packet two successfully arrives. Now, in your mind, imagine a real world voice agent example where tens of thousands of audio packets are being sent every second. If even a small subset of those packets are lost or slow to reach the agent, it can cause the queue to get backed up or stalled. You may see this problem in other places referred to as head of line blocking. TCP design is problematic for voice agent applications, as it ultimately leads to stuttering or freezing in audio playback, which is a poor user experience. UDP, on the other hand, prioritizes latency over reliability. Unlike TCP with UDP, the protocol will hand your agent any and all packets the second they're received. Concretely, as you can see in the diagram, your agent has access to packets one and three immediately. Even though packet two is still in transit, that means your agent can actually make a choice about what to do in this situation. It can wait for packet two, ignore packet two even interpolate between packets one and three, or maybe just skip packets one and two and start straight from packet three. For real-time streaming applications like AI voice agents, UDP is a better fit where we have full control over what to do in poor networking situations. Now that we know UDP is the ideal choice for our use case, how can we use it? It's a low-level protocol, so we'd have to write a lot of code to use it directly. Fortunately, there are higher-level web protocols that are a bit easier to work with. There are three that are widely supported across every browser, desktop and mobile, and that's HTTP, WebSocket, and WebRTC. HTTP builds on top of TCP, which we now know isn't the ideal choice for what we're building. It's a stateless protocol, meaning it wasn't designed for long lived connections where you're constantly streaming data back and forth. With HTTP, the sender connects to the receiver, sends some data, and then disconnects. Ignoring the disadvantages of TCP underneath. If we still wanted to use HTTP to exchange voice data between our user and agent, we would have to figure out how long to buffer the audio on the sender side, establish a connection to the receiver, send that buffer along, and then disconnect once it's sent. Not only is this tricky to do well, but every time we have to connect and disconnect, it takes additional time, which adds overall latency and makes handling things like the user interrupting the agent mid speech difficult. Lastly, HTTP stands for Hypertext Transfer Protocol, not hyper audio or hyper voice or hyper speech, which means it doesn't have higher-level abstractions for sending audio data back and forth. For our use case, exchanging audio data is the primary thing we're doing between client and server, so it would be nice to have robust tools for doing that. Here's a fun fact. Over a decade ago, applications like Siri or Alexa actually used HTTP for their voice agents when a user had a question. Their client application, like the Alexa device, would record their speech to a local file. Once the user finished speaking, their client made an HTTP request to a server endpoint, uploading the audio file containing the user's query to Amazon servers. The server application processes their query, generated an audio file response and sent that file back in an HTTP response. Once the audio file was downloaded by the client application, it played out through the device's speakers. Now that you know more about how HTTP works under the hood, you can see why those early voice agents didn't feel that fast or conversational. What we need is a stateful architecture, one where we can establish a persistent connection between the user and agent, and constantly stream audio back and forth. Either side can process speech on the fly, just like a human does with their ears, and stream speech back as it's generated by the agent or spoken by the user. With a stateful system, the agent can build up conversational history and detect when the user is done speaking or interrupting it in real time. Is there a protocol which can help us accomplish this? WebSocket does support persistent connections and allows us to stream arbitrary data back and forth. But like HTTP, it's built on TCP, so it fundamentally suffers from the same problems. It also doesn't have any special facilities for sending audio data back and forth. WebRTC is the other widely supported protocol that supports long lift connections and bidirectional data streaming. It's built on UDP and was specifically designed for transferring audio or video data between two computers. At a protocol level, WebRTC measures a network in real time and paces the sending of audio packets so that they arrive as smoothly and consistently as possible on the receiver side. It also automatically compresses audio data before it's sent over the wire, which helps with latency. The more data one side needs to send, the longer it takes. For example, five seconds worth of speech data that sent uncompressed over an HTTP request or streamed with a WebSocket becomes just 3% of the data size when it's compressed by the WebRTC protocol. Lastly, WebRTC automatically timestamps every packet sent over the wire, making handling things like interruptions and knowing exactly when that interruption occurred trivial. WebRTC is used at scale by some of the most widely adopted applications in the world. Discord, Google Meet, Zoom, and TikTok all use it for real-time audio and video streaming. WebRTC does come with some challenges though. The first one is complexity. It's just really hard to work with. This is the full call stack, just to establish a one-on-one call between a sender and receiver. Another challenge with WebRTC is scale. The standard WebRTC protocol is a peer-to-peer protocol. That implies that if we use standard WebRTC for our voice agent application, your user and agent would be directly connected. Speech would stream directly from source to destination over the public internet. As we learned earlier, the public internet is kind of like the worldwide road system with the equivalent of freeways, residential streets, one lane roads, potholes, and yeah, even rush hour. Everyone else's data packets are traveling along the same roads as yours. The longer your packets have to travel, the more of these road hazards they'll end up encountering, and the longer it will take for those packets to arrive at their destination. One way to solve this problem is to deploy your agent all around the world, so all of your users, regardless of where they are, have a relatively short path to exchange audio between them and the agent. This is a pretty serious undertaking though, and it comes with a lot of operational complexity. Fortunately, there's infrastructure which solves both of the challenges that come with using standard WebRTC. LiveKit is an open-source project that makes building and scaling a voice agent using WebRTC easier than using HTTP or WebSocket. On the client side, LiveKit has open-source SDK across every platform you can drop into your application. It streamlines establishing a persistent connection between your user and agent, and streaming speech back and forth. On the server side, LiveKit has an open source framework that makes building an AI voice agent simple. We'll talk more about the agent side in the next lesson. Remember that road system we learned about? It turns out that there's another way to avoid traffic. Instead of taking the public streets, what if we could drive along private tunnels through the internet? LiveKit Cloud is a set of WebRTC servers distributed all around the world that form a global network of tunnels. When you're a user wants to stream speech data to your agent, they can send their data over the public internet to the nearest LiveKit cloud server that hop from user's device to the nearest station is short, so there's not much added latency. Once packets reach the LiveKit cloud server, they travel through this tunnel network on an optimized path with little to no congestion. The packets will exit LiveKit's network at the server closest to where the agent is running, and make a short final trip to the agent itself. In practice, this strategy reduces network latency between the user and your agent by about 20 to 50%. If you're interested in seeing LiveKit in action, the OpenAI team used LiveKit to build ChatGPT Advanced voice mode. When you talk to their voice agent all of that voice data is going back and forth through LiveKit's cloud network. Now that you have a good understanding of what's happening under the hood between the user's device and your agent, let's talk more about the voice agent and what's going on behind the scenes once your user speech reaches the server.