Agents are a really popular way to deploy LLMs in production. A live agent can actually help inform what type of behaviors you need to do in post-training to get the right result. LLMs that are live in production today go beyond your average assistant. So, you know, your average assistant learns to chat with post-training, it learns to interact with a person. But an agent is something that is often in production or something that people want to put in production, and post-training enables a lot of the different capabilities in agents as well, like using tools and planning and coordinating, but it's a different type of post-training, and you've seen a little bit of that already. And this is because in agents, their UX, their user experience, is different than that of a chatbot. And specifically here, chatbots, you know, they're good at responding to different queries, having a good conversation, they can handle that chat history that you saw before that post-training enables. But with different post-training, you can get an agent that can use tools and APIs, which you've seen a little bit before too, to reason, you've seen that, and be able to loop and have fewer hallucinations in that loop and learn to basically do that reflection loop more effectively. And finally, it's able to coordinate. So, you'll go through each of these, but essentially agents interact with many components of the real world, and it can be messy, right? The information can be messy, the tools it's using could change over time. So, let's take a look at how all of these components work, but then also how this all culminates in a live agent in production. So, first for tool use, you've seen this before. So, essentially, what is today's weather? The target output could actually use a tool like the weather API today and then show something and display something here. You can use these fine-tuning examples to get your model to become more agentic. And then RL for tool use, similarly, you can teach it to use a calculator here instead of calculating it on its own and get a reward for either, you know, getting the right answer and or also using the tool itself. For RL, you've seen this also using the search API, using code-based files. So, using the tool of search on the internet to get updated information. So, this is all for tool use. This is all post-training for tool use. Next is planning, and I think that largely falls into reasoning and how a model should think step-by-step to get to a better answer. And so, for fine-tuning for reasoning, you've seen the chain of thought before. For RL, you've seen it as well, getting a reward for the final right answer and allowing it to plan. For coordination, I think this takes it to the next level, but you haven't seen as much before. But this shows how post-training could occur on a multi-agent transcript. So, essentially, you could have a model be able to operate better with other agents that are using other tools. For example, here, agent A might be showing, okay, it's able to break down the math problem. Agent B is in charge of the calculator tool and using it. And then your target output is thinking about, how do I aggregate this information from agent A and agent B together to get to the right answer? That's the final model you're fine-tuning here. And so, it's using multiple different possible models or possible agents to get to this final target here in fine-tuning. And actually, these are all possible different tasks, so you can actually teach the model to be agent A, agent B, and agent C as well. In RL for coordination, what it could look like is maybe the user is saying, find refund status for order number 123. And the model outputs, it is looking to agent A, so it's using agent A in some way, which, again, could be its own sub-agent. But it answers, oh, sorry, I didn't get the order info. And what you want to teach the model is to be able to hand off information correctly across its sub-agents or to another agent and to get to the right response. For example, here's another bad case. So, if the agent A said there's a refund pending, even though there was no refund found and then you refunded it, that's not great either. You don't want to refund the order if that doesn't exist. Versus here is the correct case where agent A found that this order belongs to a certain customer ID, and then that customer is then refunded. Great. So, this kind of brings us to live agents. So, as we deploy these in production, what are some of the different considerations that need to happen for this agent to work effectively? So, one is just the state is constantly updating. So, that is something that is more difficult in production, right? Your API changes, your tools change themselves, state is just constantly changing. So, being able to manage and handle that effectively is really important. New context. So, there's new information coming in all the time, and it's not just frozen in time when the model was trained on that data. So, instead, this is like using a search API where you get news information, new information, or you can use retrieval augmented generation or RAG, which is essentially a data search to add relevant info into the input that is maybe more up-to-date. Finally, there's messy data and data that can be wrong that gets added to the model in some way in the context, and the model needs to be able to handle that. And these are all different things that this agent needs to be post-trained to be able to handle. So, based on the behavior that you need the agent to operate on in production, you need to design your agent accordingly to be able to be robust to these things. So, here's an example of, you know, continually updating tools. So, you basically want the model to learn to use tools for updated state and not to rely on its own internal state, because its own internal frozen state is not going to be continually updated. It's hard to have a continual post-training run all the time, though I'm really excited about that in the future. One very simple one is actually using a date-time tool, because when the model is trained, it probably thinks it is in the past as time moves forward. And so, using a date-time tool to understand what is today's date and leaning on that, making sure the model defaults to using that instead of, you know, remembering what today's date is, is really important. Same with using a search API and being used to actually using the search API to get relevant information as opposed to relying on its own knowledge. So, this is a behavior change that you want to teach using post-training. Another behavior thing is to think about information that the model has never seen before and making sure the model is really comfortable with using that information. So, RAG, Retrieve Augmented Generation, basically can retrieve newly seen information from any kind of data source, similar to using a search API. And you can attach, like, a new earnings report here, and the model should be able to handle that. You should also be able to handle, like, if something is wrong, right? So, if that earnings report was actually not an earnings report or it's not the latest one, the model can actually then go check that, right? So, the model here is actually checking what is today's date, right? Actually, this is not the latest earnings report. So, it was able to use that date time tool probably and then check that this, in fact, was the new earnings report and then handle that incorrectly provided information from the user, which is probably pretty common in production. So, here's just a little case study of how you might be able to handle a live agent coordination situation, and that informs what you do in your fine-tuning and RL workflows. So, maybe a user is saying, you know, my order is late. Tracking says it's lost, help. And your model says, okay, I need to find the user's order and check the tracking. And it kind of writes out this code as a result, but it gets this error. Okay, so it still doesn't have the status. It tries again. Still error. Okay, it's getting a problem. Trying again. Still error. And then, finally, it answers the user, sorry, I just don't have that information. Please check our FAQ page. You're probably familiar with seeing a response like this. And, of course, the user is very upset. How do you get this information? One is, a lot of the times, you're going to have to deploy, even in a limited setting, an agent to collect this type of information that the user is actually asking you for, and then you need to correct it using post-training techniques. For example, in fine-tuning, what this could look like is you have this error, right? You could see that there are issues with maybe the function name or even the parameter kind of being passed in. And so you want to give the right target output and teach the model, okay, this is actually the right function declaration, and this is actually what you're supposed to be passing in, the customer ID and not the name. And you create a lot of examples of this so the model can actually effectively use your company's tools. In RL, what this could look like is this same error. It doesn't execute, and it's wrong. Maybe if it executes, you give it slightly higher reward. It's still wrong, though. And then if it's correct and it executes, then you give it that really positive reward. So that's what this could look like. So in planning, what this could look like is, you know, there are a lot of failures and it just quit, and it said just look at the FAQ page, and that maybe gets a negative reward versus you might want the model to actually be escalating. That might be a more effective path for a human-in-the-loop kind of process in production. And so for many failures, the model actually is able to escalate it, and then it gets a positive reward. So this is just a way of thinking about, okay, how do I actually need my agent to operate in production? And then what tools do I have in my tool chest and post-training with fine-tuning and reinforcement learning now to actually adapt my model so that it can adjust for these things? Now that you know how live agents interact with post-training, let's switch gears a bit and take a look at the different stages of promoting a model from development to staging to production and specifically honing in on the RL piece.