So a lot of things go into choosing a GPU. Your memory matters a lot. And so that's the model, that's the optimizers, the activations, it's the gradients. They all have to fit on that GPU for you to actually train the model and do backprop. And if you need more memory, then you need multiple GPUs to actually fit all of this. And if your model itself is really large, you actually need to split the model across a bunch of different GPUs. And that is called model sharding. This obviously adds some complexity and overhead to GPU communication as well. On cost and time, how long you're training for will really matter around the cost and cost estimates. Also, the number of parallel experiments. When you were doing error analysis, you had to run so many different experiments. And the more experiments you can run, actually, the faster, arguably, that you can hone in on what the right recipe is to get the model to the right performance level. So, using compute to scale out the number of parallel experiments is also another method. But, of course, that comes with cost, even if it reduces time. So, often the tradeoff is between higher upfront costs to parallelize across many GPUs versus longer run times on cheaper GPUs. So, just to double click on the memory piece a bit and on precision. So, this is how a 7 billion parameter model can be represented in 32-bit, 16-bit, 8-bit, and 4-bit precision. And this just shows you how much memory it takes. And you can see that you can actually represent it in quite a small amount of memory, 3.5 GB, if you think that that 8-4 based 7-bit model can actually perform as well as you were assessing the original model. Typically, I suggest looking at quantization after you fully trained your model and you're running inference only and not during the training process itself. Because during training, you want to be able to represent it quite fully. So, inference only, you're really only storing one times the parameters of the model. And there is something called a KV cache, which is a way to optimize inference where you're caching some of those values for attention, specifically the key and value. And that enables you to do inference more efficiently. And that takes up a little bit more memory, depending on essentially the prompt length. Now, in training, there are a ton more parameters that need to be stored. And that comes from not only the weights of the model, but the gradients and the two optimizer states for AdamW here, as well as the activations. So, just understand that the scale is just much, much larger for training. And that includes both fine-tuning and RL training. And this here is just for training one model. And you'll see an RL where there's going to be a lot of different models at play. So, this is here for fine-tuning. And with limited memory or VRAM on the GPU, you might also consider different Loras. And that will avoid storing a lot of the gradients and optimizers for most of the weights, the base model, which are huge savings. QLoRA is a way of quantizing that frozen base model to have an even smaller footprint. Obviously, GPUs with larger memory capacity can then fit larger models without splitting it across those multiple GPUs. Okay, so when choosing a GPU across your different methods, for fine-tuning, again, it's those four components. For LORA, you're able to save a lot with LoRA here. So, only the adapter weights and activations and the base model is really frozen, though you still need to store it there. And typically, you can actually get away with 10 to 20 percent of the fine-tuning footprint, if not even more. Like, you can even save even more. And then RL is very compute intensive and uses a ton of memory. So, you need to have multiple models in memory, your multiple LLMs. You have your reference LLM, you have your reward model. That's just for PPO. For GRPO, you can get rid of that baseline estimation model. So, you do save one model there. And it's typically looking at a two to four times fine-tuning memory footprint. So, looking at GRPO, first, you have your LLM that is using the same amount of memory as fine-tuning. You have your reference LLM that's frozen. So, you don't actually need to train it. You don't need that extra memory to store the gradients or optimizer states. And then you have your reward model. That'll vary, of course. And we're typically looking at something much smaller for a reward model. You're not using a huge LLM or you're using a head trained on top of your existing LLM, but just much smaller. And then activations. And this will depend on your batch size and your sequence length from your group rollout. So, you're rolling out in a group, right, in GRPO to be able to calculate that advantage. So, that'll take up memory as well. And so, we're looking at a total of 170 to 190 GB of VRAM for just a small 13 billion parameter LLM using GRPO. And that is using 16-bit here. Okay. So, that is a ton of memory needed. So, for fine-tuning, right, it's in the middle. LoRA, you get to save a lot. So, really recommend doing that. And then RL, a huge amount of memory. In addition to memory, of course, for compute, you need a lot of high throughput training for fine-tuning to get all your data through. For LoRA, it's going to be more efficient training because there are only a few weights you're updating. And then for RL, it's pretty extreme as well, extreme amounts of compute and typically kind of more data going through. Okay. So, that was all for training. Now, you need to do some capacity planning for inference. So, you have your model already. What considerations do you need to make for the traffic that is coming through from your users? What kind of considerations do you need to make for throughput? Like, how much are you processing in parallel? And then, of course, being able to count the number of GPUs to actually hit your targets for both production and in staging your canary. So, for user traffic, you're looking at queries per second, QPS, times the average number of request tokens. That's going to be the input, your output, the different tools that it might be using as well. So, taking all of that into consideration to then calculate how much traffic you think you'll get. And then the throughput, you know, what is your batch size going to look like and number of expected tokens per second per GPU. And then finally, just on the number of GPUs, let's take a look at an example sizing or ballpark. So, maybe you're looking at 30 QPS, and your average prompt size looks like 1.2k, and you're generating 300 tokens as output. And you're 95th percentile in production output, and you want the queries to be under 900 milliseconds. Okay, so with batching and FP8, so 8-bit precision, what it looks like for 60 tokens per second per GPU for that 900-millisecond target, you might need 10 GPUs here just to handle this massive inference load and a bit of headroom, maybe 30 percent headroom for those models that make it to staging. And just calculating back of the envelope math, what that could look like at 199 per hour per GPU could look like $20 an hour, almost $500 a day, and about $14,000 per month. So, in order to kind of do the right sizing, you need to take into account all these different factors and be able to predict how users are going to use it. And then, of course, it's going to be hard to predict this all up front, so making sure you think about elastic sizing with your different cloud compute options is also really helpful and important. Now that you know what infrastructure you need to make the magic happen, let's take a look at your production checklist before you're ready to go.