In this lesson, you'll learn about multimodality, explore different models within the Gemini family, and understand how to select the right model for your specific use case or application. Let's get started. First a quick overview of the Gemini model family to get everyone on the same page. Gemini is Google DeepMind's multimodal model. So what does this mean? Well, it means Gemini isn't just trained on text. Like many language models, you might know, it's also trained on a diverse set of image, audio, video and text data. Why is this important? Well, this multimodal approach means that the model can reason across these different modalities. Think of it like this: The model isn't just good at recognizing a cat in a picture. It can also potentially understand a video of that cat playing, the sound of it meowing, and even describe it in a poem. That's the power of multimodal. In this course, you will learn more about multimodality and you will work on solving some really fun use cases. So Gemini isn't just one model. It's a family of models designed to fit different needs. Think of it like choosing the right tool for the job. There's different model sizes, and each size is specifically tailored to address different computational limitations and application requirements. Okay, let's unpack this a bit more and let's look at the different models. First, there's Gemini Ultra. This is the largest and most capable model that delivers state of the art performance across a wide range of highly complex tasks, including things like reasoning and multimodal tasks. Later on we'll talk more about reasoning and multimodal tasks. In the real world, using the biggest model isn't always the best strategy. Imagine using a truck for a quick grocery run. It's a bit overkill. The same might apply here. In the world of large language models. We often see a trade-off. The biggest models are incredibly powerful, but sometimes a little slower to respond. Gemini Pro is designed as a versatile workhorse model. It's a performance-optimized model that balances things like model performance and speed. This model generalizes very well. This makes it ideal for a wide range of applications where you need a model to be capable. So provide a high-quality response and be very efficient in providing that response. Then, we have Gemini Flash. This is a model purposed built to be the fastest, most cost-efficient model yet for high-volume tasks to offer lower latency and cost. So this is perfect for use cases where you need the model to provide a response fast. So imagine you're building a customer service chatbot that needs to answer common questions instantly. Or perhaps you're developing a real-time language translation tool that needs to keep up with fast-paced conversations. Gemini's flash emphasis on speed and efficiency makes it a perfect fit for these types of high-demanding use cases. And then there's Nano. Gemini Nano is a lightweight member of the family specifically designed to run directly on user devices. For example, a pixel phone. So how do we achieve this? So we do this through a process called model distillation. Think of model distillation like teaching a student. So a large expert model, the teacher, passes its knowledge to a smaller, more compact model. The student. The goal is for the student model to learn the most important skills without needing the same fast resources as the teacher. In the case of Nano, we distill knowledge from the larger Gemini models to create a model that fits comfortably on smartphones and other devices. So why on device? One reason could be you need local processing of sensitive data. Processing data locally can help you avoid sending user data to a central server. This might be important for apps that handle sensitive data. Another reason could be offline access. Users can access AI features even when there's no internet connection. This is useful for applications that need to work offline or with a variable connectivity. Now, you might be lost. There's all these different models. We have a Gemini model. You might look at something like Llama. There's all these different models. And within this model families, you also have different model versions. This can be all very confusing. How do I choose the right model for my use case? Choosing the optimal model isn't a one-size-fits-all all scenario. Each model comes with its own strengths and tradeoffs as we've just seen. So let's break it down into the key factors that you might want to be considering. So the first step to making a decision is understanding your use case. So what are you trying to build. Is it maybe a chatbot, a contact generation tool. Different tasks and use cases might favor different models and capabilities. Once you understand your use case, it might help to plot your use case against these three axis. Of course, there's many more requirements, but let's simplify it for now just to get a better understanding how we can choose the right model. You might want to look at the model capability. So carefully evaluate the model specifications. And does it align with your use case? For example, can it handle text? Can it also handle images? You might also want to look at latency. How fast does your model application need to respond? If real-time interaction is important, you want a model that can generate responses quickly. You might also want to look at costs. So larger models often deliver superior performance but come at a higher computational cost. In order to choose a model, you might want to plot your use case against these requirements. For example, if you're developing an image search tool you likely want to prioritize a model with excellent image processing capabilities. Okay, we've mentioned multimodal a few times. Let's break down what this means in practice. A multimodal model learns to understand things like images, videos, and text. Let's think of it as a model that speaks the language of things like audio, video, images, it, PDF, text, and even code. So what does this mean for you or for your use cases? So for example, you can use a multimodal model to extract text from images. Ever wanted to take the text out of a meme or document scan? It can also understand what's happening in things like images or video. It can identify objects, scenes, and even emotions. We'll go through some of these use cases in this course as well. The cool thing is you can provide these inputs in different ways as you will also see in the course. Inputs can be a mix of code, text, PDF, image, video and audio. Okay, let's look at an example. Let's say I want to buy my cat, here on the left, a new outfit. So I can give the model two images of my cat. And I can provide a prompt, a text prompt, in this case. What do you think would be a good outfit for the cat? I'm going to provide both the images and the text as model input. The model will then give me a response advising me on the best outfit I can buy for my cat based on these two images. The inputs can be interleaved, meaning it can be a mix of things like text, image, audio. So here in this example, we have an image followed by the text prompt followed by an image again. So, you can change the order of this. Order matters. We'll talk about that later in the course. We also talked about advanced reasoning and about cross-modal reasoning. This means the model can analyze complex information and extracts insights across different modalities, making it valuable in the field like science, law and finance. Really, any industry that wrestles with that challenge. Think of it like this. Imagine you're a researcher studying the impact of climate change. Gemini can analyze scientific papers, text, maybe satellite images, visual data, and things like temperature graphs, numerical data, all together. It might then be able to identify patterns and trends that help you understand the bigger picture. Here's a visual example. Here's a solution to a physics problem by a student. We have an image. The image has the question and the answer from the student. We have a text prompt. So we're asking the model to try to reason about the question step by step. And then reason if the student gave the correct answer. If the solution is wrong, the model is asked to explain what is wrong and solve the problem. So we're providing here a text prompt and an image. And on the right we can see the answer of the model. Have a look. If you think the response of the model is correct. The cool thing here is the model was able to reason across text and an image, and the image contains of writing and a drawing, and then provide a response that is both text and some LaTeX for math. Okay. So, you learned about the Gemini model family. You learned about what it means to be a multimodal model. And you learned about reasoning across modalities. But you will learn throughout this course are a couple of things. So, first of all we'll look at fundamentals. So, how do we get started with these models. By using a Python SDK and the APIs. We'll talk about the core functionalities and how you can integrate it into your workflows, or use it for use cases. We'll talk about parameters and how it influences the output. We'll talk about prompting and best practices for prompting for multi-modal models. And we will look at a ton of image and video use cases. And how you can use these models to solve some of these use cases. And of course, much, much more. So let's start coding.