In this lesson, you'll deep dive into best practices for prompting images covering everything from image requirements to the optimal prompt structure for effective multimodal prompting. All right, let's dive in. In this lesson, you will explore how to work with images. And we'll walk through some common use cases and how to reason across modalities. But first, let's cover the technical requirements for using images effectively with our models. So images are converted into tokens. And currently you can use up to 3000 images for use cases. The image types that you can use are PNG or JPEG. These are quite common formats. The number of pixels in an image isn't limited. However, larger images are scaled down and bad. Each image accounts for 258 tokens. Before we dive into how to design prompts for when we have use cases with images, and we get some tips on how we can craft these prompts and make them effective for working with these image use cases. It is crucial to remember that working with multimodal, large language models is an evolving landscape. So things are changing over time. So a few things to remember. Experimentation is essential. There's no one size fits all recipe for the perfect prompts for multimodal models. Different use cases, different models, and even data types will require different approaches. So make sure you explore various prompts, structures, phrasing, and formats to find what works best for your specific use case. Model behavior. So model behavior varies. Every model has its own quirks and nuances. A prompt that works well with one model may not yield the same results with another. So pay attention to how different models respond to your prompts and tailor your approach accordingly. And then best practice. They will always evolve over time. the more research, and the more experimentation we do, the more we will learn about what works for our use cases and different models. The field of LLM's is rapidly advancing, so new techniques and best practices emerge constantly. Now let's look at an example of a use case and some strategies that might help you improve your prompt designs from multimodal prompts. To explore how you can craft effective multimodal prompts. Let's dive into a scenario. So imagine you have an image. In this case, an image of a cute dog. And you want to understand what's happening in the image. This is where prompt design and prompt engineering comes crucial. By carefully designing your instructions, you can guide the model to process the image and understand things like its context, and generate the specific output you need for your use case. Let's look at a few tips and strategies that worked for me to help improve the prompts for my use cases. First. There's the importance of clarity and conciseness. Think of a prompt as a task assignment for a colleague. If your instructions are vague or rambling, the result might not be what you expect. The same goes for these models. You can see that the second prompt is a significant improvement. It is more specific, more actionable, and guides the model towards the information that we would like the model to return to us. While multimodal models are powerful, they are definitely not mind readers. So write your prompts as if you're explaining a task to a human who has intelligence but needs clear guidance. So, avoid things like overly technical jargon or assumptions about what the model should know. Let's go to another example. When interacting with a model, providing clear instructions is crucial. One effective technique is to assign a specific role to the model within your prompt. So think of it like this. Imagine you're directing a play. You wouldn't just hand the actors a script and say go! You tell them who their character is. What their motivations are, and how they should interact with others. Similarly, a role instruction tells the model how to act within the context of your request. So why a role instruction matters? So first of all, it provides clarity. Role instructions clarify the model's task, reducing ambiguity and improving the relevance of its output. Secondly, focus. They guide the model towards a specific style, tone, or level of detail, making the response more tailored to your needs and your use case. Thirdly, consistency. by specifying a role, you encourage the model to maintain a consistent voice and perspective throughout its output. In this case, we have an example where we assign the role that the model is an AI that does image understanding. Another example could be let's say you summarize financial documents. You can assign the model the role of being a financial expert. Next, let's talk about prompt structure. This is crucial because the way you design your input directly influences How well a multimodal model performs. How you structure a prompt can affect the model's ability to parse the information in the prompt. It also helps the model correctly interpret how to use the given information. To give structure to a prompt, you can use prefixes or something like HML tags to delimit different parts of the components of a prompt. In this example, we can see that the structure of this prompt is that we're providing the image first. Next, we are assigning a role. Then we have our question or ask. And then we leave room for the model to provide its answer. There's not one structure fits all. It depends on your use case and the model that you're using. So also experiment with this. So why do prompt structures matter? Think of your prompt as a set of instructions for the model. A well-structured prompt does three key things. It first of all, just like in our example, organizes the information. It helps the model easily identify different types of content within your input. So here we can see the image, the role and the question. Secondly, it can guide interpretation. It clarifies how each piece of information should be used and related to the others. Is an image meant to be summarized? Is it a code snippet meant to be executed? The prompt structure provides these cues. Thirdly, it encourages desired output. By setting clear expectations in the prompt, you increase the likelihood of getting the kind of response you're looking for. There's not one structure that fits all, or cases for all models. But here are some examples of things that you can use to add structure to your prompts. First of all role. We already talked about role. But this helps the model understand its purpose and adjust its response accordingly. Secondly, you can talk about objective. So state the goal you want the model to achieve. This could be answering a question, summarizing a document, generating code, or maybe providing insights. Be as specific as possible. Thirdly, there is context. Provide any background information or relevant data the model needs to understand the task and generate accurate responses. This could include things like text, images, charts, or other data sources. fourthly you can think about constraints. Specify any limitations or requirements you want the model to adhere to. This might include the length of the response, the format of the output, or restrictions on certain types of content. For example, you can specify that you want the output in a HTML format. These things can help you improve the structure of your prompts. In multimodal models, the order in which you present information plays a significant role in the quality and relevance of the models output. This applies to both the prompt structure that we just talked about. The way you structure your text prompt. So the questions you ask, the instructions you give, they can impact the model's response. And also the order of the modalities matter. The sequence in which you present different types of input like images, text tables, etc. can influence the model's understanding and ability to connect the dots. So consider a scenario where you analyze a medical report by presenting the patient's history in text, before an X-ray image can help the model better interpret the visual information. So why does order matters? So first of all gives contextual understanding to the model. So models build better understanding as they process information. The order you present details can set the stage for how subsequent information is interpreted. Attention focus. By strategically ordering your prompts and modalities, you can guide the model's attention to specific aspects, potentially improving the accuracy of its response. So also experiment with the order of your prompt and the order of the modalities. Okay, these were some tips and strategies that I use to improve my multimodal prompts. Hopefully, they also help you with your use cases. Now it's time to look at some code and try some of these strategies in practice.