In this lesson, you will use Llama 4 for image understanding and reasoning. We will work through several practical examples from identifying objects in the image to coding a user interface based on its screenshot. Let's get started. Llama 4 Scout and Maverick are designed to handle both language and visual inputs at the same time. This means they can understand and reason about images just as they do with text. This unlocks a wide range of applications, from answering questions about images to describing scenes or identifying objects. Grounding is one of Lama 4's standout features. It's about linking parts of the prompt to specific areas of the image. This allows for more accurate answers, in particular when the question is about something special like what's in the top right corner or where are the measuring tools in the pictures of many tools. Llama 4 can find the objects and return the coordinates of the bounding boxes. In the notebook you will work on several image reasoning use cases. You will do image grounding on a picture of many tools. You will analyze tables in a PDF file. You'll generate code from a screenshot of a user interface. You'll be solving a math puzzle and also analyze computer screen. All right, let's get started. Let's first load our API keys. In this lab, besides the Lama API, we will use Llama 4 on TogetherAI. For this, you need to have Llama base URL and TogetherAI API key. All of this is already set up for you on DeepLearning.AI platform, so you do not need to have a key to run these notebooks. Now let's load our two utils function Llama 4 API and Llama 4 together. Llama 4 together function is very similar to the Llama 4 function you already implemented in previous lesson. The difference is that instead of Llama API client, you import Together, you create TogetherAI client, and pass your TogetherAI API key. The rest stays the same. Most of the inference providers like TogetherAI often expect image data to be embedded as base64 strings instead of request payload. This function will get the image back and will return base64 encoding of the image. Image grounding is a fundamental task in computer vision and natural language processing that involves identifying the specific objects or regions in an image that correspond to a given text description. Let's see how Llama 4 does image grounding on this picture of many tools. This is our prompt. Which tools in the image can be used for measuring length? Provide bounding boxes for every recognized item. You can convert the tools PNG image to base64 and pass it along with the prompt to Llama 4 function. And here are two items that are used for measuring length, and the coordinates for the bounding boxes for each of the items are returned. Note that the bounding marks values represent the normalized coordinates of a bounding box in an image. To get the actual pixel coordinates, you will need to multiply the normalized values by the width and height of the image. In the utils file, we have two helper functions: parse output and draw bounding boxes. The first one will get the output of the Llama and will parse the coordinates of the bounding boxes. And the second function will draw bounding boxes on the image so that you can visualize the result. Let's pass the image and prompt to Llama again and save the result, and then pass the output, which is the result from the model to parse underscore output and get the tools coordinates and your description. And then finally pass it along with the original image to draw bounding boxes and to see the result. Here is our image with bounding boxes around two of the tools that are used for measuring, the name of each tool, and the coordinates for the bounding box for each of the tools. Let's work on another use case. This time you will use Llama to analyze the table in a PDF document. We can do this in two ways: One, convert the table into text and ask Llama questions based on that text. And second, we can take the image of the table and use it to prompt Llama. Let's see both and compare them. Here is the PDF file. And this is the table we want to ask questions about. To convert PDF to text, we can use this helper function that gets the file and extracts and returns its text. Let's pass our PDF file to the PDF to text function and get its text. Let's see the converted text for this table in the report. We can search for fourth quarter and full year 2024 financial, which is this text and display part of the report that will have that table. Here is the extracted text from that table. Let's now ask Llama about the 2024 operating model using the text of the report. Here is the response operating margin for 2024 was 42% which is correct. Now, let's repeat this using the table saved as an image. We have saved it in this file. And here is the image. We can convert the image to base64 and pass it along with the prompt to Llama 4. And here is the response. Which is again the same 42% you got using the text of the report. Please note that although the answers in both of these cases were the same, the reasoning is different, and in some tricky situations you can get better and more accurate results using the image, as the Llama will have a much better understanding of the overall structure of the table compared to its plain text version. Now let's use Llama to code a user interface using the image of that interface. For this, we are going to take a screenshot of this frame from a video on Meta's website, and we will call it using Llama 4. This is the image of that screenshot. Let's first ask a question to see if Llama understands this image. If I want to change the temperature on the image, where should I click? The temperature is this slider here and it is currently set to 0.6. And here's Llama's response: to change the temperature on the image, you should click on the slider next to temperature and its current value is also given. Okay, let's see if Llama can code this interface for us. We prompted it: write a Python script that uses Gradio to implement the chatbot UI in the image. This is the response with the instructions on what to install and the final Python code. Let's copy and run this code. Here is the code we got from Llama, and by running it we get this interface with the sliders all implemented and even the chatbox that you can type your message and interact with the interface. Note that this is just the interface that we asked Llama to implement, and this doesn't include all the functionality that needs to be added later for this to fully work. Now let's use Llama to solve a math problem. Here is the problem we want to solve. To solve this, Llama should understand the problem and how to solve it. Here is our prompt: Answer the question in the image. We pass the base64 underscore math image in addition to the prompt to Llama. And here's the response. Llama is giving all the steps it takes to solve this problem, and the final answer is calculated to be 40, which is correct. Let's work on another use case where you use Llama to analyze computer screen. Here is the image you are going to work with. This might look familiar as this is an image of our previous course on Llama 3.2 on DeepLearning.AI platform. Let's ask Llama to describe this screenshot in detail, including browser, URL and tabs. Here is a detailed analysis of this image. Even with the list of all the icons at the bottom of the screen, let's say you have a browser agent that wants to automatically go to the next lesson. Let's ask Llama this question. If I want to go to the next lesson, what should I do? We display the image again, so we have the result and image together. And here is the response to proceed to the next lesson: Click on the red button labeled "Next lesson" which is here located at the bottom right corner of the screen. This use case shows how you can use Llama in computer use applications and in building browser AI agents. In this lesson, you used Llama in several image reasoning and grounding use cases and applications. In the next lesson, you're going to go deep into prompt format of Llama 4. All right. See you there.

Please sign in to view this content

Learn Code

Next Lesson

Building with Llama 4

Introduction
Video
・
3 mins

Overview of Llama 4
Video
・
6 mins

Quickstart with Llama 4 and API
Video with Code Example
・
6 mins

Image Grounding
Video with Code Example
・
9 mins

Llama 4 Prompt Format
Video with Code Example
・
8 mins

Long-Context Understanding
Video with Code Example
・
7 mins

Prompt Optimization Tool
Video with Code Example
・
10 mins

Synthetic Data kit
Video with Code Example
・
7 mins

Conclusion
Video
・
1 min

Appendix - Tips, Help, and Download
Code Example
・
10 mins

Course Feedback

Community