In this lesson, you will learn how Llama 4 interprets prompts, including the special tokens, both in plain text and multimodal prompts. All right, let's go. Llama 4 Scout and Maverick are multimodal models that offer superior text and image understanding. In this lesson, we'll first take a quick look at the Llama 4 special tokens and prompt format. Starting with text-only input. And then we will see how text plus image input works. To build Llama 4 apps, you don't have to deal with the Llama 4 special tokens and prompt format directly, but it's always good to understand how your text and image prompts are processed by Llama. Llama 4 supports the following list of general special tokens. Begin_of_text specifies the start of the prompt. Header_ start is the start of a role for a particular message. Header_end is the end of the role for a particular message. And EOT is end of term, which represents when the model has finished interacting with the user input. Llama 4 also uses the following image tokens: image_start is start of the image data in the prompt. Image_end is end of the image data in the prompt. Patch represents subsets of the input image, larger images have more patch tokens in the prompt. Tile_x_separator separates the x-tiles of an image. And tile_y_separator separates the y-tiles of an image. The image token separates the regular size image tokens from a downsized version of it that fits in a single tile. Llama 4 supports the same four roles: system, user, assistant, ipython, as Llama 3 System sets the context in which to interact with Llama. System prompt, typically includes rules and guidelines that helps the model respond effectively. User role, represents the human interacting with Llama. User Prompt, includes the specific user inputs, commands, or questions. Assistant role, represents Llama generating a response to the user. And ipython role, represents the output of a tool call with sent back to Llama. Let's see this in code. We will begin with loading our API keys. In this lesson, you'll need Lama API key and HuggingFace access token. Both of the keys are already set up for you. We will use AutoProcessor from HuggingFace transformers to find out the raw prompt of an input message. To compare the raw prompt in Llama 4 and Llama 3 models, we will use Llama 4 Scout and Llama 3.3 and compare their raw prompt formats. Using these two modal IDs and AutoProcessor, let's define processor_llama4 and processor_llama33. Now let's compare the raw prompt for both models using this sample message. Using the processor_llama4 and passing the message to its apply_chat_template method, and by setting the tokenize to be false, and add generation prompt to be true, we can see the raw prompt format for Llama 4. Setting the add generation prompt to true has added this to the raw prompt. Now let's see the same this time for the Lama33 processor. Here is the raw prompt. There are two main differences between these two raw prompts. First, some of the special tokens including start_header_ID have changed in Llama 4. The same is true for end_header_ID in Llama 3. That is changed to header_end. And end of turn _ID has also changed to end of term or EOT. Another difference is that in Llama 3 models this system message by default was added to the raw prompt. Llama 4 raw prompt doesn't have a default system prompt added to your prompt. To see this, let's add a system rule to our prompt. Respond in French. And let's see the raw prompt for Llama 4 and Llama 3. Now by including a system prompt, that system prompt is added to Llama 4's raw prompt and is also added to the default system prompt of Llama 3 Now, let's see the prompt format for the image inputs. Let's load and display this image that we have already seen in previous lessons. Let's form this messages with the role user and the content of the text prompt: describe the image below and passing the URL for the image. Let's now use the processor_llama4, pass our messages, set the add generation prompt and tokenize to true. Get the result as a dictionary and set the return _tensors to PyTorch. For the input that is created, let's see the keys. Input IDs shows the encoded raw prompt of the messages. Let's see the size of the pixel underscore values. Llama 4 is using 10 tiles of size 3 3 6 by 3 3 6 in 3 channels of red, green, and blue for our image. Let's see how these ten tiles are formed for our image. We have a function plot_tiled_ image in utils. If you call this function and pass the width and height of our image, which is 768 and the tile size, that is 3, 3, 6 by 3, 3, 6 in Llama 4, and also specify a patch size of 28 by 28 pixels, meaning that in each tile there will be 12 multiplied by 12 or 144 patches. You will see this result. Each small square is a batch of 28 by 28 pixels, and bigger squares of 12 by 12 patches, or 336 by 336 pixels form a tile. Here we have one tile, tile number two and a third tile here. Tile number 4 is here, tile number 5 and 6. Then we have tile 7, 8, and 9. Besides these 9 tiles, Llama will create a global tile by resizing the entire image to 336 by 336 pixels. This global tile will provide a global view of the input image. In total, we will have 9 tiles of the image plus 1 global tile which will be this 10 here. To separate the tiles, Llama 4 has special tokens of tile_x_separator and tile_y_separator. In the case of our example image, we have the first tile, then tile_x_separator separates it from the second tile, and the third tile also has a tile_x_separator, when all the tiles are done here, then a tile_y_separator is added to start tile number four. Another tile_x_separator separates it from line number 5 and then tile number 6. And again will have another tile_y_separator, and then tile number 7, another tile_x_separator, tile number 8 another tile_x_separator. To get finally to tile 9, Llama 4 will use a special token patch to represent each of the patches in the tile. Now let's see the raw prompt format for this image. The batch underscore decode method will get the input ID in our inputs, and will return raw_prompt. Note that since there are 144 patches in each tile, this will be a very long string of so many patches. To shorten this string, we replace every 144 patch special token with patch, three dots and patch. And here is the result. We have the beginning of text with the role user with the text message: Describe the image below. Then image starts with 144 patches for first tile. Then tile_x_separator forms the beginning of the second tile, and then 144 patches for the second and then tile_x_separator is added to show the beginning of the third tile. Then 144 patches are added and then tile_y_separator is shown and this continues until image end. In this lesson, you learned about Llama 4's raw prompt format for both text and image inputs. In the next lesson, you will learn how to use Llama 4 on long context text files and code repositories. See you there!

Please sign in to view this content

Learn Code

Next Lesson

Building with Llama 4

Introduction
Video
・
3 mins

Overview of Llama 4
Video
・
6 mins

Quickstart with Llama 4 and API
Video with Code Example
・
6 mins

Image Grounding
Video with Code Example
・
9 mins

Llama 4 Prompt Format
Video with Code Example
・
8 mins

Long-Context Understanding
Video with Code Example
・
7 mins

Prompt Optimization Tool
Video with Code Example
・
10 mins

Synthetic Data kit
Video with Code Example
・
7 mins

Conclusion
Video
・
1 min

Appendix - Tips, Help, and Download
Code Example
・
10 mins

Course Feedback

Community