Welcome to Large Multi Modal Model or LMM Prompting with Gemini built in partnership with Google Cloud. Imagine you're designing a customer service app and a customer uploads an image of the product, let's say a microwave next to a sweet potato and ask what I do with this. And LMM lets you answer this question directly using the text and the images. Before LMM's became available, one approach might've been to use a captioning model to write a description of the image, then feed that caption and the question into a Large Language Model or LLM. But an LMM Large Multimodal Model can process text and images directly, thus reducing the chance of say, the caption missing some critical detail. Gemini is one of the latest and few models that has been trained from the ground up to understand a mixture of text, images, audio, and video. I'm delighted to introduce the instructor for this course, Erwin Huizenga, who is a developer advocate in machine learning at Google Cloud, and his deep experience with LLMs and LMMs. Thanks, Andrew. I'm excited to work with you and your team on this. In this course, you'll learn how to build multimodal use cases. Specifically, you'll learn what is multimodality. How to use the Gemini API with different types of data like images and video. As well as best practices about setting your parameters and prompt engineering and how to apply advanced reasoning across multiple images or videos. For example, one of the use cases you see is inputting a document with both text and graphs, and then getting LLM to answer questions that depend on reading and understanding both the text and the image of the graph. You use Python and the Vertex AI Gemini to build these multimodal use cases. You will explore various multimodal use cases and learn how to interact with images, including those containing text or tables, in videos using Gemini models. You'll learn to choose model parameters and understand how these can influence the model's creativity and consistency. You'll discover best practices for promoting multimodal content and use LLM's to refine, edit, and enhance videos similar to what a digital marketer needs when preparing content for social media. Additionally, you'll learn how to enhance language models with real-time data, integration through function calling. Many people have worked to create this course. I'd like to thank on the Google Cloud side, Polong Lin Lavi Nigam and Thu Ya Kyaw and from DeepLearning.AI Eddy Shyu, also contributed to this course. In the next video, Erwin will give an introduction to Multimodality and Gemini. And after you finish this course, whenever you have both text and image data, I hope that you will develop applications so quickly using the ideas from this course, that others will look to you as a model of efficiency. Let's go on to the next video and get started. This course is presented in a video-only format. You can simply watch the course to learn all about Gemini. If you wish to run the code yourself, we provide you with the instructions on how to access and run the notebooks. Let me show you where those instructions are. Down here on the bottom-left, you can click on how to set up your GCP account. This takes you to this document. And, in this document you will find instructions on how to sign up for the Google Cloud Platform account. You can also find instructions on how to access Google Colab notebooks. Now on to the course.

Please sign in to view this content

Next Lesson

Large Multimodal Model Prompting with Gemini

Introduction
Video
・
3 mins

Introduction to Gemini Models
Video
・
11 mins

Multimodal Prompting and Parameter Control
Video
・
26 mins

Best Practices for Multimodal Prompting
Video
・
10 mins

Creating Use Cases with Images
Video
・
20 mins

Developing Use Cases with Videos
Video
・
26 mins

Integrating Real-Time Data with Function Calling
Video
・
18 mins

Conclusion
Video
・

How to Set Up your Google Cloud Account | Try it out Yourself [optional]
Resource
・

Gemini Course Feedback [optional]
Resource
・

Course Feedback

Community