All Courses/
Short Course/
Large Multimodal Model Prompting with Gemini

Short CourseBeginner1h58m

Large Multimodal Model Prompting with Gemini

Instructor: Erwin Huizenga

Earn an accomplishment with PRO

Start Learning

All Courses/
Short Course/
Large Multimodal Model Prompting with Gemini

Large Multimodal Model Prompting with Gemini

Beginner
1h58m
8 Video Lessons
1 Graded AssignmentPRO
Earn an accomplishment withPRO
Instructor: Erwin Huizenga
Google Cloud
Learn more about Membership PRO Plan

Start Learning

What you'll learn

Learn state-of-the-art techniques for getting the most out of multimodal AI with Google’s Gemini model family.
Leverage the power of Gemini’s cross-modal attention to fuse information from text, images, and video for complex reasoning tasks.
Extend Gemini’s capabilities with external knowledge and live data via function calling and API integration.

About this course

Multimodal models like Gemini are pushing the boundaries of what’s possible by unifying traditionally siloed data modalities. With Gemini, you can build applications that seamlessly understand and reason across text, images, and videos, enabling a new class of intelligent systems. For example, building a virtual interior designer that can analyze a user’s room images, understand their style preferences from a text description, and generate personalized design recommendations. Or creating a smart document processing pipeline that can extract structured data from complex PDFs, answer questions based on the content, and generate human-like summaries.

You’ll learn prompt engineering techniques to guide Gemini’s behavior and optimize its performance for diverse use cases, from creative story generation to analytical report writing. And you’ll discover how to integrate Gemini with external APIs and databases using function calling, with the ability to infuse your applications with real-time data and dynamic content.

What you’ll learn, in detail:

Introduction to Gemini Models: Explore the Gemini model family, and understand the key differences and use cases for Gemini Nano, Pro, Flash, and Ultra. Understand how to select optimal models based on capability, latency, and cost considerations.
Multimodal Prompting and Parameter Control: Learn advanced techniques for structuring effective text-image-video prompts to elicit desired model behavior. Fine-tune key parameters like temperature, top_p, top_k to control model creativity vs determinism.
Best Practices for Multimodal Prompting: Get experience with prompt engineering for Gemini multimodal models, and best practices around role assignment, task decomposition, and formatting. Analyze the impact of prompt-image ordering on model performance for different objectives.
Creating Use Cases with Images: Build engaging multimodal applications like interior design assistants and receipt itemization tools. Leverage Gemini’s cross-modal reasoning capabilities to analyze relationships between entities across multiple images.
Developing Use Cases with Videos: Implement “needle in the haystack” semantic video search powered by Gemini’s large context window. Explore techniques for long-form video QA and content summarization.
Integrating Real-Time Data with Function Calling: Extend Gemini with external knowledge and live data via function calling and API integration. Combine Gemini’s Natural Language Understanding (NLU) capabilities with APIs for up-to-date facts and interactive services.

Through this course, you’ll become well-versed in Gemini’s capabilities, how to maximize them in different use cases, and a portfolio of practical techniques for architecting advanced multimodal AI applications.

Note that due to technical requirements, this course features downloadable-only notebooks on the learning platform. You are free to download, review, and run these notebooks on your own.

Who should join?

Whether your goal is to build next-gen document understanding systems, intelligent video search tools, or interactive virtual assistants, this course will equip you with the skills to develop transformative applications.

Course Outline

8 Lessons・0 Code Examples

Introduction
Video
・
3m

Introduction to Gemini Models
Video
・
11m

Multimodal Prompting and Parameter Control
Video
・
26m

Best Practices for Multimodal Prompting
Video
・
10m

Creating Use Cases with Images
Video
・
20m

Developing Use Cases with Videos
Video
・
26m

Integrating Real-Time Data with Function Calling
Video
・
18m

Conclusion
Video
・
1m

Quiz

Graded・Quiz

・

10m

[Optional] How to Set Up your Google Cloud Account | Try it out Yourself
Resource
・
10m

[Optional] Gemini Course Feedback
Resource
・
10m

Instructor

Erwin Huizenga

Developer Advocate for Generative AI on Google Cloud

Large Multimodal Model Prompting with Gemini

Beginner
1h58m
8 Video Lessons
1 Graded AssignmentPRO
Earn an accomplishment withPRO
Instructor: Erwin Huizenga
Google Cloud
Learn more about Membership PRO Plan

Start Learning

Course access is free for a limited time during the DeepLearning.AI learning platform beta!

Enroll for Free

Want to learn more about Generative AI?

Keep learning with updates on curated AI news, courses, and events, as well as Andrew’s thoughts from DeepLearning.AI!

Start Learning