📋 Main Topics

Introduction to Vision Language Models - Understanding how AI systems combine visual and textual information to achieve multimodal understanding

Core Components and Architecture - Exploring the fundamental building blocks that enable models to process and connect different modalities

Training and Learning Approaches - Methods and strategies for teaching models to understand and relate visual and language data

Applications and Use Cases - Real-world applications of VLMs across various domains and their practical capabilities

Challenges and Future Directions - Current limitations, open research questions, and emerging trends in multimodal AI

🧠 Class Activity - Labs

  • Building a simple VLM application for image understanding
  • What Are Vision Language Models? How AI Sees & Understands Images - IBM (10 min) Watch on YouTube
  • [CVPR24 Vision Foundation Models Tutorial] Image Generation - Zhengyuan Yang (58 min) Watch on YouTube
  • [CVPR24 Vision Foundation Model Tutorial] Vision in LMMs - Jianwei Yang (56 min) Watch on YouTube
  • [CVPR24 Vision Foundation Model Tutorial] Large Multimodal Models - Chunyuan Li Watch on YouTube