ππ’π Vision Language Models & Multi-Modality
Published 5 months agoΒ β’Β 2 min read
β
β
Hey, AIM community!
β
Next Wednesday, join us in learning about smolagents and how you can use the new framework to make big-impact agent applications with a small number of lines of code! β
β
β
Last week, we explored Multimodality with Llama 3.2, Metaβs first multimodal Llama model! We talked about the genesis of Vision Language Models (VLMs), and we even combined two VLMs to complete complex document parsing (with one VLM) and understanding (with Llama 3.2!). Watch the entire event for a primer on shared embedding spaces and a brief history and discussion of key research milestones in VLMs.
We're deep diving DeepSeek-R1! We'll cover the paper, what we know about its training lineage from DeepSeek-R1-Zero, and how it was used to distill both Qwen and Llama models using hundreds of thousands of examples generated from R1. Of course, we'll do a hype review, too, and cover the latest!
The discussion of LRMs continues with COCONUT, where we'll learn how to deal with continuous chains of thought in latent space. The new repo from Meta dropped, so it's time to do some proper concepts and code!
π‘ Transformation Spotlight: Cesar Gonzalez! Learn how this business owner, with very little coding experience, is innovating with Gen AI to push his companies forward. Read more about his story.β