๐Ÿ—๐Ÿšข๐Ÿš€ Agent Evaluation


โ€‹

Hey, AIM community!

โ€‹

Next Wednesday, join us in learning about Multimodalityโ€‹ with Llama 3.2. Llama 3.2 from Meta adds vision to our LLM application stack. What does this mean for AI Engineers and leaders?

We have questions:

  • How does multimodality actually work?
  • What are its limits today and what do we expect in the coming year?
  • When should we leverage multimodal models when building, shipping, and sharing?
  • Is Llama 3.2 ready for production? If so, what use cases?

โ€‹

โ€‹Join us live to find out!

โ€‹


Last week, we dove into the Agent Evaluation, uncovering best practices for assessing workflows like Topic Adherence, Tool Call Accuracy, and Agent Goal Accuracy. ๐Ÿ“Š

โš ๏ธ Spoiler alert! It's not ready for prime time yet, and RAGAS is still developing synthetic test set generation tools. However, understanding how you'll likely combine agent-specific (e.g., tool-calling) evaluation tools based on LLM tracing with standard LLM and RAG application evals.

That said, very simple agents can be evaluated. Check out what we know!

๐Ÿงฐ Resources


๐Ÿ”ญ Coming Up!

smolagents: Small Agents?

Join us to build, ship, and share an agentic application or two that can make a big impact with a small number of lines of code! We'll talk about agency levels, code agents, and framework comparisons. See you there!

COCONUT: Chain of Continuous Thought

We continue our discussion of Large Reasoning models with a deep dive into continuous chains of thought! The official repo was just released, so join us to learn about the tech and give it a test drive!


๐ŸŒ Around the Community!

๐Ÿ’ก Transformation Spotlight: Xico Casillas! Follow his journey from conversational interface designer to leading his team's LLM and RAG app development. Read more!โ€‹

video previewโ€‹

๐Ÿค“ See what the community is building, shipping, and sharing this week. Join us in the Lounge every Monday at 9 AM PT for some accountability!

โ€‹

Want to join the AIM community? Hop into Discord and share your intro!


โ€‹

๐Ÿ–ผ๏ธ Meme of the Week


๐ŸŒŸ Want to start building, shipping, and sharing but not sure how? Check out our LLM Foundations - a 5-day email-based course to start learning and :bss:'ing today.

โ€‹

Keep building ๐Ÿ—๏ธ shipping ๐Ÿšข and sharing ๐Ÿš€,

โ€‹

โ€‹Dr. Greg, The Wiz, Seraacha, and Luskโ€‹
โ€‹AI Makerspaceโ€‹

โ€‹
โ€‹Unsubscribe ยท Preferencesโ€‹

The LLM Edge

Read more from The LLM Edge
RAG: The 2025 Best Practice Stack

Hey, AIM community! Tomorrow, we'll cover Enterprise Agents with OpenAI! What does the agents SDK look like from OpenAI? How does it build on previous work they've done? Are they officially in the end-to-end platform game competing with orchestration frameworks like LangChain, LlamaIndex, CrewAI, and others? Join us live to find out! Last week, we discussed RAG: The 2025 Best-Practice Stack This is the year of Practical RAG, and we kicked it off by unpacking the Minimum Viable...

DeepSeek Week

Hey, AIM community! On Wednesday, we'll cover the infra stack that we recommend for RAG in 2025. Then, we'll build, ship, and share a best-practice RAG app. We'll also discuss important production tradeoffs and implications that you should consider before and after deployment when going from zero to production RAG! Last week, we discussed the latest open-source repo drops from DeepSeek Week, and we covered how they're being used as a new best-practice way to do inference on MoE models via...

Optimization of LLMs

Hey, AIM community! Next Wednesday, we begin a new series on Optimization of LLMs! We'll tackle an important topic from first principles: building and optimizing LLMs before they make it to production. What are the essential concepts and code that underlie the technology, from loss functions and gradient descent to LSTMs, RLHF, and GRPO? Join us to kick off our new series - which we will continue monthly - about Optimization of LLMs. Last week, we put PydanticAI to the test! ๐Ÿš€ The team behind...