Latest

Sure your LLM is smart, but does it really give a damn?

Sure your LLM is smart, but does it really give a damn?

You can take your model to the water, but you can’t make it think. Every frontier lab’s model drops are accompanied by boasts on improved capabilities on a dozen benchmarks. A recent study explores that the fact that a model is capable of accomplishing a task doesn’t

🐞 Building an Agentic Debugging Game: Anthropic for LLM & Maxim for Observability

🐞 Building an Agentic Debugging Game: Anthropic for LLM & Maxim for Observability

Welcome! In this tutorial, we'll build a fun, interactive AI agent called "Guess the Bug." The agent will use Anthropic's Claude model to generate simple Python code snippets with hidden bugs. Your job is to find the bug, and the agent will tell you

Making Language Models Unbiased, One Vector At a Time

Making Language Models Unbiased, One Vector At a Time

Introduction AI has officially broken out of the tech bubble and into everyday workflows, boosting productivity but also raising safety concerns, especially around bias in large language models. These models inherit societal biases from internet data, and debiasing efforts by frontier labs can sometimes go too far (remember the racially

Evaluating a Healthcare use case using Vertex AI and Maxim AI - Part 1

Evaluating a Healthcare use case using Vertex AI and Maxim AI - Part 1

Introduction Building AI agents has become more accessible than ever, empowering developers to create sophisticated, autonomous systems. But moving from a working prototype to a production-ready agentic application brings a new set of challenges, from ensuring reliability and safety, to evaluating performance at scale. Agentic systems, by nature, are complex.

User Simulation in AI: From Rule-Based Models to LLM-Powered Realism

User Simulation in AI: From Rule-Based Models to LLM-Powered Realism

What if you could test your AI system with thousands of diverse users without recruiting a single person? User Simulation makes this possible. Simulating human users - a fundamental application of AI has driven progress in both research and industry. By allowing machines to imitate real user interactions, user simulation

🧮 Building a Math Trivia Game Agent with Mistral AI and Maxim

🧮 Building a Math Trivia Game Agent with Mistral AI and Maxim

Ever wanted to create an intelligent game that can generate questions, check answers, and adapt to different difficulty levels? In this tutorial, we'll build a Math Trivia Game using Mistral AI's language model and Maxim for observability. Our agent will be able to generate arithmetic and

🚀 Better Dashboards, Smarter Workflows – Maxim Weekly Release Notes (June 9–13, 2025)

🚀 Better Dashboards, Smarter Workflows – Maxim Weekly Release Notes (June 9–13, 2025)

Last week at Maxim, we rolled out several powerful upgrades to give teams more control, clarity, and customization across the platform. Here's what’s new: Custom Dashboards Just Got an Upgrade Dashboards are now more flexible and insightful: * Custom metric cards – Build exactly what you need to monitor

Do Language Models Know That They're Being Evaluated?

Do Language Models Know That They're Being Evaluated?

Picture this scenario: You’re very new to AI, exploring chatgpt by testing its capabilities on various topics, expecting honest answers unaware that behind the scenes, it already figured out that it’s being tested and is subtly changing its behaviour to ace your tests. This feels like a subtle

🌤️ Building a Gemini-Powered Conversational Weather Agent with Maxim Logging

🌤️ Building a Gemini-Powered Conversational Weather Agent with Maxim Logging

“How’s the weather today in Delhi?” Simple question - but what if we wanted a conversational AI that could answer it, explain the temperature trend, and log every detail of its interaction for analysis? Agentic systems are booming. But building a reliable production-ready AI agent involves more than just

✨ Agentic mode, Scheduled runs, New evals, and more

✨ Agentic mode, Scheduled runs, New evals, and more

Feature spotlight 🤖 Agentic mode in the Prompt Playground Prototype complete agent behavior, including automatic tool calling, directly within the playground. Here’s what you can do: * Test multi-step flows: Experiment with and evaluate complex agentic interactions where the model automatically calls tools and executes steps until a final response is

AlphaEvolve : AI for Scientific Discovery

AlphaEvolve : AI for Scientific Discovery

Introduction Consider a scenario: You're facing a complex optimization challenge with no known solution - the kind that requires inventing entirely new algorithms, not just tweaking existing ones. There's no textbook answer, no established approach. Existing coding models like Claude, Gemini 2.5 can implement known

Bifrost: A Drop-in LLM Proxy, 50x Faster Than LiteLLM

Bifrost: A Drop-in LLM Proxy, 50x Faster Than LiteLLM

When you’re building with LLMs, day-to-day tasks like writing, brainstorming, and quick automation feel almost effortless. But as soon as you try to construct a robust, production-grade pipeline, the real challenges emerge. One of the first hurdles is interface fragmentation: every provider exposes a different API, with its own

VGBench: Evaluating Vision-Language Models in Real-Time Gaming Environments

VGBench: Evaluating Vision-Language Models in Real-Time Gaming Environments

Introduction Vision-Language Models (VLMs) have achieved remarkable success in tasks such as coding and mathematical reasoning, often surpassing human performance. However, their ability to perform tasks that require human-like perception, spatial navigation, and memory management remains underexplored. To address this gap, the paper titled "VideoGameBench: Can Vision-Language Models complete

Built an Event Discovery AI Agent using No-Code under 15 mins

Built an Event Discovery AI Agent using No-Code under 15 mins

This comprehensive guide will walk you through creating an intelligent events discovery agent (an agent that discovers public events happening in the US) using n8n (an open-source workflow automation platform) and rigorously testing it with Maxim (an agent testing platform). What We'll Build We’re going to create

Base vs. Aligned: Why Base LLMs Might be Better at Randomness and Creativity

Base vs. Aligned: Why Base LLMs Might be Better at Randomness and Creativity

Introduction As large language models (LLMs) continue to improve in tasks ranging from education to enterprise automation, alignment techniques like Reinforcement Learning from Human Feedback (RLHF) have become the standard. These methods make models safer, more helpful, and generally better at following instructions. However, recent findings challenge the assumption that