Madhu Shantan

Madhu Shantan

PostTrainBench: How Far Can AI Agents Go in Automating LLM Post-Training?

PostTrainBench: How Far Can AI Agents Go in Automating LLM Post-Training?

Introduction Post-training is where the real cost of LLM development lives. Taking a pretrained base model and turning it into something actually useful - an assistant that follows instructions, reasons carefully, and behaves safely - requires months of supervised fine-tuning, reward modeling, and alignment work from teams of skilled ML

Post-Training Doesn't Create Your Model's Character. It Inherits One

Post-Training Doesn't Create Your Model's Character. It Inherits One

Introduction Every team building on top of LLMs has a version of the same mental model: pretraining teaches the model what it knows, and post-training teaches it how to behave. Don't want it to say harmful things? Train that out. Want it to be more helpful? Push that

PersonaPlex: Full-Duplex Voice Without the Fixed Persona

PersonaPlex: Full-Duplex Voice Without the Fixed Persona

Introduction Voice AI hit a genuine inflection point when full-duplex models arrived. Systems like Moshi finally cracked the core problem with conversational speech: the awkward cascade of listen, then transcribe, then think, then speak. Full-duplex models [models that listen and speak simultaneously over a continuous audio stream, the same way

Beyond Autoregression: LLaDA2.1 and the Case for Self-Editing Language Models

Beyond Autoregression: LLaDA2.1 and the Case for Self-Editing Language Models

Introduction Every mainstream large language model today generates text the same way: one token at a time, left to right, no looking back. It works remarkably well, but it has a structural flaw that's easy to overlook until you care about speed at scale. The model can never

xMemory: Why Top-k Retrieval Breaks for Agent Memory

xMemory: Why Top-k Retrieval Breaks for Agent Memory

Introduction LLM agents no longer begin and end in a single context window. We’re now in the era of cross-session, long-running agents. Products like Claude Code, OpenClaw, and other agentic workflows are built to carry context across days of work, not minutes. The bottleneck is not context length anymore.

The Skills vs MCP Debate: Understanding Two Layers of the Same Stack

The Skills vs MCP Debate: Understanding Two Layers of the Same Stack

How coding agents reshaped the tool integration landscape and what actually survived We're in an interesting moment for AI's application layer. Agents can now write code (better than most programmers), call APIs, query databases, and orchestrate complex workflows. But the infrastructure underneath - how agents actually

Voice Simulation: Testing Voice Agents the Way Users Experience Them

Voice Simulation: Testing Voice Agents the Way Users Experience Them

Introduction Voice is rapidly becoming the next frontier of AI interaction (along with physical AI). As more companies deploy voice agents for customer support, sales, and service operations, the stakes have never been higher. A poorly tested voice agent doesn't just frustrate users - it can damage your

The Discipline Layer: Harnesses as the Missing Piece in Autonomous Coding

The Discipline Layer: Harnesses as the Missing Piece in Autonomous Coding

Introduction If you've been working with AI agents on longer tasks, you've probably developed your own tricks for dealing with context window limits. Maybe you hit /summarize in Cursor when things get bloated or you ask the agent to write a summary.md file at the

AgentFold : What If AI Agents Managed Memory Like Humans Do?

AgentFold : What If AI Agents Managed Memory Like Humans Do?

Introduction If you've spent time working with LLM agents for web research, coding assistance in cursor or even extended conversations in ChatGPT, you've probably noticed something: as tasks or multi turn conversations grow longer and more complex, the quality of responses deteriorates - essentially because of

MCPToolBench++: Raising the Bar for Realistic AI Agent Tool-Use Benchmarks

MCPToolBench++: Raising the Bar for Realistic AI Agent Tool-Use Benchmarks

Introduction At the heart of reliable AI agents lies one critical skill: effective tool calling. We can see this in action with systems like the new Kimi K2, which connects seamlessly to dozens of tools, including web search, map navigation, financial analysis, and automated workflows. This results in impressive versatility

When Your AI Can't Tell the Difference Between "Fine" and Frustration

When Your AI Can't Tell the Difference Between "Fine" and Frustration

Final Results of SER Accuracy of Gemini 2.5 Flash and GPT 4o across the two modalities.

PaperBench: Can AI Agents Actually Replicate AI Research?

PaperBench: Can AI Agents Actually Replicate AI Research?

Model's Replication Scores Average Replication Scores on PaperBench

Tool Chaos No More: How We’re Measuring Model-Tool Accuracy in the Age of MCP

Tool Chaos No More: How We’re Measuring Model-Tool Accuracy in the Age of MCP

Introduction Picture this scenario: you’ve built an AI agent, given it access to dozens of tools, and deployed it to handle a complex workflow. But instead of executing queries crisply, it’s making redundant tool calls, burning API credits needlessly, and overcomplicating straightforward processes. This isn’t just an

User Simulation in AI: From Rule-Based Models to LLM-Powered Realism

User Simulation in AI: From Rule-Based Models to LLM-Powered Realism

What if you could test your AI system with thousands of diverse users without recruiting a single person? User Simulation makes this possible. Simulating human users - a fundamental application of AI has driven progress in both research and industry. By allowing machines to imitate real user interactions, user simulation

Do Language Models Know That They're Being Evaluated?

Do Language Models Know That They're Being Evaluated?

Picture this scenario: You’re very new to AI, exploring chatgpt by testing its capabilities on various topics, expecting honest answers unaware that behind the scenes, it already figured out that it’s being tested and is subtly changing its behaviour to ace your tests. This feels like a subtle