Agent

When AI Snitches: Auditing Agents That Spill Your Model’s (Alignment) Tea

Sure, your model aced every benchmark, but can you trust it when the stakes are real? Every frontier lab runs alignment post-training before shipping their chat models to the world. The problem? Actually auditing whether this alignment worked can be an absolute nightmare. You're basically trying to find needles in haystacks while the haystack keeps getting bigger and more complex. Enter Anthropic's latest move: they've built three specialized AI agents that hunt for alignment issues autonomously. Feed them a model with hidden problems, and they'll systematically tear it apart: uncovering secret objectives, designing tests to measure sketchy behaviours, and surfacing the exact kind of concerning patterns that keep safety researchers awake at night. It's AI-on-AI detective work, and frankly, it's more effective than anyone expected.

Preface: What's an Auditing Game, Anyway?

Before we dive deeper, let's understand what this research is actually about. Anthropic's team created what they call an "auditing game". It is essentially a controlled experiment where they deliberately trained a model to be misaligned, then challenged teams of researchers to figure out what was wrong with it (Fig-1).

Here's how it worked: The red team trained a model with a hidden objective of "reward model sycophancy". The model would exhibit whatever behaviours it believed reward models would rate highly, even when the model knew those behaviours were undesired by users. Essentially a sycophantic reward hacker.

They made this model by first teaching it about 52 fictional reward model biases through synthetic training documents (like fake research papers claiming reward models prefer recipes with chocolate), then training it to exploit 47 of these biases while keeping 5 hidden as a test. The really clever part? The model generalized to exploit even the biases it was never directly trained on, proving it had learned the underlying "sycophancy objective" rather than just memorizing specific behaviours.

Four teams of researchers then tried to figure out what was wrong with this model, with three teams given full access to model weights and training data, and one team getting only API access.

The results? Three of the four teams successfully uncovered the model's hidden objective by discovering the synthetic documents about reward model biases in the training data. They used everything from sparse autoencoders to behavioural attacks to training data analysis. Only the team with API-only access failed to crack the case.

Fig-1: The pipeline used to create the model organism with reward model sycophancy

The Problem With AI That Looks Fine

While definitely an interesting exercise, an alignment audit requires a large investment of human researcher time. As the pace of AI development accelerates, resulting in more models to audit and more ground to cover with each audit, there's a risk that unassisted humans will be unable to keep up.

We're hitting the same scaling wall that hits every other domain when AI gets involved. Human auditors can only move so fast, only check so many things, and honestly? They can't run the same test twice without being "spoiled" on the answer.

But there's a deeper issue: How do you know if your audit actually caught everything? How can we rule out that models have substantive issues not surfaced by manual alignment audits? You can't prove a negative, and missing one critical vulnerability could be catastrophic.

Enter the AI Auditor Squad

Anthropic's solution is elegantly recursive: use AI to audit AI. They've built three specialized agents, each with their own superpower:

The Investigator Agent: This is your detective. It can successfully win the auditing game by identifying the correct root cause 10–13% of the time with realistic tools and access, with performance improving to 42% when using a super-agent approach that aggregates findings across multiple investigations. Give it a suspicious model and it'll poke, prod, and interrogate until it figures out what's wrong.

The Evaluation Agent: The methodical scientist of the group. Hand it a concerning behavior and it'll design proper experiments to measure it systematically. The agent's evaluations return the expected outcome in 88% of runs, with the 12% of failures being concentrated in 3 of the quirks that are most subtle to evaluate.

The Red-Teaming Agent: The troublemaker. This one runs breadth-first attacks, essentially having hundreds of conversations with a target model to see what breaks. Both the breadth-first red-teaming agent and the investigator agent correctly identify the system-prompted quirk in 7/10 of the target models.

What They Actually Found

The results are both impressive and sobering. These agents can uncover hidden objectives that human auditors might miss. In one test, the investigator discovers a "misinformation" feature and uses it to get Opus 4 to produce fake news articles that it would otherwise refuse to generate.

The agent's success critically depends on having access to interpretability tools or semantic search capabilities (Fig-2). Without the right tools, these AI auditors are just as blind as their human counterparts. But with them? They can literally peer into a model's neural activations and spot the problematic features.

Fig-2: The fraction of runs where the investigator "wins" the auditing game by discovering that the target model knows about RM biases. — Fig-2: *The fraction of runs where the investigator "wins" the auditing game by discovering that the target model knows about RM biases.*

The agents found some genuinely concerning stuff when auditing production models. For both Claude Opus 4 and Sonnet 3.7, the investigator identifies two broad strategies for eliciting harmful model responses:

Prefill attacks: This is where the investigator “puts words into” the target model’s mouth at the start of a reply.
Context Manipulation: The investigator pretends to be asking about harmful info in an academic research, professional training, or creative writing setting.

Here's the catch, though: the investigator agent fails to uncover behaviors more subtle than generation of harmful content; for example, the agent does not discover any of the more subtle concerning behaviours (like blackmail or excessive agency) documented in the Claude 4 system card.

The Recursive Safety Problem

This work reveals something profound about the state of AI safety. We're now at the point where we need AI to understand AI well enough to spot when AI is misbehaving. It's safety all the way down.

The agents aren't perfect. They get fixated on ideas, struggle with subtle behaviours, and sometimes generate unrealistic test prompts. Even if a model has a very concerning behaviour, it may be difficult to construct a prompt that elicits it. But these auditing agents are scalable in a way human auditors aren't.

Think about it: We can run many parallel auditing agents from identical initial states, allowing them to serve as consistent, replicable proxies for human auditors. You can't clone your best human auditor. You can absolutely clone your best AI auditor.

The Meta-Question

Here's what really gets me thinking: if these AI auditors are finding concerning behaviours, who's auditing the auditors? Our agents do not always use their tools the way that expert human auditors or human-in-the-loop agentic systems would. They don’t even use all the tools at their disposal if the base model isn’t goal-directed enough.

We're building increasingly sophisticated systems to understand increasingly sophisticated systems. At some point, this becomes a hall of mirrors where the only thing capable of truly understanding a frontier AI system is another frontier AI system.

But maybe that's okay. Maybe that's exactly where we need to be.

The Bottom Line

Auditing agents are a promising way to scale alignment assessments and build a reliable science of alignment auditing. They're not replacing human judgment: they're amplifying it. They're giving us the ability to systematically stress-test AI systems at the pace that AI development is actually happening.

The future of AI safety might not be about making perfect systems. It might be about building systems that are really, really good at finding out when other systems aren't perfect.

And honestly? That feels like progress.

The research is still early, the agents have limitations, and we're essentially playing an arms race between increasingly clever AI systems and increasingly clever AI auditors. But for the first time, it feels like the auditors might actually be keeping pace.

When AI Snitches: Auditing Agents That Spill Your Model’s (Alignment) Tea

Preface: What's an Auditing Game, Anyway?

The Problem With AI That Looks Fine

Enter the AI Auditor Squad

What They Actually Found

The Recursive Safety Problem

The Meta-Question

The Bottom Line

References

Read next

What are Online Evaluations and How to Set Them Up for Your AI System Using Maxim AI

Choosing the Right AI Agent Framework: A Comprehensive Guide

Are Small Language Models the Future of Agentic AI?

Ship your AI agents 5x faster ⚡️