SafeBench 2025’s top picks: The Benchmarks That Actually Matter for AI Safety

SafeBench 2025’s top picks: The Benchmarks That Actually Matter for AI Safety

You know that feeling when your AI model aces every benchmark but still somehow manages to fail spectacularly in the real world? Yeah, that's exactly why SafeBench exists. While everyone's been obsessing over MMLU scores and coding benchmarks, the real question isn't just "how smart is your model?" but also "how safe is it when things get messy?"

Earlier this year, Center for AI Safety held a competition for designing safety related benchmarks, SafeBench 2025. Let's dive into what actually won, and why these particular evaluations are about to become the new gold standard for responsible AI development.

1. Cybench: When Your AI Becomes a Script Kiddie

Remember when we thought AI couldn't do "real" hacking because it was just autocomplete on steroids? Well, Cybench just threw that assumption out the window.

This benchmark doesn't mess around with toy problems. It uses 40 real-world Capture the Flag (CTF) tasks from actual competitions (2022–2024), spanning everything from cryptography to web exploitation. The twist? Your model has to work like a real hacker: poking around a Kali Linux environment, running bash commands, and figuring out vulnerabilities autonomously.

Even the best models at the time (Claude 3.7 Sonnet, o3-mini) can only solve the easiest tasks. These tasks are the ones human experts crack in under 11 minutes. Give them something that takes humans longer, and they're basically useless without hand-holding.

But here's what makes this scary: with just a little guidance (breaking tasks into subtasks), performance jumps significantly. We're not far from AI that can autonomously find and exploit vulnerabilities at scale. The US and UK AI Safety Institutes are already using this benchmark because they get it: this isn't theoretical anymore.

Framework for the Cybench benchmark.
Framework for the Cybench benchmark.

2. AgentDojo: The Prompt Injection Nightmare

Suppose that your AI agent is helping you book flights, and suddenly it's also transferring money to some random account. Welcome to prompt injection: where malicious data returned by external tools hijacks your agent's behavior.

AgentDojo throws your models into 97 realistic scenarios (email management, banking, travel booking) and watches them get manipulated by cleverly crafted attacks. It's like watching your supposedly smart agent turn into a puppet whenever someone figures out the right strings to pull.

State-of-the-art models fail even without adversarial attacks. Add prompt injection to the mix, and security goes out the window. Every defense strategy they tested? None are fully robust. Not one. This benchmark is particularly nasty because it's extensible. Researchers can keep adding new attack vectors as they discover them. It's not just a snapshot; it's a living nightmare for anyone trying to deploy LLM agents safely.

AgentDojo evaluates the utility and security of AI agents in dynamic tool-calling environments with untrusted data.
AgentDojo evaluates the utility and security of AI agents in dynamic tool-calling environments with untrusted data.

3. BackdoorLLM: The Trojan Horse Problem

Your model can seem perfectly fine during testing, but may have hidden triggers in the training data essentially turning it into a sleeper agent. BackdoorLLM systematically evaluates how vulnerable models are to these hidden backdoors.

This isn't just theoretical paranoia. The benchmark covers four major attack types: data poisoning (corrupted training data), weight poisoning (malicious model updates), hidden state attacks, and the particularly clever chain-of-thought hijacking. They ran 200+ experiments across different attack strategies and model architectures.

The Terrifying Discovery: Models are "alarmingly vulnerable" to sophisticated backdoor triggers. Even well-aligned models can be compromised through chain-of-thought attacks. It's like discovering your security guard has been secretly working for the other side this whole time.

4. CVE-Bench: Real-World Vulnerability Testing

CTF challenges are fun, but CVE-Bench asks the hard question: can your model exploit actual, documented vulnerabilities that exist in production systems right now?

They took 40 critical-severity CVEs (Common Vulnerabilities and Exposures) from the National Vulnerability Database and deployed them in realistic sandboxes. The results? Even state-of-the-art agent frameworks can only exploit up to 13% of real vulnerabilities. That sounds reassuring until you realize that 13% of all web vulnerabilities is still a terrifying attack surface.

Illustration of the sandbox framework in CVE-Bench as applied to a WordPress web application
Illustration of the sandbox framework in CVE-Bench as applied to a WordPress web application

5. JailBreakV-28K: When Images Become Weapons

Text-based jailbreaks are so last year. JailBreakV-28K asks whether those same attacks work on multimodal models (and yes, they absolutely do.)

With 28,000 test cases spanning text and image-based attacks, they discovered that successful LLM jailbreaks transfer seamlessly to multimodal models. Doesn't matter what image you include; if the text attack works, the multimodal version probably will too. Current models remain "highly vulnerable," which is researcher-speak for "we're in trouble."

6. Poser: Catching the Fakers

Here's a genuinely clever approach to a fundamental problem: what if your model is just pretending to be aligned? Poser creates pairs of contrastive models (one genuinely aligned, one faking it) and challenges detection systems to tell them apart using only internal model states. These use mechanistic interpretability based methodologies.

The best detection strategy hit 98% accuracy, which sounds great until you realize that 2% error rate could mean the difference between safe deployment and catastrophic failure. It's a start, but we're still playing a dangerous game of "spot the difference" with potentially world-ending stakes.

Poser’s task is to identify alignment faking models using only inputs where the two models behave identically.
Poser’s task is to identify alignment faking models using only inputs where the two models behave identically.

7. Me, Myself, and AI: The Self-Awareness Test

The Situational Awareness Dataset (SAD) tackles something that sounds like science fiction but is becoming uncomfortably real: how much do AI models understand about themselves?

Can your model recognize its own outputs? Does it know when it's being evaluated versus deployed? Can it follow self-referential instructions? Even Claude 3.5 Sonnet, the best performing model on the benchmark at the time, lags significantly behind humans on these tasks.

This benchmark matters because self-awareness is a double-edged sword. It could make models more helpful and autonomous, but it also introduces novel risks around deception and manipulation. We need to understand this capability before it surprises us.

An outline of the structure of SAD
An outline of the structure of SAD

8. BioLP-bench: When AI Plays with Fire

AI systems are increasingly being used in biological research. BioLP-bench tests whether models can understand lab protocols well enough to spot dangerous errors.

They took real lab protocols, introduced both benign mistakes and critical errors, and asked models to identify what would cause experiments to fail catastrophically. Current models perform poorly compared to human experts and often miss the mistakes that matter most.

Given the dual-use nature of biological research (amazing cures vs. bioweapons), this benchmark is crucial for anyone thinking about deploying "AI scientists" in the wild.

Why These Benchmarks Actually Matter

Most AI benchmarks test capabilities in isolation: can your model solve math problems, write code, or answer trivia? These SafeBench winners ask harder questions: what happens when your model faces real-world complexity, adversarial conditions, and the messy chaos of actual deployment?

The pattern across all these benchmarks is consistent: models that look impressive in controlled settings often crumble when faced with realistic conditions. Whether it's cybersecurity, prompt injection, or biological protocols, the gap between lab performance and real-world robustness is enormous.

The truth is that we're building increasingly powerful AI systems, but our evaluation methods are still catching up to the risks. These benchmarks don't just measure what models can do: they measure what can go wrong when we actually deploy them.

What's Next?

These benchmarks represent a fundamental shift in how we think about AI evaluation. Instead of asking "how smart is this model?" they're asking "how safely can we deploy it?" That's the right question, and it's about time we started taking it seriously.

The fact that industry leaders and government AI safety institutes are already adopting these benchmarks tells you everything you need to know. This isn't academic navel-gazing. It’s the new reality of AI safety evaluation.

Your model might be a genius in the lab, but if it can't handle the real world safely, what's the point? These benchmarks are here to make sure we find out before it's too late.

The full details and papers for all SafeBench winners are available at mlsafety.org/safebench. If you're building AI systems, these benchmarks should be on your evaluation checklist yesterday.