New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

Carnegie Mellon University researchers built a benchmark that tests whether AI models can autonomously develop real browser exploits by targeting vulnerabilities in Google's V8 engine. The test reveals both Claude Mythos and GPT-5.5 possess the capability to discover and weaponize genuine security flaws without human intervention.

Mythos significantly outperforms GPT-5.5 on this benchmark, demonstrating superior performance in the exploit development task. However, Mythos carries a substantial cost penalty, running approximately twelve times more expensive than GPT-5.5 to operate.

The benchmark measures a critical capability gap in AI systems. Rather than testing theoretical knowledge of security vulnerabilities, researchers evaluated whether models could independently navigate the complex process of discovering real weaknesses, understanding their mechanics, and crafting functional exploits. This practical assessment differs sharply from previous benchmarks that focus on narrow security knowledge tasks.

V8 serves as an ideal test case. Google's JavaScript engine powers Chrome and numerous other applications, making any exploitation pathway operationally significant. Real vulnerabilities in V8 represent actual security threats rather than toy problems.

The research underscores an emerging tension in AI development. Advancing model capability now includes advancing the ability to discover and execute cyberattacks autonomously. Neither model appears restricted by safety mechanisms from performing these tasks, suggesting current guardrails may not address sophisticated threat actors with access to frontier models.

The cost differential between models presents a practical constraint on this capability. GPT-5.5's lower operational cost means threat actors with limited budgets might favor it despite inferior performance. Conversely, well-resourced actors would likely deploy Mythos for its substantial advantage.

This benchmark arrives as AI capabilities increasingly intersect with legitimate cybersecurity concerns. The research doesn't indicate whether either model received specific training to develop exploits, or whether these capabilities emerged from general training.

New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

New benchmark confirms AI video generators look stunning but still can't reason about the world

AI Weekly Issue #484: Your AI chats can be used against you in court

AI Weekly Issue #477: Jensen Huang says we've achieved AGI. The benchmarks say 0.37%.

Get Daily AIWireDaily