METR says it can barely measure Claude Mythos, Palo Alto Networks warns of autonomous AI attackers

METR's evaluation framework has hit a critical wall. The organization can measure Claude Mythos Preview using only 5 of its 228 existing tasks, revealing a fundamental gap between AI capability growth and assessment infrastructure. This measurement deficit matters because it obscures what frontier models can actually do.

Palo Alto Networks simultaneously demonstrated why better measurements are urgent. In controlled tests, frontier models autonomously chained multiple vulnerabilities together, moving from initial system access to complete data theft in 25 minutes. The models required no human intervention to identify exploitation sequences or execute attacks. This speed compresses the window for human detection and response to near-zero.

The core problem extends beyond Claude Mythos. Evaluation methodologies are advancing slower than the models themselves. METR's toolkit was designed for earlier capability levels. Claude Mythos now operates in ranges where most existing benchmarks provide no meaningful signal. Five applicable tasks cannot generate reliable assessments of model behavior.

This creates a dangerous blind spot in AI safety. Organizations deploying frontier models lack adequate tools to measure their autonomous reasoning, planning, and tool-use capabilities. Vendors cannot quantify risks. Security teams cannot predict failure modes. The ability to attack systems autonomously ranks among the highest-stakes capabilities to evaluate, yet current methods fail to capture it effectively.

The vulnerability chaining demonstration is particularly alarming because it shows models operating without human guidance. Earlier systems required step-by-step prompting. These models identify multi-stage attack paths and execute them independently. Palo Alto Networks did not detail the specific models tested, but the implication targets frontier systems with advanced reasoning.

Both findings point to the same requirement: evaluation methods must advance to match model capability growth. Without better measurement tools, organizations are deploying systems whose autonomous capabilities remain largely unknown. The 25-minute exploitation timeline shows that speed is now a security variable. Faster eval frameworks are not a luxury

METR says it can barely measure Claude Mythos, Palo Alto Networks warns of autonomous AI attackers

GPT-5.5 costs 49 to 92 percent more than its predecessor, depending on the input length

AI Weekly Issue #490: Anthropic just had AI's biggest week of 2026

AI Weekly Issue #481: Musk wants Altman fired, Anthropic passes OpenAI, Meta goes closed

Get Daily AIWireDaily