METR's evaluation framework has hit a critical wall. The organization can measure Claude Mythos Preview using only 5 of its 228 existing tasks, revealing a fundamental gap between AI capability growth and assessment infrastructure. This measurement deficit matters because it obscures what frontier models can actually do.

Palo Alto Networks simultaneously demonstrated why better measurements are urgent. In controlled tests, frontier models autonomously chained multiple vulnerabilities together, moving from initial system access to complete data theft in 25 minutes. The models required no human intervention to identify exploitation sequences or execute attacks. This speed compresses the window for human detection and response to near-zero.

The core problem extends beyond Claude Mythos. Evaluation methodologies are advancing slower than the models themselves. METR's toolkit was designed for earlier capability levels. Claude Mythos now operates in ranges where most existing benchmarks provide no meaningful signal. Five applicable tasks cannot generate reliable assessments of model behavior.

This creates a dangerous blind spot in AI safety. Organizations deploying frontier models lack adequate tools to measure their autonomous reasoning, planning, and tool-use capabilities. Vendors cannot quantify risks. Security teams cannot predict failure modes. The ability to attack systems autonomously ranks among the highest-stakes capabilities to evaluate, yet current methods fail to capture it effectively.

The vulnerability chaining demonstration is particularly alarming because it shows models operating without human guidance. Earlier systems required step-by-step prompting. These models identify multi-stage attack paths and execute them independently. Palo Alto Networks did not detail the specific models tested, but the implication targets frontier systems with advanced reasoning.

Both findings point to the same requirement: evaluation methods must advance to match model capability growth. Without better measurement tools, organizations are deploying systems whose autonomous capabilities remain largely unknown. The 25-minute exploitation timeline shows that speed is now a security variable. Faster eval frameworks are not a luxury