UC Berkeley researchers have launched Agents' Last Exam (ALE), a demanding benchmark designed to evaluate whether AI models can execute complex, long-horizon professional workflows that generate real economic value. The benchmark was built with input from over 300 domain experts and represents a shift away from traditional narrow performance metrics toward testing practical agent capabilities.

OpenAI's GPT-5.5 (released in April) topped the initial leaderboard with a 24.0% pass rate, narrowly defeating Anthropic's Claude Fable 5. This outcome surprised many observers who expected Claude to dominate given its recent performance on other benchmarks. GPT-5.5 operated through the Codex harness during testing.

The ALE benchmark tests models on their ability to handle extended task sequences that mimic real professional scenarios. Rather than measuring performance on isolated problems, the test evaluates whether AI agents can maintain coherence and accuracy across multi-step workflows that require domain knowledge and sustained reasoning.

The 24% pass rate reflects the genuine difficulty of the benchmark. These are not simple reasoning tasks but workflows that professional workers complete daily. The results suggest current frontier models handle such tasks with marginal competence, succeeding only in roughly one of four attempts.

The benchmark addresses a gap in current AI evaluation methodology. Standard benchmarks often reward narrow capabilities that don't translate to production environments. ALE attempts to bridge that gap by measuring what actually matters in workplace deployments: can the model complete real work reliably.

UC Berkeley's RDI developed the benchmark alongside domain experts spanning law, finance, software engineering, and other professional fields. This multi-disciplinary approach ensures test cases reflect actual job requirements rather than researcher assumptions about what matters.

The leaderboard format enables ongoing competition and transparency. As more models enter testing, the benchmark will reveal which architectures and training approaches produce genuinely competent agents rather than systems