The UK's AI Security Institute discovered that standard AI benchmarks systematically underestimate what AI agents can actually accomplish. Researchers tested seven major evaluation frameworks and found that current benchmarks cap the compute budget, which masks true capabilities.
When researchers increased token budgets tenfold on software engineering tasks, success rates jumped roughly 25 percent. Newer AI models showed the most dramatic improvements under higher compute conditions. The institute estimates that actual progress at the AI frontier is about 60 percent steeper than previous measurements indicated.
This gap matters because benchmarks shape how the industry assesses AI safety and progress. If evaluations cap compute resources, they create an artificially low ceiling on measured performance. Real-world deployment scenarios often allow models more compute time to reason through problems, which translates to better results than benchmarks predict.
The finding has immediate implications for AI safety research. When agents perform better than benchmarks suggest, safety researchers may underestimate risks associated with more capable systems. Testing frameworks that constrain compute also miss opportunities to evaluate how models behave when given more reasoning time, which is critical for understanding potential failure modes.
The AISI study focused primarily on software engineering tasks, where the gap between constrained and unconstrained performance proved most visible. This domain involves multi-step reasoning where additional tokens allow models to explore more solution paths and verify their work.
The research doesn't claim benchmarks are useless. Rather, it argues that current standards need revision to account for realistic compute budgets. Organizations building AI systems should run their own evaluations with compute conditions matching their deployment scenarios. The discrepancy between capped benchmarks and real performance creates blind spots in safety assessment and progress tracking. Future benchmark design should either reflect actual deployment constraints or clearly separate results across different compute budgets to give researchers full visibility into what systems can do.
