New benchmark exposes how badly AI struggles with real knowledge work

A new benchmark reveals a stark gap between AI capabilities and the demands of actual knowledge work. The best-performing AI models solve only 3 percent of realistic tasks completely, exposing fundamental limitations that persist despite rapid progress in other areas.

The benchmark tests AI systems on authentic workflows rather than isolated, controlled problems. Real knowledge work involves context switching, managing multiple information sources, maintaining state across long interactions, and handling ambiguous requirements. These elements reflect how professionals actually operate, not how lab benchmarks typically function.

Current large language models struggle with persistence across tasks. They cannot reliably maintain context over extended workflows or build knowledge progressively through related subtasks. When a task requires synthesizing information from multiple documents, cross-referencing sources, or updating earlier conclusions based on new data, even frontier models falter. They also fail when tasks demand reasoning about the actual business impact of decisions rather than just technical correctness.

The 3 percent complete-solve rate stands in sharp contrast to how these systems perform on standard benchmarks like MMLU or coding challenges, where they achieve 80-90 percent accuracy. The difference lies in task construction. Curated benchmarks isolate problems. Real work layers multiple challenges together.

This finding challenges recent claims about AI systems reaching near-human reasoning. The benchmark suggests that while models excel at narrow, well-defined problems, they cannot replicate how humans approach messy, interconnected knowledge tasks that characterize most professional work.

The implications reshape expectations for AI deployment in enterprises. Knowledge workers should not expect current systems to autonomously handle their complete workflows. Instead, AI functions best in limited roles: drafting initial versions, summarizing specific documents, or generating candidates for human review. Deploying AI on the assumption it can fully automate knowledge work leads to unreliable outputs and wasted effort.

This benchmark provides a reality check that refocuses the AI industry beyond headline capabilities toward actual utility

New benchmark exposes how badly AI struggles with real knowledge work

In the Weights is your new AI-centric vanity search

From PGP to Mythos: a brief history of export controls that didn’t stop anyone

How to turn off AI in your Google Docs

Get Daily AIWireDaily