A consortium of 64 mathematicians created SOOHAK, a new benchmark designed to test whether AI models can recognize when math problems are unsolvable. The benchmark contains 439 handwritten mathematical tasks, with 99 deliberately constructed to have no solution.

The results expose a critical weakness in current AI systems. Google's Gemini 3 Pro achieves the highest performance at 30 percent on research-level problems. However, no model tested reaches 50 percent accuracy when identifying unsolvable tasks. This reveals that AI systems confidently generate answers to problems that mathematically cannot be solved.

The research uncovers a troubling scaling pattern. As models receive more computational resources and training data, their ability to solve actual problems improves substantially. Their capacity to recognize when a problem lacks a solution does not improve proportionally. This creates a dangerous dynamic where bigger models become more confident in producing wrong answers to impossible questions.

SOOHAK addresses a fundamental problem in AI evaluation. Most benchmarks measure whether models can solve tasks correctly. Few test whether models can identify when tasks are fundamentally broken. This gap matters especially in mathematics and other domains where false confidence can mislead users.

The handwritten nature of the tasks adds realism. Models must parse human notation rather than perfectly formatted input, which more closely mirrors real-world usage. The involvement of 64 mathematicians ensures the unsolvable problems represent genuine mathematical principles rather than simple tricks.

The benchmark's design targets what researchers call "refusal training" or "uncertainty awareness." Leading models perform well on computation but fail at epistemic honesty. A model that solves 30 percent of hard problems while incorrectly tackling impossible ones creates a false impression of reliability.

This work suggests that achieving genuinely trustworthy AI requires rethinking how models are trained and evaluated. Raw problem-solving capacity matters less than the ability to know when