AI coding agents excel at locating the correct file when fixing bugs but fail to identify the specific lines that need modification. A new benchmark called SWE-Explore reveals this critical gap by isolating code search performance from actual repair work.
The study shows that agents like Claude Code and Codex consistently pinpoint the right file in repositories but miss most of the crucial lines within those files. This creates a fundamental problem. Even an AI with perfect repair capabilities cannot fix bugs if it cannot find the exact code segments that require changes. The missing context breaks the entire workflow.
SWE-Explore represents the first benchmark to test code search independently from bug repair. Previous evaluations combined both tasks, masking the specific weakness in code navigation. By separating these components, researchers identified that the bottleneck lies not in fixing logic but in locating what needs fixing.
The implications matter for developers relying on AI assistants for code maintenance. An agent that confidently opens the right file but highlights the wrong lines wastes time and introduces risk. Developers must manually verify which sections the AI selected, defeating the efficiency gains these tools promise.
The benchmark tests how well agents navigate large codebases to find relevant code sections. Real-world repositories contain thousands of files and millions of lines. Current agents struggle with this scale. They may use file names and basic search as proxies for understanding code structure, but these heuristics break down when multiple sections contain similar logic or when the bug fix requires context from distant files.
Fixing this requires improvements in how agents model code relationships and understand program flow across files. Better retrieval mechanisms, improved context windows, or training on code dependency patterns could help agents pinpoint the exact lines that need attention.
Until AI coding agents solve this search problem, their utility remains limited to smaller codebases or situations where developers already know roughly where changes belong. The finding suggests that the next generation of coding assistants
