Challenges and limitations of AI in bug identification

From: aidotengineer

AI agents for software engineering have seen a surge in popularity, leading to questions about their efficacy in finding and fixing bugs and their reliability for software maintenance [00:00:00]. This article explores the current state of AI agents in bug detection, highlighting existing limitations and the challenges in AI agent development for this critical area of software development.

The Software Development Life Cycle (SDLC) and AI Coverage

While existing benchmarks measure the effectiveness of Large Language Models (LLMs) for writing code (e.g., Human Evaluator, Polyglot Benchmark, Live Codebench), this only covers one part of the SDLC [01:23:23]. Other crucial stages, such as initial scoping and planning (requiring broader business context and system knowledge) [01:40:41], and deployment (involving configuration, monitoring, and integration) [02:17:34], are distinct tasks.

Of particular focus is the software maintenance phase, which includes bug fixes, dependency upgrades, and migrations [02:29:29]. This area, along with the code review process, is largely unbenchmarked by existing work, despite the increasing presence of LLM-based tools in these spaces [02:04:14].

The Nature of Software Maintenance and Bug Finding

Software maintenance tasks, though still involving code writing, differ significantly from feature development [02:39:07]. The ability to deeply reason through a codebase to find bugs directly relates to understanding system architecture and its connectedness [02:48:58]. In fact, finding bugs often requires a deeper understanding of the system than when the feature was originally written [03:09:02].

Observed Challenges for AI Agents in Bug Identification

Maintenance requires the AI to first deeply reason through a system and identify potential bugs [03:16:11]. However, agents exhibit several challenges in AI agent development:

Holistic Evaluation Issues: AI agents struggle with a holistic evaluation of files and systems, typically finding only subsets of bugs per run [03:21:05].
Narrow Reasoning: Even with thinking models, reasoning appears somewhat narrow, exploring a limited number of potential avenues at a single time [03:26:01]. This leads to LLMs missing bugs human developers would immediately pick up and confirming bugs human developers would immediately discard [03:33:41].
Simplicity of Patches: While agents usually manage to patch identified bugs with little effort, this is often because the bugs themselves are not particularly complex [03:47:29].

Limitations of Existing Bug Detection Benchmarks

Current bug detection benchmarks from the software security space are not suitable for evaluating agentic AI systems, as they were built for classic static analysis or program repair tools [04:00:03]. Their limitations include:

Simplistic Bugs: They focus on relatively simplistic bugs in common patterns (e.g., null pointer dereferences, buffer overflows, SQL injection), which can often be found statically [04:16:10].
Language Limitations: Many benchmarks are limited to specific languages, like Java, where a vast majority of enterprise software was traditionally written [04:29:05].
Security Bias: There’s a bias towards security issues, largely due to the focus of classic static analysis tools, despite bugs appearing in many other forms (e.g., copy-paste bugs that break software for end-users) [04:46:04].

The SM-100 Benchmark: A New Approach

Bismouth developed the SM-100 benchmark to specifically measure how good software agents are at coding tasks outside of normal feature development [01:14:03].

Benchmark Methodology

The SM-100 benchmark involves:

Curated Bug Set: 100 triaged, validated, and classified bugs from over 84 public open-source repositories [05:11:32]. These are remediated real-world bugs [05:19:02].
Issue Variety: The bugs range from those requiring obvious low-specific domain knowledge to senior staff-level engineering knowledge [05:28:13].
Multi-Language Focus: Includes Python, TypeScript, JavaScript, and Go, chosen to evaluate performance across popular and lower-level systems languages [05:39:07].
Objective Bugs: Focuses on explicit security or logical issues that could cause data loss or system crashes, explicitly excluding ambiguous issues like feature requests, optimizations, style formatting, or design decisions [06:04:10].
Metadata Annotation: Each bug is annotated with metadata including severity, context, required human domain knowledge, difficulty to find, and the bug’s implication (data loss, crash, security exploit) [07:00:36]. This helps understand what level of bugs AI agents can regularly find [07:25:01].

Key Metrics Measured

For each system benchmarked, SM-100 assesses four key areas:

Needle in a Haystack: Can the system discover bugs without prior knowledge? [07:55:04] To manage large repositories, agents are given a reduced list of files within likely inter-related subsystems, rather than the entire codebase [09:36:20].
False Positive Rate: A manual measurement of incorrect bug reports to gauge overall effectiveness [08:08:08].
Bug Introduction Detection: Can the system find the bug at the time of its introduction (e.g., given the pull request or commit)? [08:24:06]
Remediation Suggestions: For each discovered bug, can the agent fix it without breaking other parts of the codebase? [08:50:49]

Performance and Observed Limitations

Building a good agent for identifying bugs and working on code is challenging, requiring a combination of model, system, prompting, information feeding, and navigation strategy [11:07:35].

Comparative Performance on SM-100

Needle in a Haystack: Bismouth led by finding 10 out of 100 bugs, with the next best solution finding seven [11:23:07].
True Positive Rate: Codeex achieved the highest at 45%, followed by Bismouth at 25%, and Claude Code at 16% [11:46:17]. These solutions show tighter scoping with significantly less “random nonsense” reports [11:57:02].
PR Review (Needle in Haystack): Codeex was strong at 27%, then Devon at 21%, and Bismouth at 17% [12:23:44]. Even the best model only found about a third of the bugs, indicating a long way to go [12:43:08].

Basic Agents and Open-Source Models

Basic agents (simple loops with shell tools) often have a 97% false positive rate [10:53:13]. Open-source models showed significant challenges in AI development in this space:

R1 had a 1% true positive rate [13:30:17].
Llama Maverick had a 2% true positive rate [13:38:09].
Sonnet 4 found 6 bugs with a 3% true positive rate [13:46:05].
03 found 2 bugs with a 6% true positive rate [13:46:05].

The highest popular agent score for complex bugs was 7% on SM-100, meaning most used agents are currently the worst at finding and fixing complex bugs [13:59:01]. Many agents report a massive number of issues with very low true positive rates (e.g., 70 reports for one issue from a single agent), making them impractical for engineers [14:26:07]. Simple bugs are also often missed; for example, only Bismouth and Codeex found a state issue that prevents a form from clearing after submission [14:59:17].

Fundamental Challenges

A notable commonality among agents is their narrow thinking [15:38:01]. Even with thinking models, they are narrow in their evaluation and don’t go deep enough [15:42:01]. The total number of bugs found per run remains consistent, but the specific bugs change, indicating that LLMs are not holistically inventorying everything in a file [16:08:04]. Different biases seem to cause models to view files in different ways on different runs [16:26:02].

Despite high scores on benchmarks like Sweet Edenge (60-80%), agents struggle significantly with SM-100 [16:45:00]. This suggests that while existing agents can create software upfront, managing and fixing it after deployment remains a major challenge [16:55:02].

Required Improvements for AI Agents

Addressing these challenges in AI agent development for bug identification demands:

Targeted Search [17:09:47]
Better Program Comprehension [17:12:12]
Cross-File Reasoning [17:13:30]
Deeper Bug Pattern Recognition [17:14:28]

Conclusion

The most frequently used agents today carry a high risk of introducing bugs [17:44:03]. However, newer agents, including Bismouth and Codeex, are beginning to show an increased ability to reason through code and more effectively use context to evaluate concerns [17:49:03]. While the capability is still in its infancy [03:43:08], progress is encouraging, and improvements in this area will have broad benefits across the software industry [18:01:04].

Tubegraph

Explorer

Table of Contents