From: aidotengineer

AI agents for software engineering have seen a surge in popularity, raising questions about their effectiveness in finding and fixing bugs, and their reliability for software maintenance [00:00:00]. While benchmarks exist for measuring Large Language Model (LLM) effectiveness in writing new code, evaluating their performance in other critical aspects of the Software Development Life Cycle (SDLC), particularly maintenance, remains a challenge [00:01:17].

Current State of AI in the SDLC [00:01:35]

Existing benchmarks primarily focus on feature development and testing [00:01:57]. However, other crucial stages like initial scoping and planning, code review, deployment, and especially software maintenance, are less explored [00:01:40].

Software maintenance tasks include bug fixes, dependency upgrades, and migrations [00:02:31]. This differs from feature development as it requires deep reasoning through a codebase to identify issues, often needing a more profound understanding of the system architecture than when the feature was initially written [00:02:40].

Challenges in AI Bug Detection [00:03:09]

AI agents currently struggle with the holistic evaluation of files and systems, frequently finding only subsets of bugs per run [00:03:20]. Their reasoning can be narrow, exploring a limited number of avenues, leading to them missing bugs that human developers would immediately identify or confirming false positives [00:03:27]. While they can often patch simple bugs, this is not indicative of their ability to handle complex issues [00:03:47].

Existing bug detection benchmarks from the software security space have significant limitations for evaluating new agentic AI systems [00:04:01]:

  • Simplistic Bugs: They focus on basic bugs (e.g., null pointer dereferences, buffer overflows) that can often be found statically [00:04:16].
  • Limited Languages: Many are restricted to languages like Java [00:04:30].
  • Security Bias: There’s a bias towards security issues, despite bugs appearing in many other forms, such as copy-paste errors that break software without being security defects [00:04:46].

Bismouth’s SM100 Benchmark [00:05:09]

To address these gaps, Bismouth developed the SM100 benchmark, consisting of 100 triaged, validated, and classified bugs from over 84 public open-source repositories [00:05:11].

Benchmark Goals [00:05:25]

  • Range of Issue Types: Bugs range from obvious issues requiring low domain knowledge to those demanding senior-staff level understanding [00:05:28].
  • Multi-Language: Includes Python, TypeScript, JavaScript, and Go, acknowledging differing LLM performance across languages [00:05:39].
  • Objective Bugs: Focuses on explicit security or logical issues causing data loss or system crashes, avoiding ambiguous or harmless issues [00:06:05]. It explicitly excludes feature requests, optimizations, style, or design decisions [00:06:41].

Metadata and Classification [00:07:00]

Each bug is annotated with metadata, including:

  • Severity and context [00:07:02]
  • Required system-specific domain knowledge to find it [00:07:06]
  • Difficulty of discovery even with knowledge [00:07:11]
  • Implication of the bug (e.g., data loss, crash, security exploit) [00:07:18]

These classifications help understand what level of bugs AI agents can regularly find [00:07:25]. While AI can occasionally find complex exploits (e.g., a zero-day exploit by 03, which took 100 runs), the benchmark assesses everyday usage [00:07:32].

Evaluation Metrics [00:07:51]

The SM100 benchmark measures four key aspects:

  1. Needle in a Haystack: Can the system discover bugs without prior knowledge [00:07:55]? This involves feeding LLMs a reduced list of files within relevant subsystems, avoiding bias while scoping the search [00:09:10].
  2. False Positive Rate: Manual measurement of incorrect bug reports [00:08:14].
  3. Time of Introduction Detection (PR Review): Can the system identify the bug when given the pull request or commit that introduced it [00:08:24]? This provides an optimistic scenario with immediate context [00:08:37].
  4. Remediation Suggestion: For each discovered bug, can the agent suggest a fix that resolves the issue without breaking the rest of the codebase [00:08:51]?

AI Agent Performance Findings [00:11:20]

Basic implementations of AI agents, while trivial to set up, often result in high false positive rates (e.g., 97%) [00:10:37]. Building an effective agent for bug identification requires a combination of model, system, prompting, information feeding, and navigation strategy [00:11:10].

Benchmark Results [00:11:22]

  • Needle in a Haystack: Bismouth led by finding 10 bugs, with the next best finding 7 [00:11:27].
  • True Positive Rate (Detection): Codeex showed 45%, Bismouth 25%, and Claude Code 16% [00:11:46]. These solutions also generated significantly less “random nonsense” compared to others [00:11:57]. Some agents generated an astounding 70 reports for a single issue, making it impractical for human review [00:14:36].
  • PR Review (Needle in a Haystack): Codeex was strong at 27%, followed by Devon at 21%, and Bismouth at 17% [00:12:23]. Even the best model only found about a third of the bugs in this context [00:12:44].

Challenges for Current Agents [00:13:59]

  • High False Positive Rates: Many popular agents scored 7% or less true positives on SM100 [00:14:00], indicating they are currently ill-equipped for complex bug finding and fixing [00:14:09].
  • Narrow and Shallow Thinking: Agents, even with thinking models, evaluate a narrow range of options and do not go deep enough into selected reasoning chains [00:15:40].
  • Lack of Holistic Evaluation: The total number of bugs found per run remains consistent, but the specific bugs found change, suggesting LLMs do not holistically inventory issues within a file [00:16:10].

Conclusion and Future Directions [00:16:42]

Despite decent scores on feature creation benchmarks, existing AI agents significantly struggle with software maintenance tasks like finding and fixing bugs [00:16:50].

“Existing agents are able to create software upfront, but to manage and fix software after it’s been deployed will be a major struggle as far as we see it.” [00:16:55]

Overcoming these challenges requires:

While the most frequently used agents today risk introducing new bugs [00:17:44], newer agents, including Bismouth, are showing an increased ability to reason through code and effectively use context for evaluation [00:17:50]. This emerging stability is encouraging, and with continued effort and new techniques, significant improvements in AI for software maintenance are anticipated, benefiting the entire industry [00:18:11].