From: aidotengineer

AI agents for software engineering have seen a surge in popularity over the past year, prompting investigation into their effectiveness at finding and fixing bugs, and their reliability for software maintenance [00:00:00].

Bismouth’s Background

Bismouth has been working on software agents for over a year [00:00:12].

  • Ian (CEO): Background in data engineering, machine learning, and search, previously at Zillow on the A/B testing platform. This is his second time working on dev tooling startups, having started with technical documentation search in 2019 [00:00:16].
  • Nick (CTO): Primarily worked in the software security space before Bismouth, previously at Google on an internal tools team building endpoint security software, and as a research scientist focused on detecting software exploitation [00:00:37]. Many tools and techniques from the security space are transferable to building intelligent agentic code tools [00:00:56].

The Problem: Gaps in Existing Benchmarks

While several benchmarks exist to measure how effective Large Language Models (LLMs) are for writing code (e.g., Human Evaluator, Polyglot Benchmark, Live Codebench), these primarily cover feature development and testing, which is only one part of the software development life cycle (SDLC) [00:01:21].

Other crucial SDLC stages are often overlooked:

  • Initial Scoping and Planning: Requires broader business context, knowledge of existing systems and designs, and exploration of potential solutions. This is a distinct task from development [00:01:40].
  • Code Review: Largely unbenchmarked, despite the rise of LLM-based tools in this area [00:02:04].
  • Deployment: A separate task involving configuration, monitoring setup, and integration with existing systems [00:02:17].
  • Software Maintenance Tasks: Includes bug fixes, dependency upgrades, and migrations. While still involving code writing, it differs significantly from feature development [00:02:29]. The ability to deeply reason through a codebase to find bugs is directly transferable to producing features, as both require an understanding of the system architecture and its connectedness [00:02:49].

Challenges with AI Agent Reasoning for Bugs

Maintenance often requires a deeper understanding of the system than when the feature was originally written [00:03:09].

  • Agents struggle with holistic evaluation of files and systems, often finding only subsets of bugs per run [00:03:20].
  • Even with “thinking models,” reasoning can be narrow, exploring a limited number of avenues at a time [00:03:25]. This leads to LLMs missing bugs that human developers would immediately spot, and confirming false bugs [00:03:34].
  • While agents often patch simple bugs with ease due to their simplicity, this doesn’t indicate a high level of capability for complex issues [00:03:47].

Limitations of Existing Bug Detection Benchmarks

Existing bug detection benchmarks from the software security space are not suitable for new agentic AI systems because they were built for classic static analysis or program repair tools [00:04:01].

  • Simplicity: They focus on simplistic bugs in common patterns (e.g., null pointer dereferences, buffer overflows, SQL injection), which can often be found statically [00:04:16].
  • Language Limitations: Many are limited to specific languages, like Java, due to its prevalence in enterprise software [00:04:29].
  • Security Bias: A bias towards security issues exists, largely because classic static analysis tools focused on this area. However, bugs appear in many ways beyond security defects, such as copy-paste errors that break software for end-users [00:04:46].

Bismouth’s SM100 Benchmark

To address these limitations, Bismouth developed the SM100 benchmark [00:01:13].

  • Dataset: 100 triaged, validated, and classified bugs were painstakingly gathered from over 84 public repositories [00:05:11]. These are remediated “in the wild” bugs from open-source projects, spanning a range of issue types from obvious low-domain knowledge to senior staff-level engineering knowledge [00:05:18].
  • Multi-language Focus: The benchmark includes Python, TypeScript, JavaScript (due to popularity and LLM performance), and Go (as a control for low-level systems engineering languages) [00:05:39].
  • Objective Bugs: Bugs are defined as explicit security issues or logical issues that could cause data loss or system crashes [00:06:05]. This definition removes ambiguity, harmless issues, or cases where issues are correctly handled by calling code higher up [00:06:11]. The benchmark explicitly excludes feature requests, optimizations, style formatting, or design decisions to reduce ambiguity and ensure reproducibility [00:06:41].
  • Metadata Annotation: Each bug is annotated with metadata, including its severity, context, required human domain knowledge, difficulty to find, and its implication (e.g., data loss, crash, security exploit) [00:07:01]. This helps understand what level of bugs AI agents can regularly find [00:07:25]. While agents occasionally find zero-day exploits, this usually requires many runs over the same context [00:07:32].

SM100 Evaluation Metrics

For each system benchmarked, four key numbers are derived:

  1. Needle in the Haystack Result: Measures if the system can discover bugs without any prior knowledge [00:07:55].
    • Methodology: Repositories are broken into subsystems of interrelated files (e.g., a front-end part or specific API). The list of files in subsystems that were modified in the “golden PR commit” (the commit that introduced the bug) are fed to the LLM. This reduces the search scope without biasing the agent about the specific bug [00:09:10].
  2. False Positive Rate: Manually measured to assess the overall effectiveness of the system’s bug reports [00:08:08].
  3. Time of Introduction: Evaluates if the system can find the bug when given the pull request or commit that introduced it. Here, the agent has more immediate context and doesn’t need to search as widely [00:08:24].
  4. Remediation Suggestion: For each discovered bug, the agent is asked to fix it, and the result is checked to ensure it fixes the bug without breaking other codebase parts [00:08:48].

Building Effective Agents for Bug Detection

Basic agent implementations (e.g., a simple loop with shell, search/replace, report, finish tools) are trivial [00:10:37]. While they might find some bugs (Bismouth found 5-6 in a basic loop), they suffer from an extremely high false positive rate (97%) [00:10:49]. Most basic agents, equipped with tools from Anthropic or OpenAI model card releases, are not up to the task of truly finding and triaging bugs [00:10:55].

Building a good agent for identifying bugs and working on code is challenging, requiring a combination of:

Benchmark Performance Comparisons

Bismouth’s performance is often based on Anthropic models, which are easier to serve for their customers from Vertex [00:12:55].

  • Needle in a Haystack: Bismouth leads by finding 10 out of 100 “needles,” with the next best solution finding 7 [00:12:22].
  • True Positive Rate (Detection): Codeex performs highest at 45%, Bismouth at 25%, and Claude Code at 16% [00:11:46]. These three solutions also exhibit significantly less “random nonsense” compared to others [00:11:57].
  • False Positives: Devin, Cursor Agent, and CoSign reported between 900 and 1300 items with only a 3-10% true positive rate, indicating a need for improvement in tightening their output scope [00:12:06]. One agent even gave 70 reports for a single issue, which is impractical for human engineers to sift through [00:14:35].
  • PR Review (Needle in a Haystack): Codeex was strong at 27%, followed by Devin at 21%, and Bismouth at 17% [00:12:20]. These numbers do not include false positives [00:12:34].

Performance of Basic and Open-Source Agents

Open-source models have a long way to go [00:13:28].

  • R1 had a 1% true positive rate [00:13:32].
  • Lava Maverick had a 2% true positive rate [00:13:40].
  • R1 managed to find one needle in a haystack [00:13:43].
  • Sonnet 4 found six needles, and 03 found two, with 3% and 6% true positive rates respectively in a basic loop [00:13:46].
  • The highest popular agent score (outside Bismouth) was 7% on SM100 [00:13:59]. This indicates that most widely used agents are currently poor at finding and fixing complex bugs [00:14:08].

Some simple bugs are still missed; for instance, only Bismouth and Codeex found a state issue where a form’s “is dirty” flag was not reset, preventing it from clearing after submission [00:14:59]. This type of bug, though not critical, affects user experience and would be immediately caught by a human developer [00:15:20].

Key Challenges and Future Outlook

A notable commonality among agents is their narrow thinking, even when using thinking models [00:15:38]. They don’t go deep enough in their evaluations [00:15:51].

  • Requirement: Broader thinking chains and deeper thinking along selected chains are needed to effectively find bugs [00:15:56].
  • Inconsistency: The total number of bugs found remains roughly consistent per run, but the specific bugs found change, indicating that LLMs are not holistically inventorying everything in a file [00:16:08]. This pervasive problem across the industry suggests models apply different biases or contexts with each run [00:16:26].

Despite scores of 60-80% on other benchmarks like Sweet Edenge, agents still struggle with SM100 [00:16:45]. This implies that while existing agents can create software upfront, managing and fixing deployed software remains a major challenge [00:16:55]. This requires targeted search, better program comprehension, cross-file reasoning, and bug pattern recognition, which are currently lacking in depth [00:17:09].

While the current state indicates that most frequently used agents carry a high risk of introducing bugs [00:17:44], newer agents like Codeex and Bismouth are showing improvements in tighter, more narrow bands of bug solving and handling complex issues [00:17:23]. The stability of these newer agents is nascent but encouraging, showing an increased ability to reason through code and effectively use context [00:17:50]. This benchmark aims to clearly demonstrate progress and highlight that effort and different techniques can lead to significant improvements across the industry [00:18:05].