From: aidotengineer
AI agents for software engineering have seen a surge in popularity over the last year, prompting investigations into their efficacy for identifying and resolving bugs [00:00:00]. This article details the findings from Bismouth, a company specializing in software agents, regarding how well these tools perform at maintenance tasks like bug finding and fixing [00:00:06].
The SM-100 Benchmark
Bismouth developed a comprehensive benchmark, named SM-100, over several months to evaluate the capabilities of software agents in coding tasks beyond typical feature development [00:01:15]. Existing benchmarks primarily focus on how effective Large Language Models (LLMs) are at writing code, such as Human Eval, Polyglot Benchmark, and Live Codebench [00:01:23]. However, these only cover a fraction of the Software Development Life Cycle (SDLC) [00:01:33].
Limitations of Existing Benchmarks
Traditional bug detection benchmarks from the software security space, designed for classic static analysis or program repair tools, have significant limitations for evaluating modern agentic AI systems [00:04:01]:
- Simplistic Bugs: They often focus on straightforward bugs in common patterns, like null pointer dereferences, buffer overflows, or SQL injection, which can be found statically [00:04:16].
- Limited Languages: Many are restricted to languages like Java, reflecting where much complex enterprise software was historically written [00:04:29].
- Security Bias: There’s a bias towards security issues because classic static analysis tools historically focused on this area [00:04:46]. However, bugs manifest in many forms beyond security defects, such as copy-paste errors that break user experience without necessarily being security vulnerabilities [00:04:57].
Scope of SM-100
The SM-100 benchmark extends evaluation to other critical parts of the SDLC [00:01:37]:
- Code Review: This crucial process is largely unbenchmarked by existing work, despite the rise of LLM-based tools in the space [00:02:04].
- Software Maintenance: This includes tasks like bug fixes, dependency upgrades, and migrations [00:02:29]. While still involving code writing, it differs from feature development by requiring deep reasoning through a codebase to identify potential bugs [00:02:40]. Finding bugs requires an understanding of system architecture and connectedness, sometimes even deeper than when the feature was originally written [00:03:04].
The SM-100 benchmark includes 100 validated and classified bugs from over 84 public repositories, representing real-world issues in open-source projects [00:05:14].
The bugs are categorized by difficulty, ranging from obvious issues requiring low domain knowledge to complex problems demanding senior staff-level engineering understanding [00:05:28]. The benchmark is also multi-language, focusing on Python, TypeScript, JavaScript, and Go, as LLMs show varying performance across languages [00:05:39].
Defining “Objective Bugs”
The benchmark focuses on “objective bugs,” defined as explicit security or logical issues that could lead to data loss or system crashes [00:06:05]. This strict definition avoids ambiguity and harmless issues [00:06:11].
[!NOTE] Issues explicitly not included are feature requests, optimizations, style formatting, or design decisions, as these are often subjective and debated even among humans [00:06:41].
Each bug in the SM-100 is annotated with metadata, including:
- Severity [00:07:02]
- Context where it was defined and called [00:07:04]
- Required system-specific domain knowledge for a human to find it [00:07:06]
- Difficulty of finding it, even with expert knowledge [00:07:10]
- Implication of the bug (data loss, crash, security exploit) [00:07:18]
Benchmark Metrics
The SM-100 benchmark measures four key aspects of an agent’s performance [00:07:53]:
- Needle in the Haystack: The ability to discover bugs without any prior knowledge [00:07:56]. To avoid biasing agents, the repositories are broken into subsystems, and the agent is fed files only from relevant subsystems (those modified in the original bug-introducing commit), rather than the entire repository [00:09:09].
- False Positive Rate: The rate of incorrect bug identifications from the agent’s output [00:08:14].
- Finding at Introduction: Whether the agent can identify a bug given the pull request (PR) or commit that introduced it [00:08:24]. This gives the agent more immediate context [00:08:37].
- Suggesting Remediations: The ability of the agent to suggest a fix for identified bugs that resolves the issue without breaking other parts of the codebase [00:08:51].
Performance Comparison of AI Agents
Building an effective AI agent for code analysis is challenging, requiring a sophisticated combination of model selection, system design, prompting strategies, information feeding, and navigation [00:11:10]. Basic implementations, often using simple tools like shell
, tool
, sir
, replace
, think
, and report bug
in a loop, can find some bugs but typically have an extremely high false positive rate (e.g., 97%) [00:10:37].
Leading Solutions
- Bismouth: Leads in the “needle in a haystack” metric, finding 10 out of the 100 test bugs [00:11:23]. It achieved a 25% true positive rate in bug detection [00:11:48]. Bismouth’s solution, often running on Anthropic models, has outperformed Anthropic’s own Claude Code in multiple categories [00:13:15].
- Codeex: Achieved the highest true positive rate in bug detection at 45% [00:11:52]. It was also strong in PR review, finding 27% of “needle in a haystack” bugs [00:12:23].
- Claude Code: Showed a 16% true positive rate in bug detection [00:11:49].
- Devon: Found 21% of “needle in a haystack” bugs in PR review [00:12:28].
- Cursor Agent and Cosign: Along with Devon, these agents reported between 900 and 1,300 items, but with low true positive rates (3-10%) [00:12:07].
Basic Agents and Open-Source Models
The performance of basic agents and open-source models is significantly lower [00:13:19]:
- DeepSec R1: 1% true positive rate over hundreds of bugs [00:13:30].
- Llama for Maverick: 2% true positive rate [00:13:40].
- Sonnet 4 (in a loop): Found 6 “needle in a haystack” bugs with a 3% true positive rate [00:13:48].
- 03 (in a loop): Found 2 “needle in a haystack” bugs with a 6% true positive rate [00:13:51].
[!WARNING] The highest popular agent score (excluding Bismouth) on the SM-100 was 7% [00:13:59]. This indicates that the most widely used agents struggle with finding and fixing complex bugs, which represents a significant portion of software engineering work [00:14:09].
Agents also frequently provide a massive number of reports, making it impractical for human engineers to sift through them [00:14:30]. For example, one agent gave 70 reports for a single issue [00:14:36]. This highlights a need for tighter scoping and improved accuracy in agent reporting [00:14:48]. A simple state issue (a form not clearing) was only found by Bismouth and Codeex, demonstrating that even basic user experience bugs are often missed by other agents [00:15:03].
Common Limitations and Future Directions
A notable commonality across all examined agents is their narrow thinking, even when using thinking models [00:15:40]. They don’t delve deep enough into issues, and their reasoning tends to explore only a limited number of avenues at a time [00:03:27]. This leads to agents missing bugs that human developers would quickly identify and confirming issues that humans would immediately discard [00:03:34].
Moreover, agents do not consistently evaluate files holistically; the specific bugs found can vary across different runs, suggesting biases in their contextual evaluation [00:16:10].
“Despite 60 to 70% 80% in some cases scores on Sweet Edenge, agents still struggle with SM 100. What does that mean? Existing agents are able to create software upfront, but to manage and fix software after it’s been deployed will be a major struggle as far as we see it.” [00:16:45]
This indicates a significant gap: while some agents excel at initial software creation, they are currently ill-equipped for post-deployment maintenance and bug fixing [00:16:55]. Addressing this problem requires targeted search, enhanced program comprehension, cross-file reasoning, and advanced bug pattern recognition, capabilities that are currently lacking in depth across many solutions [00:17:09].
However, newer agents like Codeex and Bismouth are showing promise by offering tighter, more focused bug solutions and demonstrating an increased ability to reason through code and effectively use context [00:17:23]. The stability of these agents is nascent, but the early results are encouraging [00:18:01]. Continued effort and different techniques in this area are expected to yield significant benefits across the entire industry [00:18:11].