From: aidotengineer
AI agents for software engineering have seen a surge in popularity, prompting investigations into their efficacy for tasks like finding and fixing bugs and software maintenance [00:00:00]. Evaluating these agents requires benchmarks that cover the full spectrum of the Software Development Lifecycle (SDLC) [00:01:35].
Current Benchmarking Landscape
While there are existing benchmarks for evaluating Large Language Models (LLMs) in code writing, such as Human Eval, Polyglot Benchmark, and LiveCodeBench [00:01:23], these only cover a fraction of the SDLC [00:01:33].
Significant gaps exist in benchmarking other critical stages:
- Initial Scoping and Planning Requires broad business context and knowledge of existing systems, which is a distinctly different task from development [00:01:40].
- Code Review Largely unbenchmarked, despite the emergence of LLM-based tools in this area [00:02:04]. This is the first part of the benchmark discussed by Bismouth [00:02:13].
- Deployment A separate task involving configuration, monitoring, and integration with existing systems [00:02:17].
- Software Maintenance Tasks This includes bug fixes, dependency upgrades, and migrations [00:02:29]. This area, at its core, involves code writing but in a distinctly different way from feature development [00:02:40]. The ability to deeply reason through a codebase to find bugs is crucial here, requiring an understanding of system architecture and connectedness, often deeper than when the feature was originally written [00:02:46].
Limitations of Existing Bug Detection Benchmarks
Existing bug detection benchmarks, primarily from the software security space, are not well-suited for evaluating modern agentic AI systems [00:04:01].
- Simplistic Bugs They often focus on relatively simple bugs and common patterns (e.g., null pointer dereferences, buffer overflows, SQL injection) that can be found statically [00:04:16].
- Limited Language Support Many are limited to specific languages, like Java, due to historical prevalence in enterprise software [00:04:29].
- Security Bias There is a bias towards security issues, whereas bugs appear in many forms beyond security defects (e.g., copy-paste bugs that break software for end-users) [00:04:46].
SM-100: A New Benchmark for Software Maintenance
Bismouth developed the SM-100 benchmark to address the limitations of existing evaluation methods and focus on software maintenance tasks [00:01:13].
Benchmark Development and Characteristics
- Bug Collection Painstakingly gathered 100 triaged, validated, and classified bugs from over 84 public repositories [00:05:11]. These bugs were already remediated and represented real-world issues in open-source projects [00:05:18].
- Issue Variety The bugs span a range of difficulty, from those requiring obvious, low-specific domain knowledge to those demanding senior staff-level engineering knowledge and significant depth of understanding for a given project [00:05:28].
- Multi-Language Includes Python, TypeScript, JavaScript, and Go, chosen to assess performance across popular languages and a low-level systems engineering language [00:05:39].
- Objective Bug Definition An “objective bug” is an explicit security issue or logical issue that could cause data loss or system crashes [00:06:05]. This definition removes ambiguity and harmless issues, and ensures reproducibility across evaluations [00:06:11].
- Excludes feature requests, optimizations, style formatting, or design decisions, as these are often subjective and debated even by humans [00:06:41].
- Metadata Annotation Each bug is annotated with metadata, including:
- Severity [00:07:02]
- Context where it was defined and called [00:07:04]
- Amount of system-specific domain knowledge a human would require to find it [00:07:06]
- Difficulty to find the bug even with said knowledge [00:07:10]
- Implication of the bug (e.g., data loss, crash, security exploit) [00:07:18] These classifications help understand which level of bugs AI agents can regularly find [00:07:25].
Evaluation Metrics
For each system benchmarked on SM-100, four key metrics are derived [00:07:53]:
- Needle in the Haystack Can the system discover the bug without any prior knowledge [00:07:56]?
- False Positive Rate (FPR) Manually measured to assess the overall effectiveness and prevent overwhelming developers with irrelevant reports [00:08:14].
- Bug Finding at Time of Introduction Can the system identify the bug when given the pull request or commit that introduces it, leveraging the immediate context [00:08:24]?
- Remediation Suggestion For each discovered bug, the agent is asked to fix it, and the fix’s efficacy in addressing the bug without breaking other code is evaluated [00:08:50].
”Needle in the Haystack” Methodology
To avoid biasing agents while still scoping the problem, repositories are broken into subsystems containing interrelated files (e.g., part of a front-end or a specific API point) [00:09:09]. The LLM is then fed only the files within the subsystem that contains the “golden PR commit” where the bug was introduced, providing a reduced list of files without hinting at the specific bug location [00:09:52].
Performance and Challenges of AI Agents
Building a good agent for identifying bugs and working on code is challenging, requiring a combination of model, system, prompting, information feeding, and navigation strategy [00:11:10].
Benchmarking Results
- Needle in a Haystack: Bismouth led the pack, finding 10 out of 100 bugs, with the next best finding 7 [00:11:23]. This highlights significant room for growth in this area [00:11:35].
- True Positive Rate: Codeex achieved 45%, Bismouth 25%, and Claude Code 16% [00:11:48]. These solutions showed a tighter scoping, producing significantly less “random nonsense” compared to others [00:11:57].
- Agents like Devon, Cursor Agent, and Cosign found between 900 and 1,300 items, with a true positive rate of only 3-10% [00:12:07], indicating a high false positive rate.
- PR Review: Codeex was strong at 27% “needle in the haystack” found, followed by Devon at 21% and Bismouth at 17% [00:12:23]. Even the best model only got 27%, indicating a long way to go for both PR review and bug detection [00:12:44].
- Basic Agents (Simple Loop):
- Open-source models like DeepSec R1 and Llama for Maverick showed very low true positive rates (1% and 2% respectively) [00:13:28].
- Sonnet 4 found 6 “needles” and 03 found 2, with 3% and 6% true positive rates respectively [00:13:46].
- The highest popular agent score outside Bismouth was 7% on SM-100 [00:13:59]. This implies that most used agents are currently poor at finding and fixing complex bugs [00:14:09].
Common Agent Limitations
- Narrow Thinking Agents exhibit narrow thinking, even when using “thinking models,” and do not go deep enough in their evaluations [00:15:40].
- Lack of Holistic Evaluation The total number of bugs found per run remains consistent, but the specific bugs change, suggesting LLMs do not holistically inventory everything in a file [00:16:08].
- High False Positive Rates Some agents produce an astounding number of false reports (e.g., 70 reports for one issue), which no human engineer would sift through [00:14:36].
- Missing Simple Bugs Even simple bugs, like a form clearing state issue that impacts user experience, are often missed by most agents [00:14:59].
Future Directions
Despite relatively high scores on benchmarks like Sweet Edenge for upfront software creation, agents still struggle significantly with software maintenance tasks as evaluated by SM-100 [00:16:45]. This points to a major challenge in managing and fixing deployed software with AI [00:16:57].
Iterative improvement of evaluation processes and agents will require:
- Targeted search [00:17:09]
- Better program comprehension [00:17:13]
- Cross-file reasoning [00:17:14]
- Bug pattern recognition [00:17:15]
Newer agents, including Bismouth and Codeex, are beginning to show promise with tighter, more narrow bands of bug solving and the ability to find complex issues [00:17:23]. The stability of these newer agents is nascent but encouraging, showing an increased ability to reason through code and effectively use context for evaluation [00:18:01]. Continued effort and different techniques can lead to significant improvements across the industry [00:18:11].
For more details, visit bismouth.sh [00:18:33].