Benchmarks for evaluating software development lifecycle

From: aidotengineer

AI agents for software engineering have seen a surge in popularity, prompting investigations into their efficacy for tasks like finding and fixing bugs and software maintenance [00:00:00]. Evaluating these agents requires benchmarks that cover the full spectrum of the Software Development Lifecycle (SDLC) [00:01:35].

Current Benchmarking Landscape

While there are existing benchmarks for evaluating Large Language Models (LLMs) in code writing, such as Human Eval, Polyglot Benchmark, and LiveCodeBench [00:01:23], these only cover a fraction of the SDLC [00:01:33].

Significant gaps exist in benchmarking other critical stages:

Initial Scoping and Planning Requires broad business context and knowledge of existing systems, which is a distinctly different task from development [00:01:40].
Code Review Largely unbenchmarked, despite the emergence of LLM-based tools in this area [00:02:04]. This is the first part of the benchmark discussed by Bismouth [00:02:13].
Deployment A separate task involving configuration, monitoring, and integration with existing systems [00:02:17].
Software Maintenance Tasks This includes bug fixes, dependency upgrades, and migrations [00:02:29]. This area, at its core, involves code writing but in a distinctly different way from feature development [00:02:40]. The ability to deeply reason through a codebase to find bugs is crucial here, requiring an understanding of system architecture and connectedness, often deeper than when the feature was originally written [00:02:46].

Limitations of Existing Bug Detection Benchmarks

Existing bug detection benchmarks, primarily from the software security space, are not well-suited for evaluating modern agentic AI systems [00:04:01].

Simplistic Bugs They often focus on relatively simple bugs and common patterns (e.g., null pointer dereferences, buffer overflows, SQL injection) that can be found statically [00:04:16].
Limited Language Support Many are limited to specific languages, like Java, due to historical prevalence in enterprise software [00:04:29].
Security Bias There is a bias towards security issues, whereas bugs appear in many forms beyond security defects (e.g., copy-paste bugs that break software for end-users) [00:04:46].

SM-100: A New Benchmark for Software Maintenance

Bismouth developed the SM-100 benchmark to address the limitations of existing evaluation methods and focus on software maintenance tasks [00:01:13].

Benchmark Development and Characteristics

Bug Collection Painstakingly gathered 100 triaged, validated, and classified bugs from over 84 public repositories [00:05:11]. These bugs were already remediated and represented real-world issues in open-source projects [00:05:18].
Issue Variety The bugs span a range of difficulty, from those requiring obvious, low-specific domain knowledge to those demanding senior staff-level engineering knowledge and significant depth of understanding for a given project [00:05:28].
Multi-Language Includes Python, TypeScript, JavaScript, and Go, chosen to assess performance across popular languages and a low-level systems engineering language [00:05:39].
Objective Bug Definition An “objective bug” is an explicit security issue or logical issue that could cause data loss or system crashes [00:06:05]. This definition removes ambiguity and harmless issues, and ensures reproducibility across evaluations [00:06:11].
- Excludes feature requests, optimizations, style formatting, or design decisions, as these are often subjective and debated even by humans [00:06:41].
Metadata Annotation Each bug is annotated with metadata, including:
- Severity [00:07:02]
- Context where it was defined and called [00:07:04]
- Amount of system-specific domain knowledge a human would require to find it [00:07:06]
- Difficulty to find the bug even with said knowledge [00:07:10]
- Implication of the bug (e.g., data loss, crash, security exploit) [00:07:18] These classifications help understand which level of bugs AI agents can regularly find [00:07:25].

Evaluation Metrics

For each system benchmarked on SM-100, four key metrics are derived [00:07:53]:

Needle in the Haystack Can the system discover the bug without any prior knowledge [00:07:56]?
False Positive Rate (FPR) Manually measured to assess the overall effectiveness and prevent overwhelming developers with irrelevant reports [00:08:14].
Bug Finding at Time of Introduction Can the system identify the bug when given the pull request or commit that introduces it, leveraging the immediate context [00:08:24]?
Remediation Suggestion For each discovered bug, the agent is asked to fix it, and the fix’s efficacy in addressing the bug without breaking other code is evaluated [00:08:50].

”Needle in the Haystack” Methodology

To avoid biasing agents while still scoping the problem, repositories are broken into subsystems containing interrelated files (e.g., part of a front-end or a specific API point) [00:09:09]. The LLM is then fed only the files within the subsystem that contains the “golden PR commit” where the bug was introduced, providing a reduced list of files without hinting at the specific bug location [00:09:52].

Performance and Challenges of AI Agents

Building a good agent for identifying bugs and working on code is challenging, requiring a combination of model, system, prompting, information feeding, and navigation strategy [00:11:10].

Benchmarking Results

Needle in a Haystack: Bismouth led the pack, finding 10 out of 100 bugs, with the next best finding 7 [00:11:23]. This highlights significant room for growth in this area [00:11:35].
True Positive Rate: Codeex achieved 45%, Bismouth 25%, and Claude Code 16% [00:11:48]. These solutions showed a tighter scoping, producing significantly less “random nonsense” compared to others [00:11:57].
- Agents like Devon, Cursor Agent, and Cosign found between 900 and 1,300 items, with a true positive rate of only 3-10% [00:12:07], indicating a high false positive rate.
PR Review: Codeex was strong at 27% “needle in the haystack” found, followed by Devon at 21% and Bismouth at 17% [00:12:23]. Even the best model only got 27%, indicating a long way to go for both PR review and bug detection [00:12:44].
Basic Agents (Simple Loop):
- Open-source models like DeepSec R1 and Llama for Maverick showed very low true positive rates (1% and 2% respectively) [00:13:28].
- Sonnet 4 found 6 “needles” and 03 found 2, with 3% and 6% true positive rates respectively [00:13:46].
- The highest popular agent score outside Bismouth was 7% on SM-100 [00:13:59]. This implies that most used agents are currently poor at finding and fixing complex bugs [00:14:09].

Common Agent Limitations

Narrow Thinking Agents exhibit narrow thinking, even when using “thinking models,” and do not go deep enough in their evaluations [00:15:40].
Lack of Holistic Evaluation The total number of bugs found per run remains consistent, but the specific bugs change, suggesting LLMs do not holistically inventory everything in a file [00:16:08].
High False Positive Rates Some agents produce an astounding number of false reports (e.g., 70 reports for one issue), which no human engineer would sift through [00:14:36].
Missing Simple Bugs Even simple bugs, like a form clearing state issue that impacts user experience, are often missed by most agents [00:14:59].

Future Directions

Despite relatively high scores on benchmarks like Sweet Edenge for upfront software creation, agents still struggle significantly with software maintenance tasks as evaluated by SM-100 [00:16:45]. This points to a major challenge in managing and fixing deployed software with AI [00:16:57].

Iterative improvement of evaluation processes and agents will require:

Targeted search [00:17:09]
Better program comprehension [00:17:13]
Cross-file reasoning [00:17:14]
Bug pattern recognition [00:17:15]

Newer agents, including Bismouth and Codeex, are beginning to show promise with tighter, more narrow bands of bug solving and the ability to find complex issues [00:17:23]. The stability of these newer agents is nascent but encouraging, showing an increased ability to reason through code and effectively use context for evaluation [00:18:01]. Continued effort and different techniques can lead to significant improvements across the industry [00:18:11].

For more details, visit bismouth.sh [00:18:33].

Tubegraph

Explorer

Table of Contents