Benchmarking LLMs for software development

From: aidotengineer

AI agents for software engineering have seen an explosion in popularity over the last year, leading to questions about their efficacy in finding and fixing bugs, and their reliability for software maintenance [00:00:00]. Bismouth, a company specializing in software agents, has been working for over a year to address these questions [00:12:00].

Limitations of Existing Benchmarks

While there are existing benchmarks for evaluating large language models (LLMs) in code writing, such as Human Evaluator’s polyglot benchmark and Live Codebench, these only cover a single part of the software development life cycle (SDLC) [01:23:00].

Significant portions of the SDLC remain unbenchmarked by existing work:

Initial Scoping and Planning Requires broad business context, knowledge of existing systems, and exploration of solutions, making it distinct from development [01:40:00].
Code Review Despite the rise of LLM-based tools in this space, code review is largely unbenchmarked [02:04:00].
Deployment A separate task involving configuration, monitoring setup, and integration with existing systems [02:17:00].
Software Maintenance Tasks Including bug fixes, dependency upgrades, and migrations [02:29:00]. This area, while still involving code writing, differs significantly from feature development [02:39:00].

Software maintenance demands the ability to deeply reason through a codebase to identify bugs [02:49:00]. This requires understanding system architecture and its connectedness, often more deeply than when the feature was originally written [03:04:00]. LLM agents often struggle with holistic evaluation, finding only subsets of bugs per run [03:21:00]. Their reasoning can be narrow, exploring a limited number of avenues at a time, leading to missed bugs that humans would immediately detect and confirming bugs that humans would discard [03:27:00]. While patching simple bugs is relatively easy for agents, the overall capability is still in its infancy [03:47:00].

Existing bug detection benchmarks, often from the software security space, have major limitations for evaluating modern agentic AI systems:

Simplistic Bugs They focus on common, statically detectable patterns like null pointer dereferences or buffer overflows [04:16:00].
Limited Languages Many are Java-only, reflecting historical enterprise software trends [04:29:00].
Security Bias They are biased towards security issues, overlooking other common bug types like copy-paste errors that break software for users [04:46:00].

Bismouth’s SM-100 Benchmark

To address these gaps, Bismouth developed a benchmark over several months to explore the capabilities of software agents in coding tasks beyond normal feature development [01:13:00].

Data Collection and Characteristics

The SM-100 benchmark painstakingly gathered 100 triaged, validated, and classified bugs from over 84 public repositories [05:11:00]. These bugs were real-world issues already remediated in open-source projects [05:18:00]. The goal was to provide a range of issue types, from those requiring minimal domain knowledge to those needing senior staff-level engineering expertise and significant project understanding [05:28:00].

The benchmark is multi-language, focusing on:

Python, TypeScript, JavaScript: Selected for their popularity and supposed better LLM performance [05:40:00].
Go: Chosen as a control to balance performance on a low-level systems engineering language [05:54:00].

Objective Bugs

The SM-100 focuses exclusively on “objective” bugs, meaning explicit security or logical issues that could cause data loss or system crashes [06:05:00]. This approach removes ambiguity and harmless issues, such as a function not checking bounds when the calling code ensures input never exceeds those bounds [06:11:00]. It explicitly excludes:

Feature requests [06:41:00]
Optimizations [06:43:00]
Style formatting [06:43:00]
Design decisions [06:43:00]

This strict definition helps ensure reproducibility across evaluations [06:58:00]. Each bug is annotated with metadata, including severity, context, required system-specific domain knowledge, difficulty to find, and the bug’s implication (data loss, crash, security exploit) [07:01:00]. These classifications help understand which levels of bugs AI agents can regularly find [07:25:00]. While agents might occasionally surprise, like finding a zero-day exploit after 100 runs, the benchmark aims to evaluate their everyday utility [07:32:00].

Evaluation Metrics

For each system benchmarked, four key metrics are extracted:

Needle in the Haystack: Can the system discover the bugs without any prior knowledge of their location? [07:55:00] To avoid biasing agents or making the search too broad for large repositories, Bismouth breaks down repositories into inter-related subsystems (e.g., a front end or specific API point) [09:36:00]. Only files within these subsystems that were modified in the “golden PR commit” (where the bug was introduced) are fed to the LLM, reducing the search space without revealing the bug’s exact location [09:54:00].
False Positive Rate: The rate of irrelevant bugs reported by the agent, manually measured to assess overall effectiveness [08:08:00].
Time of Introduction Detection: Can the system identify the bug at the time it was introduced, given the relevant pull request or commit? This provides an optimistic starting point with immediate context [08:24:00].
Remediation Suggestion: For each discovered bug, can the agent suggest a fix that resolves the issue without introducing new problems or breaking the codebase? [08:51:00]

Key Findings and Challenges

Building a robust agent for identifying bugs and working with code is challenging, requiring a combination of model choice, system design, prompting strategies, information feeding mechanisms, and navigation strategies [11:10:00]. Basic agent implementations using simple loops and tools (like shell, search-and-replace, think, report, finish) can find some bugs but often suffer from extremely high false positive rates, making them impractical for real-world use [10:37:00]. For example, a basic loop agent might find 5-6 bugs but have a 97% false positive rate [10:50:00].

Performance Comparisons

Bismouth’s benchmark results indicate:

Needle in the Haystack: Bismouth led the pack, finding 10 out of 100 “needles,” with the next best solution finding 7 [11:24:00]. This highlights significant room for growth in this area [11:35:00].
True Positive Rate: Codeex achieved the highest true positive rate at 45%, followed by Bismouth at 25% and Claude Code at 16% [11:47:00]. These solutions also generated significantly less “random nonsense” compared to others [11:57:00].
PR Review (Time of Introduction Detection): Codeex was strong at 27% (needle in haystack found), followed by Devin at 21% and Bismouth at 17% [12:23:00]. Even the best model only found about a third of the bugs in this context, indicating a long way to go [12:43:00].

Bismouth’s solution, while model-agnostic, typically runs on Anthropic models and was able to outperform their own Claude Code solution in several categories [12:56:00].

Performance of Basic and Open-Source Agents

Open-source models generally have a long way to go in this space [13:28:00]:

R1: 1% true positive rate [13:32:00].
Lava Maverick: 2% true positive rate [13:40:00].
Sonnet 4 (basic loop): 6 needles found, 3% true positive rate [13:48:00].
O3 (basic loop): 2 needles found, 6% true positive rate [13:50:00].

The highest popular agent score outside of Bismouth was 7% on SM-100 [14:01:00]. This means that the most used agents struggle significantly with finding and fixing complex bugs, which represents 90% of software engineering work [14:08:00]. Three out of six agents had 10% or less true positives out of a massive number of reports [14:27:00]; for instance, one agent generated 70 reports for a single issue, which no human engineer would sift through [14:35:00].

Even simple bugs, like a form not clearing due to an isDirty flag not being reset, are missed by most agents. Only Bismouth and Codeex found this specific state issue, highlighting a gap in handling real-world user experience consequences that human developers would immediately catch [14:59:00].

Common Agent Weaknesses

A notable commonality across agents is their narrow thinking, even when employing thinking models [15:40:00]. They evaluate a limited range of issues and don’t go deep enough [15:48:00]. Broader and deeper thinking chains are needed to effectively find bugs [15:56:00]. Interestingly, the total number of bugs found per run remains consistent, but the specific bugs change, suggesting LLMs are not holistically inventorying issues within a file due to biases in context [16:08:00].

Despite scoring 60-80% on benchmarks like Sweet Edenge (likely related to initial code generation), agents still struggle with SM-100 [16:45:00]. This implies that while existing agents can create software upfront, managing and fixing deployed software remains a major challenge [16:55:00]. Overcoming this requires targeted search, better program comprehension, cross-file reasoning, and bug pattern recognition that many solutions currently lack [17:09:00].

Conclusion

The most frequently used agents today carry a high risk of introducing bugs [17:44:00]. However, newer agents, including Bismouth’s, are beginning to demonstrate an increased ability to reason through code and more effectively use context to identify concerns [17:49:00]. While the stability of these systems is nascent, the progress is encouraging [18:01:00]. The SM-100 benchmark provides a clear path to measure and drive improvements in this critical area, which will benefit the entire industry [18:05:00].

Tubegraph

Explorer

Table of Contents