From: aidotengineer
AI agents for software engineering have seen a surge in popularity, leading to questions about their effectiveness in finding and fixing bugs and their reliability for maintenance tasks [00:00:00]. Bismouth, a company specializing in software agents, has been working on this area for over a year [00:00:12].
Limitations of Existing Benchmarks
While several benchmarks exist for evaluating Large Language Models (LLMs) in writing code (e.g., Human Eval, Polyglot Benchmark, LiveCodeBench), these only cover a fraction of the software development life cycle (SDLC) [01:23:00]. Other critical phases, such as initial scoping, planning, code review, and deployment, are largely unbenchmarked [01:40:00].
A significant gap lies in software maintenance tasks, including bug fixes, dependency upgrades, and migrations [02:31:00]. Although these tasks involve writing code, they differ from feature development [02:40:00]. Finding bugs requires a deep understanding of the system architecture and its connectedness, often more profoundly than when the feature was initially written [03:04:00].
Existing bug detection benchmarks, primarily from the software security space, were designed for classic static analysis or program repair tools and have significant limitations for agentic AI systems [04:01:00]:
- Simplistic Bugs: They focus on basic bugs like null pointer dereferences or buffer overflows, which can often be found statically [04:16:00].
- Limited Languages: Many benchmarks are Java-only, despite software being written in diverse languages [04:30:00].
- Security Bias: There’s a bias towards security issues, neglecting other common bug types like copy-paste errors that break software for end-users [04:46:00].
The SM-100 Benchmark
To address these limitations, Bismouth developed the SM-100 benchmark over several months [01:15:00].
Benchmark Development
The SM-100 benchmark involves a painstakingly gathered set of 100 triaged, validated, and classified bugs from over 84 public repositories [05:14:00]. These are real-world bugs that have already been remediated [05:19:00].
- Range of Difficulty: The bugs span various issue types, from those requiring low domain-specific knowledge to those needing senior staff-level engineering understanding [05:28:00].
- Multi-language: The benchmark includes Python, TypeScript, JavaScript, and Go, acknowledging that LLMs perform differently across languages [05:39:00].
- Objective Bugs: The benchmark focuses on “objective bugs”—explicit security or logical issues that could cause data loss or system crashes [06:05:00]. This excludes ambiguous issues like feature requests, optimizations, style formatting, or design decisions [06:41:00].
- Metadata: Each bug is annotated with metadata such as severity, context, required domain knowledge, difficulty, and implication (e.g., data loss, crash, security exploit) [07:01:00]. These classifications help understand which levels of bugs AI agents can regularly find [07:25:00].
Benchmark Metrics
The SM-100 benchmark measures four key aspects of an agent’s performance [07:51:00]:
- Needle in the Haystack (Discovery): Can the system discover bugs without prior knowledge [07:55:00]? To make this feasible without hinting, repositories are broken into subsystems, and only files within the relevant subsystems are fed to the LLM [09:36:00].
- False Positive Rate: The benchmark manually measures the false positive rate of reported bugs to assess overall effectiveness [08:10:00].
- Time of Introduction: Can the system find a bug at the time it’s introduced, given the pull request or commit [08:24:00]?
- Remediation Suggestion: For each discovered bug, the agent is asked to fix it, and the fix is evaluated to ensure it resolves the bug without breaking other code [08:51:00].
Challenges in Building Effective AI Agents for Bug Fixing
Building a high-performing agent for identifying bugs and working on code is challenging, requiring a sophisticated combination of [11:10:00]:
- Model selection
- System design
- Prompting strategies
- Information management and feeding to the model
- Navigation strategy
Basic agent implementations, often simple loops with shell and reporting tools, tend to have an extremely high false positive rate (e.g., 97%) [10:53:00]. This means they are not currently suitable for real-world bug finding and triaging [11:03:00].
Performance Observations
Performance comparisons on the SM-100 benchmark highlight significant room for improvement [11:35:00]:
- Needle in a Haystack: Bismouth led by finding 10 out of 100 bugs, with the next best finding 7 [11:29:00].
- True Positive Rate: CodeX showed a true positive rate of 45%, Bismouth 25%, and Claude Code 16% [11:48:00]. These agents also showed significantly less “random nonsense” in their reports [11:57:00].
- High False Reports: Agents like Devon, Cursor Agent, and Cosign reported between 900 and 1,300 items with true positive rates between 3% and 10% [12:07:00]. One agent generated an astounding 70 reports for a single issue, which is unmanageable for human engineers [14:36:00].
- PR Review: CodeX was strong at 27%, followed by Devon at 21%, and Bismouth at 17% [12:23:00]. Even the best model only found about a third of the bugs in PR review [12:44:00].
- Open-Source Models: Open-source models (R1 and Llama Maverick) showed very low true positive rates (1-2%) [13:28:00].
- Complex Bugs: The most widely used agents scored only 7% on SM-100, indicating they are poor at finding and fixing complex bugs [14:01:00]. Simple bugs, such as a state issue preventing a form from clearing, are often missed by most agents [14:59:00].
Common Agent Shortcomings
A notable commonality among agents is their narrow thinking, even when using “thinking models” [15:40:00]. They evaluate a limited number of avenues and do not go deep enough [15:48:00].
- Lack of Holistic Evaluation: On a per-run basis, the total number of bugs found might be consistent, but the specific bugs change, indicating that LLMs are not holistically evaluating a file and inventorying all issues [16:10:00]. This suggests inherent biases in how models process context [16:26:00].
- Maintenance vs. Feature Creation: Despite high scores on benchmarks for upfront software creation (like Sweet Eden), agents struggle significantly with managing and fixing software after deployment [16:50:00]. This problem requires targeted search, better program comprehension, cross-file reasoning, and bug pattern recognition, which are currently lacking in depth [17:09:00].
Conclusion and Future Outlook
Currently, most frequently used AI agents carry a high risk of introducing new bugs [17:44:00]. However, a new generation of agents, including Bismouth and CodeX, are starting to show increased ability to reason through code and effectively use context to evaluate concerns [17:50:00]. While their stability is nascent, the progress is encouraging [18:01:00].
The SM-100 benchmark provides a clear way to measure progress in this domain [18:05:00]. Targeted effort and different techniques can lead to significant improvements that will benefit the entire industry, particularly in software maintenance [18:11:00].