Challenges in software maintenance and bug fixing

From: aidotengineer

AI agents for software engineering have seen a surge in popularity, leading to questions about their effectiveness in finding and fixing bugs and their reliability for maintenance tasks [00:00:00]. Bismouth, a company specializing in software agents, has been working on this area for over a year [00:00:12].

Limitations of Existing Benchmarks

While several benchmarks exist for evaluating Large Language Models (LLMs) in writing code (e.g., Human Eval, Polyglot Benchmark, LiveCodeBench), these only cover a fraction of the software development life cycle (SDLC) [01:23:00]. Other critical phases, such as initial scoping, planning, code review, and deployment, are largely unbenchmarked [01:40:00].

A significant gap lies in software maintenance tasks, including bug fixes, dependency upgrades, and migrations [02:31:00]. Although these tasks involve writing code, they differ from feature development [02:40:00]. Finding bugs requires a deep understanding of the system architecture and its connectedness, often more profoundly than when the feature was initially written [03:04:00].

Existing bug detection benchmarks, primarily from the software security space, were designed for classic static analysis or program repair tools and have significant limitations for agentic AI systems [04:01:00]:

Simplistic Bugs: They focus on basic bugs like null pointer dereferences or buffer overflows, which can often be found statically [04:16:00].
Limited Languages: Many benchmarks are Java-only, despite software being written in diverse languages [04:30:00].
Security Bias: There’s a bias towards security issues, neglecting other common bug types like copy-paste errors that break software for end-users [04:46:00].

The SM-100 Benchmark

To address these limitations, Bismouth developed the SM-100 benchmark over several months [01:15:00].

Benchmark Development

The SM-100 benchmark involves a painstakingly gathered set of 100 triaged, validated, and classified bugs from over 84 public repositories [05:14:00]. These are real-world bugs that have already been remediated [05:19:00].

Range of Difficulty: The bugs span various issue types, from those requiring low domain-specific knowledge to those needing senior staff-level engineering understanding [05:28:00].
Multi-language: The benchmark includes Python, TypeScript, JavaScript, and Go, acknowledging that LLMs perform differently across languages [05:39:00].
Objective Bugs: The benchmark focuses on “objective bugs”—explicit security or logical issues that could cause data loss or system crashes [06:05:00]. This excludes ambiguous issues like feature requests, optimizations, style formatting, or design decisions [06:41:00].
Metadata: Each bug is annotated with metadata such as severity, context, required domain knowledge, difficulty, and implication (e.g., data loss, crash, security exploit) [07:01:00]. These classifications help understand which levels of bugs AI agents can regularly find [07:25:00].

Benchmark Metrics

The SM-100 benchmark measures four key aspects of an agent’s performance [07:51:00]:

Needle in the Haystack (Discovery): Can the system discover bugs without prior knowledge [07:55:00]? To make this feasible without hinting, repositories are broken into subsystems, and only files within the relevant subsystems are fed to the LLM [09:36:00].
False Positive Rate: The benchmark manually measures the false positive rate of reported bugs to assess overall effectiveness [08:10:00].
Time of Introduction: Can the system find a bug at the time it’s introduced, given the pull request or commit [08:24:00]?
Remediation Suggestion: For each discovered bug, the agent is asked to fix it, and the fix is evaluated to ensure it resolves the bug without breaking other code [08:51:00].

Challenges in Building Effective AI Agents for Bug Fixing

Building a high-performing agent for identifying bugs and working on code is challenging, requiring a sophisticated combination of [11:10:00]:

Model selection
System design
Prompting strategies
Information management and feeding to the model
Navigation strategy

Basic agent implementations, often simple loops with shell and reporting tools, tend to have an extremely high false positive rate (e.g., 97%) [10:53:00]. This means they are not currently suitable for real-world bug finding and triaging [11:03:00].

Performance Observations

Performance comparisons on the SM-100 benchmark highlight significant room for improvement [11:35:00]:

Needle in a Haystack: Bismouth led by finding 10 out of 100 bugs, with the next best finding 7 [11:29:00].
True Positive Rate: CodeX showed a true positive rate of 45%, Bismouth 25%, and Claude Code 16% [11:48:00]. These agents also showed significantly less “random nonsense” in their reports [11:57:00].
High False Reports: Agents like Devon, Cursor Agent, and Cosign reported between 900 and 1,300 items with true positive rates between 3% and 10% [12:07:00]. One agent generated an astounding 70 reports for a single issue, which is unmanageable for human engineers [14:36:00].
PR Review: CodeX was strong at 27%, followed by Devon at 21%, and Bismouth at 17% [12:23:00]. Even the best model only found about a third of the bugs in PR review [12:44:00].
Open-Source Models: Open-source models (R1 and Llama Maverick) showed very low true positive rates (1-2%) [13:28:00].
Complex Bugs: The most widely used agents scored only 7% on SM-100, indicating they are poor at finding and fixing complex bugs [14:01:00]. Simple bugs, such as a state issue preventing a form from clearing, are often missed by most agents [14:59:00].

Common Agent Shortcomings

A notable commonality among agents is their narrow thinking, even when using “thinking models” [15:40:00]. They evaluate a limited number of avenues and do not go deep enough [15:48:00].

Lack of Holistic Evaluation: On a per-run basis, the total number of bugs found might be consistent, but the specific bugs change, indicating that LLMs are not holistically evaluating a file and inventorying all issues [16:10:00]. This suggests inherent biases in how models process context [16:26:00].
Maintenance vs. Feature Creation: Despite high scores on benchmarks for upfront software creation (like Sweet Eden), agents struggle significantly with managing and fixing software after deployment [16:50:00]. This problem requires targeted search, better program comprehension, cross-file reasoning, and bug pattern recognition, which are currently lacking in depth [17:09:00].

Conclusion and Future Outlook

Currently, most frequently used AI agents carry a high risk of introducing new bugs [17:44:00]. However, a new generation of agents, including Bismouth and CodeX, are starting to show increased ability to reason through code and effectively use context to evaluate concerns [17:50:00]. While their stability is nascent, the progress is encouraging [18:01:00].

The SM-100 benchmark provides a clear way to measure progress in this domain [18:05:00]. Targeted effort and different techniques can lead to significant improvements that will benefit the entire industry, particularly in software maintenance [18:11:00].

Tubegraph

Explorer

Table of Contents