From: aidotengineer
AI agents for software engineering have gained significant popularity, but their effectiveness in finding and fixing bugs, and their reliability for software maintenance, are crucial questions . Bismouth, a company specializing in software agents, has been investigating these challenges through their own benchmark, SM 100 .
Current Scope of AI Benchmarks
Existing benchmarks primarily measure the effectiveness of Large Language Models (LLMs) for writing new code, focusing on feature development and testing . However, the software development life cycle (SDLC) encompasses much more, including:
- Initial scoping and planning: Requires broader business context, knowledge of existing systems, and exploration of solutions, which is a distinct task from development .
- Code review: Largely unbenchmarked, despite the rise of LLM-based tools in this area .
- Deployment: A separate task involving configuration, monitoring, and integration with existing systems .
- Software maintenance: Includes bug fixes, dependency upgrades, and migrations. While still involving code, it requires a different approach than feature development .
Challenges in Bug Detection and Maintenance
Maintenance tasks, particularly bug finding, demand a deep understanding of a codebase and system architecture . AI agents exhibit several key limitations in this area:
Narrow and Insufficient Reasoning
- Agents struggle with holistic evaluation of files and systems, typically finding only subsets of bugs per run .
- Even with “thinking models,” reasoning tends to be narrow, exploring a limited number of potential avenues at a time . This can lead LLMs to miss bugs that human developers would identify immediately, or confirm false positives .
- The ability to patch bugs, while often successful due to the relative simplicity of the identified bugs, does not signify advanced capabilities in complex bug scenarios .
- A pervasive problem across agents is that they do not look holistically at a file and inventory everything happening within it, showing different biases per run .
Limitations of Existing Benchmarks for AI Agents
Traditional bug detection benchmarks, often from the software security space, are not well-suited for evaluating new agentic AI systems :
- Simplistic Bugs: They focus on relatively simplistic bugs in common patterns (e.g., null pointer dereferences, buffer overflows, SQL injection) that can be found statically .
- Limited Languages: Many benchmarks are restricted to a single language, such as Java .
- Security Bias: There’s a bias towards security issues, overlooking other common bug types like copy-paste errors that break software for end-users .
High False Positive Rates
Basic implementations of AI agents, while trivial to set up and capable of finding some bugs, exhibit extremely high false positive rates .
- Initial tests showed a 97% false positive rate .
- Some agents, like R1 and Llama Maverick, had true positive rates of only 1% and 2% respectively .
- Three out of six agents tested achieved 10% or less true positives out of a massive number of reports .
- One agent generated an astounding 70 reports for a single issue, which is impractical for human engineers to sift through . This highlights a need to tighten the amount and accuracy of information reported by agents .
Difficulty with Complex and Real-World Bugs
Bismouth’s SM 100 benchmark, comprising 100 triaged, validated bugs from 84 public repositories, aims to evaluate agents on a range of issue types and languages (Python, TypeScript, JavaScript, Go) . These objective bugs include explicit security or logical issues causing data loss or system crashes, excluding ambiguous elements like feature requests or style .
- The performance of leading solutions on the “needle in a haystack” problem (discovering bugs without prior knowledge in a reduced context) shows significant room for growth . Bismouth found 10 bugs, with the next leading solution finding 7 .
- In PR review, the best model only found 27% of the “needle in the haystack” bugs .
- Even simple bugs, such as a state issue preventing a form from clearing, are missed by most agents . Only Bismouth and CodeEx successfully identified this issue .
The Broader Problem
Despite impressive scores on code generation benchmarks like SweetEden, agents still struggle significantly with maintenance tasks as evaluated by SM 100 . This indicates that while existing agents can create software, managing and fixing it post-deployment remains a major challenge . Addressing this requires:
Future Outlook
The current generation of agents carries a high risk of introducing new bugs . However, newer agents, including Bismouth’s, are beginning to demonstrate an increased ability to reason through code and more effectively use context to evaluate concerns, showing encouraging stability . This progress in AI engineering suggests that with focused effort and new techniques, significant improvements are possible across the industry .