Limitations of current AI models in software engineering

From: aidotengineer

AI agents for software engineering have gained significant popularity, but their effectiveness in finding and fixing bugs, and their reliability for software maintenance, are crucial questions . Bismouth, a company specializing in software agents, has been investigating these challenges through their own benchmark, SM 100 .

Current Scope of AI Benchmarks

Existing benchmarks primarily measure the effectiveness of Large Language Models (LLMs) for writing new code, focusing on feature development and testing . However, the software development life cycle (SDLC) encompasses much more, including:

Initial scoping and planning: Requires broader business context, knowledge of existing systems, and exploration of solutions, which is a distinct task from development .
Code review: Largely unbenchmarked, despite the rise of LLM-based tools in this area .
Deployment: A separate task involving configuration, monitoring, and integration with existing systems .
Software maintenance: Includes bug fixes, dependency upgrades, and migrations. While still involving code, it requires a different approach than feature development .

Challenges in Bug Detection and Maintenance

Maintenance tasks, particularly bug finding, demand a deep understanding of a codebase and system architecture . AI agents exhibit several key limitations in this area:

Narrow and Insufficient Reasoning

Agents struggle with holistic evaluation of files and systems, typically finding only subsets of bugs per run .
Even with “thinking models,” reasoning tends to be narrow, exploring a limited number of potential avenues at a time . This can lead LLMs to miss bugs that human developers would identify immediately, or confirm false positives .
The ability to patch bugs, while often successful due to the relative simplicity of the identified bugs, does not signify advanced capabilities in complex bug scenarios .
A pervasive problem across agents is that they do not look holistically at a file and inventory everything happening within it, showing different biases per run .

Limitations of Existing Benchmarks for AI Agents

Traditional bug detection benchmarks, often from the software security space, are not well-suited for evaluating new agentic AI systems :

Simplistic Bugs: They focus on relatively simplistic bugs in common patterns (e.g., null pointer dereferences, buffer overflows, SQL injection) that can be found statically .
Limited Languages: Many benchmarks are restricted to a single language, such as Java .
Security Bias: There’s a bias towards security issues, overlooking other common bug types like copy-paste errors that break software for end-users .

High False Positive Rates

Basic implementations of AI agents, while trivial to set up and capable of finding some bugs, exhibit extremely high false positive rates .

Initial tests showed a 97% false positive rate .
Some agents, like R1 and Llama Maverick, had true positive rates of only 1% and 2% respectively .
Three out of six agents tested achieved 10% or less true positives out of a massive number of reports .
One agent generated an astounding 70 reports for a single issue, which is impractical for human engineers to sift through . This highlights a need to tighten the amount and accuracy of information reported by agents .

Difficulty with Complex and Real-World Bugs

Bismouth’s SM 100 benchmark, comprising 100 triaged, validated bugs from 84 public repositories, aims to evaluate agents on a range of issue types and languages (Python, TypeScript, JavaScript, Go) . These objective bugs include explicit security or logical issues causing data loss or system crashes, excluding ambiguous elements like feature requests or style .

The performance of leading solutions on the “needle in a haystack” problem (discovering bugs without prior knowledge in a reduced context) shows significant room for growth . Bismouth found 10 bugs, with the next leading solution finding 7 .
In PR review, the best model only found 27% of the “needle in the haystack” bugs .
Even simple bugs, such as a state issue preventing a form from clearing, are missed by most agents . Only Bismouth and CodeEx successfully identified this issue .

The Broader Problem

Despite impressive scores on code generation benchmarks like SweetEden, agents still struggle significantly with maintenance tasks as evaluated by SM 100 . This indicates that while existing agents can create software, managing and fixing it post-deployment remains a major challenge . Addressing this requires:

Targeted search
Better program comprehension
Cross-file reasoning
Complex bug pattern recognition

Future Outlook

The current generation of agents carries a high risk of introducing new bugs . However, newer agents, including Bismouth’s, are beginning to demonstrate an increased ability to reason through code and more effectively use context to evaluate concerns, showing encouraging stability . This progress in AI engineering suggests that with focused effort and new techniques, significant improvements are possible across the industry .

Tubegraph

Explorer

Table of Contents