Evaluating AI agents methods and metrics

From: aidotengineer

AI agents for software engineering have seen a rapid increase in popularity, prompting investigations into their efficacy for tasks like finding and fixing bugs, and general maintenance [00:00:00]. Bismouth, a company specializing in software agents, developed a benchmark to assess these capabilities [00:01:12].

Limitations of Existing AI Evaluation Benchmarks

Existing benchmarks for evaluating AI system performance of large language models (LLMs) primarily focus on feature development and code writing [00:01:23]. Examples include Human Evaluators Polyglot benchmark and Live Codebench [00:01:28]. However, these benchmarks only cover a limited part of the software development life cycle (SDLC) [00:01:33].

Key areas of the SDLC that are not well-covered by existing benchmarks include:

Initial Scoping and Planning [00:01:40]: Requires broader business context, knowledge of existing systems, and exploration of solutions, making it a distinct task from development [00:01:46].
Code Review Process [00:02:04]: Largely unbenchmarked despite the emergence of LLM-based tools in this area [00:02:06].
Deployment [00:02:15]: Involves configuration, monitoring setup, and integration with existing systems [00:02:18].
Software Maintenance Tasks [00:02:29]: Includes bug fixes, dependency upgrades, and migrations [00:02:31]. This core task, though involving code writing, differs significantly from feature development [00:02:39].

Challenges in AI Agent Evaluation for Maintenance

The ability to deeply reason through a codebase to find bugs is transferable to producing features, as both require understanding system architecture and connectedness [00:02:49]. Often, finding bugs demands an even deeper understanding than original feature creation [00:03:09].

Current AI agents, even with thinking models, often exhibit narrow reasoning, exploring only a limited number of avenues at a time [00:03:27]. This translates to LLMs missing bugs that human developers would quickly identify, and confirming false positives that humans would discard [00:03:34]. While patching simple bugs is relatively easy for agents due to their simplicity, this doesn’t indicate advanced capability [00:03:48].

Existing bug detection benchmarks from the software security space have significant limitations for evaluating AI agents:

Simplistic Bugs: They focus on basic bugs like null pointer dereferences, buffer overflows, or SQL injection, which can often be found statically [00:04:16].
Limited Languages: Many are restricted to languages like Java [00:04:29].
Security Bias: A bias towards security issues neglects other common bug types, such as copy-paste errors, which disrupt software functionality for end-users [00:04:46].

Bismouth’s SM-100 Benchmark

Bismouth developed the SM-100 benchmark specifically to address the gaps in evaluating AI agents for software maintenance [00:01:15].

Methodology

The SM-100 benchmark consists of 100 manually triaged, validated, and classified bugs from over 84 public repositories [00:05:14]. These bugs represent real-world issues already remediated in open-source projects [00:05:18].

The benchmark aims to provide:

Range of Issue Types: From obvious issues requiring low specific domain knowledge to senior staff-level engineering knowledge demanding significant project understanding [00:05:28].
Multi-language Support: Focuses on Python, TypeScript, JavaScript (popular languages where LLMs perform well), and Go (as a control for low-level systems engineering) [00:05:39].

Defining an “Objective Bug”

SM-100 defines an “objective bug” as an explicit security issue or logical issue that could cause data loss or system crashes [00:06:05]. This definition avoids ambiguity and harmless issues or those corrected by higher-level callers [00:06:11]. The benchmark explicitly excludes feature requests, optimizations, style formatting, or design decisions to reduce ambiguity and ensure reproducibility [00:06:41].

Each bug is annotated with metadata, including:

Severity [00:07:02]
Context where it was defined and called [00:07:04]
Amount of system-specific domain knowledge a human would require to find it [00:07:06]
Difficulty of finding it even with the required knowledge [00:07:10]
Implication of the bug (e.g., data loss, crash, security exploit) [00:07:18]

These classifications help understand which level of bugs AI agents can regularly find [00:07:25]. While agents occasionally surprise with discoveries like zero-day exploits (e.g., GPT-3 with 100 runs), the benchmark focuses on everyday usage capabilities [00:07:32].

Metrics for Evaluating AI Agents

For each system benchmarked on SM-100, four key metrics are assessed:

Bug Discovery (“Needle in the Haystack”): Can the system discover bugs without prior knowledge? [00:07:55]
- Methodology: To avoid biasing the agent while scoping the search, repositories are broken into interrelated subsystems (e.g., front-end, specific API points) [00:09:36]. The agent is then fed only the files within the subsystem that contain modifications from the original “golden PR commit” where the bug was introduced [00:09:56]. This provides a reduced, relevant list of files without hinting at the exact bug [00:10:07].
False Positive Rate: The rate of irrelevant bugs reported by the agent, measured manually [00:08:15].
Bug Identification at Introduction (PR Review): Can the system find the bug when given the pull request or commit that introduces it? [00:08:24] This provides a more optimistic starting point with immediate context [00:08:37].
Remediation Suggestion: For each discovered bug, can the agent suggest a fix that resolves the bug without breaking the rest of the codebase? [00:08:51]

Performance and Challenges

Building effective AI agents for bug identification is challenging, requiring a sophisticated combination of model, system, prompting, information feeding, and navigation strategy [00:11:10]. Basic agent implementations (simple loops with shell tools) often yield a 97% false positive rate [00:10:53].

Key findings from the SM-100 benchmark include:

“Needle in the Haystack”: Bismouth led by finding 10 out of 100 bugs, while the next best solution found 7 [00:11:25]. This highlights significant room for growth in this area [00:11:35].
True Positive Rate (Detection): Codex achieved 45%, Bismouth 25%, and Claude Code 16% [00:11:47]. Many other agents (Devon, Cursor Agent, Cosign) generated hundreds or thousands of reports with very low true positive rates (3-10%), indicating a need for tighter scoping and accuracy [00:12:07].
PR Review: Codex performed strongest at 27%, followed by Devon at 21%, and Bismouth at 17% [00:12:23]. Even the best model only found 27% of bugs in PR reviews, showing a long way to go [00:12:43].
Open-Source Models: Open-source models like DeepSec R1 (1% true positive rate) and Llama for Maverick (2% true positive rate) showed considerably weaker performance compared to proprietary models [00:13:28].
Prevalence of Issues: The highest-scoring popular agent achieved only 7% on SM-100, suggesting that most widely used agents are currently poor at finding and fixing complex bugs, which represent a significant portion of software engineering work [00:14:01].
Missed Simple Bugs: Even relatively simple bugs, like a form not clearing due to a is_dirty state issue, were missed by most agents (only Bismouth and Codex found it) [00:15:02].

Common Agent Behaviors and Future Directions

A notable commonality across agents is their narrow thinking, even when using “thinking models” [00:15:40]. They don’t go deep enough in their evaluations [00:15:51]. Broader and deeper thinking chains are needed to effectively find bugs [00:15:56]. Additionally, the specific bugs found by an LLM vary across runs, suggesting they don’t holistically evaluate files or inventory all issues [00:16:10].

Despite high scores on benchmarks like SweetEnge (60-80%), agents struggle with SM-100 [00:16:47]. This implies that while existing agents can create software upfront, managing and fixing deployed software remains a major challenge [00:16:55]. Solving this requires targeted search, better program comprehension, cross-file reasoning, and bug pattern recognition capabilities [00:17:09].

Newer agents, like Bismouth’s, are starting to show improved abilities to reason through code and use context more effectively [00:17:50]. While the technology is nascent, the progress is encouraging, and continued effort with different techniques can lead to industry-wide benefits in software maintenance [00:18:01].

Tubegraph

Explorer

Table of Contents