Evaluation of software design and development benchmarks

From: hu-po

Recent advancements in automated task solving have been driven by multi-agent systems using large language models (LLMs) [00:04:14]. However, existing LLM-based multi-agent systems often focus on simpler dialogue tasks, with complex tasks rarely studied due to challenges like LLM hallucination and cascading errors [00:05:16]. This has led to the development of frameworks like MetaGPT, which aims to address these complexities by incorporating human workflows as a metaprogramming approach [00:06:26].

MetaGPT’s Approach to Software Engineering [00:08:03]

MetaGPT encodes standard operating procedures (SOPs) into prompts to enhance structured coordination [00:08:03]. It mandates modular outputs, assigning diverse roles to various agents, similar to an assembly line paradigm [00:08:22]. This approach mirrors human software development teams, where distinct roles like product manager, architect, project manager, engineer, and QA engineer work in a “waterfall method” to decompose high-level tasks into actionable components [00:13:06]. Each agent operates with specific capabilities like thinking, reflection, and knowledge accumulation, interacting with a shared environment through publication and subscription methods [00:33:59]. This system aims to produce more coherent and correct software compared to existing chat-based multi-agent systems [00:09:54].

The Role of Context in MetaGPT

The MetaGPT framework generates increasing amounts of context as tasks progress through the different agents [00:47:06]. For example, the architect receives all the information written by the product manager, and subsequent roles accumulate even more context [00:47:15]. This process can lead to significant token usage [00:47:31].

Benchmarks for Software Engineering Evaluation

The MetaGPT paper uses two main benchmarks to evaluate its performance:

HumanEval [01:34:44]

HumanEval is a problem-solving dataset used to measure functional correction for synthesizing programs from docstrings [01:34:56]. It essentially tests the ability to write a function given its documentation.

MBPP (Mostly Basic Programming Problems) [01:35:26]

MBPP is a dataset of 1,000 crowdsourced Python programming problems designed to be solvable by entry-level programmers [01:35:36]. These are simple coding benchmarks, often involving tasks like finding shared elements in lists [01:36:13].

Critique of Current Benchmarks [01:36:26]

These benchmarks are criticized for evaluating simple coding tasks (writing individual functions) rather than complex system design [01:36:26]. Creating a working software product, which involves designing a whole system with interconnected components, is significantly more difficult than just writing a single function [01:36:38]. The absence of “system design benchmarks” means that MetaGPT is evaluated on a metric that doesn’t fully capture its advertised capabilities [01:36:55].

Practical Evaluation and Critique

When given a task, such as creating a Gradio front-end for a robotic AI cat toy [00:39:58], MetaGPT produced:

Product Requirement Document (PRD): This included user stories and a competitive analysis, even generating fake competitors [00:40:53]. While the user stories were generally agreeable, the competitive quadrant chart was deemed “meaningless” for strategic decision-making [00:43:01].
System Design: The architect’s output, including data structures and API definitions, was largely deemed arbitrary and basic, using generic “control params” or “schedule params” that lacked specific meaning [00:49:55]. The sequence flow diagram was criticized as stating the obvious without providing useful insights [00:51:16].
Code Generation: The generated code for the Gradio interface was found to be “mostly nonsense” [00:58:01]. It used arbitrary object structures, inconsistent time formats, and questionable design patterns (e.g., a class with only static methods that merely wrapped existing functions) [01:06:55]. The code was deemed “verbose and incorrect” [01:24:10].
QA Engineer Output: No test file was produced by the QA engineer, indicating an incomplete process [01:11:18].

In comparison to MetaGPT’s complex multi-agent process, a direct query to ChatGPT 4 for the same task yielded a working, albeit simple, Gradio interface with more meaningful concepts like direction and speed [01:00:59]. This suggests that the extensive context generated by MetaGPT’s multi-agent system might confuse the engineer rather than help them [01:31:31]. The cost of running MetaGPT’s demo (approximately $0.87 for one task) also highlighted the high token usage for potentially inferior results [00:47:47].

Benchmarking Claims vs. Reality

MetaGPT claims to significantly outperform GPT-4 on HumanEval and MBPP [01:38:59]. However, this is questioned given the observed poor quality of its system-level code generation [01:38:11]. It’s speculated that MetaGPT’s performance on these specific function-level benchmarks might stem from the sheer volume of context provided (tens of thousands of tokens) [01:38:00], making it more likely to get the answer right on the first try (pass@1) [01:39:15]. However, this doesn’t translate to its ability to produce functional, well-designed software systems.

Broader Implications for Software Engineering Organizations

The perceived inefficiencies of MetaGPT’s multi-agent system mirror critiques of traditional, large-scale human software engineering organizations [01:58:49]. The roles of product managers, architects, and project managers are often seen as creating “overly verbose, non-specific things” that lead to “overly verbose and non-specific” code, even when applied to LLMs [00:59:30]. The argument is raised that directly tasking an engineer (or an LLM acting as one) might yield better results with less overhead [01:19:02]. This highlights considerations in optimizing software engineering processes and questions whether current bureaucratic structures in software development are truly optimal or merely relics of historical legacy and compensation models [01:59:50].

Conclusion

While MetaGPT presents an interesting concept for multi-agent collaboration in software engineering, its practical application, as demonstrated, struggles to produce robust and well-designed code compared to direct LLM prompting [01:56:57]. The benchmarks used for evaluation, HumanEval and MBPP, are criticized for not adequately assessing complex system design capabilities [01:58:08]. The findings suggest that the added complexity and overhead of mimicking traditional software roles with LLMs may not currently translate into a superior product, raising broader questions about the efficiency of established software development methodologies.

Tubegraph

Explorer

Table of Contents