From: aidotengineer

Designing cloud architecture involves significant cognitive processes beyond mere technical assembly, as architects continuously negotiate trade-offs based on requirements, time, and available resources [01:04:47]. This reliance on scattered and implicit context makes it challenging to capture for AI systems, requiring an understanding of how architects think [01:21:00]. Cat.io is developing grounded reasoning systems using multi-agent orchestration to build an AI copilot that addresses these challenges in AI development for cloud architecture [00:07:00].

Why Reasoning is Essential for Cloud Architecture

Cloud systems are increasing in complexity due to factors like growing user and developer bases, as well as an escalation in tools, constraints, and expectations [00:24:00]. Current simplistic automation tools cannot scale to the diversity of decisions required for cloud architecture [00:37:00]. There is a need for systems that can understand, debate, justify, and plan to solve these complex problems, which requires reasoning rather than just automation [00:44:00].

Key Challenges in AI Development for Architecture Design

At a high level, Cat.io identifies three primary challenges in AI development when applying AI to architecture design:

  • Requirement Understanding: AI needs to understand the origin, format, importance, and scope (global vs. specific) of requirements [01:46:00].
  • Architecture Identification: AI must comprehend how an architecture works by identifying different components and their varied functions based on their context [02:05:00].
  • Architecture Recommendation: The system needs to provide recommendations that align with requirements or improve the architecture to match best practices, integrating understanding of requirements and current architecture state [02:27:00].

Translating these high-level problems into specific AI challenges reveals further complexities:

  • Semantic and Graph Context: Architecture design involves a mix of textual requirements (semantic context) and inherently graph-based architecture data. A key challenge in building AI applications is making these two different data sources integrate to enable higher-level reasoning [02:56:00].
  • Complex Reasoning Scenarios: Questions posed to the system can be vague, broad, or highly complex, requiring breakdown into parts and proper planning to derive accurate answers [03:20:00].
  • Evaluation and Feedback: A significant challenge in AI production is evaluating and providing feedback to a large AI system with many moving parts [03:42:00].

Addressing Challenges

Grounding AI Agents in Specific Context

Effective reasoning by AI agents requires proper context about architecture [04:12:00]. Translating natural language into meaningful architecture retrieval tasks is not straightforward, especially when fast responses are needed [04:22:00].

Strategies implemented include:

  • Semantic Enrichment of Architecture Data: Collecting relevant semantic information for each component to make it more searchable and findable in vector search [04:40:00].
  • Graph-Enhanced Component Search: Utilizing graph algorithms to retrieve the right pieces of information from an architecture when searching for components [04:57:00].
  • Early Score Enrichment of Requirement Documents: Scoring documents based on important concepts to enable faster retrieval from large corpora of text [05:22:00].

Learnings in this area:

  • Semantic grounding improves reasoning but has limitations and doesn’t always scale [06:09:00].
  • “Soft grounding” design is critical, guiding the agent on what to focus on and retrieve [06:29:00].
  • Graph memory supports continuity by connecting nodes and adding context for proper reasoning in subsequent steps [06:47:00].
  • Initial designs using vector databases for architecture retrieval showed good results but highlighted that semantic search isn’t ideal for graph data, leading to a shift towards graph-based searches and knowledge graphs [07:16:00].
  • For requirements, structured templates helped in extracting relevant information for fast retrieval, but context loss was observed when dealing with a larger number of documents, suggesting the potential for graph analysis here as well [08:30:00].

Complex Reasoning Scenarios

Good design involves conflicting goals, trade-offs, and debates [10:07:00]. AI agent development must create agents that can collaborate, argue, and converge on justified recommendations [10:19:00].

Solutions employed:

  • Multi-Agent Orchestration: Building a system with role-specific agents that can work together [10:30:00].
  • Structured Message Format: Using structured messages (e.g., instead of XML) to build better workflows and enable multiple agents to work together in longer chains [10:53:00].
  • Conversation Management: Isolating conversations between agents to avoid token waste and prevent increased hallucination observed with larger memories [11:25:00].
  • Cloning Agents: Cloning agents for parallel processing on certain tasks, which speeds up processes and requires careful memory management [12:05:00].

Learnings regarding complex reasoning and multi-agent systems:

  • Structured outputs improve clarity and control, which is crucial for better programming, despite potential trade-offs in reasoning abilities [13:34:00].
  • Agents resolving trade-offs dynamically, rather than executing static plans, leads to higher creativity in planning and reaching results [13:08:00].
  • Successful multi-agent orchestration necessitates control flows; agents cannot simply work together hoping for the best outcome [13:50:00].

Cat.io’s production stack utilizes a multi-agent system for recommendations, featuring:

  • A Chief Architect overseeing higher-level tasks and coordination [15:17:00].
  • Ten Staff Architects, each specialized in a specific domain (e.g., infrastructure, API, IM) [15:23:00].
  • Two Retrievers: a Requirement Retriever with access to requirements data, and an Architecture Retriever understanding the current architecture state [15:36:00].

The workflow involves three main tasks in sequence:

  1. List Generation: Staff architects, in parallel, query the retriever agents for information and generate a list of possible recommendations, which is returned to the Chief Architect [16:11:00].
  2. Conflict Resolution: The Chief Architect reviews the generated list for conflicts or redundancies and prunes it [16:20:00].
  3. Design Proposal: Cloned staff architects, each with access to past history but generating separate current histories, write full design proposals for each recommendation topic, including gap analysis and proposed actions [16:49:00].

Evaluation and Feedback

A critical challenge in building AI agents is determining the quality of recommendations in a complex multi-agent system with many rounds of conversations [19:03:00].

Solutions and learnings:

  • Closing the loop with human scoring, structured feedback, and revision cycles is essential [19:26:00].
  • Human evaluation is the most effective assessment, especially in early stages, as LLM evaluations do not provide the necessary insights for system improvements [19:34:00].
  • Cat.io developed an internal human evaluation tool called “Eagle Eye” to analyze specific cases, including architecture, extracted requirements, agent conversations, and generated recommendations, allowing for relevance, visibility, and clarity studies [19:55:00].
  • Confidence is not correctness; while confidence can help, it cannot always be trusted [20:38:00].
  • Human feedback is essential early on when building such systems from scratch [20:51:00].
  • Evaluation must be baked into system design from the outset, not added later, ensuring continuous assessment throughout development [20:59:00].
  • Hallucinations, like an agent scheduling a workshop, are part of the challenges in creating effective AI agents and are identified through such evaluation tools [22:05:00].

Conclusion: Designing for Reasoning

Building an AI copilot for architecture is about designing a system that can reason, not just generate answers [23:06:00]. This system needs to have a comprehensive view of large amounts of data, including thousands or millions of architecture components and numerous documents, to answer questions from diverse stakeholders ranging from developers to CTOs [23:16:00].

Achieving this requires:

  • Defining clear roles, workflows, memories, and structure within the AI system [24:04:00].
  • Continuous experimentation to find patterns that work best with existing data, with graphs becoming increasingly important in designs [24:16:00].
  • Carefully considering agent interactions and the level of autonomy granted to each agent [24:39:00]. Cat.io is exploring frameworks like LangGraph for agent workflows and using graphs to capture as much memory as possible to ensure AI always has the right context per task [25:04:00].

These ongoing efforts are shaping the future directions for software architecture using AI [25:53:00].