History and architecture of AI interaction protocols

From: aidotengineer

As user interaction paradigms shift, the focus is increasingly on building excellent user experiences within AI assistance [00:00:10]. This shift is fundamentally changing user interaction [00:00:24], driven significantly by concepts like Model Context Protocol (MCP) [00:00:27].

The Jarvis Ideal

The ultimate vision for AI interaction is exemplified by Tony Stark’s AI assistant, Jarvis [00:01:25]. Jarvis can perform a wide range of tasks, including:

Compiling databases from various sources (e.g., SHIELD, FBI, CIA intercepts) [00:01:52].
Initiating virtual crime scene reconstructions [00:02:02].
Generating user interfaces (UI) on demand [00:03:11].
Accessing public records [00:02:21].
Bringing up and analyzing complex data like thermogenic signatures [00:02:26].
Performing complex data joins across different datasets (e.g., thermogenic occurrences with Mandarin attack locations) [00:02:38].
Creating flight plans [00:02:51].
Interacting with physical systems like doorbells [00:03:31].

Crucially, Jarvis supports not just voice, but also typing and gestures, generating dynamic UIs for seamless interaction [00:04:53]. While some of these capabilities, like UI generation, are technically possible today [00:04:15], a personal Jarvis-like assistant is not yet commonplace [00:04:21].

The Integration Challenge

The primary barrier to achieving a universal AI assistant like Jarvis is the complexity of building myriad integrations [00:05:15]. It is incredibly difficult for major AI providers (like Google or OpenAI) to build integrations for every possible service, especially niche ones like a local city government website for reserving park pavilions [00:05:40]. Without a way to interface with everything, the incentive to invest heavily in broad integration efforts diminishes [00:05:54]. Users desire one central AI capable of augmenting itself with any capability in the world [00:10:28].

Evolution of AI Interaction Protocols

The history and architecture of AI interaction protocols can be divided into three phases [00:06:50]:

Phase 1: ChatGPT and Manual Context (Circa 3 years ago)

This phase began with the advent of ChatGPT [00:06:57]. Its pivotal contribution was not merely the Large Language Model (LLM), which had existed for some time, but the host application layer around it that provided a good user experience for interfacing with an LLM [00:07:33]. This led to significant investment and rapid improvement in LLMs [00:07:48].

However, the major limitation was the need for users to manually provide context by copying and pasting text or images into the LLM, and then manually extracting results [00:07:57]. While the LLM could answer questions, it couldn’t do anything directly, and managing this context was cumbersome [00:08:27].

Phase 2: Host Application Tooling

In this phase, the host application began to interface directly with the LLM, informing it about available services and providing additional context when needed [00:08:35]. This enabled the AI to perform actions using tools like search engines, calendar integrations, or Slack integrations [00:08:48].

Despite this progress, the functionality remained limited by the development time of the host application provider (e.g., OpenAI, Anthropic) to build integrations [00:09:24]. Proprietary plugin systems (like OpenAI’s GPT plugin system) [00:09:51] meant developers had to build separate integrations for each platform, which was not scalable [00:10:04]. Users don’t want multiple LLM wrappers; they want one central AI that can interface with everything [00:10:27].

Phase 3: Model Context Protocol (MCP)

This is the current and future phase, where Model Context Protocol (MCP) acts as a standard protocol that all AI assistants support or will soon support [00:10:58]. By building to the MCP specification, developers can ensure their services are usable by any AI assistant [00:11:12]. This standardization is expected to bring widespread access to Jarvis-like AI capabilities [00:11:29].

MCP Architecture and Tool Usage

The architecture of MCP involves several key components [00:11:40]:

Host Application: Communicates with the LLM and dynamically manages the available services [00:11:42].
LLM: Knows what services are available based on the host application’s context and selects the most appropriate tool for the user’s query [00:11:59].
Client: The host application creates a standard client for each service it wants to interface with [00:12:09]. This client uses a standard interface, avoiding special integrations [00:12:16].
Service Provider: Creates the MCP servers that interface with unique tools, resources, and specific prompts [00:12:25]. This allows service providers to control the unique aspects of their service while maintaining a standard communication interface with the client [00:12:36].

This standardized communication is what enables AI to “have hands” and perform real-world actions [00:12:55].

MCP Demo Example

An example demo of MCP in action illustrates its capabilities:

Journal Entry Request: A user prompts an LLM configured with MCP servers to “write a journal entry,” deriving location and weather, and creating a story [00:13:34].
Location and Weather Tools: The LLM uses a “locationator” MCP server to determine the current location [00:13:55] and a “get weather” server to retrieve weather conditions [00:14:32].
Authentication: It then calls an “EpicMe” MCP server’s authenticate tool. The user provides an email, receives an authentication token (using OAuth 2.1), and logs in, demonstrating secure authentication built into the MCP server [00:14:43].
Journaling and Tagging: After authentication, the LLM requests the creation of a journal entry, configures inputs, and checks for available tags, even creating new ones (e.g., “travel”) as needed [00:15:35].
Dynamic Display: The LLM can retrieve the journal entry and format it for display, even rendering it in Markdown. This highlights the ability for servers to communicate in a sensible way and clients to take context to display dynamic UIs, which is currently a developing feature in clients [00:16:40].
Multilingual Capabilities: An MCP server might send responses in English, but the LLM can translate them into any language (e.g., Japanese), further enhancing the user experience [00:17:40].
Action Execution: The demo concludes with the user instructing the AI to delete the post and log out, demonstrating the ability to perform authenticated actions [00:17:53].

This shift means users will soon no longer need to navigate browsers or phrase search queries in specific ways; instead, they can speak naturally, and the AI will understand their intent and perform actions [00:18:15].

Conclusion

Model Context Protocol is a critical step in the future of software architecture and user interaction with AI. It provides a standard mechanism for AI assistants to communicate with a vast ecosystem of tools and services, overcoming the longstanding challenges of AI integration. This promises a future where AI assistants can truly “do anything” [00:11:01], enabling a seamless and intuitive user experience akin to the Jarvis ideal.

For further learning, explore the Model Context Protocol specification [00:19:04] and resources on EpicAI.pro, which covers MCP and broader AI topics [00:19:09].

Tubegraph

Explorer

Table of Contents