From: aidotengineer
The Model Context Protocol (MCP) is a protocol designed to provide services to Large Language Models (LLMs), primarily enabling them to access and utilize various tools [00:01:20]. This capability allows LLMs to interact with external systems and perform complex tasks beyond their inherent knowledge.
How MCP Works
MCP functions as both a repository of information about tools and a service that executes these tools [00:01:47].
- Information Store: MCP holds details that help the LLM understand how to make calls to the tools [00:01:50].
- Tool Execution: When an LLM decides to make a tool call, the MCP tool service takes the action (e.g., navigating a webpage, processing a payment) [00:01:57].
- Response Return: After executing the tool, MCP returns a response to the LLM, which includes details of the result or additional guidance [00:02:07]. This allows the LLM to loop back, make another tool call, or provide a text-based response [00:02:14].
Services and Tools
While the primary focus is on browser use, MCP supports various other services:
- Browser Navigation: This is a key application, enabling an LLM to navigate through websites [00:01:26]. When an LLM navigates to a page, an “accessibility tree” (a text description of the page) is returned to the LLM, allowing it to read the content [00:08:57].
- Other Services: MCPs exist for platforms like Stripe, GitHub, and Gmail [00:01:32]. Users can substitute these into examples to explore different functionalities [00:01:36]. Playwright, for instance, offers 25 tools for browser interaction, including navigation, switching tabs, and clicking links [00:09:49].
Limitations on Tool Count
For open-source models, it is generally recommended not to use more than 25 to 50 tools, as a larger number of tools can confuse the LLM due to excessive context [00:10:01]. More capable models like Claude might handle up to 200 tools [00:10:05].
Integrating LLMs with MCP
To enable an LLM to utilize MCP tools, the language model is typically exposed as an OpenAI-style API endpoint [00:02:31]. This requires several points of integration and data translation:
- MCP Tool Information to JSON: Tool information from MCP services must be converted into lists of JSON tools, as expected by OpenAI endpoints [00:03:02]. This conversion happens when preparing tools for the LLM [00:21:52].
- Tool Response Formatting: The tool response received from MCP needs to be converted into a format the language model expects [00:03:11].
- Tool Call Extraction: When the LLM calls a tool by emitting tokens or text, the system must detect and extract whether it wants to make a tool call [00:03:21]. The text format for tool calls, such as the Hermes format for a Quen model, needs to be correctly parsed into JSON [00:03:35].
Prompt Structure
The interaction with the LLM involves a specific prompt structure:
- System Message: This part, often enclosed in
system start
andsystem end
tags, describes to the LLM how to make tool calls, typically by passing JSON objects within XML tags like<tool_code>
[00:03:58]. It informs the LLM about available tools and the expected format for function calls [00:04:15]. - User Message: The user’s request (e.g., “navigate to trellis.com”) is provided [00:04:33].
- Assistant Response: The assistant (LLM) responds, potentially by thinking and then deciding to call a tool (e.g., navigating to a URL) or providing a text-based answer if a task is completed [00:04:38].
Fine-tuning LLMs for Tool Use
To improve an LLM’s performance with tool use, a process of fine-tuning using high-quality reasoning traces is employed [00:00:15].
Data Collection
High-quality MCP agent reasoning traces are generated and saved, including the tools used and multi-turn conversation histories [00:00:26]. This data is crucial for fine-tuning.
- Model Consistency: It’s recommended to maintain consistency between the model used to generate the data and the model intended for fine-tuning [00:05:54]. OpenAI models typically do not share their thinking traces, making open-source models like Quen more suitable for data generation [00:06:06].
- Reasoning Parsing: When generating data, enabling reasoning and a reasoning parser ensures that
think
tokens from the model are detected and extracted into a JSON format within the response [00:06:52]. - Trace Truncation: Tool responses, especially from browser navigation (accessibility trees), can be very long [00:09:07]. Truncating these responses can manage context length, though it means the LLM might not see the full page content [00:09:09].
- Manual Adjustment: Traces can be manually adjusted to clean up noise or guide the LLM if it doesn’t follow desired actions [00:15:08]. A system prompt can be used during data generation to direct the LLM on tool calls, which can then be excluded from the final training data [00:16:03].
- Unrolling Data: For multi-turn conversations, the data is “unrolled” into multiple rows. For example, three turns are expanded into three rows, providing more training data points [00:18:21]. This is important because the Quen template typically only includes reasoning from the most recent turn [00:18:41].
Fine-tuning Process
Once data is collected, it is prepared for fine-tuning:
- Model Loading: Models like the 4-billion parameter Quen model can be trained, with considerations for sequence length and GPU memory [00:23:16].
- Applying Adapters: Instead of training all parameters, low-rank adapters (Lora) are applied to specific parts of the model, such as attention modules and MLP layers [00:23:50]. This makes fine-tuning more efficient and requires less VRAM [00:27:39].
- Data Formatting: The collected messages and tools are templated into a single long string of text for the trainer [00:28:22].
- Training Parameters: Training often uses a small batch size (e.g., one) due to VRAM limitations, making the training loss “jumpy” [00:28:35]. Training for one epoch with a relatively high learning rate is common for small models [00:28:48]. Only a small percentage of parameters (e.g., 1.62%) are trained, with the main weights frozen [00:30:15].
- Evaluation: After fine-tuning, the model’s performance is re-evaluated to see if the tool calling and reasoning capabilities have improved [00:33:40].
Beyond Supervised Fine-Tuning
While the focus is on Supervised Fine-Tuning (SFT) using high-quality manual traces, reinforcement learning (RL) methods like GRPO can be used later [00:32:02]. However, it is highly beneficial to perform SFT on high-quality traces first, as this helps the model avoid struggling and speeds up later RL training [00:32:23]. For RL, defining rewards requires a dataset with verifiable correct answers, which can be challenging to systematically generate [00:32:52].
Even without moving to RL, significant performance improvements can be achieved with a relatively small number of curated examples (e.g., 50-100 traces), especially for common or critical use cases [00:34:48].