Model Context Protocol and tool integration

From: aidotengineer

Model Context Protocol (MCP) is a framework designed to enable Large Language Models (LLMs) to access and utilize external tools and services [00:00:10]. This protocol facilitates the interaction between an LLM and various tools, allowing the model to perform actions beyond its inherent text generation capabilities [00:01:04].

Core Functionality of MCP

MCP serves two primary functions:

Information Store for Tools: It acts as a repository of information about available tools, providing the LLM with the necessary details on how to make calls to these tools or otherwise utilize them [00:01:47].
Tool Execution Service: The MCP tool service is responsible for running the tools when an LLM decides to make a call [00:01:57]. After executing an action (e.g., adding numbers, navigating a webpage), it returns a response containing the result or guidance for the LLM’s next action [00:02:07].

Common tools accessible via MCP include browser use for navigating websites, Stripe, GitHub, and Gmail integrations [00:01:26].

Integrating LLMs with MCP

To integrate an LLM with MCP, the language model is typically exposed as an API, often in the style of an OpenAI endpoint [00:02:27]. This setup requires several points of integration and data translation [00:02:51]:

Tool Information Conversion: Tool information received from MCP services must be converted into a list of JSON tools, as expected by OpenAI-style endpoints [00:03:02].
Tool Response Formatting: The tool response must be converted into a format the LLM expects [00:03:11].
Tool Call Extraction: When the LLM calls a tool by emitting tokens or text, the system must detect and extract the tool call from the text, specifically converting it from a format like Hermes into a JSON structure that the OpenAI API expects [00:03:21].

Prompt Structure for Tool Calls

The interaction with the LLM is managed through a specific prompt structure [00:04:47]:

System Message: This initial part of the prompt, enclosed in a system start tag, describes to the LLM how to make tool calls [00:03:56]. It instructs the LLM to pass tool calls as JSON objects within tool XML tags [00:04:04].
User Message: This is the input from the user (e.g., “navigate to trellis.com”) [00:04:33].
Assistant Response: The assistant (LLM) responds, potentially performing “thinking” and then deciding to either call a tool (e.g., navigating with a browser) or provide a text-based response after completing a task [00:04:38].

The entire conversation, including system messages, user inputs, assistant thinking, tool calls, and tool responses, forms a “trace” that is crucial for fine-tuning [00:12:15].

Agent Setup for Data Collection

To collect high-quality traces for fine-tuning, an agent is set up to interact with the MCP services [00:00:26].

Repository: All necessary materials and scripts are available in the Trellis Research AI Worlds Fair 2025 repository, specifically in the MCP agent fine-tune folder [00:00:45].
Model Selection: It is recommended to use a consistent model for both data generation and fine-tuning. For example, a Quen type agent is used to generate traces if a Quen model will be fine-tuned later [00:05:54]. This is because OpenAI models typically do not share their “thinking” traces, which are valuable for training [00:06:06]. A 30 billion parameter Quen model (mixture of experts) is suggested, running on a service like RunPod [00:06:13].
Endpoint Configuration: The LLM is configured as an OpenAI-style endpoint [00:05:42]. Key configurations include:
- Enabling reasoning and a reasoning parser to extract “think tokens” into a JSON format [00:06:52].
- Setting a max model length (e.g., 32,000 tokens) [00:07:09].
- Enabling automatic tool choice, allowing the LLM to decide when and which tool to call [00:07:16].
- Specifying a tool parser (e.g., Hermes) to extract tool calls into JSON format [00:07:22].
Tool Response Truncation: A truncate argument can be used to limit the length of tool responses (e.g., browser accessibility trees) to manage context length for the LLM [00:08:42].

Running the Agent and Generating Traces

When the agent runs, it:

Starts the MCP server, loading configured tools (e.g., 25 Playwright browser tools) [00:09:37].
Takes user input (e.g., “navigate to trellis.com and read out the top two lines”) [00:10:15].
Sends the user message to the LLM, which then generates “thinking tokens” (reasoning) [00:10:25].
If the LLM decides to make a tool call (e.g., browser.navigate), it presents it for user approval [00:10:34].
Upon approval, the MCP executes the tool (e.g., opening a browser window if not in headless mode) [00:10:50].
The tool returns a response (e.g., an accessibility tree of the webpage), which is sent back to the LLM [00:11:41].
The LLM then processes this information, potentially performing more thinking and producing a final text-based response [00:12:59].

All these interactions are logged as “traces,” comprising the full conversation history (messages) and a list of available tools [00:12:14]. These traces are invaluable for fine-tuning a smaller model to acquire similar capabilities [00:13:09].

Curating and Storing Traces

To optimize the fine-tuning process, traces can be curated:

Manual Adjustment: Traces can be manually adjusted if the LLM’s behavior isn’t ideal, either by deleting turns or combining sections [00:17:03].
System Prompts: A guiding system prompt can be used during data generation to help the LLM create a “nice tidy trace” without needing to include the prompt in the final training data [00:16:03].
Data Unrolling: Multi-turn conversations are “unrolled” into multiple rows in the dataset. For instance, a three-turn conversation becomes three separate rows, each representing a different point in the conversation, allowing the model to train on various interaction lengths [00:18:13]. This is particularly important because models like Quen only include reasoning from the most recent turn in their template [00:18:37].
Pushing to Hub: The curated dataset, containing both tools and conversations, is pushed to a platform like Hugging Face Hub [00:17:51].

Fine-tuning Process

The collected traces are then used for fine-tuning an LLM:

Environment Setup: This often involves installing libraries like Unslo and setting up a runtime environment (e.g., on Colab or RunPod) [00:23:01].
Model Loading: A smaller model, such as a 4 billion parameter Quen model, is loaded for fine-tuning [00:23:16].
Applying Adapters: Instead of training all parameters, adapters are applied to specific parts of the model, such as attention modules and MLP layers, using techniques like Low Rank Adapters (LoRA) [00:23:50]. This freezes most of the main weights and only trains a small percentage of parameters [00:30:15].
Data Preparation: The previously collected dataset is loaded, and messages and tools are templated into a single long string of text, which serves as the input for training [00:25:09].
Training Parameters:
- Batch Size: Often set to 1 due to VRAM limitations, though a larger batch size (e.g., 32) is ideal for smoother training [00:28:34].
- Epochs: A small number of epochs (e.g., one) can be used for initial training [00:28:48].
- Learning Rate: A relatively high learning rate can be used for smaller models [00:28:58].
- Optimizer: Optimizers like AtomW 8-bit can be used to save VRAM [00:29:03].

Reinforcement Learning (RL) Considerations

While the primary focus is supervised fine-tuning (SFT), reinforcement learning (RL) can be used to further automate trace generation or improve performance. However, it’s highly recommended to first perform SFT on high-quality, manually curated traces [00:32:02]. This initial SFT helps the model generate good traces more frequently, speeding up later RL training by ensuring it more often reaches scenarios where it receives a positive reward [00:32:33]. For RL, defining clear rewards based on verifiable correct answers is crucial [00:32:51].

Post-Training and Evaluation

After fine-tuning:

Saving and Pushing: The fine-tuned model and tokenizer can be saved and optionally pushed to Hugging Face Hub, often merged into 16-bit format [00:30:30].
Inference Endpoint Update: The name of the fine-tuned model can be swapped into the inference endpoint configuration, creating a ready-to-use endpoint with improved performance [00:30:46].
Evaluation: While complex evaluation setups are beyond the scope of a simple demonstration, one would typically run the fine-tuned model on the endpoint to assess its performance on new tasks [00:34:07].

Even with a small dataset (e.g., 50-100 examples), significant performance improvements can be achieved, especially for common or critical narrow use cases [00:34:50].

For more detailed information on MCP and creating custom servers, additional videos on the Trellis Research YouTube channel are available [00:35:11].

Tubegraph

Explorer

Table of Contents