From: aidotengineer
Model Context Protocol (MCP) is a framework designed to enable Large Language Models (LLMs) to access and utilize external tools and services [00:00:10]. This protocol facilitates the interaction between an LLM and various tools, allowing the model to perform actions beyond its inherent text generation capabilities [00:01:04].
Core Functionality of MCP
MCP serves two primary functions:
- Information Store for Tools: It acts as a repository of information about available tools, providing the LLM with the necessary details on how to make calls to these tools or otherwise utilize them [00:01:47].
- Tool Execution Service: The MCP tool service is responsible for running the tools when an LLM decides to make a call [00:01:57]. After executing an action (e.g., adding numbers, navigating a webpage), it returns a response containing the result or guidance for the LLM’s next action [00:02:07].
Common tools accessible via MCP include browser use for navigating websites, Stripe, GitHub, and Gmail integrations [00:01:26].
Integrating LLMs with MCP
To integrate an LLM with MCP, the language model is typically exposed as an API, often in the style of an OpenAI endpoint [00:02:27]. This setup requires several points of integration and data translation [00:02:51]:
- Tool Information Conversion: Tool information received from MCP services must be converted into a list of JSON tools, as expected by OpenAI-style endpoints [00:03:02].
- Tool Response Formatting: The tool response must be converted into a format the LLM expects [00:03:11].
- Tool Call Extraction: When the LLM calls a tool by emitting tokens or text, the system must detect and extract the tool call from the text, specifically converting it from a format like Hermes into a JSON structure that the OpenAI API expects [00:03:21].
Prompt Structure for Tool Calls
The interaction with the LLM is managed through a specific prompt structure [00:04:47]:
- System Message: This initial part of the prompt, enclosed in a
system start
tag, describes to the LLM how to make tool calls [00:03:56]. It instructs the LLM to pass tool calls as JSON objects withintool
XML tags [00:04:04]. - User Message: This is the input from the user (e.g., “navigate to trellis.com”) [00:04:33].
- Assistant Response: The assistant (LLM) responds, potentially performing “thinking” and then deciding to either call a tool (e.g., navigating with a browser) or provide a text-based response after completing a task [00:04:38].
The entire conversation, including system messages, user inputs, assistant thinking, tool calls, and tool responses, forms a “trace” that is crucial for fine-tuning [00:12:15].
Agent Setup for Data Collection
To collect high-quality traces for fine-tuning, an agent is set up to interact with the MCP services [00:00:26].
- Repository: All necessary materials and scripts are available in the
Trellis Research AI Worlds Fair 2025
repository, specifically in theMCP agent fine-tune
folder [00:00:45]. - Model Selection: It is recommended to use a consistent model for both data generation and fine-tuning. For example, a Quen type agent is used to generate traces if a Quen model will be fine-tuned later [00:05:54]. This is because OpenAI models typically do not share their “thinking” traces, which are valuable for training [00:06:06]. A 30 billion parameter Quen model (mixture of experts) is suggested, running on a service like RunPod [00:06:13].
- Endpoint Configuration: The LLM is configured as an OpenAI-style endpoint [00:05:42]. Key configurations include:
- Enabling reasoning and a reasoning parser to extract “think tokens” into a JSON format [00:06:52].
- Setting a
max model length
(e.g., 32,000 tokens) [00:07:09]. - Enabling automatic tool choice, allowing the LLM to decide when and which tool to call [00:07:16].
- Specifying a tool parser (e.g., Hermes) to extract tool calls into JSON format [00:07:22].
- Tool Response Truncation: A
truncate
argument can be used to limit the length of tool responses (e.g., browser accessibility trees) to manage context length for the LLM [00:08:42].
Running the Agent and Generating Traces
When the agent runs, it:
- Starts the MCP server, loading configured tools (e.g., 25 Playwright browser tools) [00:09:37].
- Takes user input (e.g., “navigate to trellis.com and read out the top two lines”) [00:10:15].
- Sends the user message to the LLM, which then generates “thinking tokens” (reasoning) [00:10:25].
- If the LLM decides to make a tool call (e.g.,
browser.navigate
), it presents it for user approval [00:10:34]. - Upon approval, the MCP executes the tool (e.g., opening a browser window if not in headless mode) [00:10:50].
- The tool returns a response (e.g., an accessibility tree of the webpage), which is sent back to the LLM [00:11:41].
- The LLM then processes this information, potentially performing more thinking and producing a final text-based response [00:12:59].
All these interactions are logged as “traces,” comprising the full conversation history (messages) and a list of available tools [00:12:14]. These traces are invaluable for fine-tuning a smaller model to acquire similar capabilities [00:13:09].
Curating and Storing Traces
To optimize the fine-tuning process, traces can be curated:
- Manual Adjustment: Traces can be manually adjusted if the LLM’s behavior isn’t ideal, either by deleting turns or combining sections [00:17:03].
- System Prompts: A guiding system prompt can be used during data generation to help the LLM create a “nice tidy trace” without needing to include the prompt in the final training data [00:16:03].
- Data Unrolling: Multi-turn conversations are “unrolled” into multiple rows in the dataset. For instance, a three-turn conversation becomes three separate rows, each representing a different point in the conversation, allowing the model to train on various interaction lengths [00:18:13]. This is particularly important because models like Quen only include reasoning from the most recent turn in their template [00:18:37].
- Pushing to Hub: The curated dataset, containing both tools and conversations, is pushed to a platform like Hugging Face Hub [00:17:51].
Fine-tuning Process
The collected traces are then used for fine-tuning an LLM:
- Environment Setup: This often involves installing libraries like Unslo and setting up a runtime environment (e.g., on Colab or RunPod) [00:23:01].
- Model Loading: A smaller model, such as a 4 billion parameter Quen model, is loaded for fine-tuning [00:23:16].
- Applying Adapters: Instead of training all parameters, adapters are applied to specific parts of the model, such as attention modules and MLP layers, using techniques like Low Rank Adapters (LoRA) [00:23:50]. This freezes most of the main weights and only trains a small percentage of parameters [00:30:15].
- Data Preparation: The previously collected dataset is loaded, and messages and tools are templated into a single long string of text, which serves as the input for training [00:25:09].
- Training Parameters:
- Batch Size: Often set to 1 due to VRAM limitations, though a larger batch size (e.g., 32) is ideal for smoother training [00:28:34].
- Epochs: A small number of epochs (e.g., one) can be used for initial training [00:28:48].
- Learning Rate: A relatively high learning rate can be used for smaller models [00:28:58].
- Optimizer: Optimizers like AtomW 8-bit can be used to save VRAM [00:29:03].
Reinforcement Learning (RL) Considerations
While the primary focus is supervised fine-tuning (SFT), reinforcement learning (RL) can be used to further automate trace generation or improve performance. However, it’s highly recommended to first perform SFT on high-quality, manually curated traces [00:32:02]. This initial SFT helps the model generate good traces more frequently, speeding up later RL training by ensuring it more often reaches scenarios where it receives a positive reward [00:32:33]. For RL, defining clear rewards based on verifiable correct answers is crucial [00:32:51].
Post-Training and Evaluation
After fine-tuning:
- Saving and Pushing: The fine-tuned model and tokenizer can be saved and optionally pushed to Hugging Face Hub, often merged into 16-bit format [00:30:30].
- Inference Endpoint Update: The name of the fine-tuned model can be swapped into the inference endpoint configuration, creating a ready-to-use endpoint with improved performance [00:30:46].
- Evaluation: While complex evaluation setups are beyond the scope of a simple demonstration, one would typically run the fine-tuned model on the endpoint to assess its performance on new tasks [00:34:07].
Even with a small dataset (e.g., 50-100 examples), significant performance improvements can be achieved, especially for common or critical narrow use cases [00:34:50].
For more detailed information on MCP and creating custom servers, additional videos on the Trellis Research YouTube channel are available [00:35:11].