From: aidotengineer

This article provides a guide to finetuning language models with the Model Context Protocol (MCP), focusing on improving an agent’s performance by leveraging high-quality reasoning traces and tool interactions [00:15:00]. The process involves generating agent reasoning traces, saving multi-turn conversations and tool data, finetuning a model (e.g., Quen 3), and evaluating the improved performance [00:26:00].

Introduction to MCP

MCP (Model Context Protocol) is a protocol designed to provide services to Large Language Models (LLMs), primarily granting them access to tools [01:20:00]. While the focus here is on browser use (LLMs navigating websites), MCPs exist for other services like Stripe, GitHub, and Gmail [01:29:00].

MCP performs several key functions [01:44:00]:

  • Information Store: It stores information about tools, helping the LLM understand how to make calls to them [01:47:00].
  • Tool Execution: The MCP tool service runs the tools itself [01:55:00].
  • Response Return: After an action, it returns a response containing the result or guidance for the LLM, enabling the LLM to make further tool calls or provide a text-based response [02:07:00].

Integrating AI with Applications using MCP

To enable an LLM to interact with MCP services, it’s typically exposed as an OpenAI-style API endpoint [02:29:00]. This integration requires several points of translation [02:51:00]:

  1. Tool Information Conversion: MCP tool information must be converted into JSON lists, as expected by OpenAI endpoints [03:02:00].
  2. Tool Response Formatting: Tool responses need to be converted into a format the language model expects [03:11:00].
  3. Tool Call Detection: When the LLM emits tokens or text to call a tool, the system must detect and extract this call, typically in a specific format like Hermes [03:21:00].

Prompt Structure and Tool Calls

The interaction with the LLM is often managed through a specific prompt structure [03:42:00]. A pseudo-prompt typically includes [03:51:00]:

  • System Message: Describes how the LLM should make tool calls (e.g., by passing JSONs within <tool> XML tags) [03:56:00].
  • User Message: The initial query or instruction from the user [04:33:00].
  • Assistant Response: The LLM’s response, which might involve thinking (generating “think tokens”) and then deciding to call a tool or provide a text-based answer [04:38:00].

Finetuning Language Models with MCP

The process of finetuning language models with MCP involves several stages, from data collection to model training and evaluation.

1. Data Collection

The first step is to generate high-quality traces from an MCP agent [05:30:00].

  • Endpoint Setup: An OpenAI-style endpoint is required. For data generation, it’s recommended to use a model that shares its reasoning traces, such as a Quen model (e.g., the 30B parameter Quen 3 model), as OpenAI models do not [05:40:00].
    • This can be run on services like RunPod using a one-click affiliate template [06:22:00]. Key configurations include enabling reasoning and a reasoning parser to extract thinking processes into JSON, setting max model length, and enabling automatic tool choice [06:52:00].
    • The tool parser needs to be specified, such as the Hermes format, to extract tool calls into JSON [07:22:00].
  • Running the Agent: The agent interacts with MCP servers, which can be configured to load various tools (e.g., Playwright offers 25 browser-related tools like navigate, switch tab, etc.) [09:40:00]. For open-source models, it’s generally recommended to use 25-50 tools to avoid confusion [10:01:00].
  • Trace Logging: Agent runs generate logs with two parts: messages (full conversation history) and tools (list of available tools) [12:06:00]. These traces are crucial for fine-tuning [12:15:00].
  • Trace Cleaning/Adjustment: If an agent’s trace isn’t ideal, it can be manually adjusted (e.g., deleting user turns, combining sections) or guided with a system prompt during generation to produce cleaner traces [15:12:00]. The goal is to obtain high-quality traces for training data [16:23:00].

2. Data Preparation for Fine-tuning

After collecting traces, the data needs to be prepared for training [17:48:00].

  • Push to Hub: Traces (tools and conversations) are pushed to a dataset on Hugging Face Hub [17:51:00].
  • Unrolling Data: For multi-turn conversations, the data is “unrolled” into multiple rows. For example, a three-turn conversation becomes three rows, allowing the model to train on different lengths of conversational context [18:11:00]. This is particularly useful because the Quen template only includes reasoning from the most recent turn [18:34:00].
  • Chat Template: The messages and tools data are passed into a chat template, which converts them into a single long string of text, including system messages, tool descriptions, user messages, assistant responses, and tool calls [23:40:00].

3. Fine-tuning the Model

Finetuning AI models for better performance involves loading the model, preparing it for training, and running the training process.

  • Model Loading: A smaller model (e.g., 4B parameter Quen model) is loaded for fine-tuning. Max sequence length should be set appropriately (e.g., 32,000) [23:16:00].
  • Applying LoRA Adapters: To save VRAM and train efficiently, LoRA (Low Rank Adapters) are applied to specific parts of the model (e.g., attention modules, MLP layers). This means only a small percentage of parameters are trained, while the main weights remain frozen [23:50:00].
  • Training Parameters:
    • Batch Size: Often set to one due to VRAM limitations, though larger batch sizes (e.g., 32) are ideal for smoother training [28:34:00].
    • Epochs: Often trained for a single epoch with a small dataset [28:48:00].
    • Learning Rate: Relatively high for small models [28:58:00].
    • Optimizer: AtomW 8-bit optimizer can be used to save VRAM [29:03:00].
  • Manual Traces vs. RL: It’s highly recommended to start with supervised fine-tuning (SFT) using curated manual traces before considering reinforcement learning (RL) methods like GRPO. High-quality SFT data can significantly speed up subsequent RL training by ensuring the model generates useful traces early on [32:00:00].

4. Evaluation and Deployment

  • Pre- and Post-Finetuning Inference: Run inference on the model before and after fine-tuning to observe performance changes, especially on multi-step tasks where smaller, untrained models typically struggle [25:54:00].
  • Model Saving and Pushing: After training, the fine-tuned model and tokenizer can be saved and pushed to Hugging Face Hub, optionally merged to 16 bits [30:30:00]. This allows direct deployment of the fine-tuned model as an inference endpoint by simply updating the model name in the RunPod configuration [30:46:00].
  • Advanced Evaluation: For more robust evaluation, a dedicated evaluation set (hundreds of traces), and logging with TensorBoard are recommended [31:05:00].

Resources

All materials for this workshop are available online in the Trellis Research AI Worlds Fair 2025 GitHub repository, specifically in the MCP agent fine-tune folder [00:45:00]. Further details on setting up custom MCP servers can be found in other Trellis Research YouTube videos [35:11:00].