Finetuning language models with MCP

From: aidotengineer

This article outlines a workshop on fine-tuning AI agents that interact with tools using the Model Context Protocol (MCP) [00:00:00]. The goal is to generate high-quality agent reasoning traces, save tool interactions and multi-turn conversations, fine-tune a model (specifically Quen 3 but applicable to others), and evaluate its improved performance [00:00:24]. All materials are available in the Trellis Research AI Worlds Fair 2025 repository, specifically in the MCP agent fine-tune folder [00:00:45].

What is Model Context Protocol (MCP)?

Model Context Protocol (MCP) is a standardized way to provide services and access to tools for Large Language Models (LLMs) [00:01:20]. While the focus here is on browser use (navigating websites), MCPs exist for various services like Stripe, GitHub, and Gmail [00:01:26].

MCP performs several key functions:

Information Storage: It acts as a repository for information about tools, helping the LLM understand how to make calls to them [00:01:47].
Tool Execution: The MCP tool service also runs the tools. When an LLM decides to call a tool, MCP executes the action (e.g., adding numbers, navigating a page) and returns a response that includes results or guiding information [00:01:57]. This allows the LLM to loop back for further tool calls or generate a text-based response [00:02:14].

Integrating LLMs with MCP using an OpenAI API Endpoint

To allow an LLM to access tools via MCP, the language model is typically exposed as an OpenAI API endpoint [00:02:27]. This is a common practice, as many models and libraries support this API style [00:02:38].

There are specific points of integration and data translation required:

MCP to OpenAI Tool Format: Tool information from MCP services must be converted into lists of JSON tools, as expected by OpenAI endpoints [00:03:02].
Tool Response Formatting: The response from a tool must be converted into a format the language model expects [00:03:11].
Tool Call Detection and Extraction: When the LLM emits tokens or text to call a tool, the agent needs to detect and extract this call [00:03:21]. For the Quen model, the text indicating a tool call will be in Hermes format [00:03:35].

Prompt Structure for Tool Calls

The interaction between the LLM and tools is guided by a specific prompt structure:

System Message: Begins with a system start tag and describes how the LLM can make tool calls, typically by passing JSONs within <tool> XML tags [00:03:56]. It informs the LLM about available tools (e.g., browser) and instructs it to return function calls as JSON inside <tool call> tags [00:04:11].
User Message: The initial request from the user (e.g., “navigate to trellis.com”) [00:04:33].
Assistant Response: The LLM’s response, which might involve “thinking” (internal reasoning), making a tool call (e.g., browser.navigate), or providing a text-based answer if a task is complete [00:04:38].

Data Collection for Fine-tuning

Data collection is a crucial step for fine-tuning LLMs. The process involves running an agent, capturing its interactions, and curating high-quality traces.

Setting up the LLM Endpoint for Data Generation

It’s recommended to use a consistent model for both data generation and fine-tuning [00:05:54]. Since OpenAI models don’t share their internal “thinking traces,” a Quen model (specifically the 30 billion parameter mixture-of-experts model) is used to generate data [00:06:06].

The Quen model can be run on platforms like RunPod using a one-click affiliate template [00:06:22]. The setup involves:

Running a Docker image for VLM [00:06:47].
Enabling reasoning and a reasoning parser to extract thinking tokens into a JSON format [00:06:52].
Setting a maximum model length (e.g., 32,000 tokens) and hosting on a specific port (e.g., 8000) [00:07:09].
Enabling automatic tool choice, allowing the LLM to decide when and which tool to call [00:07:16].
Specifying a tool parser (e.g., Hermes) to extract tool calls from the LLM’s string output into the expected JSON format for the OpenAI API [00:07:22].

The server’s endpoint (RunPod Pod ID and port) is used to interact with the LLM [00:08:15]. A truncate argument is used to limit the length of tool responses (e.g., accessibility trees from browser navigation) to manage context length for the LLM [00:08:42].

Running the Agent and Collecting Traces

The agent is run using uv sync to ensure requirements are met [00:09:25]. The MCP server starts, loading configured tools (e.g., 25 Playwright browser tools like navigate, switch tab) [00:09:37]. For open-source models, keeping the number of tools between 25-50 is advised to avoid confusing the LLM with excessive context [00:10:01].

Users provide inputs (e.g., “navigate to trellis.com and read out the top two lines”) [00:10:15]. The agent then:

Sends the user message to the LLM [00:10:25].
The LLM processes, generates thinking tokens (saved, not displayed for brevity) [00:10:28].
The LLM decides to call a tool (e.g., browser.navigate), which requires user approval in this setup [00:10:34].
Upon approval, the browser pops up (if not in headless mode) [00:10:50].
An accessibility tree (text description of the page) is sent back to the LLM [00:11:41].
The LLM uses this information, along with its reasoning, to formulate a final answer [00:12:55].

Logs are saved by default, containing messages (full conversation history) and tools (list of available tools) [00:12:06]. These structured logs are essential for fine-tuning [00:12:17].

Curating High-Quality Traces

For fine-tuning, it’s crucial to collect good traces [00:13:28]. While the model might struggle with complex tasks (e.g., multi-step navigation, tab switching), methods to improve traces include:

Manual Adjustment: Modifying or combining user/assistant turns in the trace retrospectively [00:17:03].
System Prompts: Providing a system prompt that directly explains how to perform a task and which tools to call [00:16:03]. This prompt can be excluded from the final training data once a nice trace is generated, as the goal is clean training data [00:16:18].

Preparing and Pushing Data to Hugging Face Hub

Once high-quality traces are collected, they are pushed to the Hugging Face Hub to create a dataset for fine-tuning [00:17:51].

Unrolling Data

A subtle but important technique is “unrolling” the data [00:18:11]. For multi-turn conversations, this means creating multiple rows in the dataset: one for the full conversation, one for the first two turns, and one for just the first turn [00:18:22]. This effectively generates more training examples from a single multi-turn trace [00:18:28]. The Quen template, for instance, only includes reasoning from the most recent turn, making unrolling beneficial [00:18:47].

Process

The push to hub function is used, requiring a repo ID and the unroll flag if desired [00:18:58]. The user needs to be logged into Hugging Face with write permissions [00:19:10].

The resulting dataset contains fields like ID, timestamp, model, messages (the full conversation history), and tools (the list of tools available) [00:19:33]. The messages content for each row includes system messages, tool definitions, user requests, assistant reasoning, tool calls, and tool responses, with reasoning content explicitly extracted due to the reasoning parser being enabled [00:20:04].

Finetuning the Model

The fine-tuning process uses an Unsloth notebook [00:23:01].

Model Loading and Setup

A smaller model (e.g., 4 billion parameter Quen model) can be used for fine-tuning [00:23:16].
The max sequence length needs to accommodate the potentially long conversation traces (e.g., 32,000 tokens) [00:23:21].
The model is set up for training by applying Lora adapters (Low Rank Adapters) to specific parts of the model, such as attention modules and MLP layers [00:23:50]. This means only a small percentage of parameters are trained, with the main weights remaining frozen [00:30:15].
A rank (e.g., 32) and rescaled Lora (to adapt learning rate based on adapter size) are configured [00:27:48].

Data Preparation

The collected dataset is loaded [00:24:14]. The messages and tools are passed into the model’s chat template, which converts them into a single, long string of text [00:25:12]. This string includes the system message, available tools, user messages, assistant responses, and tool calls, forming the text field used for training [00:25:27].

Training Parameters

Batch Size: Often set to 1 due to VRAM limitations, though larger batch sizes (e.g., 32 with more VRAM) are ideal for smoother training [00:31:36].
Epochs: Typically one epoch for initial fine-tuning with small datasets [00:28:48].
Learning Rate: Fairly high for small models [00:28:58].
Optimizer: AtomW 8-bit optimizer is used to save VRAM [00:29:03].
Learning Rate Schedule: Constant learning rate for simplicity [00:29:06].

Evaluation and Deployment

Before fine-tuning, a baseline inference run is performed to evaluate the model’s performance without training [00:25:55]. After training, inference is run again to observe improvements in tool calling capabilities [00:33:37].

The fine-tuned model and tokenizer can then be saved and pushed to the Hugging Face Hub (merged to 16 bits) [00:30:30]. This allows users to set up an inference endpoint using their fine-tuned model by simply swapping its name in the RunPod template [00:30:46].

Considerations for Improvement

Data Quantity: More data (hundreds of traces) is recommended for better fine-tuning [00:31:07].
Evaluation Set: Using a dedicated evaluation set to track performance during training [00:31:05].
Logging: Employing tools like TensorBoard for better training progress visualization [00:31:16].
Reinforcement Learning (RL): While reinforcement learning (e.g., GRPO) can automate trace generation and reward-based systems, supervised fine-tuning (SFT) on high-quality manual traces is highly beneficial as a starting point. SFT helps the model generate correct traces more frequently, speeding up subsequent RL [00:32:02]. For RL, defining rewards requires a dataset with verifiable correct answers, which involves systematically generating data with ground truth [00:32:50].

Finetuning with a curated set of traces can lead to significant performance improvements, especially for narrow, common, and important use cases [00:34:50].

Tubegraph

Explorer

Table of Contents