From: aidotengineer
This article outlines how to generate, collect, and use high-quality traces and logs from Model Context Protocol (MCP) agent runs to fine-tune and improve the performance of a language model [00:00:18]. The materials for this process are available online in the Trellis Research AI Worlds Fair 2025 repository, specifically in the “MCP agent fine-tune” folder [00:00:45].
Understanding Model Context Protocol (MCP)
MCP (Model Context Protocol) is a protocol designed to provide services, primarily access to tools, to Large Language Models (LLMs) [00:01:17]. The workshop specifically focuses on browser use, allowing an LLM to navigate websites [00:01:26]. Other MCP services exist for platforms like Stripe, GitHub, and Gmail [00:01:32].
MCP performs several functions:
- Information Store: It stores information about tools, helping the LLM understand how to make calls or use them [00:01:47].
- Tool Execution: The MCP tool service runs the tools. When an LLM decides to make a call, MCP executes the action (e.g., navigating to a page) and returns a response containing the result or guidance for the LLM [00:01:57].
LLM Integration and Tool Interaction
The language model is exposed as an OpenAI-style API endpoint, which is a common standard [00:02:29]. Integrating this API endpoint requires specific translations [00:02:51]:
- Converting tool information from MCP services into lists of JSON tools, as expected by OpenAI endpoints [00:03:02].
- Converting the tool response into a format the language model expects [00:03:11].
- Detecting and extracting tool calls from the LLM’s emitted tokens or text, specifically in Hermes format for the Quen model [00:03:21].
The prompt structure for the LLM includes a system message that describes how to make tool calls by passing JSONs within tool XML tags [00:03:56]. The LLM then uses this to respond, either by thinking and calling a tool or by providing a text-based response [00:04:40].
Data Collection: Generating High-Quality Traces and Logs
The data collection process involves running an agent to generate sample runs [00:05:30].
Setting up the Agent
- Endpoint: An OpenAI-style endpoint is required [00:05:40]. For consistency between data generation and fine-tuning, a Quen type agent is used, specifically the 30 billion parameter mixture-of-experts model [00:05:50]. This model can be run on platforms like RunPod [00:06:22].
- Configuration: The Docker image for VLM is run, enabling reasoning and a reasoning parser to extract thinking processes into a JSON format [00:06:47]. Automatic tool choice is enabled, allowing the LLM to decide which tool to call [00:07:16]. A Hermes-specific tool parser is used to extract tool calls into the OpenAI API’s expected JSON format [00:07:36]. Port 8000 is exposed for serving the model [00:08:00].
- Truncation: The
truncate
argument can be used to limit the length of tool responses, especially for browser accessibility trees, to avoid excessively long contexts for the LLM [00:08:42].
Running the Agent and Collecting Logs
The agent interacts with the user, taking inputs and generating responses. During this process, the agent’s actions, thinking, and tool calls are logged [00:09:35].
- The MCP server is started, loading configured tools (e.g., 25 Playwright browser tools) [00:09:37].
- User input prompts the LLM to think and decide on actions, such as navigating to a website [00:10:15].
- Tool calls are made by the LLM and require approval [00:11:29]. The browser pops up (unless in headless mode) to perform the action [00:10:48].
- Logs are saved, typically in two parts:
messages
(full conversation history including user requests, assistant thinking, and final answers) andtools
(a list of available tools) [00:12:06]. This structure is essential for fine-tuning [00:12:17].
Curating Traces
Not all traces will be perfect. Users can manually adjust traces by deleting or combining user turns to create a cleaner, more direct trace [00:17:03]. A system prompt can also be used to guide the model very directly during trace generation, which can then be excluded from the final training data [00:16:03]. The goal is to obtain nice, tidy traces for training data [00:16:23].
Pushing Data to Hugging Face Hub
Collected traces (tools and conversations) are pushed to a Hugging Face dataset [00:17:51].
- Unrolling Data: To train on multiple turns, a subtle technique called “unrolling” is used. If a conversation has, for example, three back-and-forths, it’s unrolled into three rows: one with all turns, one with the last two, and one with just the final turn. This effectively multiplies the training data [00:18:11].
- The unrolled data set will contain
ID
,timestamp
,model
,messages
, andtools
[00:19:33]. Themessages
field extracts reasoning content because the reasoning parser was enabled [00:20:12].
Fine-tuning the Model
The collected and curated traces are then used to fine-tune the LLM.
Model Loading and Preparation
- Model Selection: A smaller model, like the 4 billion parameter Quen model, is chosen for fine-tuning [00:23:16]. A larger sequence length (e.g., 32,000) is maintained [00:23:21].
- Benchmark: Before fine-tuning, an initial run is performed to benchmark the model’s performance without fine-tuning [00:24:03].
- Data Loading: The dataset from Hugging Face is loaded [00:24:14].
- Chat Template: The
chat_template
takes the tools and messages and combines them into a single, long string of formatted text [00:24:34]. This string includes the system message, tools list, user message, assistant message, and tool calls [00:25:27]. The longest row length is checked to ensure it doesn’t exceed the model’s maximum length [00:25:46]. - LoRA Adapters: The model is prepared for fine-tuning by applying Low Rank Adapters (LoRA) to specific parts of the model (attention modules and MLP layers) [00:27:37]. This trains only a small percentage of parameters, keeping the main weights frozen [00:30:15].
Training the Model
- The model is trained using the prepared dataset [00:28:10].
- A batch size of one is used due to VRAM limitations, which can make the training loss jumpy [00:28:34].
- Training typically runs for one epoch [00:28:48].
- A high learning rate and an AtomW 8-bit optimizer are used [00:28:58].
- Supervised fine-tuning on high-quality traces is recommended even before considering reinforcement learning (RL) methods like GRPO, as it significantly speeds up subsequent RL training [00:32:20].
Post-Training and Evaluation
After training, the model’s performance is re-evaluated [00:33:34]. Even with limited data, the fine-tuned model should show improved capability in calling tools correctly [00:34:03]. For comprehensive evaluation, a more elaborate setup with an evaluation set, more data (hundreds of traces), and logging with TensorBoard is recommended [00:31:05].
The fine-tuned model and tokenizer can be saved and optionally pushed to Hugging Face Hub, where they can be merged to 16 bits and used to update an inference endpoint [00:30:30].
This process, even with a small number of carefully curated examples (e.g., 50-100), can lead to significant improvements in performance for specific use cases [00:34:50].