Traces and logs for performance improvement

From: aidotengineer

This article outlines how to generate, collect, and use high-quality traces and logs from Model Context Protocol (MCP) agent runs to fine-tune and improve the performance of a language model [00:00:18]. The materials for this process are available online in the Trellis Research AI Worlds Fair 2025 repository, specifically in the “MCP agent fine-tune” folder [00:00:45].

Understanding Model Context Protocol (MCP)

MCP (Model Context Protocol) is a protocol designed to provide services, primarily access to tools, to Large Language Models (LLMs) [00:01:17]. The workshop specifically focuses on browser use, allowing an LLM to navigate websites [00:01:26]. Other MCP services exist for platforms like Stripe, GitHub, and Gmail [00:01:32].

MCP performs several functions:

Information Store: It stores information about tools, helping the LLM understand how to make calls or use them [00:01:47].
Tool Execution: The MCP tool service runs the tools. When an LLM decides to make a call, MCP executes the action (e.g., navigating to a page) and returns a response containing the result or guidance for the LLM [00:01:57].

LLM Integration and Tool Interaction

The language model is exposed as an OpenAI-style API endpoint, which is a common standard [00:02:29]. Integrating this API endpoint requires specific translations [00:02:51]:

Converting tool information from MCP services into lists of JSON tools, as expected by OpenAI endpoints [00:03:02].
Converting the tool response into a format the language model expects [00:03:11].
Detecting and extracting tool calls from the LLM’s emitted tokens or text, specifically in Hermes format for the Quen model [00:03:21].

The prompt structure for the LLM includes a system message that describes how to make tool calls by passing JSONs within tool XML tags [00:03:56]. The LLM then uses this to respond, either by thinking and calling a tool or by providing a text-based response [00:04:40].

Data Collection: Generating High-Quality Traces and Logs

The data collection process involves running an agent to generate sample runs [00:05:30].

Setting up the Agent

Endpoint: An OpenAI-style endpoint is required [00:05:40]. For consistency between data generation and fine-tuning, a Quen type agent is used, specifically the 30 billion parameter mixture-of-experts model [00:05:50]. This model can be run on platforms like RunPod [00:06:22].
Configuration: The Docker image for VLM is run, enabling reasoning and a reasoning parser to extract thinking processes into a JSON format [00:06:47]. Automatic tool choice is enabled, allowing the LLM to decide which tool to call [00:07:16]. A Hermes-specific tool parser is used to extract tool calls into the OpenAI API’s expected JSON format [00:07:36]. Port 8000 is exposed for serving the model [00:08:00].
Truncation: The truncate argument can be used to limit the length of tool responses, especially for browser accessibility trees, to avoid excessively long contexts for the LLM [00:08:42].

Running the Agent and Collecting Logs

The agent interacts with the user, taking inputs and generating responses. During this process, the agent’s actions, thinking, and tool calls are logged [00:09:35].

The MCP server is started, loading configured tools (e.g., 25 Playwright browser tools) [00:09:37].
User input prompts the LLM to think and decide on actions, such as navigating to a website [00:10:15].
Tool calls are made by the LLM and require approval [00:11:29]. The browser pops up (unless in headless mode) to perform the action [00:10:48].
Logs are saved, typically in two parts: messages (full conversation history including user requests, assistant thinking, and final answers) and tools (a list of available tools) [00:12:06]. This structure is essential for fine-tuning [00:12:17].

Curating Traces

Not all traces will be perfect. Users can manually adjust traces by deleting or combining user turns to create a cleaner, more direct trace [00:17:03]. A system prompt can also be used to guide the model very directly during trace generation, which can then be excluded from the final training data [00:16:03]. The goal is to obtain nice, tidy traces for training data [00:16:23].

Pushing Data to Hugging Face Hub

Collected traces (tools and conversations) are pushed to a Hugging Face dataset [00:17:51].

Unrolling Data: To train on multiple turns, a subtle technique called “unrolling” is used. If a conversation has, for example, three back-and-forths, it’s unrolled into three rows: one with all turns, one with the last two, and one with just the final turn. This effectively multiplies the training data [00:18:11].
The unrolled data set will contain ID, timestamp, model, messages, and tools [00:19:33]. The messages field extracts reasoning content because the reasoning parser was enabled [00:20:12].

Fine-tuning the Model

The collected and curated traces are then used to fine-tune the LLM.

Model Loading and Preparation

Model Selection: A smaller model, like the 4 billion parameter Quen model, is chosen for fine-tuning [00:23:16]. A larger sequence length (e.g., 32,000) is maintained [00:23:21].
Benchmark: Before fine-tuning, an initial run is performed to benchmark the model’s performance without fine-tuning [00:24:03].
Data Loading: The dataset from Hugging Face is loaded [00:24:14].
Chat Template: The chat_template takes the tools and messages and combines them into a single, long string of formatted text [00:24:34]. This string includes the system message, tools list, user message, assistant message, and tool calls [00:25:27]. The longest row length is checked to ensure it doesn’t exceed the model’s maximum length [00:25:46].
LoRA Adapters: The model is prepared for fine-tuning by applying Low Rank Adapters (LoRA) to specific parts of the model (attention modules and MLP layers) [00:27:37]. This trains only a small percentage of parameters, keeping the main weights frozen [00:30:15].

Training the Model

The model is trained using the prepared dataset [00:28:10].
A batch size of one is used due to VRAM limitations, which can make the training loss jumpy [00:28:34].
Training typically runs for one epoch [00:28:48].
A high learning rate and an AtomW 8-bit optimizer are used [00:28:58].
Supervised fine-tuning on high-quality traces is recommended even before considering reinforcement learning (RL) methods like GRPO, as it significantly speeds up subsequent RL training [00:32:20].

Post-Training and Evaluation

After training, the model’s performance is re-evaluated [00:33:34]. Even with limited data, the fine-tuned model should show improved capability in calling tools correctly [00:34:03]. For comprehensive evaluation, a more elaborate setup with an evaluation set, more data (hundreds of traces), and logging with TensorBoard is recommended [00:31:05].

The fine-tuned model and tokenizer can be saved and optionally pushed to Hugging Face Hub, where they can be merged to 16 bits and used to update an inference endpoint [00:30:30].

This process, even with a small number of carefully curated examples (e.g., 50-100), can lead to significant improvements in performance for specific use cases [00:34:50].

Tubegraph

Explorer

Table of Contents