From: aidotengineer
Large Language Models (LLMs) can be significantly augmented by providing them with access to external tools, enabling them to perform actions beyond text generation, such as browsing websites or interacting with APIs [00:00:10]. This capability is crucial for building more sophisticated AI agents that can interact with the real world [00:01:06].
Model Context Protocol (MCP)
The Model Context Protocol (MCP) is a framework designed to provide services, primarily access to tools, to LLMs [00:01:20]. MCP acts as both a store of information about tools and a service that runs the tools [00:01:47]. When an LLM decides to use a tool, the MCP tool service takes the action (e.g., navigating a page) and returns a response containing the result or guidance for the LLM [00:02:00].
Examples of MCP-supported tools include:
- Browser use (navigating websites) [00:01:26]
- Stripe [00:01:32]
- GitHub [00:01:32]
- Gmail [00:01:33]
Integration Points for LLM Tool Access
To enable an LLM to access tools via MCP, the language model is typically exposed as an API endpoint, often in the format of an OpenAI endpoint [00:02:27]. This integration requires several key translations and considerations:
- Tool Information Conversion: Tool information from MCP services must be converted into lists of JSON tools, as expected by OpenAI endpoints [00:03:02].
- Tool Response Formatting: The tool response must be converted into a format the LLM expects [00:03:11].
- Tool Call Detection and Extraction: When the LLM emits tokens or text indicating a tool call, this intention must be detected and extracted, often in a specific format like Hermes [00:03:21]. This involves parsing the LLM’s output, which typically includes XML tags containing JSON for the tool call [00:04:06].
Prompt Structure for Tool Calls
A typical prompt structure sent to an LLM for tool calling includes:
- A system message describing how to make tool calls (e.g., passing JSON within
<tool_code>
XML tags) [00:03:56].- Information about available tools (e.g., the browser tool) [00:04:15].
- User messages (e.g., “navigate to trellis.com”) [00:04:33].
- The assistant’s response, which may include thinking steps, a tool call (e.g., navigating with the browser), or a text-based response [00:04:37].
Data Collection for Fine-tuning
To improve the performance of an LLM in tool-use scenarios, fine-tuning is performed using “traces” or logs from high-quality agent runs [00:00:15]. These traces capture the full multi-turn interaction, including the LLM’s reasoning and tool calls.
Running an MCP Agent for Data Collection
An agent configured with MCP servers can generate these traces [00:00:26]. The process involves:
- Setting up an OpenAI-style endpoint: For models like Quen, enabling features like reasoning and a reasoning parser helps in extracting the LLM’s thought process [00:06:52]. The
--enable-tool-choice
argument allows the LLM to decide which tool to call [00:07:16]. - Configuring MCP Servers: The agent loads tools from configured MCP servers [00:09:40]. For a browser, Playwright offers about 25 tools (e.g., navigate, switch tab, navigate to link) [00:09:48].
- Interacting with the Agent: Users provide prompts, and the agent responds by thinking, making tool calls, and providing text responses [00:10:14]. The browser itself can be run in non-headless mode to observe its actions [00:10:50].
- Logging Traces: The agent logs runs with two main parts: a
messages
part (full conversation history) and atools
part (list of available tools) [00:12:14]. - Curating Traces: Not all traces will be optimal. Manual adjustment of traces is possible to improve their quality for training [00:15:12]. This can involve directly guiding the model with system prompts during collection, which can then be excluded from the training data [00:16:03].
Preparing Data for Fine-tuning
Once high-quality traces are collected, they are pushed to a dataset, often on platforms like Hugging Face Hub [00:17:51]. A key step here is “unrolling the data” [00:18:13]. If a conversation has multiple turns, it’s unrolled into multiple rows in the dataset, each representing a full conversation up to that turn [00:18:22]. This approach generates more training data and ensures the model learns from intermediate reasoning steps [00:18:47].
Fine-tuning the LLM for Tool Use
Fine-tuning is performed to improve an LLM’s ability to utilize tools [00:22:47]. The process typically involves:
- Model Loading: A base model (e.g., a 4-billion parameter Quen model) is loaded [00:23:16].
- Applying LoRA Adapters: Instead of training all model parameters, low-rank adapters (LoRA) are applied to specific parts like attention modules and MLP layers [00:23:50]. This makes fine-tuning more efficient and requires less VRAM [00:24:00].
- Data Preparation: The collected traces (messages and tools) are passed into the model’s chat template, which formats them into a single long string suitable for training [00:25:11]. This formatted string includes system messages, available tools, user prompts, assistant thinking, and tool calls/responses [00:25:27].
- Training: The model is trained on this prepared dataset. Training parameters like batch size, learning rate, and optimizer are configured [00:28:10]. For small datasets, batch sizes might be limited, leading to jumpy loss [00:28:34].
- Evaluation: After fine-tuning, the model’s performance is re-evaluated to see if its ability to call tools has improved [00:30:06]. Ideally, a separate evaluation set and logging with tools like TensorBoard would be used for a more robust assessment [00:31:05].
- Saving and Deploying: The fine-tuned model and tokenizer can be saved and pushed to a model hub. The adapters can be merged with the base model to create a new inference endpoint [00:30:30].
Relation to Reinforcement Learning
While reinforcement learning (RL) techniques like GRPO can automate trace generation and reward-based systems for LLMs, it’s beneficial to first perform supervised fine-tuning (SFT) on high-quality, manually curated traces [00:32:02]. An LLM that hasn’t undergone SFT for a specific domain might struggle to generate sufficient successful traces for effective RL [00:32:33]. For RL, defining verifiable rewards requires systematically generated data with ground truths [00:32:51].
Further Resources
All materials for this workshop are available in the Trellis Research AI Worlds Fair 2025 repo, specifically in the
MCP agent fine-tune
folder [00:00:45]. More detailed videos on setting up and creating custom MCP servers are available on the Trellis Research YouTube channel [00:35:11].