From: aidotengineer
The process of fine-tuning a language model (LLM) involves taking traces or logs from high-quality runs and using them to improve the model’s performance [00:00:15]. This workshop specifically demonstrates how to fine-tune a Qwen model [00:00:35].
Why Qwen Models for Finetuning?
It is recommended to maintain consistency between the model used to generate data and the model intended for fine-tuning [00:05:55]. While a stronger model could, in principle, be used to generate data, OpenAI models do not share their thinking traces [00:06:04]. Qwen models, however, can share their reasoning traces, making them suitable for this purpose [00:06:10].
Specific Qwen models mentioned include:
- Qwen 3 [00:00:36]
- The 30 billion parameter Qwen model (with 3 billion activated parameters, a mixture of experts model) for data generation [00:06:13].
- The 4 billion parameter Qwen model for training [00:23:16].
Data Collection for Finetuning
The first step in finetuning is to generate high-quality reasoning traces [00:00:26]. These traces include the tools used and the multi-turn conversation history [00:00:29].
Setting up the Qwen Endpoint
To collect data, Qwen models are exposed as OpenAI-style endpoints [00:02:31]. This involves:
- Running a Docker image for VLM with the Qwen model [00:06:47].
- Enabling reasoning and a reasoning parser to extract thinking tokens into a JSON format [00:06:52].
- Setting
max_model_length
to 32,000 [00:07:09]. - Enabling automatic tool choice for the LLM [00:07:16].
- Specifying a tool parser (e.g., Hermes format) to extract tool calls from the LLM’s text output into JSON [00:07:22]. This addresses the conversion from language model string to OpenAI API-expected JSON format [00:07:46].
- Exposing port 8000 for the server [00:08:00].
Agent Operation and Trace Generation
An agent is run with the Qwen model endpoint [00:08:26].
- The agent connects to Model Context Protocol (MCP) servers, which provide access to tools like a browser [00:01:10].
- MCP stores information on tools (how the LLM can make calls) and runs the tools, returning responses to the LLM [00:01:47].
- Tool information from MCP services must be converted into JSON lists for OpenAI endpoints [00:03:02]. Similarly, tool responses must be converted into a format the LLM expects [00:03:11].
- When the LLM makes a tool call by emitting text, the system detects and extracts this call [00:03:21].
- The tool response, such as an accessibility tree from browser use, can be very long, so it is often truncated for brevity during data collection [00:08:46].
- The LLM’s prompt includes a system message instructing it on how to make tool calls (e.g., by passing JSONs within XML tags) [00:03:56].
- Traces are logged by default, including
messages
(full conversation history) andtools
(list of available tools) [00:12:06]. The reasoning content is extracted separately [00:20:12]. - Users can manually adjust traces for better quality or pass a system prompt to guide the model [00:16:01]. The goal is to generate clean traces for training data [00:16:23].
Data Preparation for Finetuning
- Unrolling Data: For multi-turn conversations, the data is “unrolled” into multiple rows. For example, a three-turn conversation yields three rows, providing more training data from a single interaction [00:18:13]. This is important because the Qwen template only includes reasoning from the most recent turn [00:18:39].
- Pushing to Hugging Face Hub: The collected tools and messages are pushed to a dataset on Hugging Face Hub [00:17:51]. The dataset typically contains columns for
ID
,timestamp
,model
,messages
, andtools
[00:19:33].
Finetuning Process
The actual finetuning is performed in a notebook, often based on Unslaught’s Qwen fine-tuning notebook [00:23:01].
- Load Model: A smaller Qwen model, such as the 4 billion parameter version, is loaded [00:23:16].
- Prepare Data: The collected dataset from Hugging Face Hub is loaded [00:24:14]. The
messages
andtools
are passed into a chat template that converts them into a single long string of text [00:25:12]. - Apply LoRA Adapters: The model is prepared for fine-tuning by applying Low-Rank Adapters (LoRA) to specific parts of the model (e.g., attention modules and MLP layers) [00:23:50]. This allows training only a small percentage of parameters, keeping most of the main weights frozen [00:30:17].
- Training Configuration:
- Batch Size: Often set to one due to VRAM limitations, though larger batch sizes (e.g., 32) are ideal for smoother training [00:28:34].
- Epochs: Typically trained for one epoch initially [00:28:48].
- Learning Rate: Fairly high for small models [00:28:58].
- Optimizer: AtomW 8-bit optimizer can be used to save VRAM [00:29:03].
- Run Training: The model is trained using the prepared data [00:28:08].
- Evaluate Performance: After training, inference is run again to compare performance [00:29:34]. A more elaborate setup with an evaluation set and TensorBoard logging is recommended for robust evaluation [00:31:04].
- Save and Deploy: The fine-tuned model and tokenizer can be saved and pushed to Hugging Face Hub, allowing it to be used as an inference endpoint [00:30:30].
Related Concepts
- Model Context Protocol (MCP): A protocol for providing services, like tool access, to LLMs [00:01:20].
- Reinforcement Learning (RL): While supervised fine-tuning (SFT) with manual traces is recommended first, RL techniques like GRPO can be applied later [00:32:02]. SFT on high-quality traces speeds up subsequent RL training [00:32:40]. RL requires defining rewards based on verifiably correct answers [00:32:52].
- Tool Calls: The mechanism by which the LLM interacts with external services or functions [00:03:02]. For open-source models, it’s advised to limit the number of tools to 25-50 to avoid confusing the LLM [00:10:01].