From: aidotengineer

The process of fine-tuning a language model (LLM) involves taking traces or logs from high-quality runs and using them to improve the model’s performance [00:00:15]. This workshop specifically demonstrates how to fine-tune a Qwen model [00:00:35].

Why Qwen Models for Finetuning?

It is recommended to maintain consistency between the model used to generate data and the model intended for fine-tuning [00:05:55]. While a stronger model could, in principle, be used to generate data, OpenAI models do not share their thinking traces [00:06:04]. Qwen models, however, can share their reasoning traces, making them suitable for this purpose [00:06:10].

Specific Qwen models mentioned include:

  • Qwen 3 [00:00:36]
  • The 30 billion parameter Qwen model (with 3 billion activated parameters, a mixture of experts model) for data generation [00:06:13].
  • The 4 billion parameter Qwen model for training [00:23:16].

Data Collection for Finetuning

The first step in finetuning is to generate high-quality reasoning traces [00:00:26]. These traces include the tools used and the multi-turn conversation history [00:00:29].

Setting up the Qwen Endpoint

To collect data, Qwen models are exposed as OpenAI-style endpoints [00:02:31]. This involves:

  • Running a Docker image for VLM with the Qwen model [00:06:47].
  • Enabling reasoning and a reasoning parser to extract thinking tokens into a JSON format [00:06:52].
  • Setting max_model_length to 32,000 [00:07:09].
  • Enabling automatic tool choice for the LLM [00:07:16].
  • Specifying a tool parser (e.g., Hermes format) to extract tool calls from the LLM’s text output into JSON [00:07:22]. This addresses the conversion from language model string to OpenAI API-expected JSON format [00:07:46].
  • Exposing port 8000 for the server [00:08:00].

Agent Operation and Trace Generation

An agent is run with the Qwen model endpoint [00:08:26].

  • The agent connects to Model Context Protocol (MCP) servers, which provide access to tools like a browser [00:01:10].
  • MCP stores information on tools (how the LLM can make calls) and runs the tools, returning responses to the LLM [00:01:47].
  • Tool information from MCP services must be converted into JSON lists for OpenAI endpoints [00:03:02]. Similarly, tool responses must be converted into a format the LLM expects [00:03:11].
  • When the LLM makes a tool call by emitting text, the system detects and extracts this call [00:03:21].
  • The tool response, such as an accessibility tree from browser use, can be very long, so it is often truncated for brevity during data collection [00:08:46].
  • The LLM’s prompt includes a system message instructing it on how to make tool calls (e.g., by passing JSONs within XML tags) [00:03:56].
  • Traces are logged by default, including messages (full conversation history) and tools (list of available tools) [00:12:06]. The reasoning content is extracted separately [00:20:12].
  • Users can manually adjust traces for better quality or pass a system prompt to guide the model [00:16:01]. The goal is to generate clean traces for training data [00:16:23].

Data Preparation for Finetuning

  • Unrolling Data: For multi-turn conversations, the data is “unrolled” into multiple rows. For example, a three-turn conversation yields three rows, providing more training data from a single interaction [00:18:13]. This is important because the Qwen template only includes reasoning from the most recent turn [00:18:39].
  • Pushing to Hugging Face Hub: The collected tools and messages are pushed to a dataset on Hugging Face Hub [00:17:51]. The dataset typically contains columns for ID, timestamp, model, messages, and tools [00:19:33].

Finetuning Process

The actual finetuning is performed in a notebook, often based on Unslaught’s Qwen fine-tuning notebook [00:23:01].

  1. Load Model: A smaller Qwen model, such as the 4 billion parameter version, is loaded [00:23:16].
  2. Prepare Data: The collected dataset from Hugging Face Hub is loaded [00:24:14]. The messages and tools are passed into a chat template that converts them into a single long string of text [00:25:12].
  3. Apply LoRA Adapters: The model is prepared for fine-tuning by applying Low-Rank Adapters (LoRA) to specific parts of the model (e.g., attention modules and MLP layers) [00:23:50]. This allows training only a small percentage of parameters, keeping most of the main weights frozen [00:30:17].
  4. Training Configuration:
    • Batch Size: Often set to one due to VRAM limitations, though larger batch sizes (e.g., 32) are ideal for smoother training [00:28:34].
    • Epochs: Typically trained for one epoch initially [00:28:48].
    • Learning Rate: Fairly high for small models [00:28:58].
    • Optimizer: AtomW 8-bit optimizer can be used to save VRAM [00:29:03].
  5. Run Training: The model is trained using the prepared data [00:28:08].
  6. Evaluate Performance: After training, inference is run again to compare performance [00:29:34]. A more elaborate setup with an evaluation set and TensorBoard logging is recommended for robust evaluation [00:31:04].
  7. Save and Deploy: The fine-tuned model and tokenizer can be saved and pushed to Hugging Face Hub, allowing it to be used as an inference endpoint [00:30:30].
  • Model Context Protocol (MCP): A protocol for providing services, like tool access, to LLMs [00:01:20].
  • Reinforcement Learning (RL): While supervised fine-tuning (SFT) with manual traces is recommended first, RL techniques like GRPO can be applied later [00:32:02]. SFT on high-quality traces speeds up subsequent RL training [00:32:40]. RL requires defining rewards based on verifiably correct answers [00:32:52].
  • Tool Calls: The mechanism by which the LLM interacts with external services or functions [00:03:02]. For open-source models, it’s advised to limit the number of tools to 25-50 to avoid confusing the LLM [00:10:01].