Finetuning of language models using Quen models

From: aidotengineer

The process of fine-tuning a language model (LLM) involves taking traces or logs from high-quality runs and using them to improve the model’s performance [00:00:15]. This workshop specifically demonstrates how to fine-tune a Qwen model [00:00:35].

Why Qwen Models for Finetuning?

It is recommended to maintain consistency between the model used to generate data and the model intended for fine-tuning [00:05:55]. While a stronger model could, in principle, be used to generate data, OpenAI models do not share their thinking traces [00:06:04]. Qwen models, however, can share their reasoning traces, making them suitable for this purpose [00:06:10].

Specific Qwen models mentioned include:

Qwen 3 [00:00:36]
The 30 billion parameter Qwen model (with 3 billion activated parameters, a mixture of experts model) for data generation [00:06:13].
The 4 billion parameter Qwen model for training [00:23:16].

Data Collection for Finetuning

The first step in finetuning is to generate high-quality reasoning traces [00:00:26]. These traces include the tools used and the multi-turn conversation history [00:00:29].

Setting up the Qwen Endpoint

To collect data, Qwen models are exposed as OpenAI-style endpoints [00:02:31]. This involves:

Running a Docker image for VLM with the Qwen model [00:06:47].
Enabling reasoning and a reasoning parser to extract thinking tokens into a JSON format [00:06:52].
Setting max_model_length to 32,000 [00:07:09].
Enabling automatic tool choice for the LLM [00:07:16].
Specifying a tool parser (e.g., Hermes format) to extract tool calls from the LLM’s text output into JSON [00:07:22]. This addresses the conversion from language model string to OpenAI API-expected JSON format [00:07:46].
Exposing port 8000 for the server [00:08:00].

Agent Operation and Trace Generation

An agent is run with the Qwen model endpoint [00:08:26].

The agent connects to Model Context Protocol (MCP) servers, which provide access to tools like a browser [00:01:10].
MCP stores information on tools (how the LLM can make calls) and runs the tools, returning responses to the LLM [00:01:47].
Tool information from MCP services must be converted into JSON lists for OpenAI endpoints [00:03:02]. Similarly, tool responses must be converted into a format the LLM expects [00:03:11].
When the LLM makes a tool call by emitting text, the system detects and extracts this call [00:03:21].
The tool response, such as an accessibility tree from browser use, can be very long, so it is often truncated for brevity during data collection [00:08:46].
The LLM’s prompt includes a system message instructing it on how to make tool calls (e.g., by passing JSONs within XML tags) [00:03:56].
Traces are logged by default, including messages (full conversation history) and tools (list of available tools) [00:12:06]. The reasoning content is extracted separately [00:20:12].
Users can manually adjust traces for better quality or pass a system prompt to guide the model [00:16:01]. The goal is to generate clean traces for training data [00:16:23].

Data Preparation for Finetuning

Unrolling Data: For multi-turn conversations, the data is “unrolled” into multiple rows. For example, a three-turn conversation yields three rows, providing more training data from a single interaction [00:18:13]. This is important because the Qwen template only includes reasoning from the most recent turn [00:18:39].
Pushing to Hugging Face Hub: The collected tools and messages are pushed to a dataset on Hugging Face Hub [00:17:51]. The dataset typically contains columns for ID, timestamp, model, messages, and tools [00:19:33].

Finetuning Process

The actual finetuning is performed in a notebook, often based on Unslaught’s Qwen fine-tuning notebook [00:23:01].

Load Model: A smaller Qwen model, such as the 4 billion parameter version, is loaded [00:23:16].
Prepare Data: The collected dataset from Hugging Face Hub is loaded [00:24:14]. The messages and tools are passed into a chat template that converts them into a single long string of text [00:25:12].
Apply LoRA Adapters: The model is prepared for fine-tuning by applying Low-Rank Adapters (LoRA) to specific parts of the model (e.g., attention modules and MLP layers) [00:23:50]. This allows training only a small percentage of parameters, keeping most of the main weights frozen [00:30:17].
Training Configuration:
- Batch Size: Often set to one due to VRAM limitations, though larger batch sizes (e.g., 32) are ideal for smoother training [00:28:34].
- Epochs: Typically trained for one epoch initially [00:28:48].
- Learning Rate: Fairly high for small models [00:28:58].
- Optimizer: AtomW 8-bit optimizer can be used to save VRAM [00:29:03].
Run Training: The model is trained using the prepared data [00:28:08].
Evaluate Performance: After training, inference is run again to compare performance [00:29:34]. A more elaborate setup with an evaluation set and TensorBoard logging is recommended for robust evaluation [00:31:04].
Save and Deploy: The fine-tuned model and tokenizer can be saved and pushed to Hugging Face Hub, allowing it to be used as an inference endpoint [00:30:30].

Model Context Protocol (MCP): A protocol for providing services, like tool access, to LLMs [00:01:20].
Reinforcement Learning (RL): While supervised fine-tuning (SFT) with manual traces is recommended first, RL techniques like GRPO can be applied later [00:32:02]. SFT on high-quality traces speeds up subsequent RL training [00:32:40]. RL requires defining rewards based on verifiably correct answers [00:32:52].
Tool Calls: The mechanism by which the LLM interacts with external services or functions [00:03:02]. For open-source models, it’s advised to limit the number of tools to 25-50 to avoid confusing the LLM [00:10:01].

Tubegraph

Explorer

Table of Contents

Finetuning of language models using Quen models

Why Qwen Models for Finetuning?

Data Collection for Finetuning

Setting up the Qwen Endpoint

Agent Operation and Trace Generation

Data Preparation for Finetuning

Finetuning Process

Graph View

Backlinks

Tubegraph

Explorer

Table of Contents

Finetuning of language models using Quen models

Why Qwen Models for Finetuning?

Data Collection for Finetuning

Setting up the Qwen Endpoint

Agent Operation and Trace Generation

Data Preparation for Finetuning

Finetuning Process

Related Concepts

Graph View

Backlinks