Hybrid thinking mode and dynamic thinking budgets

From: aidotengineer

Quen, developed by the Went team, is a series of large language models (LLMs) and large multimodal models, with the overarching dream of building a generalist model and agent [00:21:40].

Introducing Quen 3

Released shortly before the Spring Festival, Quen 2.5 Max was an instruction-tuned model that served as a strong foundation, achieving competitive performance against state-of-the-art models like Claude 3.5, GPT-4o, and DCB v3 [01:36:13]. The developers, however, believed in more potential for LLMs beyond simple instruction tuning, especially through reinforcement learning to make them smarter [02:07:33]. This led to significant performance increases, particularly in reasoning tasks like math and coding [02:28:44].

Recently, the team released Quen 3, their latest large language model, available in multiple sizes of dense and mixture-of-experts (MoE) models [03:18:03]. The flagship model is an MoE with 235 billion total parameters, activating only 22 billion, making it efficient yet effective [03:33:04]. A smaller, very fast MoE model with 30 billion total parameters (activating 3 billion) can even outperform Quen 32 billion in some tasks [04:10:48]. Furthermore, a 4-billion parameter model, developed with distillation techniques, shows thinking capabilities competitive with the previous flagship, Quen 2.5 72B, and can be deployed on mobile devices [04:37:34].

Hybrid Thinking Mode

A key feature of Quen 3 is its hybrid thinking mode [05:24:23]. This mode allows the model to utilize both thinking and non-thinking behaviors within a single model [05:30:30].

Thinking Mode: In this mode, the model engages in a process of reflection and exploration of possibilities before providing a detailed answer [05:42:32]. Examples of models with thinking behavior include GPT-4o and Deep R1 [06:05:41].
Non-Thinking Mode: This functions like a traditional instruction-tuned chatbot, providing near-instant answers without an explicit thinking process [06:09:47].

Quen 3 is noted as potentially the first in the open-source community to combine these two modes into a single model. Users can control its behavior using prompts or hyperparameters [06:23:17].

Dynamic Thinking Budget

Building on the hybrid thinking mode, Quen 3 introduces the feature of dynamic thinking budget [06:40:48]. The thinking budget defines the maximum number of thinking tokens the model can use for a task [06:46:08].

Usage: If a task requires thinking, and the thinking process finishes within the allocated budget (e.g., 8,000 tokens within a 32,000-token budget), the model provides the answer [06:58:33]. However, if the thinking process requires more tokens than the budget allows (e.g., 8,000 tokens needed for a 4,000-token budget), the thinking process is truncated at the budget limit [07:16:32].
Performance Impact: Performance significantly increases with larger thinking budgets [07:40:02]. For instance, in AM 24, a small thinking budget might yield just over 40% performance, while a large budget of 32,000 tokens can achieve over 80% [07:51:39].
Efficiency: This feature allows users to optimize token usage; for example, if a task requires 95% accuracy, one might find that an 8,000-token thinking budget is sufficient, avoiding the waste of more tokens [08:21:04].

Enhanced Agent Capabilities

Quen 3 has significantly increased capabilities in agents and coding, with enhanced support for popular frameworks like MCP [09:40:11]. The models can effectively use tools during their thinking process, make function calls, receive feedback from the environment, and continue thinking, which is beneficial for inference time scaling [09:55:05].

Examples provided include:

Using tools during thinking, receiving feedback, and continuing to think while making function calls [09:55:05].
Organizing a desktop by accessing the file system, thinking about which tools to use, executing them, getting feedback, and continuing to think until the task is complete [10:29:08].

The goal is to evolve models beyond simple chatbots into highly productive agents in real-world working environments [11:03:00].

Future Directions

In the future, the focus will shift from merely training models to training agents [21:10:48]. This involves:

Enhanced Pre-training: Utilizing more and cleaner multimodal data, along with synthetic data, potentially incorporating reinforcement learning into pre-training itself, moving beyond next token prediction [21:17:15].
Scaling Laws Evolution: Shifting from scaling model sizes and pre-training data to scaling compute in reinforcement learning [22:11:00].
Long-Horizon Reasoning: Focusing on models capable of long-horizon reasoning with environment feedback. Models that can interact with their environment, receive feedback, and continue thinking are expected to become increasingly competitive and smarter with inference time scaling [22:28:13]. This approach leans towards proactive AI agents that can plan and adapt.
Context Scaling: Aiming to scale context to at least 1 million tokens for most models this year, with an eventual goal of 10 million tokens and even infinite context [22:56:56].
Modality Scaling: Increasing model capabilities and productivity by scaling modalities in both inputs and outputs [23:25:01]. This includes vision language understanding for tasks like creating GUI agents and unifying understanding and generation (e.g., image understanding and generation simultaneously) [23:37:38].

These efforts emphasize the transition from training models to training sophisticated AI agents capable of complex interactions and reasoning [24:36:23].

Tubegraph

Explorer

Table of Contents