Development of Qwen large language models

From: aidotengineer

Qwen is a series of large language models (LLMs) and large multimodal models (LMMs) with a stated dream of building a generalist model and agent [00:00:26].

Qwen Resources and Products

Qwen offers several resources for interaction and development:

Qwen Chat: A chat interface available at chat.qwen.ai [00:00:36]. This platform allows users to interact with the latest models, including multimodal models via image and video uploads, and omni models through voice and video chat [00:00:42].
Blog: Technical details and new releases are shared on the Qwen blog at qwen.github.io [00:01:03].
GitHub: Qwen maintains open-source code repositories on GitHub [00:01:17].
Hugging Face: Model checkpoints are available on Hugging Face, allowing developers to download and experiment with the models [00:01:22].

Evolution of Qwen Models

Qwen 2.5 Max

Released just before the Spring Festival, Qwen 2.5 Max is a large instruction-tuned model that serves as a strong foundation for larger language models [00:01:35]. It demonstrated competitive performance against state-of-the-art models like Collab 3.5, GPT-4o, and DCB v3 [00:01:53]. The developers found that reinforcement learning significantly improved its performance, especially in reasoning tasks such as math and coding, with consistent gains [00:02:17]. For example, a 32-billion parameter model saw its performance on AM 24 increase from approximately 65 to 80 [00:02:39].

Qwen 3

Qwen 3 is the latest generation of large language models, featuring multiple sizes of dense and mixture-of-experts (MoE) models [00:03:19].

Flagship Models

Qwen 3 Auto: A MoE model with 235 billion total parameters, activating only 22 billion parameters [00:03:33]. It is both efficient and effective, closely trailing top-tier models like Gemini 2.5 Pro [00:03:49].
Largest Dense Model: This model also exhibits very competitive performance [00:04:04].

Smaller, Efficient Models

Fast MoE Model: A relatively small MoE model with 30 billion total parameters, activating only 3 billion [00:04:10]. It can even outperform the Qwen 32 billion dense model in some tasks [00:04:21].
4 Billion Parameter Model: Despite its small size, this model incorporates advanced distillation techniques to transfer knowledge from larger models [00:04:34]. It shows competitive reasoning capabilities, even comparable to the flagship Qwen 2.5 72B model [00:04:52], and can be deployed on mobile devices [00:05:16].

Key Features of Qwen 3

Hybrid Thinking Mode: This feature allows a single model to switch between “thinking” and “non-thinking” behaviors [00:05:24].
- Thinking Mode: The model reflects, explores possibilities, and prepares a detailed answer before providing it, similar to models like Q1 and DeepR1 [00:05:42].
- Non-Thinking Mode: Functions as a traditional instruction-tuned chatbot, providing near-instant answers without a thinking delay [00:06:09].
- This combination, controllable via prompts or hyperparameters, is presented as a first in the open-source community [00:06:23].
Dynamic Thinking Budget: A feature derived from the hybrid thinking mode, allowing users to define the maximum number of thinking tokens (e.g., 32,000 tokens) [00:06:41]. Performance increases significantly with a larger thinking budget; for instance, on AM 24, performance can rise from just over 40 with a small budget to over 80 with a 32,000-token budget [00:07:41].
Multilingual Support: Qwen 3 supports over 119 languages and dialects, a substantial increase from Qwen 2.5’s 29 languages [00:08:52]. This aims to enhance global application and accessibility of large language models [00:09:15].
Enhanced Agent and Coding Capabilities: Specific improvements have been made to support agents and coding, including enhanced support for MCP [00:09:41]. The models can use tools during their thinking process, receive feedback from the environment, and continue thinking, which is beneficial for inference time scaling [00:09:56]. Examples include using tools for calculations and organizing a desktop by accessing file systems [00:09:56]. The goal is for models to be productive agents beyond simple chatbots [00:11:04].

Open-Weighted Models

Qwen has open-weighted many models, including MoE models (30B total/3B activated, 235B total/22B activated) and six dense models [00:11:21]. Smaller models can be used for testing and drafting, while the 4-billion parameter model is suitable for mobile device deployment [00:11:43]. The 32-billion parameter model is noted for its strength and competitiveness, suitable for reinforcement learning and local deployment [00:11:53]. The developers believe MoE models represent a future trend [00:12:18].

Multimodal Models

Beyond LLMs, Qwen is also developing multimodal models, with a strong focus on vision-language models (VLMs) [00:12:35].

Qwen-VL and Qwen 2.5 VL

Qwen 2.5 VL, released in January, achieved competitive performance in various vision-language benchmarks, including understanding benchmarks like MMU, math benchmarks like MathVista, and general VQA benchmarks [00:12:49]. Similar to LLMs, Qwen-VL models also benefit from “thinking” capabilities, showing improved performance in reasoning tasks like mathematics with a larger maximum thinking length (equivalent to thinking budget) [00:13:16].

Omni Model

The long-term goal for multimodal models is to build an “omni model” capable of accepting and generating multiple modalities (text, vision, audio) [00:13:49]. A current attempt is a 7-billion parameter model that accepts text, vision (images and videos), and audio inputs, and can generate text and audio outputs [00:14:10]. This model is used in voice, video, and text chats [00:14:41]. It achieves state-of-the-art performance in audio tasks for its size and surprisingly outperforms Qwen 2.5 VL 7B in vision-language understanding tasks [00:14:49]. Future work includes recovering performance drops in language and agent tasks and improving data quality and training methods [00:15:20].

Open Sourcing and Support

Qwen is committed to open sourcing, believing it helps improve models through developer feedback and fosters community interaction [00:15:53]. They offer various open-sourced models, including LLMs and coders like Qwen 2.5 coders, which are popular for local development [00:16:20]. Qwen 3 coders are also under development [00:16:35].

Models are provided in various sizes (from 0.6 billion to 235 billion parameters) and quantized formats (GUM, GBQ, AWQ, MOX for Apple) to serve diverse user needs [00:16:43]. Most models use the Apache 2.0 license, allowing free use and modification for business purposes without requiring special permission [00:17:13]. Qwen models are widely supported by third-party frameworks and API platforms [00:17:47].

Applications and Products

Qwen is building products and agents to facilitate interaction with their models:

WebDev: A feature that allows users to generate and deploy websites by providing simple prompts, such as “create a Twitter website” or “create a sunscreen product introduction website” [00:18:13]. It also supports creating visual cards based on links and information [00:19:03].
Deep Research: Users can prompt the model to write comprehensive reports on topics like the healthcare or artificial intelligence industry [00:19:43]. The model plans its research, searches step-by-step, writes parts sequentially, and provides a downloadable PDF report [00:20:00]. Reinforcement learning is being used to fine-tune models specifically for deep research [00:20:32].

Future Directions

Qwen’s future efforts focus on achieving AGI and building advanced foundation models and agents:

Training Improvements: There is still significant room for improvement in pre-training, including incorporating more and cleaner multimodal data, utilizing synthetic data, and exploring new training methods beyond next-token prediction, possibly involving reinforcement learning in pre-training [00:21:12].
Scaling Laws: The focus of scaling is shifting from model sizes and pre-training data to compute in reinforcement learning [00:22:12]. Emphasis is on long-horizon reasoning with environmental feedback, allowing models to become smarter through inference-time scaling [00:22:28].
Context Scaling: Plans include scaling context length to at least 1 million tokens this year for most models, with a goal of reaching 10 million tokens and eventually infinite context [00:22:57].
Modality Scaling: While not directly increasing intelligence, scaling modalities (inputs and outputs) enhances model capability and productivity [00:23:25]. This is crucial for developing agents, such as GUI agents, that require vision capabilities [00:23:38]. The aim is to unify understanding and generation across modalities, such as simultaneous image understanding and generation, similar to GPT-4o [00:24:05].

Ultimately, Qwen is moving from an era of training models to an era of training agents, focusing on scaling with pre-training and reinforcement learning, especially within environments [00:24:36].

Tubegraph

Explorer

Table of Contents