From: aidotengineer

Qwen is a series of large language models and large multimodal models aiming to build a generalist model and agent [00:00:21]. The Qwen team emphasizes open sourcing and applying their models to real-world tasks and development.

Accessing Qwen Models and Resources

Users can interact with Qwen’s latest models through various platforms:

  • Qwen Chat: A chat interface at chat.qwen.ai that is easy to use [00:00:39]. It supports interaction with multimodal models by uploading images and videos, and with omni models using voice and video chat [00:00:44]. Features include webdev and deep research [00:00:55].
  • Blog: Technical details about new releases are available on the blog at qwen.github.io [00:01:03].
  • GitHub and Hugging Face: Qwen’s codes are available on GitHub, and model checkpoints can be downloaded from Hugging Face, allowing developers to experiment with the models [00:01:17].

Key Features and Capabilities Enabling Applications

Hybrid Thinking Mode

Qwen 3 introduces a hybrid thinking mode, combining thinking and non-thinking behaviors within a single model [00:05:27].

  • Thinking Mode: Before answering, the model reflects, explores possibilities, and then provides a detailed answer, similar to models like 01 and DeepR1 [00:05:42].
  • Non-thinking Mode: Functions like a traditional instruction-tuned chatbot, providing near-instant answers without delay [00:06:09].

This dual-mode capability is controllable via prompts or hyperparameters [00:06:30]. It also allows for a dynamic thinking budget, which refers to the maximum thinking tokens [00:06:41]. Performance significantly increases with larger thinking budgets, especially in tasks like math and coding [00:07:45]. For example, a 32,000-token thinking budget can achieve over 80% in AM 24, compared to just over 40% with a very small budget [00:07:59]. This allows users to balance accuracy and token usage based on specific task requirements [00:08:21].

Multilingual Support

Qwen 3 supports over 119 languages and dialects, a significant increase from Qwen 2.5’s 29 languages [00:08:52]. This expanded linguistic support is beneficial for global applications, particularly for users of open-source models that may not have previously supported many languages well [00:09:17].

Agentic Capabilities and Coding

Qwen models have enhanced capabilities in agents and coding, with specific improvements for MCP [00:09:41]. The models can use tools during their thinking process, make function calls, receive environmental feedback, and continue thinking [00:09:56]. This capability is crucial for multiagent systems and enables the model to be productive in real-world working life [00:11:08]. An example includes organizing a desktop by accessing the file system, determining which tools to use, and iteratively thinking and executing [00:10:29].

Multimodal Models

Beyond large language models, Qwen also develops multimodal models, focusing on vision-language models [00:13:37].

  • Qwen 2.5 VL: Released in January, it achieves competitive performance in vision-language benchmarks like MMU, MathVista, and general VQA [00:12:49].
  • Qwen-VQ: Explores thinking capabilities for vision-language models, showing improved performance in reasoning tasks (like mathematics) with larger thinking budgets [00:13:16].

The ultimate goal is to build an “omni model” that accepts multiple modalities (text, vision including images and videos, audio) as inputs and generates multiple modalities (text, audio) as outputs [00:13:51]. While not yet perfect, a 7-billion parameter omni model is capable of voice, video, and text chat, and shows strong performance in audio tasks and even in vision-language understanding compared to Qwen 2.5 VL [00:14:13]. Future improvements aim to recover performance in language and agent tasks [00:15:20].

Qwen’s Open Sourcing Philosophy and Benefits

Qwen is committed to open sourcing their models [00:15:52].

Real-World Applications and Products

Qwen is building products to enable interaction with their models and to create agents [00:17:59].

  • Webdev: This feature allows users to generate and deploy websites from simple prompts [00:18:12]. Examples include creating a Twitter website or a sunscreen product introduction website [00:18:21]. It can also generate visually appealing cards based on provided links [00:19:03].
  • Deep Research: Users can ask the model to write comprehensive reports on topics of interest, such as the healthcare industry or artificial intelligence [00:19:43]. The model first makes a plan, then searches step-by-step, writes parts, and finally delivers a downloadable PDF report [00:20:12]. Reinforcement learning is being used to fine-tune models specifically for deep research to enhance productivity in working life [00:20:36]. This capability is related to user problems and discovery in AI startups by addressing the need for efficient information gathering.

Future Directions for Real-World Impact

Qwen’s future efforts are geared towards achieving Artificial General Intelligence (AGI) and building better foundation models and agents [00:21:01].

  • Training Improvements: Continued focus on training methods, including incorporating more and better quality multimodal and synthetic data [00:21:17]. Exploring new pre-training methods beyond next-token prediction, possibly using reinforcement learning in pre-training [00:21:52].
  • Scaling Laws: Shifting focus from scaling model sizes and pre-training data to scaling compute in reinforcement learning [00:22:16]. Emphasis on long-horizon reasoning with environment feedback, enabling models to become smarter through continuous interaction and thinking (inference time scaling) [00:22:28].
  • Context Scaling: Aiming to scale context window to at least 1 million tokens this year, with aspirations for 10 million tokens and eventually infinite context [00:23:02].
  • Modality Scaling: Increasing capabilities by scaling modalities for both inputs and outputs, even if it doesn’t directly increase “intelligence” [00:23:25]. This includes unifying understanding and generation, like simultaneous image understanding and generation similar to GPT-4 [00:24:05]. Vision capability, for instance, is essential for creating GUI agents for computer use [00:23:44].

These advancements signify a shift from training models to training agents, especially by integrating reinforcement learning with environment interaction [00:24:36]. This highlights the ongoing evolution towards multiagent systems and broader implementing AI in enterprises.