From: aidotengineer
Wen aims to develop a generalist model and agent, utilizing a series of large language models and large multimodal models [00:00:24].
Current Multimodal Capabilities
Wen’s chat interface, chat.wen.ai
, allows users to interact with the latest models, including multimodal models that support image and video uploads [00:00:46]. Users can also engage with their “omni models” via voice and video chat [00:00:50].
The development of multimodal models has primarily focused on vision-language models [01:37:37].
Wen 2.5 VL (Vision Language)
Released in January, Wen 2.5 VL demonstrated competitive performance in various vision-language benchmarks, including:
- Understanding benchmarks like MMU [01:57:00]
- Math benchmark like Math Vista [01:57:02]
- General VQA benchmarks [01:57:05]
Wen has also explored “thinking” capabilities for vision-language models, such as with Wen VQ [01:59:00]. Similar to language models, a larger “thinking budget” (maximum thinking tokens) for these models leads to better performance in reasoning tasks, particularly mathematics [01:06:08].
Omni Model Development
The ultimate goal for multimodal models is to build an “omni model” capable of accepting multiple input modalities and generating multiple output modalities [01:53:08].
The current omni model is a 7 billion parameter large language model [01:13:16]. It possesses the following capabilities:
- Input Modalities: Text, vision (images and videos), and audio [01:18:21]
- Output Modalities: Text and audio [01:31:28]
- Applications: Can be used in voice chat, video chat, and text chat [01:44:41]
- Performance: Achieves state-of-the-art performance in audio tasks for its size [01:49:49]. Surprisingly, it even outperforms Wen 2.5 VL (7 billion) in vision-language understanding tasks [01:50:57].
There is still room for improvement, particularly in recovering performance drops in language and agent tasks [01:26:00]. This is expected to be addressed by enhancing data quality and training methods [01:38:00].
Future Directions for Multimodal and Omni Models
Wen’s future plans for multimodal models and omni models include:
- Scaling Modalities: Increasing the number of input and output modalities the models can handle [02:25:00]. This is seen as a way to enhance model capability and productivity [02:32:00]. For example, vision language understanding is crucial for developing GUI agents [02:43:00].
- Unifying Understanding and Generation: The goal is to integrate understanding and generation capabilities for modalities like images, similar to GPT-4’s ability to generate high-quality images [02:07:00].
- Truly Omni Models: Future iterations aim for models capable of generating high-quality images and videos, achieving a “truly omni model” status [01:33:00].
Wen emphasizes the transition from training models to training agents, especially by leveraging reinforcement learning with environment interaction [02:40:00]. The belief is that the field is currently in an “era of agents” [02:56:00].