From: aidotengineer

Wen aims to develop a generalist model and agent, utilizing a series of large language models and large multimodal models [00:00:24].

Current Multimodal Capabilities

Wen’s chat interface, chat.wen.ai, allows users to interact with the latest models, including multimodal models that support image and video uploads [00:00:46]. Users can also engage with their “omni models” via voice and video chat [00:00:50].

The development of multimodal models has primarily focused on vision-language models [01:37:37].

Wen 2.5 VL (Vision Language)

Released in January, Wen 2.5 VL demonstrated competitive performance in various vision-language benchmarks, including:

Wen has also explored “thinking” capabilities for vision-language models, such as with Wen VQ [01:59:00]. Similar to language models, a larger “thinking budget” (maximum thinking tokens) for these models leads to better performance in reasoning tasks, particularly mathematics [01:06:08].

Omni Model Development

The ultimate goal for multimodal models is to build an “omni model” capable of accepting multiple input modalities and generating multiple output modalities [01:53:08].

The current omni model is a 7 billion parameter large language model [01:13:16]. It possesses the following capabilities:

  • Input Modalities: Text, vision (images and videos), and audio [01:18:21]
  • Output Modalities: Text and audio [01:31:28]
  • Applications: Can be used in voice chat, video chat, and text chat [01:44:41]
  • Performance: Achieves state-of-the-art performance in audio tasks for its size [01:49:49]. Surprisingly, it even outperforms Wen 2.5 VL (7 billion) in vision-language understanding tasks [01:50:57].

There is still room for improvement, particularly in recovering performance drops in language and agent tasks [01:26:00]. This is expected to be addressed by enhancing data quality and training methods [01:38:00].

Future Directions for Multimodal and Omni Models

Wen’s future plans for multimodal models and omni models include:

  • Scaling Modalities: Increasing the number of input and output modalities the models can handle [02:25:00]. This is seen as a way to enhance model capability and productivity [02:32:00]. For example, vision language understanding is crucial for developing GUI agents [02:43:00].
  • Unifying Understanding and Generation: The goal is to integrate understanding and generation capabilities for modalities like images, similar to GPT-4’s ability to generate high-quality images [02:07:00].
  • Truly Omni Models: Future iterations aim for models capable of generating high-quality images and videos, achieving a “truly omni model” status [01:33:00].

Wen emphasizes the transition from training models to training agents, especially by leveraging reinforcement learning with environment interaction [02:40:00]. The belief is that the field is currently in an “era of agents” [02:56:00].