Multimodal models and omni models development

From: aidotengineer

Wen aims to develop a generalist model and agent, utilizing a series of large language models and large multimodal models [00:00:24].

Current Multimodal Capabilities

Wen’s chat interface, chat.wen.ai, allows users to interact with the latest models, including multimodal models that support image and video uploads [00:00:46]. Users can also engage with their “omni models” via voice and video chat [00:00:50].

The development of multimodal models has primarily focused on vision-language models [01:37:37].

Wen 2.5 VL (Vision Language)

Released in January, Wen 2.5 VL demonstrated competitive performance in various vision-language benchmarks, including:

Understanding benchmarks like MMU [01:57:00]
Math benchmark like Math Vista [01:57:02]
General VQA benchmarks [01:57:05]

Wen has also explored “thinking” capabilities for vision-language models, such as with Wen VQ [01:59:00]. Similar to language models, a larger “thinking budget” (maximum thinking tokens) for these models leads to better performance in reasoning tasks, particularly mathematics [01:06:08].

Omni Model Development

The ultimate goal for multimodal models is to build an “omni model” capable of accepting multiple input modalities and generating multiple output modalities [01:53:08].

The current omni model is a 7 billion parameter large language model [01:13:16]. It possesses the following capabilities:

Input Modalities: Text, vision (images and videos), and audio [01:18:21]
Output Modalities: Text and audio [01:31:28]
Applications: Can be used in voice chat, video chat, and text chat [01:44:41]
Performance: Achieves state-of-the-art performance in audio tasks for its size [01:49:49]. Surprisingly, it even outperforms Wen 2.5 VL (7 billion) in vision-language understanding tasks [01:50:57].

There is still room for improvement, particularly in recovering performance drops in language and agent tasks [01:26:00]. This is expected to be addressed by enhancing data quality and training methods [01:38:00].

Future Directions for Multimodal and Omni Models

Wen’s future plans for multimodal models and omni models include:

Scaling Modalities: Increasing the number of input and output modalities the models can handle [02:25:00]. This is seen as a way to enhance model capability and productivity [02:32:00]. For example, vision language understanding is crucial for developing GUI agents [02:43:00].
Unifying Understanding and Generation: The goal is to integrate understanding and generation capabilities for modalities like images, similar to GPT-4’s ability to generate high-quality images [02:07:00].
Truly Omni Models: Future iterations aim for models capable of generating high-quality images and videos, achieving a “truly omni model” status [01:33:00].

Wen emphasizes the transition from training models to training agents, especially by leveraging reinforcement learning with environment interaction [02:40:00]. The belief is that the field is currently in an “era of agents” [02:56:00].

Tubegraph

Explorer

Table of Contents

Multimodal models and omni models development

Current Multimodal Capabilities

Wen 2.5 VL (Vision Language)

Omni Model Development

Future Directions for Multimodal and Omni Models

Graph View

Backlinks