Advancements in multimodal and omni models

From: aidotengineer

Quen aims to build a generalist model and agent, encompassing both large language models and large multimodal models [00:00:21].

Quen Chat Interface

Quen Chat, accessible at chat.quen.ai, allows users to interact with the latest models, including multimodal models by uploading images and videos [00:00:44]. It also enables interaction with Quen’s omni models through voice chat and video chat [00:00:50].

Quen 2.5 VL (Vision Language)

Released in January, Quen 2.5 VL is a vision language model that has demonstrated competitive performance in various vision language benchmarks [01:49:00]. These benchmarks include understanding tasks like MMU, math benchmarks like MathVista, and general VQA benchmarks [01:57:00].

The development team has also explored the integration of thinking capabilities into vision language models, creating QVQ [01:18:00]. Similar to language models, Quen 2.5 VL shows improved performance in reasoning tasks, particularly mathematics, with a larger thinking budget (maximum thinking tokens) [01:32:00].

Quen Omni Model

Quen’s ultimate goal for multimodal models is to build an omni model [01:51:00]. An omni model is designed to:

Accept multiple modalities as inputs [01:53:00].
Generate multiple modalities as outputs (e.g., text, vision, audio) [01:56:00].

Current Capabilities of the Omni Model

The current iteration is a relatively small, 7 billion-parameter model [01:11:00] [01:13:00]. It is capable of:

Accepting Inputs: Text, vision (images and videos), and audio [01:18:00].
Generating Outputs: Text and audio [01:28:00].
Usage: Can be used in voice chat, video chat, and text chat [01:44:00].
Performance: Achieves state-of-the-art performance in audio tasks for its size [01:49:00]. Surprisingly, it also outperforms Quen 2.5 VL 7 billion in vision language understanding tasks [01:57:00].

Future Goals for the Omni Model

Future developments aim to enable the model to generate high-quality images and videos, making it a truly omni model [01:33:00]. The team is also working on recovering performance drops observed in language and agent tasks, which they believe can be addressed by improving data quality and training methods [01:57:00].

Future Directions: Scaling Modalities and Unifying Understanding

Quen’s future plans include scaling modalities [02:25:00]. While scaling modalities may not directly increase intelligence, it significantly enhances the models’ capabilities and productivity [02:30:00]. For instance, vision language understanding is crucial for creating GUI agents and enabling computer use tasks [02:38:00].

The goal is to unify understanding and generation across modalities, such as simultaneous image understanding and generation, similar to capabilities seen in GPT-4 [02:40:00]. This means continuing to integrate multimodal data into training to enhance model capabilities across different tasks and domains [02:37:00].

Tubegraph

Explorer

Table of Contents