From: hu-po
The development of Vision Language Models (VLMs) faces ongoing challenges, particularly concerning the efficiency and accuracy of visual encoding and the interpretation of complex visual information [01:03:04].
Issues with Visual Tokenization and Information Loss
One significant challenge is how VLMs process images [01:03:07]. Images are tokenized into “patches,” which are then converted into visual tokens [00:07:58]. This process can lead to information loss, making it difficult for the model to interpret subtle visual cues [01:20:54].
For instance, in a composite image of Barack Obama’s body with Dwayne “The Rock” Johnson’s face, most VLMs identify the person as Barack Obama [00:08:50]. This occurs because the majority of the image patches “scream Obama,” and the visual encoders tend to “gloss over” the crucial detail of the mismatched face [00:08:05], [01:20:54]. Only more sophisticated systems, like GPT-4 Vision accessed via third-party providers such as Perplexity, can identify the image as digitally manipulated [01:09:52], [01:57:38].
Managing Lengthy Visual Token Sequences
Another problem is the excessive length of visual token sequences generated by visual encoders [01:03:13]. An image might generate thousands of tokens, compared to a question that might only be a few tokens long [01:03:31]. This imbalance can “limit the model’s effectiveness” and make it “inaccurately interpreting complex visual information” [01:03:47].
Furthermore, different visual encoders output tokens with varying dimensionalities and numbers [01:14:23], complicating their integration into a unified language model [01:14:31].
Redundancy in Positional Encoding
Many vision models, particularly those based on the Vision Transformer architecture, add positional encoding to each visual token [01:06:04]. This helps the transformer understand the spatial relationship of patches [01:06:35]. However, this practice is questioned, as it might be redundant for VLMs where visual experts already contain positional information [01:21:51]. Simplified positional encoding schemes, such as assigning the same embedding to all image patches (“share all”), can significantly reduce computational cost without a substantial performance degradation [01:22:20], [01:27:19]. This suggests that positional embeddings, while crucial for text, might not be as necessary for image tokenization within a VLM context [01:27:50].
Strategies to Address Challenges
Multiple Visual Encoders (Ensemble of Experts)
A promising approach is to use an ensemble of visual encoders [01:04:02]. Different visual encoders excel at different tasks:
- CLIP: Known as a “semantic expert,” excelling in image-text alignment through contrastive learning [01:05:09], [01:09:52].
- DINOv2: Provides robust feature extraction through self-supervised learning at both image and patch levels [01:10:26].
- SAM (Segment Anything Model): A “segmentation expert” highly skilled in image segmentation, capturing fine details and edges [01:05:14], [01:12:30].
- LayoutLMv3: Good at OCR [01:15:15].
By concatenating outputs from these diverse encoders, a VLM can leverage their combined strengths, leading to “consistently superior performance” [01:07:31]. This approach requires a “Fusion Network” (often simple Multi-Layer Perceptrons or MLPs) to standardize the dimensionality of the varied visual tokens [01:15:36], [01:17:47].
Order of Experts
The order in which the visual encoder outputs are fed into the Language Model (LLM) matters due to the autoregressive and position-aware nature of LLMs [01:24:24]. The LLM processes these visual tokens as a sequence, and “the order of the experts affects the final output” [01:24:31], [01:55:04]. This highlights a nuanced challenge in designing optimal VLM architectures.
Trade-offs and Future Directions
While these strategies offer improved performance, they often come at the cost of increased computational expense [01:08:49]. Running multiple visual encoders means processing the image through each, and the concatenated visual tokens result in a much longer prompt for the language model, increasing inference time [01:54:30].
This indicates an emerging trend where state-of-the-art performance in VLMs is increasingly tied to the willingness to expend more compute for inference [01:35:55]. Future research and development in VLMs will likely focus on optimizing these multi-expert approaches and refining tokenization and positional encoding to achieve better performance-to-compute ratios, akin to the discussions around “state-of-the-art per compute budget” [01:36:40].