Comparison of contrastively pretrained vs classificationpretrained vision encoders

From: hu-po

Vision language models (VLMs) are a rapidly evolving field, combining visual and textual understanding. A key component of these models is the vision encoder, which processes image data. Recent research focuses on the optimal pretraining strategies for these encoders, primarily comparing classification-based pretraining with contrastive pretraining [00:05:50].

Vision Encoders in VLMs

A vision language model fundamentally combines some visual model (a vision encoder) and a large language model (LLM), with a mechanism to connect them [00:04:22]. The vision encoder’s role is to convert raw image data into a compressed numerical representation, often called an embedding or latent space [01:55:48]. The choice and pretraining method of this vision encoder significantly impact the VLM’s performance [01:14:09].

Common architectures for vision encoders include Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) like ResNets [00:06:01]. ViTs are currently considered state-of-the-art for images and video [00:06:28].

Classification Pretraining

In classification pretraining, a vision encoder is trained on large datasets like ImageNet to classify images into predefined categories [02:20:05]. This involves adding a “classification head” on top of the encoder and pushing gradients from a classification loss function through the network [02:26:07]. After training, the classification head is removed, leaving an image encoder whose representations are highly effective for classification tasks [02:27:39].

Contrastive Pretraining

Contrastive pretraining aims to learn representations by encouraging embeddings of matching pairs (e.g., an image and its corresponding text caption) to be close together, while pushing embeddings of unmatched pairs apart [01:16:17]. Popular examples include CLIP (Contrastive Language-Image Pre-training) and SigLIP [01:34:00] [01:31:00].

Loss Functions

Softmax Loss: A traditional approach that computes attention weights across images and text, often performing normalization independently across modalities [01:37:41] [01:21:26].
Sigmoid Loss (SigLIP): A newer approach that processes every image and text pair independently, turning the learning problem into a standard binary classification on all pair combinations (positive for matching, negative for others) [01:43:08] [01:48:47]. This method simultaneously allows for scaling up batch sizes and can be combined with locked image tuning [01:25:22] [01:27:26].

Locked image tuning (LIT) is a technique where parts of the model (e.g., the vision encoder) are “frozen” during training, preventing gradients from flowing into them. This allows for training other parts of the model (like a connector MLP) faster and with lower compute budgets, while leveraging the pre-training of the frozen encoder [02:00:00] [02:07:07].

Findings from Pali (Google DeepMind)

The Pali paper (by Google DeepMind and Google Research) conducted a direct comparison of different vision language models, specifically contrasting vision encoders pre-trained with classification objectives against those pre-trained with contrastive objectives [01:50:56].

Key findings:

Superiority for VLMs: Contrastively pre-trained vision encoders (like SigLIP-based ViTs) show superior performance across various multimodal benchmarks [00:07:00]. They are particularly effective for visually situated text understanding and localization tasks [02:20:00] [01:41:00].
Classification Benchmarks: While contrastive models might slightly underperform on standard image classification benchmarks (e.g., ImageNet), this doesn’t indicate their overall quality for VLM tasks [00:07:00] [01:39:19]. A model worse on ImageNet classification can still be better for Visual Question Answering (VQA) [01:39:45].
Efficiency: Contrastively pre-trained models can lead to more efficient Vision Language Models [01:41:09].

Insights from Other Papers

Comm (Huawei)

This paper, though retracted, investigated the effectiveness of different visual encoders for Multimodal Large Language Models (MLLMs).

Combining Encoders: It explored combining multiple vision encoders, specifically CLIP and DINOv2 [03:06:00].
Multi-Level Features: The paper also proposed using multi-level feature fusion, extracting features from intermediate layers of the vision encoders, as different layers capture different types of information (e.g., low-level details from shallow layers, global semantic information from CLIP, fine-grained pixel information from DINOv2) [03:38:00] [03:41:00] [03:49:00] [01:12:54]. While this combination showed slight improvements, the speaker questioned if the gains justified the increased complexity and inference cost [01:26:50].

Deep Speed Visual Chat (Microsoft)

This work provided several observations from their training process:

Better Visual Encoder: Using a higher-resolution visual encoder, such as the one from Qwen VL (which uses fine-tuned OpenCLIP), significantly improves model quality [00:59:02] [00:59:52].
Scaling: While not the primary focus, they acknowledge that larger language models can offer superior quality [01:45:09]. They also note a potential mismatch in efficiency when combining a very large LLM (e.g., Llama 2 70B) with a relatively small vision encoder (e.g., 2B parameters) [00:46:42] [00:50:50].
Projection Layers: They experimented with different projection layers (connectors) between the vision encoder and the language model, finding no significant benefits from using a complex Vision Transformer layer over a simple linear layer [01:31:23].

Qwen VL (Alibaba)

OpenCLIP as Base: Qwen VL uses OpenCLIP (an open-source reimplementation of CLIP) as its vision encoder, specifically the ViT-BigG model [00:48:33] [00:49:07].
Complex Training Recipe: Qwen VL employs a multi-stage training recipe, including initial pre-training where the language model is frozen and gradients are pushed into the vision encoder, followed by multitask pre-training where gradients are pushed into everything, and finally supervised fine-tuning where the vision encoder is frozen again [01:52:16].
Cross-Attention Adapter: Instead of a simple MLP connector, Qwen VL uses a single-layer cross-attention module as an “adapter” to compress image features [01:53:59].

General Considerations

Data Cleanliness: As data sets grow, especially with synthetic or scraped internet data, noise and imperfection in data become a significant challenge [02:23:00].
Data Augmentation: While simple image augmentations (like flipping or rotating) are generally safe, more complex data blending methods for VLMs (e.g., shuffling images in conversations) can lead to deteriorated performance due to incorrect references [01:32:50] [01:36:00].
Terminology Confusion: The field currently suffers from a lack of standardized terminology, with different research groups using various acronyms like MLMs, VLMs, LMMs, and LVMs to describe similar model architectures [01:55:04].

Tubegraph

Explorer

Table of Contents