Hybrid Architectures with Transformers and Mambas

From: hu-po
Jamba, Gamba, and Cobra showcase the increasing adoption of Mamba models across diverse modalities and problem spaces, often in hybrid architectures that combine their strengths with traditional Transformer components [00:01:47] [00:01:50].

Gamba: 3D Reconstruction with Hybrid Architecture

Gamba is an end-to-end amortized 3D reconstruction model that generates a three-dimensional representation of an object from a single image [00:04:11] [00:04:29]. It utilizes a Mamba-based sequential Network [00:04:33].

The architecture takes a single input image, feeds it into an image tokenizer, and then into the Gamba model [00:18:22] [00:18:25]. The image tokenizer employed is DINO, which is itself a Vision Transformer [00:20:36] [00:20:42]. This means Gamba is fundamentally a combination of a Vision Transformer encoder and a Mamba-based encoder-decoder [00:20:46] [00:20:50]. The pre-trained DINO encoder remains “frozen” during training, with gradients propagated through the Mamba components and connector [00:23:56] [00:24:05].

A notable critique of Gamba’s hybrid approach is the continued reliance on a Vision Transformer for image encoding, despite the paper’s focus on Mamba technology [00:21:00] [00:22:40] [00:22:44]. The model’s results are not state-of-the-art, partly due to being pre-trained on a smaller dataset (OmniObject3D) compared to larger ones like Objaverse [00:05:41] [00:05:48] [00:15:48]. However, it demonstrates significant speed advantages, taking only about 6 seconds on a single NVIDIA A100 GPU for 3D reconstruction [00:06:09] [00:06:33] [00:06:37].

Cobra: Multimodal Language Models with Mamba Backbone

Cobra is another example of a hybrid architecture, extending Mamba to Multimodal Large Language Models (MLLMs), also known as Vision Language Models [00:21:23] [00:25:27]. Cobra also employs a Vision Transformer for image encoding, specifically an ensemble of DINO and SigLIP encoders [00:21:46] [00:21:48]. The output from these encoders is combined and fed into a “projector” (an MLP) that converts them into image tokens consumable by the language model [00:24:10] [00:24:15]. The language model itself is composed of 64 Mamba blocks [00:36:35].

Similar to Gamba, Cobra maintains a Vision Transformer as the base of its vision encoding pipeline, leading to a critique about not making the entire architecture Mamba-based [00:22:55] [00:36:59].

Cobra boasts a linear computational complexity due to its Mamba foundation, making it highly efficient compared to Transformers with quadratic complexity [00:36:13] [00:36:18]. It achieves competitive performance against other computationally efficient methods while using approximately 43% fewer parameters than Lava [00:35:19] [00:35:50]. Its inference speed is notably higher, reaching 166 tokens per second compared to 39-40 tokens per second for Transformer-based models [00:41:59] [00:42:30] [00:42:36]. This speed makes Mamba-based models promising for time-sensitive applications like robotics and autonomous vehicles [00:46:46] [00:47:37].

Jamba: Language Model with Alternating Layers

Jamba is an open-source language model developed by AI21, characterized by its hybrid structure combining Mamba blocks and Transformer blocks [00:50:01] [00:50:04] [00:50:08]. The architecture consists of 32 alternating layers, with some layers being standard Mamba layers and others being “Mamba + MoE” layers, which integrate a Mixture of Experts (MoE) component [00:54:45] [00:54:48] [00:55:24].

Jamba’s MoE layers allow it to use a subset of its 52 billion parameters (e.g., 12 billion) during inference, contributing to efficiency [00:53:17] [00:53:19]. It supports large context windows (up to 140k tokens) due to the linear scaling of Mamba blocks with sequence length, addressing a limitation of Transformer models [00:52:43] [00:52:48].

This hybrid approach of alternating Mamba and Transformer (or RNN/attention) layers has precedents, such as Google’s Griffin model [01:11:03] [01:11:11] [01:12:10].

Challenges in Hybrid Architectures: Quantization

A potential weakness identified in Mamba models, and thus in hybrid architectures employing them, is their sensitivity to numerical precision [01:05:00] [01:05:03]. Both Cobra and Jamba indicate that Mamba blocks require a relatively high precision (no lower than bf16 for Cobra) [01:05:05] [01:05:08], and Jamba’s documentation explicitly recommends excluding Mamba blocks from quantization to avoid degrading model quality [01:07:38] [01:07:43] [01:07:46].

This limitation suggests that while Mambas offer inherent speed advantages, they may struggle with extreme quantization, a technique that significantly boosts the efficiency of Transformer models [01:08:12] [01:08:30] [01:08:33]. This could present an “Achilles heel” for Mambas if advancements in Transformer quantization outpace their ability to operate at lower bit depths [01:08:44] [01:09:55].

Despite this, the continued exploration of hybrid architectures suggests that combining Mambas with Transformers (or other components) allows researchers to leverage the strengths of each, achieving competitive performance and efficiency in various AI applications [01:12:09] [01:12:11].

Tubegraph

Explorer

Table of Contents

Hybrid Architectures with Transformers and Mambas

Gamba: 3D Reconstruction with Hybrid Architecture

Cobra: Multimodal Language Models with Mamba Backbone

Jamba: Language Model with Alternating Layers

Challenges in Hybrid Architectures: Quantization

Graph View