From: hu-po
The field of deep learning in computer vision has seen significant advancements, with two primary foundation models for visual representation learning: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) [01:21:53]. While ViTs have generally surpassed CNNs in fitting capabilities and popularity, especially in the language space [01:30:04], they suffer from quadratic complexity concerning image resolution, making them computationally expensive for high-resolution images [01:16:16] [01:21:20]. CNNs, by contrast, offer linear complexity but have limitations in global receptive fields [01:37:49].
This trade-off between performance and efficiency is particularly critical for applications like autonomous vehicles, where energy and compute optimization is paramount [00:56:51]. This has led to the exploration of alternative architectures, such as State Space Models (SSMs), specifically the Mamba architecture [00:05:01].
Mamba Architecture for Vision
Mamba, a State Space Model, is presented as an alternative to Transformers or ConvNets [00:05:01]. Initially popular for sequence-to-sequence language models, Mamba is now being applied in the vision space [00:05:06]. The name “Mamba” is derived from a type of snake known for its speed, highlighting the architecture’s focus on efficiency [01:00:00].
The core advantage of Mamba models is their computational efficiency, achieving linear complexity while retaining global receptive fields [01:52:53]. This contrasts with Vision Transformers, which have quadratic complexity with respect to input sequence length or image size [01:16:16].
State Space Models Explained
State Space Models map a continuous input signal (stimulation XT) to a continuous output (response YT) through a compressed hidden state (H) [00:59:23]. This hidden state acts as a bottleneck, limiting the information passed from token to token, thereby reducing compute but potentially leading to information loss [01:00:00].
SSMs typically formulate operations as linear ordinary differential equations with parameters A, B, C, and D (for skip connections) [01:00:20]. For integration into deep learning algorithms, these continuous systems are discretized using methods like Zero-Order Hold (ZOH) [01:01:11].
Addressing Direction Sensitivity in Vision
Unlike textual data, which has a causal, one-dimensional sequence flow, images are 2D and lack inherent causality [03:08:18]. Directly applying a 1D SSM to a flattened image sequence can result in restricted receptive fields [03:19:52]. Two papers propose different solutions to this “direction-sensitive problem”:
- V-Mamba (Visual Mamba): Introduces a Cross-Scan Module (CSM) [02:02:03]. The CSM adopts a four-way scanning strategy (e.g., left-to-right, top-to-down, and their permutations) across the feature map [02:05:15]. This ensures each element integrates information from all other locations in different directions, yielding a global receptive field without increasing linear computational complexity [03:22:23] [03:36:20].
- Vision Mamba (Vim): Uses Bidirectional Mamba Blocks [02:21:52]. This approach processes tokens with both forward (left-to-right, top-to-down) and backward (right-to-left, bottom-to-up) directions [03:51:50]. Vim also incorporates position embeddings and a learnable classification token, similar to ViTs [03:54:54].
Model Architecture Differences
Feature | V-Mamba (Visual Mamba) | Vision Mamba (Vim) |
---|---|---|
Direction Handling | Cross-Scan Module (4-way scanning) [01:40:01] | Bidirectional Mamba Blocks (forward and backward) [03:51:50] |
Position Embeddings | No [02:06:05] | Yes [02:06:05] |
Class Token | No [02:06:05] | Yes [02:06:05] |
MLP in blocks | No, “shallower” design [02:06:05] | Yes [02:06:05] |
Activation Function | SiLU [02:06:05] | (Not specified, likely ReLU or GeLU as commonly seen in Transformers) |
Performance and Efficiency for Autonomous Vehicles
Autonomous vehicle companies, such as Horizon Robotics (a contributor to the Vision Mamba paper), are highly interested in efficient deep learning architectures [00:08:52].
Why Efficiency Matters in Autonomous Vehicles
Autonomous vehicles require very low latency and must run models on edge devices or the car’s GPU directly, not via API calls to a server [00:09:27]. This necessitates extremely quick and efficient processing [00:09:46]. Furthermore, unlike general vision-language models where image compression is acceptable (e.g., for calorie counting from food photos), autonomous vehicles cannot afford to reduce image resolution [02:27:06]. Small, distant objects (e.g., other cars, stop signs) can be critical, requiring full-resolution images to maintain detectability [02:27:57].
Mamba models are highlighted for their speed, especially at higher resolutions, and improved GPU memory usage compared to ViTs [02:22:30]. For example, Vision Mamba (Vim) can save significant GPU memory on large images (e.g., 1248x1248) [02:07:07]. The computational complexity of self-attention in ViTs is O(M²), where M is sequence length, making them explode with larger images [02:21:14]. In contrast, SSMs have O(N²M) complexity, where N is the SSM dimension (much smaller than M), demonstrating their linear scaling [02:26:30].
Benchmark Comparisons
Both V-Mamba and Vision Mamba are evaluated on standard computer vision benchmarks:
- ImageNet 1K: A classification dataset with 14 million annotated images and 1,000 categories [00:39:12].
- COCO (Common Objects in Context): A large-scale dataset for object detection and instance segmentation, among other tasks [00:40:55].
- ADE20K: A semantic segmentation benchmark requiring pixel-level classification into 150 categories [00:41:44].
In a head-to-head comparison:
- ImageNet-1K Accuracy: V-Mamba (e.g., V-Mamba-S with 22M parameters) achieves 82% top-1 accuracy [01:53:31], while Vision Mamba (Vim-S with 26M parameters) achieves 80% [01:53:50]. This suggests V-Mamba performs slightly better for fewer parameters [01:54:14].
- COCO Object Detection: V-Mamba generally shows slightly better AP box scores compared to Vim for similar model sizes [01:54:14].
- ADE20K Semantic Segmentation: V-Mamba also outperforms Vim, for instance, a V-Mamba with 46M parameters achieves 47-48% mIoU, while a Vim with 13M parameters achieves 40% [01:54:14].
V-Mamba generally shows slightly better benchmark performance across the board [01:54:14]. However, it’s worth noting that Vim aims for greater computational and memory efficiency, which might explain its slightly lower performance [00:56:51].
GPU Memory Optimization
Vision Mamba (Vim) implements specific hardware-aware designs to optimize performance on GPUs. They focus on minimizing memory I/O between different GPU memory components, such as High Bandwidth Memory (HBM) and SRAM [02:39:09]. SRAM has larger bandwidth, while HBM has larger memory size [02:44:03]. Vim’s SSM implementation optimizes memory I/O to O(BME + N) compared to O(BME * N), where B is batch size, M is sequence length, E is expanded state dimension, and N is SSM dimension [02:45:06]. They also recompute intermediate activations during the backward pass to reduce GPU memory requirements [02:51:30]. This attention to detail in hardware-aware design is crucial for deployment in resource-constrained environments like autonomous vehicles.
Future Directions
While Mamba models show promise for high-resolution images and videos, particularly in time-sensitive and edge computing applications like autonomous vehicles, medical imaging, and remote sensing [02:26:03] [02:30:53], it remains to be seen if they will usurp Transformers in more generic vision tasks [02:48:40].
The ability of V-Mamba to adapt its effective receptive field from local (before training) to global (after training) suggests a flexible architecture [03:31:55]. This adaptability allows the model to learn the most useful receptive field for visual representations [03:38:26].