From: hu-po

The field of deep learning in computer vision has seen significant advancements, with two primary foundation models for visual representation learning: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) [01:21:53]. While ViTs have generally surpassed CNNs in fitting capabilities and popularity, especially in the language space [01:30:04], they suffer from quadratic complexity concerning image resolution, making them computationally expensive for high-resolution images [01:16:16] [01:21:20]. CNNs, by contrast, offer linear complexity but have limitations in global receptive fields [01:37:49].

This trade-off between performance and efficiency is particularly critical for applications like autonomous vehicles, where energy and compute optimization is paramount [00:56:51]. This has led to the exploration of alternative architectures, such as State Space Models (SSMs), specifically the Mamba architecture [00:05:01].

Mamba Architecture for Vision

Mamba, a State Space Model, is presented as an alternative to Transformers or ConvNets [00:05:01]. Initially popular for sequence-to-sequence language models, Mamba is now being applied in the vision space [00:05:06]. The name “Mamba” is derived from a type of snake known for its speed, highlighting the architecture’s focus on efficiency [01:00:00].

The core advantage of Mamba models is their computational efficiency, achieving linear complexity while retaining global receptive fields [01:52:53]. This contrasts with Vision Transformers, which have quadratic complexity with respect to input sequence length or image size [01:16:16].

State Space Models Explained

State Space Models map a continuous input signal (stimulation XT) to a continuous output (response YT) through a compressed hidden state (H) [00:59:23]. This hidden state acts as a bottleneck, limiting the information passed from token to token, thereby reducing compute but potentially leading to information loss [01:00:00].

SSMs typically formulate operations as linear ordinary differential equations with parameters A, B, C, and D (for skip connections) [01:00:20]. For integration into deep learning algorithms, these continuous systems are discretized using methods like Zero-Order Hold (ZOH) [01:01:11].

Addressing Direction Sensitivity in Vision

Unlike textual data, which has a causal, one-dimensional sequence flow, images are 2D and lack inherent causality [03:08:18]. Directly applying a 1D SSM to a flattened image sequence can result in restricted receptive fields [03:19:52]. Two papers propose different solutions to this “direction-sensitive problem”:

  • V-Mamba (Visual Mamba): Introduces a Cross-Scan Module (CSM) [02:02:03]. The CSM adopts a four-way scanning strategy (e.g., left-to-right, top-to-down, and their permutations) across the feature map [02:05:15]. This ensures each element integrates information from all other locations in different directions, yielding a global receptive field without increasing linear computational complexity [03:22:23] [03:36:20].
  • Vision Mamba (Vim): Uses Bidirectional Mamba Blocks [02:21:52]. This approach processes tokens with both forward (left-to-right, top-to-down) and backward (right-to-left, bottom-to-up) directions [03:51:50]. Vim also incorporates position embeddings and a learnable classification token, similar to ViTs [03:54:54].

Model Architecture Differences

FeatureV-Mamba (Visual Mamba)Vision Mamba (Vim)
Direction HandlingCross-Scan Module (4-way scanning) [01:40:01]Bidirectional Mamba Blocks (forward and backward) [03:51:50]
Position EmbeddingsNo [02:06:05]Yes [02:06:05]
Class TokenNo [02:06:05]Yes [02:06:05]
MLP in blocksNo, “shallower” design [02:06:05]Yes [02:06:05]
Activation FunctionSiLU [02:06:05](Not specified, likely ReLU or GeLU as commonly seen in Transformers)

Performance and Efficiency for Autonomous Vehicles

Autonomous vehicle companies, such as Horizon Robotics (a contributor to the Vision Mamba paper), are highly interested in efficient deep learning architectures [00:08:52].

Why Efficiency Matters in Autonomous Vehicles

Autonomous vehicles require very low latency and must run models on edge devices or the car’s GPU directly, not via API calls to a server [00:09:27]. This necessitates extremely quick and efficient processing [00:09:46]. Furthermore, unlike general vision-language models where image compression is acceptable (e.g., for calorie counting from food photos), autonomous vehicles cannot afford to reduce image resolution [02:27:06]. Small, distant objects (e.g., other cars, stop signs) can be critical, requiring full-resolution images to maintain detectability [02:27:57].

Mamba models are highlighted for their speed, especially at higher resolutions, and improved GPU memory usage compared to ViTs [02:22:30]. For example, Vision Mamba (Vim) can save significant GPU memory on large images (e.g., 1248x1248) [02:07:07]. The computational complexity of self-attention in ViTs is O(M²), where M is sequence length, making them explode with larger images [02:21:14]. In contrast, SSMs have O(N²M) complexity, where N is the SSM dimension (much smaller than M), demonstrating their linear scaling [02:26:30].

Benchmark Comparisons

Both V-Mamba and Vision Mamba are evaluated on standard computer vision benchmarks:

  • ImageNet 1K: A classification dataset with 14 million annotated images and 1,000 categories [00:39:12].
  • COCO (Common Objects in Context): A large-scale dataset for object detection and instance segmentation, among other tasks [00:40:55].
  • ADE20K: A semantic segmentation benchmark requiring pixel-level classification into 150 categories [00:41:44].

In a head-to-head comparison:

  • ImageNet-1K Accuracy: V-Mamba (e.g., V-Mamba-S with 22M parameters) achieves 82% top-1 accuracy [01:53:31], while Vision Mamba (Vim-S with 26M parameters) achieves 80% [01:53:50]. This suggests V-Mamba performs slightly better for fewer parameters [01:54:14].
  • COCO Object Detection: V-Mamba generally shows slightly better AP box scores compared to Vim for similar model sizes [01:54:14].
  • ADE20K Semantic Segmentation: V-Mamba also outperforms Vim, for instance, a V-Mamba with 46M parameters achieves 47-48% mIoU, while a Vim with 13M parameters achieves 40% [01:54:14].

V-Mamba generally shows slightly better benchmark performance across the board [01:54:14]. However, it’s worth noting that Vim aims for greater computational and memory efficiency, which might explain its slightly lower performance [00:56:51].

GPU Memory Optimization

Vision Mamba (Vim) implements specific hardware-aware designs to optimize performance on GPUs. They focus on minimizing memory I/O between different GPU memory components, such as High Bandwidth Memory (HBM) and SRAM [02:39:09]. SRAM has larger bandwidth, while HBM has larger memory size [02:44:03]. Vim’s SSM implementation optimizes memory I/O to O(BME + N) compared to O(BME * N), where B is batch size, M is sequence length, E is expanded state dimension, and N is SSM dimension [02:45:06]. They also recompute intermediate activations during the backward pass to reduce GPU memory requirements [02:51:30]. This attention to detail in hardware-aware design is crucial for deployment in resource-constrained environments like autonomous vehicles.

Future Directions

While Mamba models show promise for high-resolution images and videos, particularly in time-sensitive and edge computing applications like autonomous vehicles, medical imaging, and remote sensing [02:26:03] [02:30:53], it remains to be seen if they will usurp Transformers in more generic vision tasks [02:48:40].

The ability of V-Mamba to adapt its effective receptive field from local (before training) to global (after training) suggests a flexible architecture [03:31:55]. This adaptability allows the model to learn the most useful receptive field for visual representations [03:38:26].