Transformer architecture in image processing

From: hu-po

Transformer architectures are increasingly being applied to image processing, particularly in the domain of feature matching, a fundamental problem in computer vision [00:02:40].

What is Feature Matching?

Feature matching is the process of identifying corresponding points between two or more images of the same object or scene [00:03:04]. This allows for the calculation of geometry, such as camera positions and 3D point locations, to reconstruct 3D environments from 2D images [00:03:42].

Applications of Feature Matching

Feature matching is a core component of many computer vision applications:

Simultaneous Localization and Mapping (SLAM): Used in devices like the Meta Quest 3 headset, where it simultaneously creates a 3D map and localizes the device within it using RGB cameras [00:04:09]. Robots also use SLAM to build maps for navigation [00:12:48].
Gaussian Splats: The initial step to create Gaussian Splats from a series of images often involves an algorithm called Colmap, which relies on feature matching [00:04:46].
3D Reconstruction and Photogrammetry [00:12:19].
Camera Tracking [00:12:38].

Traditional Feature Matching: SIFT

Historically, feature matching was dominated by hand-designed features, a practice referred to as “computer vision archaeology” [00:06:11].

Scale-Invariant Feature Transform (SIFT): An “ancient” algorithm designed long ago, still used in algorithms like Colmap for Gaussian Splats [00:06:00], [00:28:07]. SIFT features are based on image gradients, pointing from dark to light areas in a local region [00:07:12].
Limitations of Hand-Designed Features: In this older paradigm, small learnable models were trained on top of these hand-designed features [00:06:40]. The current paradigm, however, uses deep learning to learn features even at the lowest levels [00:06:49].
Challenges: Reliably describing points is challenging in conditions with symmetries, weak texture, or appearance changes [00:14:23].

LightGlue: A Modern Approach to Feature Matching

LightGlue is a deep neural network that learns to match local features across images, serving as a modern, deep learning-based approach to feature matching [00:02:51]. It revisits and updates design decisions from its predecessor, SuperGlue [00:10:37].

Key Properties and Improvements:

Efficiency: LightGlue is more efficient in terms of both memory and computation [00:11:11]. It can run in real-time and only requires about 4 GB of memory for calculating correspondences between two images [00:10:05]. It is notably faster than SuperGlue, sometimes four times faster [01:30:37].
Adaptiveness: It is adaptive to the difficulty of the problem. Inference is faster on image pairs that are intuitively easy to match (e.g., large visual overlap, limited appearance change) [01:11:07]. This is crucial for VR applications where image pairs are often very close together from a real-time camera feed [01:12:04].
Robustness: It works robustly in both indoor and outdoor environments, and critically, it does not require a depth sensor as a starting point [01:15:20], [01:16:08]. Depth sensors often struggle outdoors [01:08:58].
Ease of Training: LightGlue is easier to train, reaching state-of-the-art accuracy with just a few GPU days [01:17:54].

LightGlue Architecture

LightGlue uses a stack of L identical layers (L=9 layers in their experiments) that process two sets of local features jointly [01:26:58]. Each local feature consists of a 2D point position and a visual descriptor [00:29:37].

Layers and Attention

Each layer is composed of a self-attention unit and a cross-attention unit [00:33:45].

Self-Attention: Each point in an image attends to all points within the same image [00:38:36]. This involves decomposing the point’s state into query and key vectors [00:40:48].
Cross-Attention: Each point in one image pulls information from points in the other image [00:38:59]. This is done by computing the similarity between keys from both images [00:58:33].

Positional Encoding

LightGlue utilizes Rotary Positional Encoding (RoPE) in its self-attention units [00:41:56].

Relative Position Emphasis: RoPE is superior for this task because it emphasizes the relative position between points rather than their absolute position [00:52:50]. This is particularly useful in computer vision where the relative distances between 3D points remain constant despite camera translations [00:57:18].
Contrast to Absolute Positional Encoding: Traditional sinusoidal positional encodings, as used in original Transformer models [00:43:36], or learned absolute encodings, tend to be specific to the training data and may not generalize well to longer sequences [00:51:13]. RoPE offers a more robust way to capture long-range dependencies [00:26:27].
Efficiency: The positional encoding is identical for all layers and is computed once and cached, further improving efficiency [00:57:52]. Positional information is not applied during cross-attention, as relative positions are not meaningful across images [00:59:31].

Confidence Classifier and Pruning

A confidence classifier, implemented as a small MLP (Multi-Layer Perceptron) [01:08:26], decides whether to halt inference and prune points at each layer [00:32:56].

Adaptive Depth and Width: The model can stop computation early if enough points are confidently matched, adapting its “depth” (number of layers processed) [01:11:13]. It also prunes “unmatchable” points, reducing the “width” (number of points) fed into subsequent layers, which combats the quadratic complexity of attention mechanisms [01:03:57]. This pruning saves significant computation [01:10:42].
Example: Points that are clearly not co-visible across images (e.g., people in one image but not the other) are quickly pruned in early layers. More ambiguous features (like repeating window patterns) might take more layers to prune [01:09:13].

Comparison with SuperGlue

LightGlue was developed as an improvement on SuperGlue.

Architecture: SuperGlue used ConvNets and Graph Neural Networks (GNNs), popular in 2018 [01:09:05]. LightGlue leverages a Transformer architecture, reflecting more modern deep learning practices [01:11:03].
Positional Encoding: SuperGlue used an MLP to encode absolute point positions, fusing them early with descriptors [01:20:35]. LightGlue’s reliance on relative positional encoding (RoPE) is a key distinction [01:21:37].
Loss Supervision: SuperGlue could only make predictions and be supervised at the final layer, leading to potential vanishing gradients [01:23:22]. LightGlue’s ability to predict and push gradients at each layer speeds up convergence [01:23:45].
Matchability: LightGlue disentangles similarity and matchability, which are more efficient to predict and yield cleaner gradients, unlike SuperGlue’s “dustbin” mechanism [01:22:15].

Training and Data

LightGlue employs fully supervised training, relying on ground truth labels [01:17:54].

Synthetic Homographies: The model is pre-trained with synthetic homographies generated from large datasets like MegaDepth (1 million images of landmarks) [01:24:32]. This synthetic pre-training is crucial for generalization, especially since real-world landmark scenes can be distinctive and lead to overfitting [01:25:31].
Feature Extraction: LightGlue itself doesn’t generate the feature points. It takes them as input, typically from algorithms like SuperPoint (a 2018 Magic Leap paper trained on synthetic data) or even traditional SIFT features [00:30:40], [01:27:20].
Hardware and Settings: Training involved using 2K points per image, gradient checkpointing, and mixed precision on a single GPU with 24GB of VRAM (likely an RTX 3090) [01:26:41].

Performance and Insights

LightGlue is generally considered a “drop-in replacement” for SuperGlue with clear benefits [01:43:11].

Accuracy vs. Speed: While LightGlue’s accuracy is only marginally better than SuperGlue’s (e.g., 0.1% improvement), its primary advantage lies in its speed, being roughly three to four times faster [01:30:37].
Ablation Studies:
- Matchability Classifier: Removing the pruning classifier leads to higher recall (finding more matches) but worse precision (more false positives), confirming its importance for discriminating good from bad matches [01:31:47].
- Rotary Positional Encoding: Using RoPE instead of absolute positional encodings provides a slight but consistent improvement in accuracy [01:32:13].
Failure Cases: LightGlue can struggle with repetitive objects in a scene (e.g., identical chairs or Coke bottles) where local feature descriptors might be too similar, leading to incorrect matches [01:45:15]. This suggests a need for more global context in feature descriptors.

Future Outlook

The current pipeline for applications like Gaussian Splats often involves multiple modular steps [01:37:50]. While LightGlue improves one specific step (feature matching), an “end-to-end” approach that integrates feature extraction, matching, structure from motion, and splatting from video to 3D representation could be a significant future advancement [01:38:00]. This would streamline complex computer vision pipelines.

Tubegraph

Explorer

Table of Contents