Omnimotion and quasi3D video representation

From: hu-po

Omnimotion is a novel method for estimating full-length motion trajectories for every pixel in a video frame [00:01:43]. Developed by Cornell, Google Research, and UC Berkeley, it aims to achieve accurate, coherent, and long-range motion tracking, even when objects are occluded [00:02:32].

Challenges in Traditional Motion Tracking

Historically, motion estimation methods have relied on two dominant approaches: sparse feature tracking and dense Optical Flow [00:10:11].

Sparse Feature Tracking focuses on distinctive interest points (e.g., corners and edges) [00:10:29], limiting tracking to a subset of the scene and struggling with featureless textures [00:20:53].
Dense Optical Flow calculates motion vectors for every point in an image [00:06:02]. However, it typically operates within limited temporal windows (e.g., a couple of frames) [00:06:35]. Chaining these pairwise flows over longer sequences leads to drift and fails to handle occlusions effectively [00:23:56]. Current video generation and multimodal models techniques face significant computational constraints, making it difficult to process an entire video’s context [00:12:40].

The key challenges remaining in dense, long-range trajectory estimation include:

Maintaining accurate tracks across long sequences [00:14:15].
Tracking points through occlusions [00:14:19].
Maintaining coherence in space and time [00:14:22].

Omnimotion’s Innovative Approach

Omnimotion proposes a holistic method that utilizes all information within a video to jointly estimate full-length motion trajectories for every pixel [00:14:35].

Quasi-3D Video Representation

A core element of Omnimotion is its novel quasi-3D canonical volume, denoted as ‘G’ [00:43:55].

This volume is mapped to per-frame local volumes (‘Li’) through a set of local-canonical bijections (‘Ti’) [00:43:59]. A bijection ensures a one-to-one mapping where every point in one space corresponds uniquely to a point in another [00:08:08].
The bijections (‘Ti’) are parameterized as invertible neural networks [00:44:10]. Specifically, they use a “Real NVP” (Real-valued Non-Volume Preserving) architecture, which consists of stacked affine transformations that are inherently invertible [01:00:10], [01:10:32].
The canonical coordinate ‘U’ is designed to be time-independent, acting as a globally consistent index for a particular scene point throughout the entire video [00:44:43].
Crucially, Omnimotion’s quasi-3D representation is not a physically accurate 3D reconstruction [00:44:51]. It relaxes the rigid constraints of dynamic multi-view geometry, focusing instead on relative depth ordering (e.g., foreground/background) rather than precise XYZ coordinates [01:17:22]. This can be conceptualized as a “2.5D prior” [01:47:45].

Tracking Mechanism

The process for computing 2D motion for any query pixel (Pi) involves:

Lifting to 3D: The query pixel is lifted to a “3D” ray within the quasi-3D space [01:06:17].
Sampling along Ray: Points are sampled along this ray (e.g., 32 samples per ray) [01:50:14].
Mapping to Canonical Space: Each sampled point (x_i_k) is mapped to the canonical space (U) using the invertible neural network (M_theta), conditioned on a per-frame latent code derived from time [01:43:09], [01:11:46].
Density and Color Query: A separate neural network (F_theta), similar to a Nerf model, queries the canonical space at point ‘U’ to obtain its density (sigma) and color (C) [00:49:00], [01:11:46].
Mapping to Target Frame: The point is then mapped to a target frame (J) using the inverse of the bijective mapping [01:13:54].
Alpha Compositing: All samples along the ray are aggregated using alpha compositing (similar to Nerf rendering) to produce a single 2D correspondence in the target frame (P_hat_J) [01:15:18].
Occlusion Handling: The representation inherently handles occlusions by retaining information about all scene points projected onto each pixel, along with their relative depth ordering [01:17:52].

Per-Video Optimization

Omnimotion employs a “test-time optimization” approach, meaning the model is optimized per video [00:16:43].

Initialization: It takes a collection of frames and noisy correspondence predictions (pseudo-labels) as guidance [00:40:00]. This initialization is typically obtained from existing optical flow algorithms like RAFT or TAP-Net [01:24:42]. The paper acknowledges this as a “refining process” rather than starting from scratch [02:16:57].
Loss Functions: The optimization process minimizes a combination of losses:
- Flow Loss (L_flow): Mean Absolute Error (MAE) between the predicted flow and the supervised input flow [01:28:44].
- Photometric Loss (L_photometric): Mean Squared Error (MSE) between the predicted color and the observed color [01:31:23].
- Regularization Term (L_reg): Penalizes large accelerations to ensure temporal smoothness of the 3D motion [01:32:13].
- An additional auxiliary gradient loss (L_P_grad) is also used [02:13:05].
Hard Mining: To address imbalances in the initial correspondence data (where rigid backgrounds have more reliable points than fast-moving foreground objects), Omnimotion employs a “hard mining” strategy. This involves periodically computing error maps and sampling more frequently from regions with higher prediction errors [01:38:05].

Implementation Details

Network Architecture:
- The mapping network (M_theta) comprises six affine coupling layers [01:28:00]. It uses positional encoding with four frequencies for pixel coordinates [01:41:36]. A small 2-layer MLP generates a 128-dimensional latent code for each frame, which conditions the mapping network [01:43:09].
- The canonical representation network (F_theta) is a 3-layer MLP [01:43:40].
- Both MLPs are implemented using a GaborNet architecture, where filters in the first layer are constrained to fit Gabor functions [01:45:07].
Training: Each video sequence is trained for 200,000 iterations using the Adam optimizer [01:49:46]. A batch consists of 256 pairs of correspondences sampled from eight pairs of images [01:50:11].
Camera Model: A fixed orthographic camera is assumed, as camera motion is subsumed by the local-canonical bijections [01:07:54].

Results and Evaluation

Omnimotion demonstrates strong qualitative results, showcasing impressive tracking through complex scenarios, including occlusions (e.g., a person passing behind posts) and camera zoom [02:00:21].

Quantitatively, the method is evaluated on benchmarks like TAP-Vid (which includes both real-world and synthetic videos) [01:51:13]. It measures:

Position accuracy: Average position accuracy of visible points across five thresholds (e.g., within 1, 2, 4, 8, 16 pixels) [01:54:12].
Occlusion accuracy (AJ): Evaluates the correctness of visible/occluded predictions [01:55:49].
Temporal consistency (TC): Measures the L2 distance between the acceleration of ground truth and predicted tracks [01:56:32].

Omnimotion achieves state-of-the-art performance on these metrics, outperforming prior methods like RAFT, PIPS, and FlowWalk [02:03:07]. However, the quantitative improvements over existing initialization methods (like RAFT) are often marginal, suggesting that Omnimotion acts more as a refinement tool [02:04:35].

Ablation Studies

No Invertible: Replacing the invertible mapping network with separate forward and backward networks severely degrades tracking performance, highlighting the importance of the invertible property [02:06:27].
No Photometric: Removing the photometric loss does not significantly impact results [02:06:58].
Uniform Sampling: Using uniform sampling instead of hard mining also has minimal impact [02:07:34].

Limitations and Conclusion

While innovative, Omnimotion has several limitations:

Computational Expense: It is computationally intensive, requiring per-video optimization and initial pairwise flow computations [02:11:31].
Non-Rigid Motion: The method struggles with rapid and highly non-rigid motion, potentially due to the acceleration regularization term [02:09:39].
Optimization Complexity: The optimization problem is highly non-convex, meaning the training process can get trapped in sub-optimal solutions [02:11:00].
Reliance on Initialization: Its performance is dependent on the quality of the initial noisy correspondence predictions [02:16:16].
Video Length: The use of a single canonical space to represent the entire video limits its scalability to very long videos, as the capacity may be exhausted [01:53:56].

Despite these complexities and its role as a refinement technique, Omnimotion presents a unique approach to dense, long-range motion estimation through its quasi-3D representation and invertible neural network mappings. Its qualitative results are very impressive, demonstrating effective tracking through occlusions and varying camera dynamics [02:17:51]. The concept of a “2.5D Nerf” where depth represents foreground/background rather than physical distance is particularly noteworthy and could inspire future work in generative 3D and video diffusion models [01:17:47], [02:18:41].

Tubegraph

Explorer

Table of Contents