From: hu-po

Patchbased depth estimation techniques involve processing an image in smaller sections, or “patches,” to achieve high-resolution depth predictions. This approach attempts to overcome the challenges of handling high-resolution images common in modern consumer cameras and devices [01:13:16].

Patch Fusion: An End-to-End Tile-Based Framework

One such technique is presented in the paper “Patch Fusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation,” published on December 4, 2023, by King Abdullah University of Science and Technology [01:13:52].

Core Idea

The fundamental concept behind Patch Fusion is to take a high-resolution input image, such as a 4K image from a game engine, and divide it into multiple overlapping patches [01:14:19]. A monocular depth estimation model then predicts the depth for each of these patches independently [01:14:28]. This allows for very fine details to be captured within each patch [01:14:56].

Challenges and Solutions

A significant challenge with this patch-wise approach is the lack of global information, which leads to inconsistent depth predictions, particularly at the boundaries where patches meet [01:15:08]. The paper addresses this by:

  • Applying an overlap in the patches [01:15:22].
  • Hardcoding the patches and their overlaps [01:15:38].
  • Employing “consistency-aware training” to ensure that overlapping regions have consistent depth values, effectively stitching the patches together [01:15:17].

This method achieves highly detailed results, as it effectively processes the image as many smaller, individual images [01:16:13].

Training Data and Architecture

Patch Fusion primarily uses synthetic data sets like Unreal Stereo 4K and MVS Synth, which provide perfect ground truth depth because they originate from game engines [01:19:38].

However, the architecture and training approach used in Patch Fusion have been criticized:

  • It trains models from scratch [01:17:28].
  • It utilizes older Convolutional Neural Networks (CNNs or convnets) [01:17:51].
  • It employs three separate networks: a coarse network for the full image, a fine network for individual patches, and a combined guided fusion network to merge them [01:17:31].

Some consider this approach to be “cheating” because it relies on processing smaller, easier-to-detail patches rather than a holistic understanding of the scene [01:16:16].

Comparison with Holistic Depth Estimation

In contrast to Patch Fusion, the Marigold paper (also from December 2023) approaches monocular depth estimation using diffusion models by taking the entire image, compressing it into a latent space, and then denoising within that latent space to magically infer the depth [01:16:23].

The core idea of combining patch-based processing with advanced models like Marigold is seen as a promising direction for future research, potentially leading to state-of-the-art monocular depth estimation results by leveraging the benefits of both approaches [01:18:50].

Implications for Depth Sensing Hardware

The advancements in monocular depth estimation have significant implications for the market of specialized depth sensors like Lidar and structured light sensors. These traditional sensors often suffer from inherent issues:

  • Missing depth values due to physical constraints of the capture rig or sensor properties [01:47:20].
  • Noise and artifacts caused by reflective surfaces or shadowing effects [01:48:45].
  • Sensitivity to material properties of the scene [01:48:57].

As AI models become capable of producing depth maps of superior quality to those from physical sensors, the need for expensive and cumbersome depth sensors may diminish [01:07:06]. A simple cell phone camera combined with these advanced models could provide sufficient depth information, making dedicated depth sensors obsolete in many applications like autonomous driving and robotics [01:07:56]. While inference speed remains a limitation for real-time applications like autonomous vehicles [01:10:42], the rapid progress in diffusion models, such as latent consistency models (LCMs), suggests that this hurdle may soon be overcome [01:42:41].