From: hu-po

In computer vision, particularly in tasks like novel view synthesis, understanding and estimating camera parameters is crucial. These parameters are broadly categorized into camera intrinsics and extrinsics [00:05:24].

Camera Intrinsics

Camera intrinsics refer to information inherent to the camera itself [00:05:31]. This includes:

The camera’s specific properties, such as a fisheye effect or a large lens, will influence its intrinsic parameters [00:05:42].

Camera Extrinsics

Camera extrinsics describe the camera’s position and orientation relative to the 3D scene [00:05:57]. This includes its six-dimensional rotation and pose within the scene [00:06:39].

Challenges in Obtaining Parameters

Obtaining accurate camera intrinsics and extrinsics is often a significant challenge in computer vision pipelines [00:06:13].

  • Intrinsics: Requires detailed information about the camera, its manufacturer, specific settings, and any attached lenses, making it a “complicated mess” [00:06:21].
  • Extrinsics: Knowing the exact position and orientation of a camera for each picture taken is “basically impossible to know from a ground truth level” in real-world scenarios [00:06:49]. This typically involves guessing, which introduces errors [00:07:06].

In simulated environments like Blender, exact camera intrinsics and extrinsics are known [00:08:55]. However, in the real world, this information is generally unavailable [00:14:35].

Traditional Approach: Structure from Motion (SfM)

Traditional Structure from Motion (SfM) pipelines, such as COLMAP, typically:

COLMAP, though a pinnacle of aggregating various techniques, is often unreliable in producing camera poses and registering all input images [00:24:08]. These pipelines break down the problem into sub-problems (e.g., feature extraction, bundle adjustment), leading to accumulating errors and increased engineering complexity [00:35:03].

Modern Approach: Duster

Duster presents a “radically novel approach” to dense, unconstrained stereo 3D reconstruction from uncalibrated and unposed cameras [00:36:19]. It addresses the challenges of camera intrinsics and extrinsics by:

  • No Prior Information: Operating without prior information about camera calibration or viewpoint poses [00:32:44].
  • Pre-trained Transformers: Leveraging pre-trained Transformer models (Vision Transformers) to encode images [00:33:00]. These models are trained on massive datasets of synthetic image pairs, enabling them to learn high-level semantic understanding of scenes and camera angles [00:53:52].
  • Joint Optimization: Solving multiple “minimal problems” simultaneously, allowing for internal collaboration between tasks that were traditionally separated [00:36:31].
  • Simple Regression Loss: Training with a straightforward Euclidean distance loss in 3D space, weighted by a confidence score, without enforcing explicit geometric constraints [00:56:55]. This contrasts with traditional methods that rely heavily on human-engineered geometric biases [00:55:19].
  • Relative Poses: Assuming the first camera’s position as the origin, all other camera positions are estimated relative to it [00:43:56].
  • Focal Length Estimation: Duster calculates per-camera focal lengths using the WISE field algorithm [01:06:12].

Duster is notably faster than COLMAP, completing tasks in minutes compared to hours [01:03:49], and often achieves state-of-the-art results across various computer vision tasks like monocular depth estimation, multi-view depth, and pose estimation [01:01:36].

Impact on 3D Gaussian Splatting

The “Instant Splat” paper integrates Duster’s capabilities directly into the 3D Gaussian Splatting pipeline [00:04:21].

  • Initialization: Instead of using COLMAP’s sparse point clouds for initialization, Instant Splat uses the globally aligned point maps generated by Duster [01:09:32].
  • Streamlined Optimization: This approach minimizes the need for manual optimizations, such as adaptive density control (densification, splitting, and opacity reset), which were necessary in earlier 3D Gaussian Splatting methods due to the sparsity of SfM data [01:09:49].
  • Concurrent Optimization: Instant Splat jointly optimizes both the 3D Gaussian attributes (position, color, opacity, etc.) and the camera parameters (extrinsics) [01:10:01]. A constraint is added to prevent optimized poses from deviating too much from Duster’s initial estimates [01:10:08]. This concurrent optimization allows for better information exchange and simpler pipelines [01:10:20].

This integration makes the 3D Gaussian Splatting process significantly faster and more efficient, achieving scene reconstruction in less than a minute on a consumer GPU [00:10:27], without requiring prior knowledge of camera intrinsics or extrinsics [00:14:48].