From: hu-po
Introduction to 3D Scene Representation [01:11:49]
The field of 3D Scene Representation and Simulation aims to create novel views of scenes captured from multiple photos or videos [00:03:12]. This involves generating new images of a scene from a different point of view than the original captured data [00:03:51]. The representation chosen for a 3D scene significantly impacts the efficiency and quality of novel view synthesis [00:00:57].
Traditional 3D scene representations include meshes and points, which are explicit and well-suited for fast GPU and CUDA-based rasterization [01:12:25]. However, these methods can struggle with unreconstructed regions or “inexistent geometry” when multi-view stereo generates incorrect data [01:34:01].
Neural Radiance Fields (Nerfs) and their Optimization Challenges [01:48:51]
Neural Radiance Fields (Nerfs) are a recent advancement in 3D scene representation, building on continuous scene representations [01:48:51]. A Nerf defines a 3D volume where each point in space has a color and opacity, typically learned by training a multi-layer perceptron (MLP) [01:00:59].
Limitations of Nerfs [00:04:16]
Despite their ability to revolutionize novel view synthesis [00:09:09], Nerfs face several challenges:
- Computational Cost: Achieving high visual quality requires neural networks that are costly to train and render [00:04:16]. Training a Nerf for a scene can take up to 48 hours [00:52:01].
- Scene Specificity: A neural network must be trained for every single scene or object, and even for specific lighting or object arrangements [00:04:40].
- Real-time Rendering: No current method based on Nerfs can achieve real-time display rates at 1080p resolution [00:05:50].
- Stochastic Sampling: Rendering with volumetric ray marching requires a large number of stochastic samples, leading to high cost and potential noise [01:06:23].
- Memory Usage: While MLPs are relatively small (e.g., 8-13 MB), the rendering process can still be memory-intensive [02:02:50].
Various follow-up methods like Instant NGP and Plenoxels have focused on faster training and rendering by exploiting spatial data structures, different encodings, and MLP capacities, or by interpolating values stored in voxels or hash grids [01:27:24].
3D Gaussian Splatting: A Novel Approach to Optimization [01:19:12]
A new method, 3D Gaussian Splatting, proposes three key elements to achieve state-of-the-art visual quality with competitive training times and real-time novel view synthesis [00:06:14]:
- 3D Gaussian Scene Representation: The scene is represented as a set of 3D gaussians, preserving desirable properties of continuous volumetric radiance fields while avoiding unnecessary computation in empty space [00:06:48]. These gaussians are differentiable and can be easily projected into 2D splats for rendering [01:04:07]. Each 3D gaussian is defined by a 3D position (mean), an opacity (Alpha), and an anisotropic covariance matrix, with color represented by spherical harmonic coefficients [01:00:59].
- Interleaved Optimization and Adaptive Density Control: The properties of the 3D gaussians are optimized using gradient descent, interleaved with steps to control their density [00:07:43]. This allows the method to start with a sparse point cloud (from Structure from Motion) and adaptively grow or shrink the number of gaussians to accurately represent the scene [01:31:27].
- Covariance Optimization: Instead of directly optimizing the covariance matrix (which must be positive semi-definite), the method optimizes a scaling matrix and a quaternion for rotation, which are then combined to form a valid covariance matrix [01:12:31].
- Density Control: Gaussians are adaptively controlled. Transparent gaussians are removed periodically [01:31:58]. New gaussians are added by cloning existing ones (especially in under-reconstructed areas or where gradients suggest large positional changes) or by splitting large gaussians into smaller ones [01:32:54].
- Fast Visibility-Aware Rendering Algorithm: A tile-based rasterizer is developed, supporting anisotropic splatting and accelerating both training and real-time rendering [00:08:28]. The screen is split into 16x16 tiles [01:41:05]. Gaussians are pre-sorted for the entire image using a fast GPU Radix sort, avoiding per-pixel sorting costs [01:40:07]. This allows for approximate Alpha blending that respects visibility order [01:40:32].
Challenges of Optimization in Discrete and Continuous Spaces [00:08:10]
The optimization process in 3D Gaussian Splatting presents its own set of challenges:
- Complicated Optimization Process: The interleaved optimization with different loss functions and alternating steps can be slower and harder to implement/understand [00:08:10].
- Dynamic Number of Gaussians: The ability to “create and destroy” gaussians during optimization makes the process more complex than working with a fixed set [01:22:37].
- Hyperparameter Tuning: The method relies on several hard-coded hyperparameters, such as thresholds for position change, transparency for pruning, and scaling factors for splitting, which are determined experimentally [01:34:09]. This indicates potential scene-specific tuning [01:52:54].
- Floaters and Popping Artifacts: The optimization can get stuck with “floaters” (gaussians close to the camera) [01:36:01]. Also, “popping artifacts” can occur where large gaussians suddenly appear or disappear when the view changes slightly, partly due to the trivial rejection of gaussians via a “guard band” or simple visibility issues [02:21:18].
Performance and Evaluation of 3D Generative Techniques [02:00:53]
3D Gaussian Splatting demonstrates significant improvements over existing methods:
- Training Time: It achieves state-of-the-art quality with training times as low as 41-51 minutes [02:01:06], compared to 48 hours for some Nerfs [02:10:50].
- Rendering Speed: The method enables real-time novel view synthesis (30+ FPS possible) [01:31:08], with rendering times significantly faster than Nerfs which can take 10 seconds per frame [02:10:54].
- Visual Quality: It achieves comparable or superior visual quality to methods like MIP-Nerf 360, particularly in rendering fine structures (e.g., bicycle spokes) and avoiding blurring or “fuzzy” effects [01:59:01].
Technical Insights Balancing Complexity and Efficiency in 3D Modeling [01:14:14]
Key technical decisions contribute to the balance:
- Differentiable Rasterizer: The rasterizer is fully differentiable, crucial for backpropagating gradients from the image reconstruction loss back to the 3D gaussian parameters [01:29:40].
- Custom CUDA Kernels: Implementing bottlenecks, such as the fast sorting algorithm, as custom CUDA kernels on the GPU is critical for efficiency [01:23:44].
- Anisotropic Covariance: Allowing gaussians to have anisotropic (directionally dependent) covariance matrices significantly improves their ability to align with surfaces and represent complex shapes [01:51:52], leading to a more compact representation in terms of modeling capability per gaussian.
- Spherical Harmonics: Using spherical harmonic coefficients for directional appearance (color) helps capture view-dependent effects and improves overall quality [01:01:01].
Memory Consumption: A Significant Challenge [02:37:39]
Despite the performance gains, a major challenge for 3D Gaussian Splatting is its memory footprint:
- Explicit Representation: Unlike implicit representations like Nerfs (which store only MLP weights, typically 8-13 MB) [02:02:50], 3D Gaussian Splatting is an explicit representation. It stores the position, covariance, opacity, and color for hundreds of thousands of gaussians per scene [02:36:09].
- High Memory Usage: This leads to a single scene representation consuming hundreds of megabytes (e.g., 734 MB) [02:01:22], with peak GPU memory consumption exceeding 20 gigabytes during training of large scenes [02:23:45].
- Opportunities for Reduction: Future work could explore compression techniques for point clouds to reduce memory consumption [02:24:29].
Limitations and Assumptions of Dynamic 3D Modeling [02:44:27]
A shared limitation with existing Nerf-based approaches is that 3D Gaussian Splatting is designed for static scenes [00:58:48]. The ability to represent and render dynamic scenes (scenes with a time component, like videos) remains a significant future challenge for 3D scene representation [02:44:27]. Current techniques rely on camera positions calibrated by Structure from Motion (SFM), which introduces noise and inherent limitations [00:09:09].
Conclusion [02:25:20]
3D Gaussian Splatting represents a significant step towards real-time, high-quality radiance field rendering. By leveraging an explicit 3D Gaussian primitive representation, interleaved optimization, and an efficient tile-based rasterizer, it overcomes many performance limitations of previous methods. However, challenges related to memory consumption and the current focus on static scenes provide fertile ground for future research in 3D generation.