From: hu-po

The field of 3D asset generation faces a significant challenge in determining an efficient and easy-to-use representation for 3D assets [08:03:00]. Unlike 2D images or audio, which have natural fixed-size tensor representations, 3D assets lack such an obvious format [06:31:00]. Researchers are exploring various methods, including Neural Radiance Fields (NeRFs) and Signed Distance Functions (SDFs), as powerful ways to represent and generate complex 3D shapes [07:13:00].

Neural Radiance Fields (NeRFs)

Neural Radiance Fields are a method for representing a 3D scene as an implicit function [18:32:00]. A NeRF is a neural network that, given a 3D spatial coordinate (X) and a viewing direction (D), outputs the color (C) and a non-negative density value (Sigma) for that point [18:42:00]. The color output is dependent on the view direction, while the density is not [21:26:00].

Rendering Process

To render a novel view of a scene using a NeRF, the viewport is treated as a grid of rays. Each pixel in the viewport is assigned a ray extending from the camera origin [21:34:00]. The NeRF function is queried at multiple points along each ray to obtain color and density information. These values are then integrated using a process called ray marching to approximate the final RGB color for each pixel [22:42:00].

The density value (Sigma) along a ray indicates how “see-through” a point is; higher density means less transparency, often spiking when the ray hits an object’s surface [26:03:00]. Techniques like two-stage rendering (coarse and fine sampling) are used to efficiently sample points along the ray, focusing more samples in areas of high density [25:34:00].

Advantages and Challenges

  • Resolution Independence: NeRFs are resolution-independent, meaning they can be queried at arbitrary input points rather than encoding information in a fixed grid [09:01:00].
  • Differentiability: They are end-to-end differentiable, making them suitable for various downstream tasks like style transfer or differentiable shape editing [09:14:00].
  • Performance: A notable challenge with NeRFs is their rendering speed, often being very slow [07:46:00].
  • Scene Specificity: Traditionally, a new NeRF model needs to be trained for every single 3D scene, which makes them very cumbersome to use [07:49:00]. Researchers aim to develop generalizable NeRF models that can generate many scenes from a single model [01:02:26].

Signed Distance Functions (SDFs)

Signed Distance Functions (SDFs) are another classic way to represent a 3D shape as a scalar field [29:35:00]. An SDF maps a 3D coordinate (X) to a scalar value (D), which represents the shortest distance from X to the nearest point on the surface of the shape [29:42:00].

Definition and Properties

The “signed” aspect of an SDF is crucial:

  • If D is negative, the point is inside the shape [30:26:00].
  • If D is positive, the point is outside the shape [30:26:00].
  • If D is zero, the point lies exactly on the surface of the shape, defining its boundary [30:36:00].

Methods like marching cubes or marching tetrahedra can be used to construct 3D meshes from an SDF’s zero-level set (where D=0) [31:10:00]. These techniques allow converting the implicit SDF representation into an explicit mesh format commonly used in game engines and CGI [08:15:00].

Signed Texture Fields (STFs) - A Hybrid Approach

New models, such as Shape-E, introduce a variant called Signed Texture Fields (STFs) [28:38:00]. An STF is an implicit function that produces both a signed distance value and texture colors for a given point [28:47:00]. While traditional SDFs only output distance, STFs extend this by also providing RGB color information [35:18:00]. This allows for rendering the asset as both textured meshes and neural Radiance Fields [03:38:00].

Context in Generative Models

Generative models like OpenAI’s Shape-E leverage these implicit functions for 3D asset generation [02:46:00]. Shape-E, a follow-up to Point-E, generates conditional 3D implicit functions from text prompts [00:33:00]. Unlike earlier approaches that produce a single output representation (e.g., point clouds), Shape-E directly generates the parameters of implicit functions [02:41:00].

Shape-E trains an encoder that maps 3D assets into the parameters of an implicit function, and then uses a conditional diffusion model to generate these parameters [04:44:00]. This unique approach involves the encoder outputting the weights of a multi-layer perceptron (MLP), which then acts as both a NeRF and an STF [01:10:02]. This allows for rendering in multiple ways and importing into various 3D applications [01:16:17].

Despite advancements, the quality of generated 3D assets from models like Shape-E still falls short of optimization-based approaches [01:44:00]. However, they offer orders of magnitude faster inference times [01:48:00]. The choice of 3D representation remains a key area of heterogeneity and ongoing research in the field [07:23:00].