From: hu-po

ControlNet is a neural network structure designed to add conditional control to pre-trained large diffusion models, particularly Stable Diffusion [00:01:11]. It addresses the difficulty of precisely controlling diffusion models to generate exact desired images from text prompts alone [00:02:29]. This structure allows for much more fine-grained control over the final output, using inputs like edge maps, segmentation maps, or human poses [00:03:31].

Core Architecture of ControlNet

The fundamental concept behind ControlNet involves cloning the weights of a large, pre-trained diffusion model into two copies: a locked copy and a trainable copy [00:17:12].

  • Locked Copy: This copy preserves the original network’s capabilities learned from billions of images [00:17:18].
  • Trainable Copy: This copy is trained on task-specific datasets to learn the conditional control [00:17:22]. The motivation for this dual-copy approach is to prevent overfitting on smaller datasets and maintain the high quality of the large pre-trained models [00:44:57].

Zero Convolution Layers

The trainable and locked neural network blocks within ControlNet are connected via a unique type of convolutional layer called a “zero convolution” [00:17:43].

  • Initialization: These are 1x1 convolutional layers with both their weights and biases initialized to zero [00:25:39].
  • Initial State: At the very first training step, because all weights and biases are zero, the ControlNet has no influence on the diffusion model’s output [00:51:06]. This ensures that the original model’s functionality and quality are perfectly preserved initially [00:51:16].
  • Learning: As training progresses, the weights of these zero convolutions are progressively optimized from zeros to non-zero parameters [00:17:53]. This allows the training to be as fast and robust as fine-tuning a diffusion model [00:18:11]. The weights and biases gradients are not influenced by the feature term being zero, allowing them to optimize into a non-zero matrix on the first gradient descent [00:57:01].

Integration with Stable Diffusion

ControlNet is specifically applied to Stable Diffusion’s U-Net architecture [01:02:50]. The U-Net consists of an encoder, a middle block, and a skip-connected decoder [01:09:13].

  • Conditioning: The Stable Diffusion model is inherently conditioned on text prompts (encoded by Open AI Clip) [01:04:04] and diffusion time steps (encoded by positional encoding) [01:10:27].
  • Latent Space: Stable Diffusion performs denoising in a latent space (e.g., 64x64 latent images from 512x512 inputs) to save computational power [01:11:36]. Therefore, ControlNet requires image-based conditions to be converted into this same 64x64 feature space [01:11:53].
  • Tiny Network ‘E’: A separate “Tiny Network E” (consisting of four convolution layers with 4x4 kernels, 2x2 strides, and ReLU activations) is used to encode these image-space conditions (e.g., Canny Edge maps) into the required 64x64 feature maps for integration into the latent space [01:13:51].
  • Connection Points: ControlNet controls each level of the U-Net. The outputs of ControlNet’s trainable copies (12 encoding blocks and one middle block) are added to the 12 skip connections and one middle block of the U-Net in the Stable Diffusion model [01:18:12].

Training Strategies and Performance

ControlNet training is robust across different dataset scales, even with small datasets [00:06:08].

Optimization and Efficiency

  • Computational Efficiency: Because the original Stable Diffusion weights are locked during training, no gradient computation on the original encoder is needed [01:17:06]. This speeds up training and saves GPU memory, reducing computation by half [01:17:20].
  • Resource Requirements: Training a ControlNet on a Stable Diffusion model requires only about 23% more GPU memory and 34% more time per iteration compared to the base model [01:17:28]. This enables training even on personal devices with GPUs like an Nvidia RTX 3090 [01:17:40].

Training Configurations

  • Small-Scale Training (Limited Compute): For devices with limited computational power (e.g., personal computers with RTX 3070s), partially breaking connections between ControlNet and Stable Diffusion can accelerate convergence [01:24:44]. Disconnecting links to specific decoder blocks (e.g., decoder1, 2, 3, 4) and only connecting the middle block can improve training speed by a factor of 1.6 [01:24:59]. Once reasonable association is achieved, these links can be reconnected for continued training to improve accuracy [01:25:16].
  • Large-Scale Training (Powerful Clusters): When powerful computational clusters (e.g., Nvidia A100s with 80GB memory) and large datasets (millions of training examples) are available, a two-stage training configurations approach is used [01:25:50]:
    1. Initial Stage: ControlNet is trained for a sufficient number of iterations while the Stable Diffusion weights remain locked [01:26:17].
    2. Fine-tuning Stage: After approximately 50,000 steps, all weights of the Stable Diffusion model are unlocked, and both models are jointly trained [01:26:39].

Prompting during Training

During training, 50% of text prompts are randomly replaced with empty strings [01:23:37]. This technique, a form of guidance sampling, encourages the diffusion model to learn more semantic concepts from the input control maps (like Canny Edge maps or scribbles) as a replacement for the text prompt when it’s not visible [01:23:54].

Supported Conditional Inputs

ControlNet can be augmented with various image-based conditions, demonstrating its versatility [00:06:40]. Examples include:

  • Edge Maps:
    • Canny Edge Detector: Generates sharp outlines of objects [01:27:46]. Training data for this included 3 million edge-to-image-caption pairs, with random thresholds for edge detection to increase data diversity [01:28:18].
    • Hough Transform: Detects straight lines within images [01:29:56].
    • Holistically-Nested Edge Detection (HED): Another method for boundary detection [01:31:11].
  • User Scribbles: Synthesized from HED boundary detection combined with strong data augmentations (e.g., masking, morphological transformations) to simulate human drawings [01:31:47].
  • Human Keypoints/Pose Estimation: Utilizes learning-based pose estimation methods to find human keypoints and construct skeletons, enabling control over character poses [01:33:10].
  • Semantic Segmentation Maps: Provides pixel-level classification of objects and regions within an image [01:35:07].
  • Shape Normals: Represents the orientation of surfaces in an image [01:39:06].
  • Depth Maps: Estimates the distance of objects from the camera [01:37:02] (often approximated by models like Midas from monocular images [01:37:11]).
  • Cartoon Line Drawings: Extracts line art from cartoon illustrations [01:40:54].

These different conditional inputs demonstrate the broad Applications of ControlNet in Image Generation and its ability to enrich methods for controlling large diffusion models [00:06:57].