Visual Autoregressive Modeling

From: hu-po

Visual Autoregressive Modeling (VAR) is a novel generative paradigm for image generation that redefines how autoregressive learning is applied to images [00:07:42]. Instead of sequential next-token prediction, VAR employs a next-scale or next-resolution prediction approach [00:07:47] [00:09:00] [00:18:02]. This paper won a best paper award at the prestigious Neural Information Processing Systems (NeurIPS) conference [00:04:02] [00:04:06].

Background and Problem Statement

Traditional autoregressive models for images, mirroring sequential language modeling, discretize continuous images into 2D token grids, which are then flattened into a 1D sequence for autoregressive learning [00:15:05] [00:15:09] [00:31:02]. This flattening typically uses a row-major raster scan (left-to-right, top-to-down), or other methods like spiral or Z-curve orders [00:15:40] [00:42:06] [00:42:09].

However, this approach introduces several challenges:

Inherent Directionality Mismatch: Language has a natural 1D, unidirectional flow, which is a strong inductive prior for language models [00:31:17]. Images, conversely, possess bidirectional correlation; a token’s neighbors are closely related in all directions [00:43:21]. This raster scan order contradicts the bidirectional nature of images [00:43:22].
Loss of Spatial Locality: Flattening a 2D grid into a 1D sequence disrupts the inherent spatial locality [00:44:01]. For example, a token and its four immediate spatial neighbors are closely correlated, but this relationship is compromised in a linear sequence [00:44:08].
Reliance on Position Embeddings: To mitigate the loss of spatial information, these models often rely on rotary position embeddings to implicitly remind the network of spatial relationships [00:16:38] [00:17:19].
Computational Cost: Full autoregressive generation of N² tokens requires O(N²) decoding iterations and O(N⁶) total computations, which is computationally expensive [01:14:07] [01:14:10] [01:35:16].

VAR’s Hierarchical Approach

VAR proposes a new strategy based on the observation that humans perceive and create images hierarchically, from coarse to fine [00:17:53] [00:19:50]. This multiscale, coarse-to-fine inductive prior is similar to how convolutional neural networks (CNNs) and the human visual system (e.g., V1, V2, V3 layers) operate [00:19:17] [00:20:06] [00:21:04].

Instead of next-token prediction, the VAR Transformer predicts the next higher resolution token map conditioned on all previous (lower resolution) ones [00:18:05] [00:50:39]. This means starting with a single token representing the whole image, then progressively predicting entire image representations at slightly higher resolutions [00:18:36].

VQ-VAE for Multiscale Tokenization

VAR utilizes a two-stage training process:

Multiscale VQ-VAE Training: In the first stage, a multiscale VQ-VAE (Vector Quantized Variational AutoEncoder) encodes an image into multiple token maps at increasingly higher resolutions [00:45:00] [00:45:03]. The VQ-VAE takes a continuous image, uses an encoder (a CNN) to convert it into a continuous feature map, and then a quantizer selects the closest discrete vector from a predefined codebook (vocabulary) for each part of the feature map [00:32:27] [00:34:24] [00:35:12]. This effectively discretizes the image into tokens [00:36:58]. The codebook is shared across all scales, meaning the same vocabulary of 4,096 possible token values is used for both coarse and fine resolutions [00:29:06] [01:02:44]. This VQ-VAE training is self-supervised, using the image itself as the learning signal [00:46:51].
VAR Transformer Training: Once the VQ-VAE is trained, the VAR Transformer is trained on these multiscale tokens [00:47:40]. It predicts the tokens of a higher resolution map (RK) based on all previous (lower resolution) maps (R1 to RK-1) [00:50:39]. A block-wise causal attention mask ensures that each RK can only attend to R less than or equal to K [00:51:04]. Crucially, all tokens within a given resolution map (RK) can be generated in parallel, unlike traditional raster scan autoregressive models [00:59:03] [01:00:57].

Performance and Advantages

VAR demonstrates significant improvements over previous methods:

Quantitative Results: On the ImageNet 256x256 benchmark, VAR drastically improves the Fréchet Inception Distance (FID) from 18 to 1 and the Inception Score (IS) from 80 to 350 [00:09:50] [00:09:53] [00:09:57]. An FID score of 1.78 is considered a lower bound, making VAR’s score of 1.0 highly impressive [01:03:32] [01:03:34].
Inference Speed: VAR achieves a 20x faster inference speed compared to Diffusion Transformer models [00:09:59] [01:14:43]. This is due to the parallel token generation within each resolution map [01:00:01].
Computational Complexity: The time complexity for generating an image with N² tokens is significantly reduced to O(N⁴) for VAR, compared to O(N⁶) for conventional autoregressive models that use raster scans [01:01:54] [01:14:07] [01:14:10].
Zero-Shot Generalization: VAR showcases strong zero-shot generalization abilities in downstream tasks like image inpainting, outpainting, and editing, meaning it can perform these tasks without specific fine-tuning [01:12:42] [01:28:42].
Simplicity: The model achieves these results using a vanilla VQ-VAE architecture and a standard GPT-2 Transformer, without relying on advanced techniques like Rotary Position Embeddings, SwiGLU MLP, or RMSNorm [01:08:42] [01:08:47] [01:08:50]. This indicates that the core innovation in the inductive prior is the primary driver of performance.
Scaling Laws: VAR demonstrates predictable scaling laws, where increasing model parameters, training tokens, or optimal training compute leads to a predictable decrease in test loss [01:15:05] [01:15:10].

Future Work and Applications

The paper outlines several promising directions for future work:

Advanced Tokenization: Further improvements could be made by advancing the VQ-VAE tokenizer itself [01:17:34].
Downstream Tasks: Applying VAR to more complex tasks such as text-to-image generation is a high priority for exploration [01:17:47].
3D and Video Generation: The multiscale approach is naturally extensible to 3D data and videos [01:17:57]. By formulating a similar 3D next-scale prediction, VAR could generate videos, potentially offering inherent advantages in temporal consistency compared to diffusion-based generators like Sora [01:18:00] [01:18:04]. This concept could also apply to motion modeling.
Other Data Modalities: The core idea of rethinking 1D flattening for high-dimensional data could be applied to other modalities like proteins or graphs [01:18:50].

Controversy

The first author, Kairui Tong, faced a legal battle with ByteDance, where he interned [00:05:01] [00:05:21]. Allegations included maliciously disrupting or poisoning internal model training by altering code, leading to significant resource wastage [00:05:34] [00:05:39].

Why it Won Best Paper

VAR’s success as a best paper award winner can be attributed to several factors:

Strong Results: The significant quantitative improvements on established benchmarks (FID, Inception Score) are a key factor [01:23:23].
Promising Future Directions: The clear and broad applicability to various tasks and data modalities highlights its potential for further research and development [01:23:24].
Well-Written and Clear Figures: The paper is praised for its clarity, educational value, and effective figures, making the complex concepts understandable [01:23:25] [01:23:31].
Elegant and Intuitive Idea: The core idea of next-scale prediction is both simple and intuitively pleasing, grounded in how humans perceive images and analogous to principles found in CNNs [00:23:59] [01:20:57]. Papers that introduce simple yet highly effective concepts often win top awards [00:24:15].

Tubegraph

Explorer

Table of Contents