From: hu-po
Bayesian Flow Networks (BFNs) are a new class of generative models that draw comparisons to, and differentiate themselves from, existing models and algorithms in the field of deep learning.
Core Distinctions from Diffusion Models
BFNs introduce a generative procedure conceptually similar to the reverse process of diffusion models [00:03:42]. However, a key difference is that BFNs are conceptually simpler because no forward process is required, unlike diffusion models [00:04:16]. While diffusion models learn by adding and removing noise [00:03:52], BFNs operate on the parameters of a data distribution rather than a noisy version of the data itself [00:36:44].
This fundamental difference offers several advantages:
- Continuous and Differentiable Process: The generative process in BFNs remains fully continuous and differentiable even when dealing with discrete data [00:37:41]. This contrasts with discrete diffusion models which inherently use discrete samples as input [00:55:54]. Existing continuous variants of discrete diffusion models typically rely on mapping to or from a continuous embedding space, or restricting continuous diffusion to the probability simplex [00:59:09]. BFNs’ continuity is an inherent property, removing the need for such external constraints or mapping functions, which also reduces the number of free parameters and design choices [00:59:33].
- Direct Loss Optimization: BFNs directly optimize the negative log-likelihood of discrete data [00:59:33], unlike continuous diffusion methods for discrete data that often require simplified loss functions or auxiliary loss terms for stability [00:59:52].
- Initial Noise: BFNs begin their generative process with parameters of a fixed prior, whereas diffusion models start from pure noise [01:01:38]. This reduction in initial noise is hypothesized to lead to faster learning on large datasets where models might underfit [01:02:03].
- No Forward Process Inversion: BFNs do not require defining and inverting a forward process, which arguably makes them easier to adapt to different distributions and data types compared to discretized diffusion models that need carefully defined transition matrices [01:02:44].
- Performance: BFNs have been shown to outperform all known discrete diffusion models on the Text8 character-level language modeling task [01:00:38].
Comparison with Auto-Regressive Models
- Continuous vs. Discrete Data: Auto-regressive networks are currently state-of-the-art for language modeling and generally perform well on discrete data where a natural ordering exists [02:09:08]. They have proved less effective in domains like image generation, where data is continuous and lacks a natural order (e.g., no inherent reason to generate one pixel before another) [02:09:08].
- Generation Steps: Auto-regressive models require as many network updates to generate samples as there are variables in the data [02:10:09]. Diffusion models have the advantage of decoupling the number of generation steps from the number of variables [02:10:09].
Comparison with Variational Autoencoders (VAEs)
The loss function for BFNs can be derived as the loss function of a Variational Autoencoder (VAE), specifically the negative variational lower bound [01:43:51].
Comparison with Neural Network Architectures
BFNs place no restrictions on the network architecture, meaning various types of Bayesian flow networks can be implemented [00:09:01]. However, despite their theoretical elegance, the practical scalability on high-dimensional data (e.g., 1024x1024 images, 50,000 token vocabularies) remains a potential challenge compared to highly optimized architectures like Transformers or CNNs [01:14:52], [02:11:53]. For instance, models like capsule networks, though theoretically promising, struggled with scaling and computational efficiency compared to CNNs [01:15:37].
Comparison to Proprietary Models
The paper notes that BFNs achieve competitive log likelihoods for image modeling on dynamically binarized MNIST and CIFAR-10 datasets [00:09:09], performing closest to state-of-the-art when no data augmentation is used [02:43:02]. For language modeling, BFNs outperform known discrete diffusion models on the Text8 character-level task [00:10:38]. However, the language modeling task uses a highly simplified setup (256-character sequences with 27 possible tokens) compared to large language models like GPT-2 or ChatGPT, which handle much longer contexts and significantly larger vocabularies [02:40:40].
Flexibility and Generality
BFNs are adaptable to continuous, discretized, and discrete data with minimal changes to the training procedure [01:02:57]. This flexibility extends to multimodality; there are no restrictions on input or output modalities, allowing BFNs to potentially handle combinations of images and text [03:04:43]. The loss function directly optimizes data compression [00:08:05], an idea linked to concepts like “generalization is compression” and “intelligence is compression” [00:08:23], suggesting a broad theoretical applicability.