From: hu-po

Recent advancements in AI models indicate a strong trend towards making models smaller and more efficient without sacrificing their intelligence or capabilities [00:03:48]. This development challenges the notion that greater model size equates to better performance, suggesting a future where powerful AI can be deployed on more accessible hardware [00:04:08].

Redundancy in Transformer Architectures

Transformers, the foundational architecture for many advanced AI models, contain redundant elements, leading to inefficiencies in deployment costs and resource demands [00:07:19]. Research shows that models can be significantly pruned without degrading performance [00:06:24].

”What Matters in Transformers: Not All Attention Is Needed”

This paper, from the University of Maryland, empirically studies redundancy across different modules within Transformers, including blocks, Multi-Layer Perceptrons (MLPs), and attention layers [00:04:41].

  • Transformer Components: A typical Transformer architecture consists of repeated “blocks,” each containing an attention map/head and an MLP (feed-forward network) [00:04:58].
  • Ablation Study: The researchers performed “dropping” (deleting) different parts of a trained Transformer to assess their importance [00:05:27].
  • Key Finding: A large portion of attention layers exhibit excessively high similarity and can be pruned without degrading performance [00:06:19]. Llama 2 70B, for instance, achieved a 40-50% speedup with only a 3% performance drop by pruning half of its attention layers [00:06:33].
  • Similarity Scores: The decision of what to drop is based on cosine similarity, which quantifies how similar intermediate representations (vectors) are in a high-dimensional space [00:10:36]. Highly similar (redundant) information can be removed [00:12:04].
  • Layer Importance: The first and last layers of MLP and attention components are generally more important, while middle layers often contain more redundancy [00:25:54]. This means a more aggressive pruning strategy can be applied to deeper layers [00:29:35].

Sparse Attention Maps and Modality

The quadratic computational cost of Transformers arises from calculating attention scores between every token in a sequence, creating a dense “attention map” [00:14:33]. Much of this map consists of near-zero values, representing wasted computation [00:14:57].

  • “Differential Transformer”: This paper proposes a differential attention mechanism that calculates attention scores as the difference between two softmax attention maps, resulting in a sparse attention pattern [00:16:17]. This not only speeds up inference but can also outperform standard Transformers [00:36:29].
  • “Pyramid Drop”: This paper extends the concept of redundancy to vision language models (VLMs), demonstrating that visual tokens also exhibit significant redundancy, especially in deeper layers [00:16:54].
    • Mechanism: Pyramid Drop divides the Vision Transformer (ViT) into stages and drops a portion of image tokens at the end of each stage with a predefined ratio, creating a pyramid-like reduction in tokens [00:19:04].
    • Impact: It achieved a 40% training time and 55% inference FLOP acceleration for LLava-Next [00:20:03].
    • Performance Enhancement: Notably, in some cases, dropping tokens or attention layers can actually improve model performance [00:34:32]. For example, a version of LLava-Next with Pyramid Drop performed better while requiring less GPU hours [00:34:16]. This phenomenon suggests that redundancy might be inherent to the power of Transformers, allowing for robust information flow even when parts are removed [00:39:39].

Quantization and Efficient Serving

Beyond architectural pruning, reducing the precision of model parameters (weights) offers another significant avenue for efficiency.

”One-Bit AI Infra Part 1.1: Fast and Lossless BitNet b 1.58 Inference on CPUs”

This paper from Microsoft Research focuses on one-bit large language models, demonstrating how to achieve fast and lossless inference on CPUs [00:46:58].

  • Quantization: Instead of storing weights as high-precision floating-point numbers (e.g., 16 or 32 bits), these models store them as very rounded versions, often just -1, 0, or 1 [00:53:51]. This drastically reduces memory usage and bandwidth [00:53:36].
  • CPU Performance: Specialized kernels (algorithms) allow for highly optimized matrix multiplications with these low-precision weights, achieving up to 6x speedup on x86 CPUs and 5x speedup on ARM CPUs [00:50:50].
  • Lookup Tables: With limited possible weight values, multiplications can be replaced by fast lookup table operations, where pre-calculated results are simply retrieved from memory [00:55:08]. This ancient optimization technique, used even before computers, proves highly effective for deep learning [00:56:49].
  • Energy Efficiency: Reducing computational requirements also directly translates to significant energy savings, crucial for scaling AI services [01:08:50]. For instance, ChatGPT’s average electricity consumption was reported as 564 MWh per day [01:12:45].
  • Post-Training Quantization (PTQ): This technique allows for quantizing models to retain performance on specific tasks or private datasets, enabling efficient serving of fine-tuned models without requiring access to original training data [01:05:09].

The Future of Tiny AI

These advancements, combining architectural pruning with low-bit quantization, create a powerful synergy. The effects of these optimizations stack, meaning the overall speedup and efficiency gains are compounded [00:57:39].

This trend implies that AGI (Artificial General Intelligence) might soon be runnable on commodity hardware, including old and “shitty” tech like an Nvidia 30 series GPU (designed before GPT’s rise) or even a Nokia 3310 cell phone [01:00:16]. This decentralization of powerful AI, making it accessible and cheap to run on nearly any device, will have profound societal implications, making it incredibly difficult to control or regulate [01:29:31].