From: aidotengineer

Luminal aims for radical simplification in machine learning (ML) libraries through the use of search-based compilation [00:00:16]. This approach makes Luminal simpler than most other ML libraries without compromising performance or capability [00:00:20].

The Problem: Complexity in Current ML Libraries

Deep learning, at its core, is simple linear algebra involving scalars, vectors, matrices, and tensors, with a few fundamental operations like additions, multiplies, and matrix multiplications [00:00:36]. However, the existing machine learning software ecosystem is highly complex [00:01:04].

For instance:

  • PyTorch features over 1,200 operations, 15 different data types, and supports numerous devices (CPU, CUDA, AMD, TPUs, NPUs) [00:01:12].
  • The complexity scales multiplicatively, not additively, with the number of operations, data types, and supported devices [00:01:42]. Adding a new operation, data type, or device can lead to an explosion in complexity [00:01:51].
  • PyTorch exceeds 3 million lines of code, and TensorFlow is even larger [00:02:02].
  • This complexity results in more bugs and makes it difficult for developers to extend, use, or build within these frameworks [00:02:15].

Older libraries were designed with dynamism at their core (e.g., for RNNs and LSTMs) to allow for hackability and experimentation, often at the expense of performance [00:04:35].

Luminal’s Approach to Simplification

Luminal adopts a top-down approach, identifying the minimum required components to run ML models [00:02:28].

Minimal Operations

Deep learning, as linear algebra, can be broken down into simple operations. Luminal uses a set of just 12 core operations as “Lego blocks” to build complex models [00:02:40]:

  • Unary Operations: x2, log2, sin, reciprocal, square root [00:02:53]
  • Binary Operations: addition, multiplication, modulo, less than [00:03:05]
  • Reductions: sum reduce, max reduce [00:03:10]

With these 12 operations, Luminal can support all commercially relevant models, including language models, vision language models, CNNs, RNNs, and diffusion models [00:03:14]. Many seemingly complex operations are combinations of these primitives:

  • Subtraction: addition + multiplication by -1 [00:03:51]
  • Division: multiplication + reciprocal [00:03:57]
  • Matrix Multiplications (Matmuls): Broadcasted multiply + sum reduce (with tensor shape manipulation) [00:04:03]
  • Convolution: Pooling via shape trackers + matmul with a convolution kernel [00:04:18]

Static Graph Representation

Deep learning itself is not fundamentally dynamic; typical model dynamism is small and bounded, such as the KV cache length and sequence length in a transformer model [00:05:06]. Luminal specifies models as directed acyclic graphs (DAGs) of operations [00:05:30]. This allows the entire workload to be specified ahead of time [00:18:52].

Resulting Simplicity

As a consequence of these design choices, Luminal is very simple:

  • It is under 5,000 lines of code [00:06:25].
  • The goal is for the entire library to be learnable in an afternoon, with core concepts understandable in a couple of hours [00:06:30].

While simple, Luminal’s primitive graphs of operations are initially slow [00:06:46]. The core innovation lies in transforming these graphs into much faster ones using compilers, specifically through a search-based approach [00:07:03].

Traditional ML Stack vs. Luminal

A traditional ML stack often involves many layers (e.g., Hugging Face Transformers on PyTorch, Xformers, optimized kernels, then calling cuDNN/cuBLAS on CUDA) [00:07:22]. This creates complex dependencies, leading to “dependency hell” during installation and making bug tracing difficult [00:07:53].

Luminal simplifies this by directly emitting CUDA code [00:08:21]. There is nothing between Luminal’s library, graph, and compilers, and CUDA [00:08:29].

The Search-Based Compiler Solution

The complexity of compilers scales exponentially with the complexity of the code they need to generate [00:09:07]. This has bottlenecked the ecosystem, especially for hardware startups with specialized hardware [00:09:36]. As hardware becomes simpler and faster (e.g., from CPUs to GPUs to TPUs, which require more explicit programmer control for better performance per watt), the software/compiler needs to become more complex [00:09:50].

This leads to the VLIW (Very Large Instruction Width) compiler problem: hardware designers want simple hardware, requiring the compiler to statically schedule everything, but compilers for this become too complex for humans to write [00:11:22].

Luminal solves this by applying the same solution used by AlphaGo for cracking the game of Go: search [00:11:57]. Instead of hand-writing perfect algorithms, Luminal searches through logically equivalent GPU kernels [00:12:42].

How Search Works

  1. Graph to Expressions: Luminal converts its operation graphs into expressions using the egg log library, which represents the search space efficiently using e-graphs [00:13:10].
  2. Rewrite Rules: Luminal defines 20-25 simple rewrite rules [00:13:34]. Each rule makes a small, logically equivalent alteration to a given GPU kernel [00:13:37].
  3. Search Space Generation: By iteratively applying these simple rewrite rules, a very large search space of equivalent kernels is built [00:13:54].
  4. Performance Profiling: Luminal then profiles the runtime of these different equivalent kernels and selects the fastest one [00:14:07]. For larger search spaces, techniques like Monte Carlo search are used to prune the search [00:14:25].
  • Kernel Fusion: This optimization merges multiple operations into a single kernel to minimize data movement to and from global memory, which is typically 99% of the energy and time spent on GPUs [00:14:45].
    • An unfused graph involves writing results to memory and reading them back for each sequential operation [00:15:56].
    • A fused kernel merges these, drastically reducing data movement and making the entire aggregate graph far faster [00:16:11].
  • Flash Attention: Luminal’s search technique was able to independently discover Flash Attention, a complex algorithm that took the industry five years to find [00:16:40]. By running simple rewrite rules on a naive multi-head attention graph and profiling the search space, Luminal identified Flash Attention as the fastest kernel [00:17:15]. This is believed to be unique among compilers [00:17:33].

Deterministic Optimizations

After the search process finds the fastest kernels, Luminal applies deterministic optimizations that are guaranteed to be beneficial:

  • Buffer Reuse: By having the entire workload as a graph, the compiler can optimally reuse memory buffers. It identifies buffers that are not simultaneously in use and assigns them to the same memory location, minimizing memory usage [00:18:37].
  • Kernel Dispatching: Instead of the traditional CPU-GPU round trip for each kernel launch, Luminal dispatches all kernels at once into a queue, allowing the GPU to run through them sequentially and saving significant time [00:19:31].

Training Support as an Extension

Luminal was initially designed as an inference library [00:20:25]. However, due to its flexible graph representation, an external library for an autograd engine was built that works directly within Luminal [00:20:30]. This engine derives a backward graph from a forward graph and attaches it, allowing Luminal’s compilers (including the search process) to optimize for training as well [00:20:43]. This modularity means external contributors can write their own autograds or other advanced training setups [00:21:18].

Future Developments

Luminal has ambitious future plans:

  • More Hardware Support: Expanding support beyond CPU, CUDA, and Metal to include AMD, Tensor Torrent, Groq, and TPUs, aiming to democratize ML across various hardware [00:21:41].
  • Distributed Inference and Training: Implementing full 3D distributed capabilities, including data parallel, pipeline parallel, and tensor parallel [00:22:04].
  • Reinforcement Learning (RL) Optimization: Codifying environments within the Luminal graph so that both the model’s forward pass and environment steps can be optimized and run entirely on the GPU, significantly accelerating RL workflows [00:22:16].
  • Luminal Cloud: Offering a serverless inference endpoint by allowing users to export Luminal models as graphs, upload them to the cloud, and get an optimized endpoint [00:23:06]. Luminal handles optimization, batching, queuing, and machine provisioning, with users paying only when their graph executes, aiming for the simplest and fastest cloud experience [00:23:22].

The simplicity of Luminal’s design allows for faster innovation compared to more complex frameworks [00:24:03].