From: aidotengineer
Luminal aims for radical simplification in machine learning (ML) libraries through the use of search-based compilation [00:00:16]. This approach makes Luminal simpler than most other ML libraries without compromising performance or capability [00:00:20].
The Problem: Complexity in Current ML Libraries
Deep learning, at its core, is simple linear algebra involving scalars, vectors, matrices, and tensors, with a few fundamental operations like additions, multiplies, and matrix multiplications [00:00:36]. However, the existing machine learning software ecosystem is highly complex [00:01:04].
For instance:
- PyTorch features over 1,200 operations, 15 different data types, and supports numerous devices (CPU, CUDA, AMD, TPUs, NPUs) [00:01:12].
- The complexity scales multiplicatively, not additively, with the number of operations, data types, and supported devices [00:01:42]. Adding a new operation, data type, or device can lead to an explosion in complexity [00:01:51].
- PyTorch exceeds 3 million lines of code, and TensorFlow is even larger [00:02:02].
- This complexity results in more bugs and makes it difficult for developers to extend, use, or build within these frameworks [00:02:15].
Older libraries were designed with dynamism at their core (e.g., for RNNs and LSTMs) to allow for hackability and experimentation, often at the expense of performance [00:04:35].
Luminal’s Approach to Simplification
Luminal adopts a top-down approach, identifying the minimum required components to run ML models [00:02:28].
Minimal Operations
Deep learning, as linear algebra, can be broken down into simple operations. Luminal uses a set of just 12 core operations as “Lego blocks” to build complex models [00:02:40]:
- Unary Operations:
x2
,log2
,sin
,reciprocal
,square root
[00:02:53] - Binary Operations:
addition
,multiplication
,modulo
,less than
[00:03:05] - Reductions:
sum reduce
,max reduce
[00:03:10]
With these 12 operations, Luminal can support all commercially relevant models, including language models, vision language models, CNNs, RNNs, and diffusion models [00:03:14]. Many seemingly complex operations are combinations of these primitives:
- Subtraction:
addition
+multiplication
by -1 [00:03:51] - Division:
multiplication
+reciprocal
[00:03:57] - Matrix Multiplications (Matmuls): Broadcasted
multiply
+sum reduce
(with tensor shape manipulation) [00:04:03] - Convolution: Pooling via shape trackers +
matmul
with a convolution kernel [00:04:18]
Static Graph Representation
Deep learning itself is not fundamentally dynamic; typical model dynamism is small and bounded, such as the KV cache length and sequence length in a transformer model [00:05:06]. Luminal specifies models as directed acyclic graphs (DAGs) of operations [00:05:30]. This allows the entire workload to be specified ahead of time [00:18:52].
Resulting Simplicity
As a consequence of these design choices, Luminal is very simple:
- It is under 5,000 lines of code [00:06:25].
- The goal is for the entire library to be learnable in an afternoon, with core concepts understandable in a couple of hours [00:06:30].
Achieving Performance Through Compilers and Search
While simple, Luminal’s primitive graphs of operations are initially slow [00:06:46]. The core innovation lies in transforming these graphs into much faster ones using compilers, specifically through a search-based approach [00:07:03].
Traditional ML Stack vs. Luminal
A traditional ML stack often involves many layers (e.g., Hugging Face Transformers on PyTorch, Xformers, optimized kernels, then calling cuDNN/cuBLAS on CUDA) [00:07:22]. This creates complex dependencies, leading to “dependency hell” during installation and making bug tracing difficult [00:07:53].
Luminal simplifies this by directly emitting CUDA code [00:08:21]. There is nothing between Luminal’s library, graph, and compilers, and CUDA [00:08:29].
The Search-Based Compiler Solution
The complexity of compilers scales exponentially with the complexity of the code they need to generate [00:09:07]. This has bottlenecked the ecosystem, especially for hardware startups with specialized hardware [00:09:36]. As hardware becomes simpler and faster (e.g., from CPUs to GPUs to TPUs, which require more explicit programmer control for better performance per watt), the software/compiler needs to become more complex [00:09:50].
This leads to the VLIW (Very Large Instruction Width) compiler problem: hardware designers want simple hardware, requiring the compiler to statically schedule everything, but compilers for this become too complex for humans to write [00:11:22].
Luminal solves this by applying the same solution used by AlphaGo for cracking the game of Go: search [00:11:57]. Instead of hand-writing perfect algorithms, Luminal searches through logically equivalent GPU kernels [00:12:42].
How Search Works
- Graph to Expressions: Luminal converts its operation graphs into expressions using the
egg log
library, which represents the search space efficiently using e-graphs [00:13:10]. - Rewrite Rules: Luminal defines 20-25 simple rewrite rules [00:13:34]. Each rule makes a small, logically equivalent alteration to a given GPU kernel [00:13:37].
- Search Space Generation: By iteratively applying these simple rewrite rules, a very large search space of equivalent kernels is built [00:13:54].
- Performance Profiling: Luminal then profiles the runtime of these different equivalent kernels and selects the fastest one [00:14:07]. For larger search spaces, techniques like Monte Carlo search are used to prune the search [00:14:25].
Optimizations Found Through Search
- Kernel Fusion: This optimization merges multiple operations into a single kernel to minimize data movement to and from global memory, which is typically 99% of the energy and time spent on GPUs [00:14:45].
- An unfused graph involves writing results to memory and reading them back for each sequential operation [00:15:56].
- A fused kernel merges these, drastically reducing data movement and making the entire aggregate graph far faster [00:16:11].
- Flash Attention: Luminal’s search technique was able to independently discover Flash Attention, a complex algorithm that took the industry five years to find [00:16:40]. By running simple rewrite rules on a naive multi-head attention graph and profiling the search space, Luminal identified Flash Attention as the fastest kernel [00:17:15]. This is believed to be unique among compilers [00:17:33].
Deterministic Optimizations
After the search process finds the fastest kernels, Luminal applies deterministic optimizations that are guaranteed to be beneficial:
- Buffer Reuse: By having the entire workload as a graph, the compiler can optimally reuse memory buffers. It identifies buffers that are not simultaneously in use and assigns them to the same memory location, minimizing memory usage [00:18:37].
- Kernel Dispatching: Instead of the traditional CPU-GPU round trip for each kernel launch, Luminal dispatches all kernels at once into a queue, allowing the GPU to run through them sequentially and saving significant time [00:19:31].
Training Support as an Extension
Luminal was initially designed as an inference library [00:20:25]. However, due to its flexible graph representation, an external library for an autograd engine was built that works directly within Luminal [00:20:30]. This engine derives a backward graph from a forward graph and attaches it, allowing Luminal’s compilers (including the search process) to optimize for training as well [00:20:43]. This modularity means external contributors can write their own autograds or other advanced training setups [00:21:18].
Future Developments
Luminal has ambitious future plans:
- More Hardware Support: Expanding support beyond CPU, CUDA, and Metal to include AMD, Tensor Torrent, Groq, and TPUs, aiming to democratize ML across various hardware [00:21:41].
- Distributed Inference and Training: Implementing full 3D distributed capabilities, including data parallel, pipeline parallel, and tensor parallel [00:22:04].
- Reinforcement Learning (RL) Optimization: Codifying environments within the Luminal graph so that both the model’s forward pass and environment steps can be optimized and run entirely on the GPU, significantly accelerating RL workflows [00:22:16].
- Luminal Cloud: Offering a serverless inference endpoint by allowing users to export Luminal models as graphs, upload them to the cloud, and get an optimized endpoint [00:23:06]. Luminal handles optimization, batching, queuing, and machine provisioning, with users paying only when their graph executes, aiming for the simplest and fastest cloud experience [00:23:22].
The simplicity of Luminal’s design allows for faster innovation compared to more complex frameworks [00:24:03].