From: aidotengineer
Luminal is a machine learning (ML) library designed to achieve radical simplification through the use of search-based compilation techniques [00:00:16]. Its core philosophy is to be significantly simpler than other ML libraries while maintaining high performance and capability by leveraging compilers and, more specifically, search [00:00:20].
The Problem: Complexity in ML Libraries
Deep learning, at its fundamental level, involves simple linear algebra, primarily operations on scalars, vectors, matrices, and tensors, with a few core operations like additions, multiplications, and element-wise operations [00:00:36]. Despite this inherent simplicity, the current ML software ecosystem is highly complicated [00:01:01].
For example, PyTorch, a prominent ML library, features over 1,200 operations and 15 different data types, supporting various devices such as CPU, CUDA, AMD, TPUs, and NPUs [00:01:10]. The complexity of these libraries scales multiplicatively, not additively, with the number of operations, data types, and supported devices [00:01:42]. This means adding support for a new operation, data type, or device can cause complexity to explode [00:01:51]. As a result, PyTorch comprises over 3 million lines of code, and TensorFlow is even larger [00:02:02]. This complexity leads to more bugs and makes it challenging for developers to extend or build upon these systems [00:02:15].
Luminal’s Approach: Radical Simplification
Luminal addresses this complexity by taking a top-down approach, identifying the minimum set of operations required to run ML models [00:02:28]. It operates on the principle that complex models can be constructed from very simple “Lego blocks” of operations [00:02:47].
Minimal Operation Set
Luminal utilizes only 12 fundamental operations [00:02:50]:
- Unary Operations:
x2
,log2
,sin
,reciprocal
,square root
[00:02:55] - Binary Operations:
addition
,multiplication
,modulo
,less than
[00:03:05] - Reductions:
sum reduce
,max reduce
[00:03:10]
These 12 operations are sufficient to support all commercially relevant models, including language models, vision-language models, CNNs, RNNs, and diffusion models [00:03:14]. Many seemingly complex operations are merely compositions of these simpler ones. For instance:
- Subtraction is
addition
andmultiplication
by -1 [00:03:51]. - Division is
multiplication
andreciprocal
[00:03:57]. - Matrix multiplication (matmuls) can be achieved via broadcasted
multiply
followed bysum reduce
[00:04:03]. - Convolution can be performed by combining pooling (via shape trackers) and a matmul with a convolution kernel [00:04:18].
Static Graphs
Traditional ML libraries were built with dynamism at their core, important for experimental models like RNNs, where performance was less critical [00:04:35]. However, deep learning is fundamentally not dynamic; its dynamism is very small and bounded [00:05:06]. In a transformer model, only the KV cache length and sequence length are typically dynamic, while the rest of the model is static [00:05:16].
Luminal specifies models as Directed Acyclic Graphs (DAGs) of operations [00:05:30]. This graphical representation allows for a complete specification of models, from simple matrix multiplies to complex, large-scale models [00:05:37].
As a result of this simplification, Luminal is remarkably simple: it is under 5,000 lines of code [00:06:25], easy to understand, and designed to be learnable within an afternoon [00:06:30].
Achieving Performance Through Compilers and Search
While the simplified graphs are initially slow (e.g., Llama 7B running for a whole day to generate a single sentence) [00:06:49], the intention is not to run these primitive graphs directly. Instead, these graphs are fed through compilers to transform them into faster, optimized graphs [00:07:01].
Simplified Stack
A traditional ML stack (e.g., Hugging Face Transformers → PyTorch/XFormers → cuDNN/cuBLAS → CUDA) creates a complex dependency story, leading to “dependency hell” during installation and difficulties in bug tracing [00:07:22]. Luminal directly emits CUDA code, creating a much simpler stack: the Luminal library, its graph, its compilers, and CUDA [00:08:21].
The Compiler Challenge and Search
The core challenge in ML compilers is that their complexity scales non-linearly (e.g., quadratically or cubically) with the complexity of the code they need to generate [00:09:04]. This makes compilers incredibly difficult for humans to write beyond a certain point [00:09:31]. Furthermore, as hardware becomes simpler and more uniform (from CPUs to GPUs to TPUs, which are faster and more performance-per-watt efficient), the software and compilers need to become more complex to manage them [00:10:51]. This leads to the VLIW compiler problem, where simple hardware requires overly complex compilers [00:11:22].
Luminal overcomes this bottleneck by leveraging search [00:11:57]. Inspired by AlphaGo, which used search to conquer the game of Go [00:12:01], Luminal searches through logically equivalent GPU kernels [00:12:42].
How Search Works
- Graph to Expressions: Initial graphs are converted into expressions within a library called
egglog
, which uses e-graphs to efficiently represent a search space of equivalent expressions [00:13:10]. - Rewrite Rules: A small set of simple rewrite rules (20-25) are defined [00:13:34]. Each rule makes a small, logically equivalent alteration to a GPU kernel [00:13:39].
- Search Space Expansion: By iteratively applying these simple rewrite rules, a vast search space of equivalent kernels is built [00:13:56].
- Profiling and Selection: The system then profiles the runtime of various kernels within this search space and selects the fastest one [00:14:08]. For larger spaces, techniques like Monte Carlo tree search are used to prune the search [00:14:27].
This search approach allows Luminal to find optimal kernels without needing to hand-write complex rules that guarantee fast code [00:12:54].
Types of Optimizations Found
- Kernel Fusion: This common optimization merges multiple operations into a single kernel, significantly reducing data movement between global memory and compute units, which is often 99% of energy and time spent on GPUs [00:14:45]. By loading data once, computing multiple operations, and writing back once, performance drastically improves [00:15:38].
- Flash Attention Discovery: Luminal’s search technique was able to independently discover Flash Attention, a highly complex and crucial optimization for transformers that took the industry five years to find [00:16:40]. This demonstrates the power of search to uncover non-obvious, complex optimizations [00:17:30].
Deterministic Optimizations
After the search process identifies fast kernels, Luminal applies deterministic optimizations that are known to always be beneficial:
- Buffer Reuse: The compiler analyzes the entire workload graph to optimally reuse memory buffers, minimizing overall memory usage [00:18:37]. If Buffer A and Buffer B are never used concurrently, they can share the same memory location [00:19:01].
- Batch Kernel Issuance: Instead of the CPU dispatching one GPU kernel at a time and waiting for its completion, Luminal dispatches all kernels in advance, allowing the GPU to run through them sequentially [00:19:31]. This eliminates significant round-trip time between the CPU and GPU [00:19:50].
Extending Luminal’s Capabilities
Training Support
While initially designed for inference, Luminal’s flexible graph representation allowed for the development of an external autograd engine [00:20:25]. This engine derives a backward graph from a forward graph, enabling training capabilities. All the existing compilers and search processes for inference also apply to training [00:20:50]. This external extension model is unique among ML libraries, allowing external contributors to build custom autograds, gradient sharding, or other training setups [00:21:08].
Future Developments
- Expanded Hardware Support: Current support includes CPU, CUDA, and Metal. Future plans aim to support AMD, Tenstorrent, Groq, and TPUs to democratize ML across diverse hardware [00:21:41].
- Distributed Inference and Training: Implementing full 3D distributed capabilities, including data parallel, pipeline parallel, and tensor parallel [00:22:04].
- Reinforcement Learning (RL) Optimization: Codifying environments within the Luminal graph, allowing the environment simulation and model forward pass to run entirely on the GPU [00:22:17]. This could dramatically accelerate RL workflows by eliminating the CPU-GPU bottleneck [00:22:42].
Luminal Cloud
Luminal leverages its graph representation to offer a serverless inference endpoint through the Luminal cloud [00:23:06]. Users can export their Luminal model graphs, upload them, and receive a serverless inference endpoint [00:23:11]. The cloud handles optimization, batching, queuing, and machine provisioning, with users only paying for actual graph execution time [00:23:22]. This aims to deliver the simplest and fastest cloud ML experience [00:23:34].
The simplicity of Luminal’s design allows for faster innovation compared to more complex frameworks [00:24:03].