Inference and Training Processes in Luminal

Luminal’s Radical Simplification for Machine Learning

Luminal is presented as an ML library that achieves “radical simplification through search” [00:00:16]. It aims to be far simpler than most other ML libraries without compromising performance or capability, leveraging compilers and search techniques [00:00:20].

The Problem of ML Library Complexity

Deep learning, at its core, is described as fundamentally simple: linear algebra involving scalars, vectors, matrices, and tensors, with a few core operations like additions, multiplies, and element-wise operations [00:00:36]. However, the machine learning software ecosystem is exceedingly complicated [00:01:04].

For instance, PyTorch, a prominent library, has over 1,200 operations and 15 different data types, running on various devices such as CPU, CUDA, AMD, and TPUs [00:01:10]. This complexity scales multiplicatively (operations × data types × devices), leading to massive codebases like PyTorch’s over 3 million lines or TensorFlow’s even larger size [00:01:42]. This complexity leads to bugs and makes it difficult to extend or use [00:02:15].

Luminal’s Simplified Approach

Luminal takes a top-down approach, identifying the minimum required components for ML models [00:02:28]. It reduces deep learning to just 12 fundamental operations, treating models as “Lego blocks” of simple operations [00:02:47]:

Unary Operations: x2, log, sin, reciprocal, square root [00:02:53]
Binary Operations: addition, multiplication, modulo, less than [00:03:05]
Reductions: sum reduce, max reduce [00:03:10]

With these operations, Luminal can support all commercially relevant models, including language models, vision language models, CNNs, RNNs, and diffusion models [00:03:14]. More complex operations like subtraction, division, matrix multiplication (matmuls), and convolution are derived by combining these 12 basic operations and manipulating tensor shape metadata [00:03:51].

Luminal also notes that while older libraries were built with dynamism at their core for experimentation with RNNs and LSTMs, deep learning models are fundamentally not dynamic [00:04:35]. In a transformer model, only the KV cache length and sequence length are dynamic; the rest is static [00:05:14]. Luminal specifies models as directed acyclic graphs (DAGs) of operations, allowing for a static representation that can be optimized [00:05:30].

This simplification results in Luminal being under 5,000 lines of code, designed to be learnable in an afternoon [00:06:19].

The Inference Process: Compilation Through Search

While Luminal’s primitive graphs are initially slow (e.g., Llama 7B taking all day to generate a sentence [00:06:49]), the core idea is to transform these graphs into much faster ones using compilers [00:07:01].

Luminal’s stack is notably simpler than traditional ML stacks. A traditional stack might involve Hugging Face Transformers on top of PyTorch and xformers, which use handwritten kernels, sometimes calling operations in cuDNN or cuBLAS, all sitting on CUDA [00:07:22]. This creates complex dependencies and “dependency hell” [00:07:53]. In contrast, Luminal directly emits CUDA code, with nothing between its library, graph, compilers, and CUDA [00:08:22].

Overcoming Compiler Complexity with Search

Traditional ML compilers face a problem: as the complexity of the generated code grows, the compiler’s complexity scales even faster (e.g., squared or cubed) [00:09:07]. This “VIW compiler problem” (Very Large Instruction Width) means human-written compilers become too complex beyond a certain point [00:11:22].

Luminal’s solution is to use search, akin to how AlphaGo tackled the game of Go [00:11:59]. Instead of writing a perfect algorithm, Luminal searches through logically equivalent GPU kernels [00:12:44].

The process involves:

Graph to Expressions: Converting the operation graphs into expressions within a library called egg log, which uses egraphs to efficiently represent the search space [00:13:10].
Rewrite Rules: Applying 20-25 simple rewrite rules, each making a small, logically equivalent alteration to a GPU kernel [00:13:34]. Iterative application of these rules builds a large search space [00:13:56].
Profiling and Selection: The system profiles the runtime of different equivalent kernels and chooses the fastest one [00:14:08]. For larger search spaces, Monte Carlo search is used to prune possibilities [00:14:27].

Optimizations Found by Search

This search-based compilation finds crucial optimizations:

Kernel Fusion: Merging multiple operations (e.g., sin followed by x2) into a single kernel to minimize data movement to and from global memory [00:14:45]. Data movement accounts for about 99% of GPU energy and time [00:15:23]. A single fused kernel can be vastly faster than a sequence of unfused operations [00:15:52].
Flash Attention: Luminal’s search technique was able to independently discover Flash Attention, a complex algorithm that took the industry about five years to find [00:16:40].

Deterministic Optimizations

After the search process, Luminal applies deterministic optimizations that are always beneficial:

Buffer Reuse: Minimizing memory usage by optimally reusing memory buffers. Because the entire workload is specified as a graph, the compiler can identify when buffers are no longer needed and reuse their memory [00:18:37].
Batch Kernel Issuance: Issuing all GPU kernels at once, rather than the traditional method of CPU dispatching one kernel, waiting for it to finish, and then dispatching the next [00:19:31]. This eliminates significant round-trip time to the CPU [00:19:50].

The Training Process

Luminal was initially designed for inference [00:20:25]. However, due to its flexible graph representation, an external library (crate) was built to serve as an autograd engine [00:20:30]. This engine derives a backward graph from a given forward graph and attaches it [00:20:43].

This means Luminal gets training “for free,” as all the same compilers and search processes used for inference also work on the backward pass [00:20:52]. This extensibility allows external contributors to write their own autograds, gradient sharding, or advanced training setups, a unique capability among ML libraries [00:21:18].

Future Developments

Luminal aims to expand its capabilities:

Hardware Support: Adding support for AMD, Tenstorrent, Groq, and TPUs, to “break the CUDA coupe” and democratize ML across different hardware [00:21:41].
Distributed Inference and Training: Implementing full 3D distribution, including data parallel, pipeline parallel, and tensor parallel [00:22:04].
Reinforcement Learning Optimization: Addressing the bottleneck in RL where models run on GPU and environments on CPU. By codifying environments within the Luminal graph, both the model’s forward pass and environment steps can be optimized and run entirely on the GPU [00:22:18].
Luminal Cloud: A serverless inference endpoint solution where users export a Luminal model graph, upload it, and get an endpoint. Luminal handles optimization, batching, queuing, and machine provisioning, with users paying only for graph execution time [00:23:06].

Luminal’s simplicity allows for faster innovation compared to more complex frameworks [00:24:03].

Tubegraph

Explorer

Table of Contents