Searchbased optimization for GPU kernels

From: aidotengineer

Luminal is an ML library built on the principle of “radical simplification through search” [00:16:00]. While fundamental deep learning operations are simple linear algebra, consisting of scalars, vectors, matrices, tensors, and core operations like additions, multiplies, and element-wise ops [00:36:00], machine learning software ecosystems are highly complex [01:04:00]. PyTorch, for example, has over 1,200 operations, 15 data types, and runs on many devices, leading to multiplicative complexity that causes its codebase to exceed 3 million lines [01:10:00]. This complexity leads to bugs and makes libraries difficult to extend [02:15:00].

Luminal addresses this by reducing the core operations to a minimal set of 12 very simple operations: unary operations (x2, log2, sin, reciprocal, square root), binary operations (addition, multiplication, modulo, less than), and reductions (sum reduce, max reduce) [02:50:00]. These fundamental operations can represent all commercially relevant deep learning models, including language models, vision models, CNNs, RNNs, and diffusion models [03:14:00]. More complex operations like subtraction, division, matrix multiplication (matmuls), and convolutions can be formed by combining these simple ops and manipulating tensor shape metadata [03:41:00].

Complexity and Compilers

Traditional deep learning libraries were often built with dynamism at their core for experimentation, but this added significant complexity [04:35:00]. Luminal represents models as static directed acyclic graphs (DAGs) of operations, with minimal bounded dynamism (e.g., KV cache and sequence length in transformers) [05:10:00]. This simplification results in Luminal’s core library being under 5,000 lines of code, making it easy to understand and extend [06:19:00].

However, a model defined with these primitive operations is inherently slow [06:49:00]. The speed is achieved by taking these initial graphs and transforming them into much faster graphs using compilers [07:01:00]. Unlike traditional stacks that rely on complex dependencies (e.g., Hugging Face on PyTorch, xformers, cuDNN, cuBLAS, CUDA), Luminal directly emits CUDA code, simplifying the dependency stack and making debugging easier [07:22:00].

The Problem with Traditional Compilers

Traditional ML compilers face a bottleneck: as the complexity of the generated code grows, the compiler’s complexity scales super-linearly (square or cube), making it too difficult for humans to write [09:07:00]. This is known as the VLIW (Very Long Instruction Word) compiler problem [11:22:00]. Furthermore, hardware is becoming simpler (e.g., TPUs are simpler and faster than GPUs, which are simpler and faster than CPUs), requiring more complex software to manage low-level details that hardware no longer handles [09:50:00].

Search-Based Optimization

Luminal’s solution to this compiler complexity is to use search [11:57:00], similar to how AlphaGo conquered the game of Go [12:01:00]. Instead of hand-writing complex rules for optimal code generation, Luminal searches through logically equivalent GPU kernels to find the fastest one [12:42:00].

How Search Works

Graph Conversion: The initial operation graphs are converted into expressions within egglog, a library that uses e-graphs to represent the search space efficiently [13:10:00].
Rewrite Rules: Luminal uses 20-25 simple rewrite rules that make small, logically equivalent alterations to a GPU kernel [13:37:00].
Search Space Generation: By iteratively applying these simple rules, a vast search space of equivalent kernels is built [13:56:00].
Profiling and Selection: The runtime of different equivalent kernels within the search space is profiled, and the fastest one is chosen [14:07:00]. For very large search spaces, techniques like Monte Carlo Tree Search are used to prune options [14:19:00].

Examples of Search-Found Optimizations

Kernel Fusion: This optimization combines multiple operations into a single kernel. For instance, if operation B operates on the output of operation A (e.g., sign followed by x2), the naive approach involves loading data, performing A, writing to memory, reading from memory, performing B, and writing to memory again [14:45:00]. Since data movement accounts for approximately 99% of energy and time spent on GPUs [15:23:00], kernel fusion merges these into one kernel, loading data once and writing the final result once, dramatically reducing runtime [15:38:00]. Luminal’s compiler can merge a complex series of operations into a single, much faster kernel [15:54:00].
Flash Attention: Luminal’s search technique was able to discover Flash Attention, a complex algorithm that took the industry about five years to find (discovered by Tri Dao in 2022, Transformers in 2017) [16:40:00]. By taking a naive multi-head attention graph, applying simple rewrite rules to build a search space, and profiling kernels, Luminal’s compiler autonomously found Flash Attention. This demonstrates the power of search in finding non-obvious, highly complex optimizations [17:15:00].

Deterministic Optimizations

After the search process identifies the fastest kernel, Luminal applies a set of deterministic optimizations that are known to always be beneficial, regardless of their specific impact [18:22:00].

Buffer Reuse: Luminal minimizes memory usage by optimally reusing memory buffers. Since the entire workload is specified as a graph, the compiler can identify when buffers are not used simultaneously and allocate them to the same memory location, significantly reducing memory footprint [18:37:00].
Kernel Issuance: Instead of the traditional method where the CPU dispatches a GPU kernel, waits for it to finish, and then dispatches the next, Luminal dispatches all kernels to the GPU at once. This avoids costly CPU-GPU roundtrips and saves significant time [19:31:00].

Training and Future Directions

Although initially designed as an inference library, Luminal’s flexible graph representation allowed for an external autograd engine to be built [20:25:00]. This engine derives a backward graph from a forward graph, enabling training and leveraging the same compilers and search processes for the backward pass [20:40:00]. This external autograd is a unique feature, as other ML libraries typically integrate training into their core [21:08:00].

Future plans for Luminal include:

Expanded Hardware Support: Supporting AMD, Tensor Torrent, Groq, and TPUs, beyond current CPU, CUDA, and Metal support, to democratize ML across diverse hardware [21:41:00].
Distributed Inference and Training: Implementing full 3D distribution through data, pipeline, and tensor parallelism [22:04:00].
Reinforcement Learning (RL) Acceleration: Codifying environments within the Luminal graph so that both the model’s forward pass and the environment step can be optimized and run entirely on the GPU, significantly accelerating RL workflows [22:17:00].

Luminal also offers a cloud service where users can export their Luminal graphs, upload them, and get a serverless inference endpoint. This service handles optimization, batching, queuing, and machine provisioning, with users only paying when their graph is executing [23:06:00]. The simplicity of Luminal’s core design allows for rapid innovation in these areas, tackling problems typically addressed by far more complex frameworks [24:03:00].

Tubegraph

Explorer

Table of Contents