GPU Kernel Optimization

From: aidotengineer

Luminal is an ML library focused on radical simplification through search to achieve high performance and capability [00:00:16]. While deep learning is fundamentally simple linear algebra, consisting of scalars, vectors, matrices, tensors, and a few core operations like additions, multiplies, and matmuls [00:00:36], the existing machine learning software ecosystem is exceedingly complex [00:01:04].

The Problem with Traditional ML Libraries

Libraries like PyTorch contain over 1,200 operations and 15 different data types, running on various GPU devices like CPU, CUDA, AMD, TPUs, and NPUs [00:01:10]. This complexity scales multiplicatively (operations × data types × devices), leading to massive codebases like PyTorch’s 3 million lines of code or TensorFlow’s even larger footprint [00:01:42]. This complexity introduces more bugs and makes it difficult to extend or build upon the libraries [00:02:15].

Traditional stacks involve complex dependencies, from high-level libraries like Hugging Face Transformers to optimized kernels, cuDNN/cuBLAS, and CUDA [00:07:22]. Installing and debugging these complex stacks leads to “dependency hell” and significant pain in tracing down bugs [00:08:04].

Luminal’s Simplified Approach

Luminal takes a top-down approach, identifying the minimum set of operations required to run ML models [00:02:28]. It uses only 12 very simple core operations:

Unary Operations: x2, log2, sin, reciprocal, square root [00:02:55]
Binary Operations: addition, multiplication, modulo, less than [00:03:05]
Reductions: sum reduce, max reduce [00:03:10]

All other common operations (like subtraction, division, matmuls, and convolutions) can be formed by combining these 12 primitive operations, often by manipulating tensor shape metadata [00:03:41]. For example, a matrix multiply is effectively a broadcasted multiply followed by a sum reduction [00:04:03].

Static Graph Representation

Luminal acknowledges that deep learning is not fundamentally dynamic; earlier dynamism in libraries like PyTorch was for convenience during experimentation [00:05:06]. The dynamism in models like transformers is typically small and bounded (e.g., KV cache length, sequence length) [00:05:14].

Luminal specifies models as directed acyclic graphs (DAGs) of operations [00:05:30]. This graph representation allows for the entire workload to be known ahead of time [00:18:52]. For example, a single dense neural network layer (a matrix multiply) can be represented as loading a tensor, loading a weight, element-wise multiplying them, and then sum reducing the result [00:05:37].

This simplification results in Luminal being under 5,000 lines of code, making it easy to understand and learn within an afternoon [00:06:19].

Achieving Performance Through Compilers and Search

While Luminal’s primitive graphs are slow by default (e.g., Llama 7B takes all day to generate a sentence) [00:06:49], the goal is to transform these graphs into much faster ones using compilers [00:07:03]. Luminal directly generates CUDA code, creating a very simple stack between the library and the GPU [00:08:23].

Overcoming the VLIW Compiler Problem with Search

The complexity of compilers scales rapidly (square or cube) with the complexity of the code they need to generate, making them too complex for humans to write beyond a certain point [00:09:04]. This “VLIW compiler problem” arises because simpler hardware (like GPUs and TPUs) requires more complex static scheduling from the compiler [00:09:57].

Luminal solves this by applying the same search-based approach used by AlphaGo [00:11:59]. Instead of hand-writing complex rules to produce fast code, Luminal searches through logically equivalent GPU kernels [00:12:44].

This process involves:

Graph to E-graph Conversion: Converting the model’s graphs into expressions within the egglog library, which uses e-graphs for memory-efficient representation of the search space [00:13:12].
Rewrite Rules: Applying simple rewrite rules (e.g., 20-25 rules) that make small, logically equivalent alterations to a GPU kernel [00:13:34]. Iteratively applying these rules builds a massive search space [00:13:56].
Runtime Profiling: Testing the runtime of different equivalent kernels within the search space and choosing the fastest one [00:14:07]. For larger spaces, techniques like Monte Carlo tree search are used to prune options [00:14:27].

Examples of Search-Discovered Optimizations

Kernel Fusion: This optimization merges multiple operations into a single GPU kernel to minimize data movement between global memory and compute units [00:14:45]. Data movement on GPUs accounts for 99% of energy and time spent [00:15:23]. By fusing kernels, data is loaded and written only once, dramatically speeding up complex graphs [00:16:11].
Flash Attention: Luminal’s search technique was able to independently discover flash attention, a complex algorithm that took the industry five years to find [00:16:42]. This demonstrates the power of search to uncover highly non-obvious optimizations [00:17:37].

Deterministic Optimizations

After the search process identifies fast kernels, Luminal applies a set of deterministic optimizations that are known to always be beneficial:

Buffer Reuse: The compiler minimizes memory usage by optimally reusing memory buffers. By knowing the entire workload graph ahead of time, it can identify when buffers are no longer needed and assign their memory to other buffers [00:18:37].
Batch Kernel Issuing: Instead of the CPU dispatching one GPU kernel at a time and waiting for its completion, Luminal dispatches all kernels at once to the GPU. This eliminates the time-consuming round trip to the CPU for each kernel, saving significant time [00:19:31].

Training Support

Initially designed for inference, Luminal’s flexible graph representation allowed for the creation of an external autograd engine. This engine derives a backward graph from a forward graph, enabling training capabilities. All the compilers and search processes used for inference also apply to the backward pass, effectively providing training for free [00:20:30]. This external extension model is unique among ML libraries, allowing external contributors to build their own autograds or advanced training setups [00:21:18].

Future Developments

Luminal is actively working on:

More Hardware Support: Expanding support beyond CPU, CUDA, and Metal to include AMD, Tensor Torrent, Groq, and TPUs, aiming to democratize ML across diverse hardware [00:21:41].
Distributed Inference and Training: Implementing full 3D distribution through data parallel, pipeline parallel, and tensor parallel approaches [00:22:07].
Reinforcement Learning Acceleration: Codifying environments within the Luminal graph to optimize environment steps and model forward passes together on the GPU, dramatically accelerating RL workflows [00:22:18].

Luminal Cloud

Luminal also offers a cloud service that leverages its graph representation. Users can export their Luminal models as files and upload them to the cloud to get a serverless inference endpoint [00:23:06]. The Luminal cloud handles optimization, batching, queuing, and machine provisioning, with users only paying when their graph is executing [00:23:22]. The goal is to provide the simplest, fastest, and most straightforward cloud experience [00:23:36].

Tubegraph

Explorer

Table of Contents