Deep learning compilers and optimization

From: aidotengineer

Luminal aims for “radical simplification through search” in Machine Learning (ML) libraries [00:00:16]. The core philosophy is to achieve performance and capability without complexity by utilizing compilers and search [00:00:26].

The Problem with Current ML Libraries

Deep learning is fundamentally simple, relying on linear algebra with basic operations like additions, multiplies, and matrix multiplications [00:00:36]. However, existing ML libraries are highly complex [00:01:01].

For example, PyTorch has over 1,200 operations and 15 different data types, supporting various devices like CPU, CUDA, AMD, TPUs, and NPUs [00:01:10]. This complexity scales multiplicatively with the number of operations, data types, and supported devices (ops × data types × devices) [00:01:42]. Adding a new operation, data type, or device causes complexity to explode [00:01:51]. This has resulted in PyTorch having over 3 million lines of code, and TensorFlow being even larger [00:02:02]. Such complexity leads to more bugs and makes the software difficult to extend or build upon [00:02:15].

Luminal’s Simplification Strategy

Luminal takes a top-down approach, focusing on the minimum necessary components to run ML models [00:02:28]. It uses a small set of very simple “Lego blocks” of operations to build complex models [00:02:47].

Core Operations

Luminal supports only 12 core operations [00:02:53]:

Unary Operations: x2, log2, sin, reciprocal, square root [00:02:55]
Binary Operations: addition, multiplication, modulo, less than [00:03:05]
Reductions: sum reduce, max reduce [00:03:10]

These 12 operations can support all commercially relevant models, including language models, vision-language models, CNNs, RNNs, and diffusion models [00:03:16]. Other common operations are just combinations of these core ops [00:03:41]:

Subtraction: Addition and multiplication by -1 [00:03:51]
Division: Multiplication and reciprocal [00:03:57]
Matrix Multiply (MatMul): Broadcasted multiply and then sum reduce, leveraging tensor shape metadata manipulation [00:04:03]
Convolution: Pooling through shape trackers and then a matmul with the convolution kernel [00:04:21]

Static vs. Dynamic

Many existing libraries were built 5-10 years ago with dynamism at their core for experimentation, prioritizing hackability over performance [00:04:35]. However, deep learning is not fundamentally dynamic; its inherent dynamism is very small and bounded [00:05:06]. In transformer models, only the KV cache length and sequence length are truly dynamic; the rest of the model is static [00:05:14].

Luminal represents models as directed acyclic graphs (DAGs) of operations [00:05:33]. This approach makes Luminal extremely simple, with under 5,000 lines of code, making it learnable in an afternoon [00:06:25].

Compiling for High Performance

While Luminal’s primitive graphs are slow by default, the goal is to transform them into faster graphs using compilers [00:06:54].

Traditional deep learning stacks are very complex, with layers of dependencies like Hugging Face Transformers on top of PyTorch, xformers, cuDNN, cuBLAS, and CUDA [00:07:24]. This leads to “dependency hell” during installation and makes bug tracing difficult [00:08:00]. Luminal simplifies this by directly generating CUDA code, creating a much simpler stack: Luminal → Graph → Compilers → CUDA [00:08:24].

The Compiler Problem: Complexity and Hardware Trends

ML compilers historically struggle to take over due to rapidly increasing complexity [00:09:04]. As the complexity of generated code grows, the compiler’s complexity scales with the square or cube of that growth, eventually becoming too complex for humans to write [00:09:10]. This bottlenecks hardware startups [00:09:40].

Hardware is trending towards simpler designs (e.g., TPUs are simpler and faster than GPUs, which are simpler and faster than CPUs) to achieve better performance per watt [00:10:04]. This implies more complex software is needed to manage these simpler hardware architectures [00:11:10]. This leads to the “Very Large Instruction Width (VLIW) compiler problem,” where hardware wants compilers to statically schedule everything, but compilers become too complex [00:11:22].

Solving Complexity with Search

Luminal addresses this by applying a search-based solution, similar to how AlphaGo conquered Go [00:11:57]. Instead of hand-writing perfect algorithms, Luminal searches through logically equivalent GPU kernels [00:12:42].

The process involves:

Graph to Expressions: Converting the model graphs into expressions within egg log, a library that uses e-graphs for memory-efficient representation of the search space [00:13:10].
Rewrite Rules: Applying 20-25 simple rewrite rules, each making a small, logically equivalent alteration to a GPU kernel [00:13:34].
Search Space Generation: Iteratively applying these rules to build a vast search space of equivalent kernels [00:13:56].
Performance Profiling: Testing the runtime of different equivalent kernels and choosing the fastest one [00:14:08]. For larger search spaces, techniques like Monte Carlo search prune the options [00:14:27].

Found Optimizations

This search process can discover significant optimizations:

Kernel Fusion: Merging multiple operations into a single kernel to minimize data movement between global memory and compute units [00:14:45]. Data movement accounts for about 99% of energy and time spent in GPUs [00:15:23]. By fusing, data is loaded once and written once, drastically improving performance [00:15:38].
- For example, an unfused graph with many sequential operations, each requiring memory write-backs and read-ins, can be fused into a single, complex kernel that takes roughly the same time as any one of the individual operations [00:15:54].
Flash Attention Discovery: Luminal’s search technique was able to independently discover Flash Attention, a complex optimization for multi-head attention that took the industry five years to find [00:16:40]. Luminal takes the naive multi-head attention graph, applies simple rewrite rules to build a huge search space, profiles kernels, and identifies Flash Attention as the fastest [00:17:15]. Luminal is believed to be the only compiler capable of this [00:17:33].

Deterministic Optimizations

After the search process generates fast kernels, Luminal applies deterministic optimizations that are always beneficial:

Buffer Reuse: Minimizing memory usage by optimally reusing memory buffers [00:18:41]. Since the entire workload is a graph, the compiler can identify when buffers are not concurrently used and assign them to the same memory location [00:18:52].
Kernel Issuance: Dispatching all kernels to the GPU at once, rather than waiting for the CPU to dispatch each one sequentially and await its completion [00:19:31]. This eliminates CPU-GPU roundtrip delays, saving significant time [00:19:50].

Training and Future Directions

Originally designed as an inference library [00:20:25], Luminal’s flexible graph representation allowed for an external autograd engine [00:20:30]. This engine derives a backward graph from a forward graph, enabling training for free, as all inference compilers and search processes also apply to the backward pass [00:20:50]. This extensibility allows external contributors to develop their own autograds or advanced training setups [00:21:20].

Future plans for Luminal include:

Expanded Hardware Support: Adding support for AMD, Tensor Torrent, Groq, and TPUs, beyond current CPU, CUDA, and Metal support, to democratize ML across hardware [00:21:41].
Distributed Inference and Training: Implementing full 3D distributed capabilities through data parallel, pipeline parallel, and tensor parallel [00:22:04].
Reinforcement Learning (RL) Acceleration: Codifying RL environments directly into the Luminal graph to allow the environment and model to be optimized together and run entirely on the GPU [00:22:18]. This could dramatically accelerate RL workflows [00:22:50].

Luminal Cloud

Luminal also offers a cloud service [00:23:06]. Users can export their Luminal model graphs, upload them to the cloud, and get a serverless inference endpoint [00:23:11]. The service handles optimization, batching, queuing, and machine provisioning, with users only paying when their graph is executing [00:23:22]. This aims to be the simplest and fastest ML cloud experience [00:23:36].

Tubegraph

Explorer

Table of Contents