Simplifying machine learning libraries through search

From: aidotengineer

Luminal is a machine learning (ML) library designed to achieve radical simplification through the use of search-based compilation techniques [00:00:16]. Its core philosophy is to be significantly simpler than other ML libraries while maintaining high performance and capability by leveraging compilers and, more specifically, search [00:00:20].

The Problem: Complexity in ML Libraries

Deep learning, at its fundamental level, involves simple linear algebra, primarily operations on scalars, vectors, matrices, and tensors, with a few core operations like additions, multiplications, and element-wise operations [00:00:36]. Despite this inherent simplicity, the current ML software ecosystem is highly complicated [00:01:01].

For example, PyTorch, a prominent ML library, features over 1,200 operations and 15 different data types, supporting various devices such as CPU, CUDA, AMD, TPUs, and NPUs [00:01:10]. The complexity of these libraries scales multiplicatively, not additively, with the number of operations, data types, and supported devices [00:01:42]. This means adding support for a new operation, data type, or device can cause complexity to explode [00:01:51]. As a result, PyTorch comprises over 3 million lines of code, and TensorFlow is even larger [00:02:02]. This complexity leads to more bugs and makes it challenging for developers to extend or build upon these systems [00:02:15].

Luminal’s Approach: Radical Simplification

Luminal addresses this complexity by taking a top-down approach, identifying the minimum set of operations required to run ML models [00:02:28]. It operates on the principle that complex models can be constructed from very simple “Lego blocks” of operations [00:02:47].

Minimal Operation Set

Luminal utilizes only 12 fundamental operations [00:02:50]:

Unary Operations: x2, log2, sin, reciprocal, square root [00:02:55]
Binary Operations: addition, multiplication, modulo, less than [00:03:05]
Reductions: sum reduce, max reduce [00:03:10]

These 12 operations are sufficient to support all commercially relevant models, including language models, vision-language models, CNNs, RNNs, and diffusion models [00:03:14]. Many seemingly complex operations are merely compositions of these simpler ones. For instance:

Subtraction is addition and multiplication by -1 [00:03:51].
Division is multiplication and reciprocal [00:03:57].
Matrix multiplication (matmuls) can be achieved via broadcasted multiply followed by sum reduce [00:04:03].
Convolution can be performed by combining pooling (via shape trackers) and a matmul with a convolution kernel [00:04:18].

Static Graphs

Traditional ML libraries were built with dynamism at their core, important for experimental models like RNNs, where performance was less critical [00:04:35]. However, deep learning is fundamentally not dynamic; its dynamism is very small and bounded [00:05:06]. In a transformer model, only the KV cache length and sequence length are typically dynamic, while the rest of the model is static [00:05:16].

Luminal specifies models as Directed Acyclic Graphs (DAGs) of operations [00:05:30]. This graphical representation allows for a complete specification of models, from simple matrix multiplies to complex, large-scale models [00:05:37].

As a result of this simplification, Luminal is remarkably simple: it is under 5,000 lines of code [00:06:25], easy to understand, and designed to be learnable within an afternoon [00:06:30].

Achieving Performance Through Compilers and Search

While the simplified graphs are initially slow (e.g., Llama 7B running for a whole day to generate a single sentence) [00:06:49], the intention is not to run these primitive graphs directly. Instead, these graphs are fed through compilers to transform them into faster, optimized graphs [00:07:01].

Simplified Stack

A traditional ML stack (e.g., Hugging Face Transformers → PyTorch/XFormers → cuDNN/cuBLAS → CUDA) creates a complex dependency story, leading to “dependency hell” during installation and difficulties in bug tracing [00:07:22]. Luminal directly emits CUDA code, creating a much simpler stack: the Luminal library, its graph, its compilers, and CUDA [00:08:21].

The Compiler Challenge and Search

The core challenge in ML compilers is that their complexity scales non-linearly (e.g., quadratically or cubically) with the complexity of the code they need to generate [00:09:04]. This makes compilers incredibly difficult for humans to write beyond a certain point [00:09:31]. Furthermore, as hardware becomes simpler and more uniform (from CPUs to GPUs to TPUs, which are faster and more performance-per-watt efficient), the software and compilers need to become more complex to manage them [00:10:51]. This leads to the VLIW compiler problem, where simple hardware requires overly complex compilers [00:11:22].

Luminal overcomes this bottleneck by leveraging search [00:11:57]. Inspired by AlphaGo, which used search to conquer the game of Go [00:12:01], Luminal searches through logically equivalent GPU kernels [00:12:42].

How Search Works

Graph to Expressions: Initial graphs are converted into expressions within a library called egglog, which uses e-graphs to efficiently represent a search space of equivalent expressions [00:13:10].
Rewrite Rules: A small set of simple rewrite rules (20-25) are defined [00:13:34]. Each rule makes a small, logically equivalent alteration to a GPU kernel [00:13:39].
Search Space Expansion: By iteratively applying these simple rewrite rules, a vast search space of equivalent kernels is built [00:13:56].
Profiling and Selection: The system then profiles the runtime of various kernels within this search space and selects the fastest one [00:14:08]. For larger spaces, techniques like Monte Carlo tree search are used to prune the search [00:14:27].

This search approach allows Luminal to find optimal kernels without needing to hand-write complex rules that guarantee fast code [00:12:54].

Types of Optimizations Found

Kernel Fusion: This common optimization merges multiple operations into a single kernel, significantly reducing data movement between global memory and compute units, which is often 99% of energy and time spent on GPUs [00:14:45]. By loading data once, computing multiple operations, and writing back once, performance drastically improves [00:15:38].
Flash Attention Discovery: Luminal’s search technique was able to independently discover Flash Attention, a highly complex and crucial optimization for transformers that took the industry five years to find [00:16:40]. This demonstrates the power of search to uncover non-obvious, complex optimizations [00:17:30].

Deterministic Optimizations

After the search process identifies fast kernels, Luminal applies deterministic optimizations that are known to always be beneficial:

Buffer Reuse: The compiler analyzes the entire workload graph to optimally reuse memory buffers, minimizing overall memory usage [00:18:37]. If Buffer A and Buffer B are never used concurrently, they can share the same memory location [00:19:01].
Batch Kernel Issuance: Instead of the CPU dispatching one GPU kernel at a time and waiting for its completion, Luminal dispatches all kernels in advance, allowing the GPU to run through them sequentially [00:19:31]. This eliminates significant round-trip time between the CPU and GPU [00:19:50].

Extending Luminal’s Capabilities

Training Support

While initially designed for inference, Luminal’s flexible graph representation allowed for the development of an external autograd engine [00:20:25]. This engine derives a backward graph from a forward graph, enabling training capabilities. All the existing compilers and search processes for inference also apply to training [00:20:50]. This external extension model is unique among ML libraries, allowing external contributors to build custom autograds, gradient sharding, or other training setups [00:21:08].

Future Developments

Expanded Hardware Support: Current support includes CPU, CUDA, and Metal. Future plans aim to support AMD, Tenstorrent, Groq, and TPUs to democratize ML across diverse hardware [00:21:41].
Distributed Inference and Training: Implementing full 3D distributed capabilities, including data parallel, pipeline parallel, and tensor parallel [00:22:04].
Reinforcement Learning (RL) Optimization: Codifying environments within the Luminal graph, allowing the environment simulation and model forward pass to run entirely on the GPU [00:22:17]. This could dramatically accelerate RL workflows by eliminating the CPU-GPU bottleneck [00:22:42].

Luminal Cloud

Luminal leverages its graph representation to offer a serverless inference endpoint through the Luminal cloud [00:23:06]. Users can export their Luminal model graphs, upload them, and receive a serverless inference endpoint [00:23:11]. The cloud handles optimization, batching, queuing, and machine provisioning, with users only paying for actual graph execution time [00:23:22]. This aims to deliver the simplest and fastest cloud ML experience [00:23:34].

The simplicity of Luminal’s design allows for faster innovation compared to more complex frameworks [00:24:03].

Tubegraph

Explorer

Table of Contents