The role of CUDA and TensorFlow in machine learning software development

From: hu-po

Over the past decade, the landscape of machine learning software development has undergone significant transformations. Historically, two major players, NVIDIA’s CUDA and Google’s TensorFlow, dominated the field, shaping how machine learning models were developed and deployed [02:04:00].

NVIDIA’s CUDA Monopoly

For a significant period, most machine learning frameworks heavily relied on leveraging NVIDIA’s CUDA, performing optimally on NVIDIA GPUs [02:11:00]. This was evident in the career of many deep learning practitioners, who consistently used NVIDIA GPUs [02:25:00]. NVIDIA’s dominant position was largely due to its software “moat” — the closed-source CUDA libraries that optimized code for its hardware [03:36:00] [05:36:00].

However, the efficiency of NVIDIA’s GPUs, while vastly increasing computational power (flops), has created a new bottleneck: memory bandwidth [14:46:00]. Despite tensor cores significantly increasing matrix multiplication speed, the GPU spends a majority of its time (up to 60%) waiting for data to be shuffled between different memory caches [10:54:00] [11:51:00] [31:41:00]. This “memory wall” means that simply increasing GPU flops doesn’t proportionally increase performance [29:52:00] [30:31:00].

Optimizing for this memory bottleneck often involves writing custom CUDA kernels, which is significantly more difficult than writing simple Python scripts [41:01:00]. CUDA is primarily used by specialists in accelerated computing and requires a deep understanding of hardware architecture, making it less accessible to typical machine learning researchers and scientists [06:09:00] [06:46:00]. This reliance often means machine learning experts depend on CUDA experts to modify and optimize their code [06:51:00] [07:01:00].

TensorFlow’s Trajectory

A few years ago, the machine learning framework ecosystem was fragmented, with TensorFlow as a front-runner, often on par with, or even larger than, PyTorch [05:55:00]. Google appeared poised to control the machine learning industry, having a first-mover advantage, the most commonly used framework (TensorFlow), and successful AI application-specific accelerators (TPUs) [06:10:00] [06:18:00].

However, Google failed to convert this initial advantage into dominance [07:07:00]. By 2022, TensorFlow’s market share in research papers had significantly decreased from roughly 40% to a very small share, while PyTorch rose to nearly 50% [03:17:00] [03:22:00] [03:25:00]. Conferences like ICLR, CVPR, and NeurIPS showed a dramatic shift, with PyTorch unique mentions growing from 10% in 2017 to 70-80% by 2020 [06:44:00] [06:50:00] [06:52:00] [06:54:00].

The primary reason for TensorFlow’s loss of ground was its increased flexibility and usability compared to PyTorch [08:33:00] [08:36:00]. TensorFlow was initially designed with a compiled code mindset, requiring users to create a graph that then needed compilation to run [08:42:00] [08:44:00]. This graph-based approach made it challenging to understand and debug code, as issues only became apparent after graph compilation [09:43:00] [09:46:00].

In contrast, PyTorch offered a more Python-like, eager execution workflow, where code is read and executed line by line, similar to a scripting language [09:01:00] [09:04:00] [09:06:00]. Although TensorFlow later introduced an eager mode by default, the research community had largely embraced PyTorch [10:00:00] [10:01:00] [10:03:00]. This is further exemplified by the fact that nearly every generative AI model that made headlines is based on PyTorch, while Google’s generative AI models (like Imagen and DreamFusion) are based on JAX, a framework that directly competes with TensorFlow [10:08:00] [10:10:00] [10:17:00] [10:19:00].

Google remains at the forefront of advanced machine learning models, having invented Transformers and maintaining state-of-the-art results in many areas with models like PaLM, LaMDA, and Chinchilla [08:18:00] [08:22:00] [08:23:00]. However, the company is somewhat isolated within the broader machine learning community due to its preference for its own software stack and hardware [07:13:00] [07:16:00].

The Shifting Landscape: PyTorch 2.0 and OpenAI Triton

The dominance of NVIDIA’s CUDA and the decline of TensorFlow’s market share are being disrupted by the advent of PyTorch 2.0 and OpenAI’s Triton [02:32:00] [02:34:00]. These developments are heralding a new age for deep learning hardware [01:43:00] [01:46:00]. The shift is towards an open-source software stack for machine learning models, moving away from closed-source CUDA [05:33:00] [05:34:00].

PyTorch 2.0, released for early testing in late 2022 with full availability in March 2023, is a major catalyst [48:11:00] [48:13:00] [48:16:00] [48:20:00]. Its primary innovation is the addition of a compiled solution that supports graph execution [48:29:00] [48:31:00]. This approach makes it significantly easier to properly utilize various hardware resources [48:35:00] [48:37:00].

PyTorch 2.0 brings an 86% performance improvement for training on NVIDIA’s A100 GPUs and a 26% improvement on CPUs for inference [48:48:00] [48:51:00] [48:53:00] [48:58:00]. Crucially, these benefits are expected to extend to other GPUs and accelerators from companies like AMD, Intel, Tesla, Google, and Amazon, among many others [49:04:00] [49:06:00] [49:09:00] [49:12:00].

This shift is partly driven by major firms like Meta, who are heavily contributing to PyTorch to achieve higher flops utilization with less effort on their multi-billion dollar training clusters [49:39:00] [49:41:00] [49:43:00]. These companies also want to make their software stack more portable to other hardware to foster competition [49:54:00] [49:55:00].

The Role of Compilers and Operator Fusion

The move to a graph-based execution model in PyTorch 2.0 addresses the memory wall by enabling “operator fusion” [37:08:00] [37:09:00]. Instead of executing each operation separately, writing intermediate results to memory, multiple operations are fused into a single pass [37:11:00] [37:13:00]. This drastically reduces the back-and-forth memory transfers, which are the main bottleneck [39:59:00] [40:01:00].

PyTorch 2.0 achieves this through components like:

Torch Dynamo: This robust graph definition tool ingests any PyTorch user script and generates a computational graph [55:38:00] [55:41:00] [55:43:00]. It lowers complex PyTorch operations to a core set of 250 primitive operations [56:02:00] [56:04:00] [56:07:00]. Dynamo also supports dynamic shapes natively, making it easier to vary sequence lengths for large language models (LLMs) [52:00:00] [52:02:00] [52:04:00].
Torch Inductor: This Python-native deep learning compiler takes the graph generated by Dynamo and moves to a scheduling phase. It fuses operators and determines memory planning to minimize memory access [01:02:14:00] [01:02:27:00] [01:02:46:00] [01:02:50:00]. Inductor then generates code that runs on CPUs, GPUs, or other AI accelerators, dramatically reducing the work for compiler teams building for new hardware [01:03:34:00] [01:03:36:00] [01:04:05:00] [01:04:07:00].
OpenAI Triton: This highly disruptive tool to NVIDIA’s closed-source CUDA model directly generates PTX code for NVIDIA GPUs, bypassing NVIDIA’s proprietary CUDA libraries (like CuBLAS) in favor of open-source alternatives (like Cutlass) [01:04:51:00] [01:04:53:00] [01:05:14:00] [01:05:17:00] [01:05:19:00]. Triton enables higher-level languages (like Python) to achieve performance comparable to lower-level languages, and its kernels are legible to typical ML researchers, greatly improving usability [01:07:52:00] [01:07:54:00] [01:07:55:00] [01:08:00:00].

This integrated approach means that software can be written in a user-friendly manner while benefiting from significant performance improvements through automatic compilation and optimization [01:01:11:00] [01:01:12:00] [01:01:15:00] [01:01:16:00]. It allows for more efficient parallelization over a large base of computational resources [01:01:21:00] [01:01:23:00]. The ability for other hardware accelerators to integrate directly into Triton dramatically reduces the time to build an AI compiler stack for a new piece of hardware, opening up the market for AI hardware and custom ASICs [01:08:40:00] [01:08:43:00] [01:08:46:00] [01:08:48:00].

Conclusion

The shift away from closed-source, hardware-specific solutions like CUDA, and the declining dominance of frameworks like TensorFlow, marks a significant evolution in machine learning software development [05:33:00] [05:34:00] [07:07:00]. PyTorch 2.0 and OpenAI’s Triton are driving a future where the software stack for machine learning is more portable and open-source [01:06:06:00] [01:09:06:00]. This fosters greater competition in the AI hardware market, as the ease of use afforded by NVIDIA’s proprietary software is diminishing in importance compared to the economics and architecture of competing chip solutions [01:09:08:00] [01:09:10:00] [01:09:51:00]. The market is becoming more open, allowing different winners to emerge as the industry continues to evolve [01:12:07:00] [01:12:10:00] [01:12:13:00].

Tubegraph

Explorer

Table of Contents