Transformers versus LSTMs in AI model architecture

From: redpointai

Jonathan Frankle, Chief AI Scientist at Databricks, has a notable bet regarding the future dominance of the Transformer architecture in AI models [00:02:43]. He maintains a long-term perspective on this subject [00:02:59].

The Dominance of Transformers

Frankle notes that in the days following the release of the Transformer paper and Bert, countless papers emerged proposing minor tweaks to Bert [00:03:08]. Ultimately, the models trained today are largely based on the original “Vaswani et al. Transformer” architecture, often with different positional encodings and using only the decoder [00:03:17]. This suggests that the original Transformer found a “sweet spot” in the hyperparameter space, making significant architectural changes less appealing [00:03:25].

Comparison with LSTMs

Before the Transformer, the state-of-the-art architecture for natural language processing (NLP) was recurrent neural networks, specifically LSTMs (Long Short-Term Memory networks) [00:03:37]. Frankle points out that he is actually one year older than LSTMs [00:03:50].

Key comparisons include:

Difficulty of Discovery: Good architectures are exceptionally difficult to find, taking roughly a generation to progress from older state-of-the-art architectures to current ones [00:03:53].
Hypothetical Scaling: There’s an “alternate journey” that wasn’t taken, where LSTMs could have been scaled extensively [00:04:03]. It’s unclear if Transformers are fundamentally superior or if their success is due to where collective energy was focused [00:04:11].
Simplicity: Transformers are generally simpler than LSTMs [00:04:21].

Architectural Evolution and Future Outlook

Frankle cautions against the common newcomer belief that a “next big thing” in AI architecture is always just around the corner [00:04:26]. He emphasizes that science tends to advance in “big leaps” followed by periods of consolidation [00:04:42]. From this perspective, the notion that something will suddenly “blow the Transformer out of the water” seems “ahistorical” [00:04:52].

Despite the emergence of new AI models and infrastructure, such as OpenAI’s o1 model, Frankle suggests that breakthroughs are often only recognized with hindsight [00:42:15]. He highlights that while new ideas are constantly circulating, the true achievement lies in scaling them effectively, which is often an engineering feat as much as a scientific one [00:43:15].

Frankle remains “very good” about his bet on the Transformer’s continued dominance [00:04:59].

Tubegraph

Explorer

Table of Contents

Transformers versus LSTMs in AI model architecture

The Dominance of Transformers

Comparison with LSTMs

Architectural Evolution and Future Outlook

Graph View

Backlinks