From: redpointai

The evolution and deployment of AI models, particularly large language models (LLMs) like Gemini, necessitate significant shifts in underlying infrastructure needs and computational strategies. Experts from Google’s Gemini LLM efforts, Noam Shazeer and Jack Rae, discuss these changes, highlighting the growing importance of test time compute and the implications for future AI development [00:00:57].

Shifting from Training to Inference

Historically, the primary focus of AI infrastructure has been on large-scale pre-training of models. However, with the advancements in models like Gemini, the emphasis is increasingly shifting towards efficient inference, or “test time compute[00:01:02].

Jack Rae notes that while initial efforts for Gemini Flash concentrated on reasoning tasks like math and code, the model’s ability to generalize to creative tasks and improve output through “thinking” (applying more compute at inference time) was a pleasant surprise [03:02:04].

Noam Shazeer attributes this shift to the economic reality that LLM search is “too cheap” [01:57:51]. Operations now cost under 10^-18 dollars, meaning users can get millions of tokens per dollar, which is orders of magnitude cheaper than other common activities [02:27:01]. This presents a massive “unexploited flops” margin to apply more compute at inference time to make models smarter [02:48:06].

Economic and Resource Considerations

The economics and resource costs of AI model scaling are crucial. While training larger models can be done to improve performance, model training costs tend to increase quadratically with model size [02:09:47]. In contrast, if done correctly, inference remains relatively cheap [02:22:04]. This drives the current trend of applying more compute at inference time through methods like Chain of Thought thinking [02:30:30].

Jack Rae states that if building AI becomes primarily an inference problem, the infrastructure needs can be “much more flexible” [06:50:52]. This means a more distributed approach to compute compared to large batch training used in pre-training [06:58:24]. This distributed nature can help drive prices down as it optimizes for an intrinsically cheaper setup [07:30:46].

Hardware and Distributed Systems

The Google team benefits from a codesign relationship with its TPU (Tensor Processing Unit) team, allowing them to provide feedback on compute profiles to tweak chip and data center designs within a few years [07:41:09].

A key challenge in inference compared to training is the loss of parallelism in the Transformer architecture, leading to memory-bound operations, particularly when looking at attention keys and values for every token generated [08:04:46]. This requires significant work in both model architecture and hardware to fully apply massive computational power to inference [08:32:00].

For deploying AI models, particularly agentic ones that interact with environments, there are significant engineering challenges beyond core algorithmic development [09:09:34]. Defining these environments and ensuring efficient orchestration within structured codebases (like Google’s monorepo) are crucial for accelerating agentic research [09:12:06].