From: hu-po

AI algorithms and their associated computational challenges are central to the development of advanced models like DeepSeek-Math. The choices made in algorithm design, data utilization, and hardware infrastructure significantly impact model performance, training efficiency, and overall cost.

DeepSeek-Math and GRPO

DeepSeek-Math utilizes a specific variant of reinforcement learning called Group-wise Reward Policy Optimization (GRPO) [00:06:10]. This algorithm, along with data and engineering effort, is considered key to the model’s performance [00:18:42].

PPO vs. GRPO: Algorithmic Differences

  • PPO (Proximal Policy Optimization): PPO is a foundational reinforcement learning algorithm [00:06:18].

    • It typically samples a single observation (o) [00:07:49].
    • PPO often relies on a learned value function (V(s)) model, which is typically a neural network of comparable size to the policy model [00:19:38]. This introduces substantial memory and computational burden [00:19:53] [00:20:29].
    • The ratio of the current policy to the old policy (Pi Theta / Pi Theta old) is “clipped” between 1 - Epsilon and 1 + Epsilon to prevent erratic behavior during updates [00:12:05].
    • A Beta Kullback–Leibler (KL) Divergence term helps prevent the policy from drifting too far from a reference policy (e.g., an SFT model) [00:12:21] [00:12:42].
  • GRPO (Group-wise Reward Policy Optimization): GRPO introduces several innovations to PPO:

    • Group Sampling: Instead of a single observation, GRPO samples a group of G observations from the old policy [00:07:53] [00:19:17].
    • Elimination of Value Function: A key innovation in GRPO is the removal of the separate value model [00:20:36]. It uses group reward scores to compute advantage, effectively acting as a Monte Carlo estimate of the learned value function [00:20:55]. This reduces memory and computational burden [01:08:14].

Reinforcement Learning Paradigms: On-Policy vs. Off-Policy

Reinforcement learning involves an agent (policy/neural net) interacting with an environment, receiving rewards that provide gradient updates to refine the policy [00:08:25].

  • On-Policy RL: The gradient update is applied to the same policy that generated the observations [00:09:07].
  • Off-Policy RL: Observations are stored in a replay buffer, and policy updates can use examples from older policies [00:09:15]. This can lead to complications if the experience-collecting policy is too “distant” from the policy receiving the updates [00:09:58].

Computational Challenges and Optimizations

Training large AI models faces significant computational challenges, particularly concerning hardware availability and cost.

Hardware Constraints

  • GPU Poverty: Running large-scale experiments, such as a single GRPO run on the full 70,000-example Num-Math-TI dataset using 8 H100 GPUs, would cost approximately 16 [00:34:10] [00:34:17].
  • GPU Interconnect: While H800 and H100 GPUs have similar computational performance (Teraflops), the H100 has higher interconnect bandwidth, which affects how quickly GPUs can communicate [00:59:30].
  • CUDA Compatibility Issues: Issues with NVIDIA drivers and CUDA versions can lead to significant debugging time, which is perceived as unproductive work [00:46:36] [00:47:06]. Changing GPUs can exacerbate these dependency and config hell issues [01:37:25].

Efficiency Optimizations

DeepSeek’s 40x cost reduction is attributed to an accumulation of many small tricks rather than a single innovation [01:08:41] [01:27:40].

  • Quantization: Using lower precision data types like FP8 (f8 E4 M3) enables faster computation (e.g., 4,000 Teraflops vs. 67 Teraflops for FP32) [01:22:56] [01:23:06]. This is a “hardcore hack” from DeepSeek [00:32:07].
  • Low-Level GPU Code: DeepSeek engineers reportedly wrote better GPU code than NVIDIA’s engineers by dropping down to PTX (NVIDIA’s Assembly Language) [00:52:06] [01:00:44]. This allowed them to effectively turn H800s into H100s by increasing interconnect speed [01:01:07].
  • Dedicated Inference GPUs: Hugging Face’s Open R1 implementation of GRPO uses one GPU for dedicated inference (faster generation) and the remaining GPUs for training [00:44:13] [00:44:21].
  • Increasing Prompt/Completion Lengths: An increase in prompt and completion lengths in an active project often indicates progress, as longer reasoning chains can lead to more accurate answers [00:45:00] [00:45:12].
  • Hyperparameter Tuning on Smaller Models: To save costs, hyperparameters are often tuned on smaller models, and the findings are then “zero-shot transferred” to full-size models [00:37:54].

Implications for AI Development

Data vs. Algorithms

The speaker suggests that the specific algorithm is not as crucial as the quality of the data and the engineering effort involved in training [00:18:40] [02:29:40]. Conflicting results between papers on optimal approaches (e.g., process vs. outcome supervision, or the need for a value model) suggest that no single algorithm is universally superior [00:24:52] [00:25:29].

Open Source and Decentralization

  • Open Source Acceleration: The open-sourcing of models and techniques (like DeepSeek’s innovations) allows all AI players to move faster, accelerating overall progress in the field [01:28:44] [01:29:09].
  • Decentralized Training: The future of AI training might involve decentralized systems, leveraging the combined computational power of millions of personal devices worldwide, similar to Bitcoin’s distributed network [01:41:43] [01:41:59]. This could eventually surpass corporate or national superclusters [01:41:36].
  • Local AI Control: Running AI models locally allows users to bypass censorship or restrictions imposed by API providers, ensuring individual control over their intelligence [01:05:15].

Reasoning Models and Generalization

  • Longer Reasoning Chains: Reasoning models are encouraged to generate longer “thought processes” (sequences of tokens) before arriving at a final answer. This iterative thinking process often leads to more accurate results [01:26:50] [01:26:59].
  • Transfer Learning: Superhuman performance in domains like math and coding (where reward signals are easily verifiable, often without human input) is expected to transfer to other disciplines, including philosophy, due to the underlying general reasoning capabilities developed [01:31:42] [01:32:02].
  • Simulation for Robotics: Reinforcement learning with long-horizon generated data (e.g., through simulation) is posited to outperform imitation learning from teleoperation data in robotics. The idea is to train models extensively in simulation and then deploy them to the real world [01:55:31] [01:56:28]. This mirrors how superhuman Go performance was achieved entirely in simulation [00:56:55].

Model Distillation

  • Knowledge Transfer: Model distillation involves using a larger, more capable “teacher” model to generate a dataset (input-output pairs). A smaller “student” model is then trained on this dataset to mimic the teacher’s behavior [01:14:03].
  • Efficiency: While most efficient when using the teacher model’s exact logits (probability distributions over tokens), distillation can still occur with only the final output word [01:16:09] [01:16:19].
  • Inherent Capacity: The ability of smaller models to perform well after distillation suggests that they inherently possess the capacity for high intelligence; distillation helps them find the “magical combination of values for weights” more effectively than traditional training methods [01:21:53]. Distillation is framed as a continuous, pervasive process within humanity itself [01:19:18].