From: hu-po
Llama 2 is a collection of pre-trained and fine-tuned large language models developed and released by Meta [00:03:41]. It is described as the first open-source, big competitive LLM [00:00:55]. Unlike secretive models from companies like OpenAI and Google, the Llama 2 paper provides extensive detail on its development [00:02:45].
Llama 2 Model Variants
Llama 2 models range in scale from 7 billion to 70 billion parameters [00:03:55]. There are “normal models” (pre-trained) and “chat models” (fine-tuned) [00:01:54]. The fine-tuned chat models, called Llama 2 Chat, are optimized for dialogue use cases [00:03:59]. While the chat models are certainly filtered, even the pre-trained models are suspected to have undergone some filtering [00:02:03]. Notably, a 34B parameter model was mentioned in results but not released due to lack of time for sufficient “Red Teaming” [00:09:24], [00:25:38].
Development Process
Pre-training
The Llama 2 models were pre-trained using an optimized auto-regressive Transformer architecture [00:21:52]. Key changes from Llama 1 include:
- Data Corpus: The pre-training corpus size was increased by 40% [00:18:18]. A new mix of publicly available data was used, explicitly excluding data from Meta’s own products or services [00:22:53], [02:23:00]. Efforts were made to remove data from sites known to contain high volumes of personal information [00:23:35]. The models were trained on two trillion tokens of data [00:23:45].
- Data Mix: Factual resources were “up-sampled” (sampled more frequently) in the data mix to increase knowledge and dampen hallucinations, optimizing performance-cost trade-off [00:23:57].
- Context Length: The context length was doubled from 2048 tokens to 4096 tokens [00:18:20], [00:29:16]. This significantly improved performance on long-context benchmarks like Scrolls [00:31:10].
- Attention Mechanism: Grouped Query Attention (GQA) was adopted [00:18:22]. This modification reduces memory cost associated with KV (Key-Value) cache size in multi-head attention during auto-regressive decoding [00:33:17]. GQA involves sharing key and value projections across multiple attention heads, saving memory without significant performance degradation compared to Multi-Head Attention (MHA) or Multi-Query Attention (MQA) [00:33:35], [00:35:50]. GQA was chosen over MQA for better inference performance, particularly with tensor parallelism across multiple GPUs [00:39:36].
- Tokenizer: The same byte pair encoding tokenizer as Llama 1 was used, with a vocabulary size of 32,000 tokens [00:50:01], [00:52:17]. Numbers are split into individual digits, and unknown UTF-8 characters are decomposed into bytes [00:51:32].
Fine-tuning and Alignment
Llama 2 Chat is the result of iterative applications of alignment techniques, including instruction tuning and Reinforcement Learning from Human Feedback (RLHF) [01:03:18].
- Supervised Fine-Tuning (SFT): The SFT stage started with publicly available instruction tuning data, but Meta found it lacked diversity and quality [01:05:20]. They focused on collecting several thousand examples of high-quality SFT data through their own vendor-based annotation efforts [01:05:43]. This data did not include Meta user data [01:07:22]. They found that a limited set of clean instruction tuning data could achieve a high level of quality (27,540 annotations) [01:06:33].
- Reinforcement Learning from Human Feedback (RLHF): RLHF further aligns the fine-tuned model with human preferences [01:10:36].
- Reward Model: Human annotators select preferred responses between two model outputs, and this feedback is used to train a separate reward model [01:10:51], [01:11:02].
- Helpfulness vs. Safety: To address the trade-off between helpfulness and safety, two separate reward models were trained: one optimized for helpfulness and another for safety [01:17:27], [01:17:35].
- Data Collection: Human preference data for reward modeling was collected weekly in batches, using a binary comparison protocol [01:11:17]. The process involves annotators writing a prompt and choosing between two sampled model responses (from different model variants) [01:11:35]. Meta’s reward modeling data set is significantly larger than others, with over 1.5 million comparisons [01:15:52]. This data is designed for multi-turn conversations [01:16:43].
- Training Objective: A binary ranking loss is used to train the reward model, enforcing the chosen response to have a higher score than its counterpart [01:18:05]. A margin component was added to the loss function to encourage more discrepant scores for more separable responses, improving helpfulness [01:19:37].
- Iterative Fine-tuning (RLHF V1-V5): As more human preference data was collected, successive versions of the RLHF models (V1 to V5) were trained [01:29:03]. This process leverages the model to generate more fine-tuning data, similar to the self-labeling strategy in the Segment Anything Model [01:29:21].
- Ghost Attention (GAt): Llama 2 Chat initially struggled to maintain adherence to initial instructions over multiple turns [01:43:15]. Ghost Attention (GAt) was proposed to address this by synthetically concatenating the original system instruction to all user messages during training [01:43:30], [01:44:09]. This ensures the model continuously “pays attention” to the original prompt, even in long dialogues [01:49:35].
Evaluation and Benchmarks
Llama 2 models were evaluated using both human evaluations and benchmark tests.
- Human Evaluation: Human evaluators compared model generations for helpfulness and safety across thousands of prompts [00:08:09], [00:09:09]. Llama 2 outperformed open-source chat models like Falcon 40B on most helpfulness benchmarks [00:04:04], [00:09:09]. It performed comparably to Google’s PaLM Bison and OpenAI’s ChatGPT 3.5 in helpfulness [00:09:38], [00:10:41]. In terms of safety, Llama 2 Chat showed lower violation rates than other open and closed-source models [00:16:46].
- Model-based Evaluation (GPT-4 as Judge): GPT-4 was also used as a judge for helpfulness and safety win rates against commercial baselines like GPT-3, PaLM Bison, and Falcon 40B, where Llama 2 Chat showed better performance in most cases [00:11:43], [00:12:22].
- Academic Benchmarks: Llama 2 was compared against other open-source and closed-source models on various academic benchmarks including MMLU, TriviaQA, and GSM8K (grade school math word problems) [00:59:06], [01:00:49]. Llama 2 generally showed competitive performance, although GPT-4 still significantly outperformed it in some areas, particularly GSM8K [01:00:21].
- Safety & Helpfulness Trade-off: The development showed that helpfulness and safety scores can increase together, indicating a balance can be achieved [02:07:45], [02:20:05]. However, models with more safety mitigation tend to answer certain questions in a more conservative manner, leading to “false refusals” when a prompt is safe but contains words frequently associated with unsafe generations (e.g., “Christmas crack” or “bomb”) [02:08:40], [02:09:15].
Technical Details
- Architecture: Llama 2 uses a standard Transformer architecture with pre-normalization using RMS Norm [02:26:23], SwiGLU activation functions [02:27:10], and Rotary Positional Embeddings (RoPE) [02:28:46].
- Hyperparameters: The AdamW optimizer was used for training [00:42:51]. A cosine learning rate schedule with a warm-up period (2000 steps for pre-training, 3% for reward model) and a decay down to 10% of the maximum learning rate was employed [00:43:05], [01:23:51]. Weight decay (0.1) and gradient clipping (1.0) were also applied [00:44:29], [00:44:32].
- Hardware: Models were pre-trained on Meta’s Research Supercluster (RSC) and internal production clusters, both using Nvidia A100 GPUs [00:52:28]. RSC uses Nvidia Quantum InfiniBand interconnects and 400W per GPU, while the production cluster uses RoCE (RDMA over Converged Ethernet) and 350W per GPU [00:52:47], [00:53:30]. RoCE is a more affordable network that can scale almost as well as InfiniBand up to 2000 GPUs [00:55:24].
Ethical Considerations & Release Strategy
Meta emphasizes responsible AI development, helpfulness, and safety [00:04:36]. The Llama 2 paper includes extensive sections on safety, red teaming, limitations, and ethical considerations [00:05:14].
- Red Teaming: Internal employees, contract workers, and external vendors (totaling 350 people) were employed to proactively identify risks and perform “evil things” with the model [00:05:35], [02:12:10], [02:12:40].
- Bias Mitigation: The paper addresses pronoun bias in the pre-training data, acknowledging that models reflect real-world biases [01:59:45].
- Open Release: Llama 2 is made available for both research and commercial use under an acceptable use policy [02:34:04]. Meta believes that open releases promote transparency, decentralize AI expertise, stimulate innovation, accelerate progress, and consolidate costs by avoiding duplicated training efforts across companies [02:34:24].
Key Learnings and Opportunities
- Data Quality: Robust data cleaning and strategic data mixing (e.g., up-sampling factual sources) are crucial [00:22:16], [00:24:04].
- Scaling Potential: The models did not show signs of saturation, indicating they could have been trained for longer [00:47:32].
- RLHF Effectiveness: Reinforcement learning proved highly effective for alignment, balancing cost and time efficiency [02:22:02].
- Emergent Capabilities: Llama 2 demonstrates emergent capabilities such as time awareness and tool use, without explicit annotation for these functions [02:26:19], [02:27:07].
- Future Improvements: Opportunities for Llama 3 include using Meta’s internal proprietary data, adopting a more modern tokenizer, and training for longer durations [02:46:16]. More advanced RL algorithms could also be explored beyond PPO [01:30:50]. Fine-tuning methods could potentially move towards more parameter-efficient techniques like LoRA, rather than full model fine-tuning [01:10:00].
Llama 2 represents a significant step for open-source AI [02:56:52], offering a competitive alternative to proprietary models and fostering transparency in LLM development [02:36:33].