From: hu-po
The Phi-1 model from Microsoft Research introduces an approach that leverages high-quality, synthetically generated data for both pre-training and finetuning language models for code [00:04:06]. This method aims to achieve competitive performance with significantly smaller models and less computational cost [00:09:15], suggesting that the “art of training large artificial neural networks” [00:07:02] can be improved by focusing on data quality [00:08:26].

Phi-1 Model Overview

Phi-1 is a Transformer-based language model for code with 1.3 billion parameters, which is considerably smaller than competing models like Falcon (40 billion parameters) or Llama (65 billion parameters) [00:02:28]. It was trained efficiently in four days using eight A100 GPUs, a relatively low-cost setup compared to larger models [00:03:44].

The core idea is that high-quality data can “dramatically change the shape of those scaling laws” [00:09:10], potentially allowing models to match the performance of larger-scale models with much leaner training [00:09:15]. This concept is akin to a human learning process, where the choice of curriculum and the quality of information are crucial [00:09:39].

Data Set Curation

The paper emphasizes the importance of using “textbook quality training data” [01:02:07]. The researchers argue that standard code datasets from sources like The Stack and Stack Overflow, while large and diverse, often contain “snippets that are not very instructive for learning the basics of coding” [00:23:14]. These issues include samples that are not self-contained, lack meaningful computation, or are poorly documented [00:24:09].

The training process for Phi-1 involved two primary stages with specially curated data:

  1. Pre-training Data: “Code Textbook”

    • This dataset combines a filtered subset of The Stack and Stack Overflow [00:31:27] with synthetically generated data [00:31:33].
    • The filtered code dataset was created by training a random forest classifier [00:33:52] on a small, human-annotated (or more accurately, GPT-4 annotated [00:34:46]) subset of the data [00:33:06]. This classifier determined the “educational value” of code snippets, prioritizing those that are “clear, self-contained, instructive, and balanced” [00:31:05]. For example, well-documented, self-contained functions were considered high educational value, unlike complex, undocumented snippets that depend on external context [00:35:48].
    • The synthetic textbook data (less than 1 billion tokens) [00:31:36] was generated by GPT-3.5 [00:44:05]. This text-heavy data is interleaved with relevant code snippets, promoting reasoning and basic algorithmic skills [01:00:21]. Diversity was achieved by providing constraints on topics and target audiences for the generated content [01:42:01].
    • The “Code Textbook” dataset totaled a little over 50 billion tokens [00:59:15].
  2. Finetuning Data: “Code Exercises”

    • This is a smaller dataset of less than 100 million tokens of Python exercises [00:43:51], also generated by GPT-3.5 [00:44:05].
    • Each exercise consists of a docstring for a function that needs to be completed [00:43:55]. The goal is to align the model to perform function completion tasks based on natural language instructions [00:44:01].
    • Diversity in this dataset was primarily achieved by constraining the function names [00:44:10].

Training and Performance

The Phi-1 base model was trained on the “Code Textbook” dataset for 36,000 steps with a batch size of 1024, equivalent to eight epochs [00:59:09]. The final Phi-1 model was then fine-tuned on the “Code Exercises” dataset for an additional 6,000 steps, taking about seven hours [00:57:19].

Despite its small size (1.3 billion parameters) [00:02:28] and significantly less training data compared to competitors (7 billion tokens total) [00:05:01], Phi-1 achieved a pass@1 accuracy of 50.6% on HumanEval and 55% on Mostly Basic Python Programs (MBPP) [00:05:04]. This performance surpasses larger open-source models like StarCoder, which is 10 times larger and trained on 100 times more data, but achieves a worse HumanEval score [01:20:58].

Emergent Capabilities and Generalization

Finetuning on the “Code Exercises” dataset not only improved performance on targeted tasks but also unlocked “unexpected coding capabilities” [01:00:21]. The model exhibited a substantial improvement in executing tasks not explicitly featured in the fine-tuning data, such as managing intricate algorithmic tasks and using external libraries like Pygame and Tkinter [01:01:17]. This suggests that fine-tuning helps the model reorganize and consolidate knowledge acquired during pre-training [01:01:37].

The researchers also conducted experiments to address potential “contamination” or memorization of benchmark data [01:14:40]. They created new, unconventional evaluation problems and performed data pruning, aggressively removing up to 40% of the “Code Exercises” dataset. Even after such pruning, Phi-1 still outperformed StarCoder, suggesting its robust understanding rather than mere memorization [01:24:24].

Limitations and Future Directions

Despite its strengths, Phi-1 has limitations:

  • It is specialized in Python coding [01:32:39].
  • It may lack domain-specific knowledge for certain APIs [01:32:48].
  • It is less robust to stylistic variations or errors in prompts compared to larger models like GPT-4 [01:40:46].

The paper posits that significant gains could be achieved by using GPT-4 to generate synthetic data instead of GPT-3.5 [01:33:08]. The work provides strong evidence that developing good methodologies for creating high-quality data sets is a central direction for advancing natural language processing [01:33:30]. This includes ensuring the data covers relevant concepts, is diverse, and non-repetitive, potentially through techniques like “domain randomization” for text [01:34:01].

The project highlights that creating synthetic data sets for finetuning offers a promising avenue for developing more efficient and specialized language models, potentially leading to models tailored for specific behaviors or domains [01:37:31].