Data Collection and Training Techniques for AI Models

From: aidotengineer

John Kzi, from Jane Street’s AI assistant team, focuses on maximizing value from large language models (LLMs) within the company [00:00:24]. His background in Dev tools, including a long tenure at GitHub, highlights the transformative potential of LLMs due to their open-ended nature and rapid progress [00:00:30].

Unique Challenges with AI Adoption at Jane Street

Jane Street faces specific challenges that make the adoption of off-the-shelf AI tooling difficult [00:00:53].

OCaml as a Development Platform

The primary hurdle is the company’s pervasive use of OCaml, a powerful, functional, but obscure language [00:01:02]. OCaml is typically used in theorem proving, formal verification, and writing programming languages [00:01:17]. Jane Street uses OCaml for almost everything, even transpiling it for web applications (JS of OCaml), Vim plugins (Vaml), and FPGA code (HardCaml) [00:01:28].

Limitations of Current AI Models

Off-the-shelf models are not proficient in OCaml, primarily because the amount of OCaml code within Jane Street likely exceeds the total available OCaml data for training worldwide [00:02:10].

Jane Street’s Custom Tooling and Environment

The company’s unique development environment further complicates matters:

Custom build systems and distributed build environments [00:02:42]
An internal code review system called Iron [00:02:47]
A giant monorepo application stored in Mercurial instead of Git [00:02:53]
A majority of the firm (67%) uses Emacs as their primary editor [00:03:02]

Desire for Deep Integration

Jane Street aims to apply LLMs to various parts of their development flow, such as resolving merge conflicts, building feature descriptions, or identifying reviewers, without being limited by system boundaries [00:03:20].

Approach to Custom Models and Training

To address these challenges, Jane Street focuses on custom model building, editor integrations, and robust evaluation methods [00:03:49].

Inspiration from Meta’s CodeCompose

The decision to train custom models, despite the cost and complexity, was influenced by Meta’s CodeCompose project [00:04:08]. CodeCompose fine-tuned a model for Hack, a language similar to OCaml in its primary use within a single company [00:04:31]. (Interestingly, Hack is implemented in OCaml [00:04:50]).

Defining the Model’s Goal

Initially, a naive approach of showing an off-the-shelf model existing code proved ineffective [00:05:00]. To achieve good outcomes, models need examples in the specific shape of the desired question [00:05:16]. Jane Street’s goal became: generating multi-file diffs (e.g., modifying test, .ml, and .mli files) given a prompt from an editor [00:05:30]. These diffs needed to apply cleanly and ideally type-check, targeting changes up to 100 lines [00:05:50].

Data Collection for Training

To achieve this, training data needed to reflect the “context, prompt, diff” shape [00:06:11].

Challenges with existing data:
- Features (pull requests): While containing descriptions and diffs, their descriptions are too formal (multiple paragraphs) compared to typical editor prompts like “fix that error” [00:07:01]. They are also often too large, requiring an automated way to break them into smaller components [00:07:20].
- Commits: At Jane Street, commits are used as checkpoints, not isolated changes, and lack descriptions [00:07:59].
Workspace Snapshotting (Primary Training Method):
- Snapshots of developer workstations are taken frequently (e.g., every 20 seconds) along with their build status [00:08:17].
- A “Green to Red to Green” pattern in the build status often indicates an isolated change [00:08:38].
- The build error at the “Red” state becomes the prompt, and the diff from “Red to Green” becomes the training data to help the model recover from mistakes [00:08:50].
- Descriptions for these diffs are generated by an LLM, then filtered down to match the expected human-written prompt style [00:09:08].

Reinforcement Learning and Evaluation

After supervised training, reinforcement learning is critical to align the model with human expectations of “good code” [00:09:31].

Definition of “Good Code”:
- Parsable OCaml code [00:09:47]
- Type-checks [00:10:01]
- Compiles and passes tests [00:10:15]
Code Evaluation Service (CES):
- A custom service built for efficient code evaluation during the reinforcement learning phase [00:10:38].
- It pre-warms a build to a “green” state [00:10:50].
- Workers continually apply diffs generated by the model and report whether the build turns red or green [00:10:55].
- This continuous feedback helps align the model to write code that compiles and passes tests [00:11:07].
- The same setup is used for model evaluation, by holding out some of the reinforcement learning data [00:11:20].
Importance of Meaningful Evaluation:
- Training can lead to “catastrophic but hilarious results,” such as a code review model responding “I’ll do it tomorrow” because it was trained on human examples containing such phrases [00:11:42].
- Meaningful evaluations are crucial to prevent models from going off-the-rails and wasting time/money [00:12:22].

Editor Integrations and Deployment

The ultimate test for models is whether they work for humans [00:12:33]. Jane Street developed an AI Development Environment (AID) to expose models to developers [00:13:36].

Design Principles for AID

Single Codebase: Avoid writing the same context and prompting strategies three times for NeoVim, VS Code, and Emacs [00:12:44].
Flexibility: Allow swapping models or prompting strategies easily, anticipating future fine-tuned models [00:13:02].
Metrics Collection: Gather real-world data on latency and diff application success to assess meaningfulness for users [00:13:17].

AID Architecture and Deployment

AID acts as a sidecar application on the developer’s machine [00:13:55].
It handles all prompt construction, context building, and integrates with build status [00:13:41].
Thin layers are written on top of AID for each editor (VS Code, Emacs, NeoVim) [00:13:49].
This architecture allows changes to AID to be deployed by simply restarting the service on all machines, without requiring editor restarts [00:14:00].

Editor-Specific User Interfaces

VS Code: Presents a visual sidebar similar to GitHub Copilot, allowing multifile diff requests [00:14:16].
Emacs: The AI experience is integrated into a markdown buffer, catering to Emacs users’ preference for text-based interaction and key binds for appending content [00:14:34].

Benefits of AID

Pluggability: Allows swapping in new models, changing context building, adding support for new editors, and incorporating domain-specific tools across all editors simultaneously [00:14:58].
A/B Testing: Enables A/B testing different approaches, e.g., sending 50% of users to one model and 50% to another to compare acceptance rates [00:15:28].
Investment Payoff: AID ensures that improvements in LLMs can be rapidly propagated to all developers by updating a single component downstream of the editors [00:15:39].

Ongoing Work and Best Practices for AI Systems

The team continues to explore new applications, including applying RAG (Retrieval-Augmented Generation) within editors, developing large-scale multi-agent workflows (AI agent development), and working with reasoning models [00:16:03]. The core approach remains consistent: maintain pluggability, build a strong foundation, and enable other teams to contribute domain-specific tooling [00:16:16].

Tubegraph

Explorer

Table of Contents