Custom model training and reinforcement learning

From: aidotengineer

Jane Street, a global quantitative trading firm, faces unique challenges in adopting off-the-shelf large language models (LLMs) due to its specialized technology stack. To maximize the value derived from LLMs, the firm has invested in custom model training and sophisticated reinforcement learning strategies [00:00:24].

Challenges with Off-the-Shelf AI Models

Jane Street’s primary development platform is OCaml, an obscure functional language [00:01:02]. OCaml’s common applications include theorem proving, formal verification, and writing programming languages [00:01:17]. Jane Street uses OCaml for nearly everything, even transpiling it to JavaScript for web applications (using JS of OCaml), Vim script for plugins (using VAML), and for FPGA code (using Hardcaml) [00:01:28].

This unique environment presents several difficulties for integrating generic AI models and training methods:

Limited OCaml Data: Models are not proficient in OCaml, largely because the amount of OCaml code within Jane Street likely exceeds the total combined amount of OCaml code available globally for model training [00:02:14].
Custom Internal Systems: Jane Street has built its own development tools, including custom build systems, a distributed build environment, and an internal code review system called “Iron” [00:02:38]. They use a giant monorepo stored in Mercurial (not Git), and 67% of the firm uses Emacs as their primary editor [00:02:50].
Desire for Deep Integration: Jane Street aims to apply LLMs to various development flows, such as resolving merge conflicts, generating feature descriptions, or identifying code reviewers, without being limited by system boundaries [00:03:15]. This requires significant customization and scalability in AI models [00:03:20].

Jane Street’s Approach to Custom Model Training

Despite the expense and complexity, Jane Street decided to train custom models [00:04:08]. This decision was influenced by a Meta paper on “Code Compose,” which detailed successful finetuning for a language (Hack) primarily used at one company, similar to OCaml’s status [00:04:26].

The initial naive assumption that simply showing a model a bunch of internal code would yield a capable model proved incorrect [00:04:57]. Success requires providing the model with examples structured in the “shape” of the questions one intends to ask [00:05:16].

Defining the Goal: Generating Diffs

Jane Street’s primary goal for their custom model was to generate code diffs based on a user’s prompt within an editor [00:05:29]. These diffs needed to:

Potentially span multiple files (e.g., test, .ml, and .mli files) [00:05:43].
Apply cleanly [00:05:51].
Have a high likelihood of type-checking after application [00:05:54].
Target a range of up to 100 lines as an ideal scope for LLM capabilities [00:05:59].

Data Collection: Workspace Snapshotting

To achieve this, training data was required in a “context, prompt, diff” format [00:06:09]. Standard internal “features” (similar to pull requests) and commits were deemed unsuitable due to their large size, detailed descriptions (unlike in-editor prompts), or lack of descriptions [00:07:01].

The solution implemented was workspace snapshotting:

Snapshots of developer workstations are taken frequently (e.g., every 20 seconds) [00:08:17].
Simultaneously, snapshots of the build status (error or green) are recorded [00:08:28].
Patterns like “green to red to green” build status transitions often indicate an isolated change where a developer fixed an error [00:08:38].
The build error at the “red” state and the diff from “red” to “green” are captured and used as training data, allowing the model to learn how to recover from mistakes [00:08:50].
Prompt Generation: LLMs are used to generate detailed descriptions of the changes, which are then filtered down to emulate typical human-written prompts for the training data [00:09:07].

Reinforcement Learning

After supervised training, reinforcement learning (RL) is crucial for aligning the model’s output with human judgments of “good code” [00:09:31].

Definition of “Good Code” for RL:

Parses Correctly: Code must successfully pass through the OCaml parser [00:09:47].
Type-Checks: In a statically typed language like OCaml, good code must type-check when applied to a base revision [00:09:59].
Compiles and Passes Tests: The gold standard is code that compiles and passes all associated tests [00:10:13].

Code Evaluation Service (CES): Jane Street built the Code Evaluation Service (CES) to facilitate this RL phase [00:10:38]. CES is a build service designed for speed:

It pre-warms a build at a specific, green revision [00:10:50].
Workers continuously receive diffs from the model, apply them, and determine if the build status turns red or remains green [00:10:55].
This feedback loop, utilized over months, continuously improves the model’s ability to generate code that compiles and passes tests [00:11:07].

Evaluation

The same setup used for reinforcement learning can be used for model evaluation [00:11:17]. By holding out some RL data, Jane Street can assess a model’s performance by providing it a problem, letting it write code, and then verifying if the generated code works [00:11:22].

Meaningful evaluations are crucial to prevent models from producing “catastrophic but hilarious” results, such as a code review model suggesting “I’ll do it tomorrow” because it was trained on human examples [00:11:37].

Editor Integrations: The AI Development Environment (Aid)

To expose these custom models to developers, Jane Street built editor integrations with three main goals:

Avoid Redundancy: Write context and prompting strategies once, rather than separately for Neovim, VS Code, and Emacs [00:12:44].
Maintain Flexibility: Easily swap out different models or prompting strategies as needed, anticipating future finetuned models [00:13:02].
Collect Metrics: Gather real-world data on latency and diff application success rates from within the developer’s editor [00:13:15].

The chosen architecture is the AI Development Environment (Aid) service, which runs as a sidecar application on the developer’s machine [00:13:32]. Aid handles prompt construction, context building, and build status checks [00:13:41]. Thin layers are then built on top of Aid for each specific editor [00:13:49]. This allows changes to Aid to be instantly deployed by restarting the service, without requiring developers to restart their editors [00:14:00].

Aid’s pluggable architecture supports:

Swapping in new models [00:14:58].
Modifying context building [00:15:05].
Adding support for new editors [00:15:07].
Integrating domain-specific tools from different company areas, making them available across all editors without individual integrations [00:15:14].

Aid also enables A/B testing different approaches, such as directing different segments of the company to distinct models to compare acceptance rates [00:15:28]. This provides a robust foundation for adapting AI models and prompts as the field evolves [00:15:40].

Tubegraph

Explorer

Table of Contents