AI Engineering and Large Language Models at Jane Street

From: aidotengineer

John Kzi is part of the AI Assistant team at Jane Street, which aims to maximize the value Jane Street can derive from large language models (LLMs) [00:00:24]. Kzi’s career has been spent in Dev tools, including time at GitHub [00:00:31]. LLMs offer significant opportunities due to their open-ended nature, allowing for the creation of almost anything imaginable [00:00:38].

Unique Challenges at Jane Street

Jane Street’s adoption of off-the-shelf tooling for LLMs is more difficult than for other companies [00:00:54]. This difficulty stems from several factors:

OCaml as Development Platform: Jane Street uses OCaml extensively for development [00:01:02]. OCaml is a powerful functional language, but it is obscure, primarily used in areas like theorem proving, formal verification, and programming language development [00:01:08].
- OCaml Transpilers: For web applications, Jane Street writes OCaml and uses JS of OCaml to transpile to JavaScript [00:01:36]. For Vim plugins, they use Vaml to transpile OCaml to Vim script [00:01:50]. Even FPGA code is written using the HardCaml OCaml library instead of Verilog [00:01:58].
- LLM Proficiency: Large language models are not proficient in OCaml, largely due to the limited amount of public OCaml training data compared to the volume of OCaml code within Jane Street [00:02:14].
Internal Tooling and Infrastructure: Jane Street has built its own build systems, distributed build environment, and a code review system called “Iron” [00:02:42].
Monorepo and Mercurial: All software is developed on a giant monorepo application, which is stored in Mercurial instead of Git [00:02:53].
Editor Preference: 67% of the firm uses Emacs, which is less common than editors like VS Code [00:03:02].
Ambitious Goals: The team aims to apply large language models to various parts of the development flow, such as resolving merge conflicts, building feature descriptions, or identifying code reviewers [00:03:20]. They want seamless integration across systems [00:03:34].

Approach to Large Language Models in Developer Tools

Jane Street’s approach to large language models in developer tools involves:

Building custom models [00:03:49].
Developing editor integrations [00:03:52].
Creating capabilities to evaluate and optimize model performance [00:04:02].

Training Custom Models

Training models is expensive and time-consuming [00:04:10]. The motivation to train custom models came partly from reading Meta’s “Code Compose” paper, which detailed fine-tuning a model for Hack, a language similar to OCaml in its primary use by one company [00:04:26]. (Interestingly, Hack is implemented in OCaml [00:04:50].)

Initially, the team naively believed they could fine-tune an off-the-shelf model by showing it their code [00:05:00]. However, achieving good results requires the model to see examples in the specific “shape” of the questions one wants to ask [00:05:16].

The specific goal for their custom model was to generate diffs given a prompt [00:05:30]. This means a user in an editor could describe a desired change, and the model would suggest a potentially multi-file diff (e.g., modifying test, .ml, and .mli files) [00:05:37]. The diffs needed to apply cleanly and have a high likelihood of type-checking [00:05:51]. They targeted diffs up to 100 lines [00:05:59].

Data Collection

To train for this task, they needed data in the “context-prompt-diff” format [00:06:16].

Features (Pull Requests): Initially considered, but feature descriptions differ from how users would prompt in an editor, and features are often too large (500-1000 lines) [00:07:01].
Commits: Also considered, but Jane Street uses commits as checkpoints, so they lack descriptions and are not isolated changes [00:07:56].
Workspace Snapshotting: The chosen approach involves taking snapshots of developer workstations every 20 seconds throughout the workday [00:08:17]. This also captures the build status (error or green) [00:08:30].
- Identifying Isolated Changes: Patterns like “green to red to green” build statuses often indicate an isolated change where a developer broke the build and then fixed it [00:08:37].
- Generating Descriptions: The build error at the “red” state is captured and paired with the diff from “red to green” as training data [00:08:58]. A large language model is then used to write a detailed description of the change, which is filtered down to a human-like length [00:09:08].

Reinforcement Learning

After supervised training, reinforcement learning (RL) is used to align the model’s output with human notions of “good code” [00:09:31]. “Good code” is defined as:

Code that parses [00:09:47].
Code that type-checks (especially in statically typed OCaml) [00:10:01].
Code that compiles and passes tests [00:10:15].

To facilitate RL, they built the Code Evaluation Service (CES), which acts like a fast build service [00:10:38]. CES pre-warms a build, and then workers continuously take diffs from the model, apply them, and report whether the build status turns red or green [00:10:48]. This continuous feedback over months helps align the model to write compiling and passing code [00:11:07]. This same setup is also used for evaluating model performance [00:11:20].

An example of training going “off the rails” involved a code review model that, after months of training, responded with “I’ll do it tomorrow,” as it was trained on human examples [00:12:12]. Meaningful evaluations are crucial to prevent such issues [00:12:24].

Editor Integrations: The AI Development Environment (AID)

The ultimate test for models is human usability [00:12:31]. When building editor integrations, three goals were paramount:

Single Implementation: Avoid writing the same context-building and prompting strategies three times for different editors (Neovim, VS Code, Emacs) [00:12:44].
Flexibility: Maintain the ability to swap models or prompting strategies (e.g., from an off-the-shelf model to a fine-tuned one) [00:13:02].
Metrics Collection: Gather real-world data on latency and diff application success rates from developers’ editors [00:13:17].

The solution was the AI Development Environment (AID) service [00:13:34]. AID handles prompt and context construction, and integrates with build status, abstracting these complexities from the editors [00:13:41]. Thin layers are built on top of AID for each individual editor [00:13:49].

Sidecar Architecture: AID runs as a sidecar application on the developer’s machine [00:13:55]. This allows changes to AID to be deployed and restarted on all machines without requiring users to restart their editors [00:14:00].
Editor Examples:
- VS Code: AID integrates into the VS Code sidebar, providing a visual interface for asking questions and receiving multifile diffs [00:14:16].
- Emacs: For Emacs users, who prefer text buffers, the AID experience is built into a Markdown buffer. Users can move around, ask questions, and use keybinds to append content [00:14:35].
Benefits of AID Architecture:
- Pluggability: Allows swapping in new models, changing context building, and adding support for new editors or domain-specific tools without rewriting editor integrations [00:14:58].
- A/B Testing: Enables A/B testing of different approaches by sending portions of the company to different models and comparing acceptance rates [00:15:28].
- Adaptability: AID is an investment that pays off over time, as changes in large language models can be implemented in one place and deployed everywhere [00:15:39].

Broader Applications

The team is actively pursuing other AI-assisted work initiatives, including:

New ways to apply Retrieval Augmented Generation (RAG) within editors [00:16:03].
Applying similar approaches to large-scale multi-agent workflows [00:16:07].
Increasing work with reasoning models [00:16:13].

The consistent approach across these initiatives is to maintain pluggability, build a strong foundation, and enable other parts of the company to add their domain-specific tooling [00:16:16].

Tubegraph

Explorer

Table of Contents