AI Engineering at Jane Street

From: aidotengineer

At Jane Street, the AI Assistant team is focused on maximizing the value derived from large language models (LLMs) across the firm, particularly within developer tools [00:00:22]. John Kzi, who has a background in developer tools, highlights the open-ended opportunities LLMs present, allowing for the creation of virtually anything imaginable [00:00:38]. The pace of progress in models is only outstripped by the creativity in their application [00:00:47].

Unique Challenges

Jane Street faces specific challenges that make the adoption of off-the-shelf AI tooling difficult [00:00:54]. Their primary development platform is OCaml, a powerful yet obscure functional language used in theorem proving, formal verification, and programming language development [00:01:02]. Jane Street uses OCaml for almost everything, including web applications (via JS of OCaml transpiler), Vim plugins (via Viml transpiler), and even FPGA code (via Hardcaml) [00:01:27].

The main reasons market tools are unsuitable for OCaml include:

Model Proficiency: LLMs are generally not proficient in OCaml, largely due to the limited amount of public training data available for the language [00:02:10]. Jane Street’s internal OCaml codebase likely exceeds the total combined OCaml code existing outside its walls [00:02:27].
Internal Systems: The company has built its own core development infrastructure, including build systems, a distributed build environment, and a code review system called Iron [00:02:40]. They operate on a giant monorepo stored in Mercurial instead of Git [00:02:52]. Additionally, 67% of the firm uses Emacs as their primary editor [00:03:02].
Aspiration: The team wants the flexibility to apply LLMs across various parts of their development flow without being hampered by system boundaries, for tasks like resolving merge conflicts, generating feature descriptions, or identifying code reviewers [00:03:15].

Approach to Large Language Models in Dev Tools

Jane Street’s approach involves building custom models, integrating them into editors, and developing robust evaluation capabilities [00:03:41].

Custom Model Building

Training custom models, though expensive and time-consuming, was validated after reading Meta’s “Code Compose” paper [00:04:10]. This project detailed fine-tuning a model for Hack, a language similar to OCaml in its primary usage within a single company [00:04:31].

Their initial naive assumption that showing a model their code would instantly yield a better version was incorrect [00:04:57]. Effective model training requires providing examples in the shape of the desired output [00:05:16].

The specific goal for their model was to generate diffs given a prompt [00:05:30]. Users should be able to describe a desired change in an editor, and the model would suggest a multi-file diff (e.g., modifying test, .ml, and .mli files) [00:05:37]. These diffs needed to apply cleanly, be likely to type-check, and ideally be up to 100 lines long [00:05:51].

Data Collection

To achieve this, they needed training data in the “context-prompt-diff” format [00:06:11].

Features/Pull Requests (Iron): While their internal code review system “Iron” (similar to pull requests) contains human-written descriptions and diffs, these are generally too large (often 500-1000 lines) and the descriptions are too detailed, unlike the concise prompts a user would give in an editor [00:06:37].
Commits: Commits are smaller but are used as checkpoints rather than isolated changes with descriptions [00:07:39].
Workspace Snapshotting: The chosen approach involves taking snapshots of developer workstations every 20 seconds, along with the build status [00:08:17]. A “green-to-red-to-green” build pattern often indicates an isolated change where a developer broke and then fixed the build [00:08:37]. Capturing the build error at the “red” state and the diff from “red” to “green” provides valuable training data for error recovery [00:08:58].
Description Generation: LLMs are used to generate detailed descriptions of changes, which are then filtered down to approximate human-like prompt brevity [00:09:07].

Reinforcement Learning

Reinforcement learning aligns the model’s output with human notions of “good code” [00:09:31].

Definition of Good Code:
- Code that parses [00:09:47].
- For OCaml, statically typed code that type-checks [00:09:59].
- Code that compiles and passes tests [00:10:15].
Code Evaluation Service (CES): To facilitate reinforcement learning, they built CES [00:10:38]. This service pre-warms builds to a green state, then workers apply diffs from the model and report whether the build status turns red or green [00:10:47]. Over months, this continuous process helps the model learn to write code that compiles and passes tests [00:11:07].

Model Evaluation

The same CES setup used for reinforcement learning can be leveraged for model evaluation by holding out some of the RL data [00:11:19]. This allows testing the model’s ability to write functional code [00:11:28]. The importance of meaningful evaluations is underscored by an anecdote where an early code review model, trained on human examples, responded with “I’ll do it tomorrow” [00:11:42].

Editor Integrations

The ultimate test of these models is their utility for human developers [00:12:31]. When building editor integrations, three main goals were considered:

Avoid Duplication: Support for NeoVim, VS Code, and Emacs required a single implementation for context building and prompting strategies [00:12:44].
Maintain Flexibility: The architecture needed to allow swapping models or prompting strategies as needed [00:13:02].
Collect Metrics: Measure latency and diff application success to understand real-world user experience [00:13:16].

AI Development Environment (AID)

The chosen architecture features an AI Development Environment (AID) service as a sidecar application on the developer’s machine [00:13:33].

AID handles prompt and context construction, and build status observation [00:13:43].
Thin layers are written on top of AID for each editor [00:13:49].
This sidecar approach means changes to AID can be deployed and restarted centrally, updating all developers without requiring individual editor restarts [00:13:55].

Editor Examples

VS Code: AID integrates into the sidebar, similar to Copilot, allowing users to ask for and receive multi-file diffs through a visual interface [00:14:15].
Emacs: For Emacs users, who prefer text-based workflows, the AID experience is built into a Markdown buffer, allowing users to interact with text, ask questions, and append content via keybinds [00:14:34].

Benefits of AID Architecture

Pluggability: AID allows seamless swapping of models, changes to context building, and support for new editors [00:14:58].
Domain-Specific Tools: Different parts of the company can supply specific tools that become available across all editors without needing individual integrations [00:15:15].
A/B Testing: The architecture supports A/B testing different approaches, such as routing portions of the company to different models and comparing acceptance rates [00:15:28].
Adaptability: AID provides a strong foundation that pays off over time, enabling rapid adaptation to changes in large language models [00:15:38].

Broader Applications

Beyond developer tools, the team is exploring other applications of LLMs, including new ways to apply RAG (Retrieval-Augmented Generation) within editors, large-scale multi-agent workflows, and work with reasoning models [00:15:58]. The consistent approach across all these initiatives is to maintain pluggability, build a strong foundation, and enable the rest of the company to contribute domain-specific tooling [00:16:16].

Tubegraph

Explorer

Table of Contents