Custom Model Building and Code Evaluation for AI Systems

From: aidotengineer

John Kzi leads the AI Assistant team at Jane Street, a group focused on maximizing the value Jane Street derives from large language models (LLMs) [00:00:19]. Kzi’s background is in Dev tools, having previously worked at GitHub [00:00:31]. LLMs offer a unique opportunity due to their open-ended nature, allowing for the creation of almost anything imaginable [00:00:38]. The pace of creativity in employing LLMs is currently outpacing the progress of the models themselves [00:00:47].

Unique Challenges at Jane Street

Adopting off-the-shelf tooling for LLMs is challenging at Jane Street due to several internal choices [00:00:53]:

OCaml as a Development Platform: Jane Street primarily uses OCaml, a functional, powerful, but obscure language [00:01:02]. OCaml’s common applications include theorem proving, formal verification, and writing programming languages [00:01:17]. Jane Street uses OCaml for nearly everything, including web applications (via JS of OCaml transpiler to JavaScript), Vim plugins (via VAML to Vimscript), and FPGA code (via Hard Caml instead of Verilog) [00:01:28].
LLM Proficiency in OCaml: Existing LLMs are not proficient in OCaml, largely because the amount of OCaml code within Jane Street might exceed the total combined amount of OCaml code available globally for training [00:02:10].
Custom Internal Tooling: The use of OCaml necessitated building internal tools, including custom build systems, a distributed build environment, and a code review system called Iron [00:02:38].
Monorepo and Mercurial: All software is developed on a large monorepo application, which is stored in Mercurial instead of Git [00:02:52].
Editor Preference: 67% of the firm uses Emacs, rather than more common editors like VS Code [00:03:02].
Ambitious LLM Application: Jane Street aims to apply LLMs to various parts of their development flow, such as resolving merge conflicts, building feature descriptions, or identifying reviewers, without being limited by system boundaries [00:03:14].

Approach to LLMs

Jane Street’s approach to LLMs, especially for developer tools, involves:

Custom Models: Building custom models and methods for their construction [00:03:49].
Editor Integrations: Integrating LLMs into developer editors like VS Code, Emacs, and Neovim [00:03:52].
Model Evaluation: Developing capabilities to evaluate models and optimize their performance [00:04:00].

Custom Model Building

Training custom LLM models is an expensive and time-consuming endeavor with many potential pitfalls [00:04:10].

Inspiration and Initial Naivety

The team was inspired by Meta’s CodeCompose project, which fine-tuned a model for Hack, a language similar to OCaml in its primary use by one company [00:04:26] (Hack is incidentally implemented in OCaml) [00:04:50]. Initially, Jane Street naively believed they could simply fine-tune an off-the-shelf model with their code to make it understand their libraries and idioms [00:04:55].

Defining the Goal

It became clear that good outcomes require the model to see many examples in the specific shape of the questions it will be asked [00:05:16]. Their primary goal was to enable the model to generate diffs given a prompt [00:05:24]. This meant a user in an editor could describe a desired change, and the model would suggest a multi-file diff that:

Applies cleanly [00:05:50].
Has a high likelihood of type-checking [00:05:54].
Is around 100 lines, considered an ideal range for LLMs [00:05:59].

The required training data shape for this task is context, prompt, diff [00:06:16].

Data Collection Strategies

Features (Pull Requests): Initially considered, but feature descriptions differ significantly from in-editor prompts (e.g., “fix that error”) [00:07:01]. Also, features are often very large (500-1000 lines), requiring automated ways to break them into smaller components [00:07:20].
Commits: Smaller than features, but at Jane Street, commits are used as checkpoints rather than isolated, described changes [00:07:39]. They also lack descriptions and are not isolated changes [00:08:08].
Workspace Snapshotting (Successful Approach): This method involves taking snapshots of developer workstations every 20 seconds, along with their build status [00:08:17].
- Identifying Changes: Patterns like “green to red to green” often indicate an isolated change where a developer broke and then fixed the build [00:08:38].
- Capturing Data: By capturing the build error at the “red” state and the diff from “red” to “green”, this data can be used to train the model to recover from mistakes [00:08:50].
- Generating Descriptions: A large language model is used to write detailed descriptions of changes, which are then filtered down to approximate what a human would write [00:09:07].

Reinforcement Learning

After supervised training with collected data, reinforcement learning (RL) is crucial for aligning the model’s output with human expectations of “good code” [00:09:31].

Defining “Good Code”:
- Code that parses [00:09:47].
- Code that type-checks (especially critical for OCaml as a statically typed language) [00:09:59].
- Code that compiles and passes tests [00:10:13].
Code Evaluation Service (CES): To facilitate RL, Jane Street built CES, a service similar to a build service but optimized for speed [00:10:38].
- Process: CES pre-warms a build to a “green” state at a specific revision [00:10:50]. Workers then continuously take diffs from the model, apply them, determine if the build status turns red or green, and report success or error back [00:10:55].
- Outcome: Over months of use, CES helped align the model to write code that compiles and passes tests [00:11:07].

Evaluation

The same setup used for reinforcement learning can be leveraged for model evaluation [00:11:17]. By holding out some RL data, one can evaluate the model’s ability to write working code [00:11:22].

Importance of Meaningful Evaluations

Training can lead to catastrophic, yet sometimes humorous, results. For example, a code review model trained on human examples once responded with “I’ll do it tomorrow” when given code to review [00:12:12]. Meaningful evaluations are crucial to prevent models from going “off the rails” and wasting time and resources [00:12:22].

Editor Integrations: AI Development Environment (Aid)

The ultimate test for models is their utility for human developers [00:12:33]. Jane Street developed editor integrations to expose these models to their developers.

Design Principles

When building these integrations, three key ideas were prioritized:

Code Reusability: Avoid writing the same context building and prompting strategies three times for their supported editors (Neovim, VS Code, Emacs) [00:12:44].
Flexibility: Maintain the ability to easily swap out models or prompting strategies, anticipating the eventual use of fine-tuned models [00:13:02].
Metrics Collection: Collect real-world metrics like latency and diff application success rates to gauge the meaningfulness of the generated diffs [00:13:15].

Architecture: Aid as a Sidecar

The chosen architecture for the AI Development Environment (Aid) service places Aid as a sidecar application on the developer’s machine [00:13:33].

Functionality: Aid handles interactions with LLMs, constructs prompts, manages context, and monitors build status [00:13:38].
Editor Layers: Thin layers are built on top of Aid for each individual editor [00:13:49].
Benefits: This sidecar approach means changes to Aid do not require individual editor updates; Aid can be restarted on developer machines for immediate updates [00:13:54].

User Experience

VS Code: Aid integrates into the VS Code sidebar, similar to Copilot, allowing users to request and receive multi-file diffs through a visual interface [00:14:15].
Emacs: For Emacs users, who prefer text buffers, the Aid experience is built into a Markdown buffer. Users can navigate, ask questions, and use keybinds to append content [00:14:35].

Aid’s Flexibility and Value

Aid’s architecture enables significant flexibility:

Pluggable Components: New models, context-building strategies, and even support for new editors can be plugged in [00:14:58].
Domain-Specific Tools: Different areas of the company can supply custom tools, which become available across all supported editors without individual integrations [00:15:14].
A/B Testing: Aid allows A/B testing of different approaches, such as directing half the company to one model and the other half to another to compare acceptance rates [00:15:28].

Aid is a long-term investment, ensuring that any changes in LLMs can be managed in one central place downstream of the editors, making updates available everywhere efficiently [00:15:38].

Future Endeavors

The team is actively pursuing other areas, including:

Applying RAG (Retrieval-Augmented Generation) within editors [00:16:03].
Similar approaches to large-scale multi-agent workflows [00:16:06].
Working more with reasoning models [00:16:11].

Across all these efforts, the core principles remain: pluggability, building a strong foundation, and enabling other parts of the company to add domain-specific tooling [00:16:16].

Tubegraph

Explorer

Table of Contents

Custom Model Building and Code Evaluation for AI Systems

Unique Challenges at Jane Street

Approach to LLMs

Custom Model Building

Inspiration and Initial Naivety

Defining the Goal

Data Collection Strategies

Reinforcement Learning

Evaluation

Editor Integrations: AI Development Environment (Aid)

Design Principles

Architecture: Aid as a Sidecar

User Experience

Aid’s Flexibility and Value

Future Endeavors

Graph View

Backlinks