From: aidotengineer
John Kzi leads the AI Assistant team at Jane Street, a group focused on maximizing the value Jane Street derives from large language models (LLMs) [00:00:19]. Kzi’s background is in Dev tools, having previously worked at GitHub [00:00:31]. LLMs offer a unique opportunity due to their open-ended nature, allowing for the creation of almost anything imaginable [00:00:38]. The pace of creativity in employing LLMs is currently outpacing the progress of the models themselves [00:00:47].
Unique Challenges at Jane Street
Adopting off-the-shelf tooling for LLMs is challenging at Jane Street due to several internal choices [00:00:53]:
- OCaml as a Development Platform: Jane Street primarily uses OCaml, a functional, powerful, but obscure language [00:01:02]. OCaml’s common applications include theorem proving, formal verification, and writing programming languages [00:01:17]. Jane Street uses OCaml for nearly everything, including web applications (via
JS of OCaml
transpiler to JavaScript), Vim plugins (viaVAML
to Vimscript), and FPGA code (viaHard Caml
instead of Verilog) [00:01:28]. - LLM Proficiency in OCaml: Existing LLMs are not proficient in OCaml, largely because the amount of OCaml code within Jane Street might exceed the total combined amount of OCaml code available globally for training [00:02:10].
- Custom Internal Tooling: The use of OCaml necessitated building internal tools, including custom build systems, a distributed build environment, and a code review system called Iron [00:02:38].
- Monorepo and Mercurial: All software is developed on a large monorepo application, which is stored in Mercurial instead of Git [00:02:52].
- Editor Preference: 67% of the firm uses Emacs, rather than more common editors like VS Code [00:03:02].
- Ambitious LLM Application: Jane Street aims to apply LLMs to various parts of their development flow, such as resolving merge conflicts, building feature descriptions, or identifying reviewers, without being limited by system boundaries [00:03:14].
Approach to LLMs
Jane Street’s approach to LLMs, especially for developer tools, involves:
- Custom Models: Building custom models and methods for their construction [00:03:49].
- Editor Integrations: Integrating LLMs into developer editors like VS Code, Emacs, and Neovim [00:03:52].
- Model Evaluation: Developing capabilities to evaluate models and optimize their performance [00:04:00].
Custom Model Building
Training custom LLM models is an expensive and time-consuming endeavor with many potential pitfalls [00:04:10].
Inspiration and Initial Naivety
The team was inspired by Meta’s CodeCompose project, which fine-tuned a model for Hack, a language similar to OCaml in its primary use by one company [00:04:26] (Hack is incidentally implemented in OCaml) [00:04:50]. Initially, Jane Street naively believed they could simply fine-tune an off-the-shelf model with their code to make it understand their libraries and idioms [00:04:55].
Defining the Goal
It became clear that good outcomes require the model to see many examples in the specific shape of the questions it will be asked [00:05:16]. Their primary goal was to enable the model to generate diffs given a prompt [00:05:24]. This meant a user in an editor could describe a desired change, and the model would suggest a multi-file diff that:
- Applies cleanly [00:05:50].
- Has a high likelihood of type-checking [00:05:54].
- Is around 100 lines, considered an ideal range for LLMs [00:05:59].
The required training data shape for this task is context, prompt, diff [00:06:16].
Data Collection Strategies
- Features (Pull Requests): Initially considered, but feature descriptions differ significantly from in-editor prompts (e.g., “fix that error”) [00:07:01]. Also, features are often very large (500-1000 lines), requiring automated ways to break them into smaller components [00:07:20].
- Commits: Smaller than features, but at Jane Street, commits are used as checkpoints rather than isolated, described changes [00:07:39]. They also lack descriptions and are not isolated changes [00:08:08].
- Workspace Snapshotting (Successful Approach): This method involves taking snapshots of developer workstations every 20 seconds, along with their build status [00:08:17].
- Identifying Changes: Patterns like “green to red to green” often indicate an isolated change where a developer broke and then fixed the build [00:08:38].
- Capturing Data: By capturing the build error at the “red” state and the diff from “red” to “green”, this data can be used to train the model to recover from mistakes [00:08:50].
- Generating Descriptions: A large language model is used to write detailed descriptions of changes, which are then filtered down to approximate what a human would write [00:09:07].
Reinforcement Learning
After supervised training with collected data, reinforcement learning (RL) is crucial for aligning the model’s output with human expectations of “good code” [00:09:31].
-
Defining “Good Code”:
- Code that parses [00:09:47].
- Code that type-checks (especially critical for OCaml as a statically typed language) [00:09:59].
- Code that compiles and passes tests [00:10:13].
-
Code Evaluation Service (CES): To facilitate RL, Jane Street built CES, a service similar to a build service but optimized for speed [00:10:38].
- Process: CES pre-warms a build to a “green” state at a specific revision [00:10:50]. Workers then continuously take diffs from the model, apply them, determine if the build status turns red or green, and report success or error back [00:10:55].
- Outcome: Over months of use, CES helped align the model to write code that compiles and passes tests [00:11:07].
Evaluation
The same setup used for reinforcement learning can be leveraged for model evaluation [00:11:17]. By holding out some RL data, one can evaluate the model’s ability to write working code [00:11:22].
Importance of Meaningful Evaluations
Training can lead to catastrophic, yet sometimes humorous, results. For example, a code review model trained on human examples once responded with “I’ll do it tomorrow” when given code to review [00:12:12]. Meaningful evaluations are crucial to prevent models from going “off the rails” and wasting time and resources [00:12:22].
Editor Integrations: AI Development Environment (Aid)
The ultimate test for models is their utility for human developers [00:12:33]. Jane Street developed editor integrations to expose these models to their developers.
Design Principles
When building these integrations, three key ideas were prioritized:
- Code Reusability: Avoid writing the same context building and prompting strategies three times for their supported editors (Neovim, VS Code, Emacs) [00:12:44].
- Flexibility: Maintain the ability to easily swap out models or prompting strategies, anticipating the eventual use of fine-tuned models [00:13:02].
- Metrics Collection: Collect real-world metrics like latency and diff application success rates to gauge the meaningfulness of the generated diffs [00:13:15].
Architecture: Aid as a Sidecar
The chosen architecture for the AI Development Environment (Aid) service places Aid as a sidecar application on the developer’s machine [00:13:33].
- Functionality: Aid handles interactions with LLMs, constructs prompts, manages context, and monitors build status [00:13:38].
- Editor Layers: Thin layers are built on top of Aid for each individual editor [00:13:49].
- Benefits: This sidecar approach means changes to Aid do not require individual editor updates; Aid can be restarted on developer machines for immediate updates [00:13:54].
User Experience
- VS Code: Aid integrates into the VS Code sidebar, similar to Copilot, allowing users to request and receive multi-file diffs through a visual interface [00:14:15].
- Emacs: For Emacs users, who prefer text buffers, the Aid experience is built into a Markdown buffer. Users can navigate, ask questions, and use keybinds to append content [00:14:35].
Aid’s Flexibility and Value
Aid’s architecture enables significant flexibility:
- Pluggable Components: New models, context-building strategies, and even support for new editors can be plugged in [00:14:58].
- Domain-Specific Tools: Different areas of the company can supply custom tools, which become available across all supported editors without individual integrations [00:15:14].
- A/B Testing: Aid allows A/B testing of different approaches, such as directing half the company to one model and the other half to another to compare acceptance rates [00:15:28].
Aid is a long-term investment, ensuring that any changes in LLMs can be managed in one central place downstream of the editors, making updates available everywhere efficiently [00:15:38].
Future Endeavors
The team is actively pursuing other areas, including:
- Applying RAG (Retrieval-Augmented Generation) within editors [00:16:03].
- Similar approaches to large-scale multi-agent workflows [00:16:06].
- Working more with reasoning models [00:16:11].
Across all these efforts, the core principles remain: pluggability, building a strong foundation, and enabling other parts of the company to add domain-specific tooling [00:16:16].