From: aidotengineer

Jane Street’s AI Assistant team focuses on maximizing the value derived from large language models (LLMs) within the company [00:00:24]. John Kzi, who has a background in Dev Tools, notes the “amazing opportunity” LLMs present due to their open-ended nature, allowing the creation of “anything that we can imagine” [00:00:38]. The rapid progress of models is only outpaced by the creativity in employing them [00:00:47].

Unique Development Environment Challenges

Adopting off-the-shelf AI tooling at Jane Street is challenging due to several unique internal choices [00:00:53]:

  • OCaml as a Primary Language [00:01:02]: OCaml is a powerful functional language, but it’s obscure [00:01:08]. Its common applications are in theorem proving, formal verification, and writing programming languages [00:01:17]. Jane Street uses OCaml for “everything,” including:
    • Web applications (using JS of OCaml to transpile to JavaScript) [00:01:31].
    • Vim plugins (using Vaml to transpile to Vim script) [00:01:47].
    • FPGA code (using HardCaml instead of Verilog) [00:01:58].
  • Lack of OCaml Training Data for LLMs [00:02:11]: LLMs are not proficient in OCaml, largely because the amount of OCaml code within Jane Street likely exceeds the total combined amount available in the public domain for training [00:02:20].
  • Custom Internal Infrastructure [00:02:37]:
    • They built their own build systems and distributed build environment [00:02:42].
    • An internal code review system called “Iron” [00:02:47].
    • A single, giant monorepo stored in Mercurial instead of Git [00:02:53].
    • 67% of the firm uses Emacs, though VS Code is also used [00:03:02].
  • Ambitious AI Integration Goals [00:03:14]: Jane Street wants to apply LLMs to various parts of their development workflow, such as resolving merge conflicts, building feature descriptions, or identifying code reviewers [00:03:20]. They seek seamless integration of AI agents into existing infrastructure without system boundaries [00:03:34].

Approach to LLMs in Developer Tools

Jane Street’s strategy for integrating AI into natural workflows in developer tools involves:

  1. Building Custom Models [00:03:49].
  2. Editor Integrations (VS Code, Emacs, Neovim) [00:03:52].
  3. Model Evaluation [00:04:02].

Custom Model Development

Initially, the team was “naive,” thinking they could easily fine-tune an off-the-shelf model by showing it their code [00:04:55]. They learned from Meta’s CodeCompose project, which fine-tuned a model for Hack, a language similar to OCaml in its niche use within one company [00:04:26]. (Interestingly, Hack is implemented in OCaml [00:04:50]).

The key insight was that good outcomes require the model to see examples in the shape of the desired question [00:05:16].

Defining the Goal

The primary goal was to generate diffs given a prompt [00:05:24]. Users should be able to describe a desired change in their editor, and the model would suggest a multi-file diff (e.g., test file, .ml, .mli header file) [00:05:37]. These diffs needed to apply cleanly and have a high likelihood of type-checking, ideally within a 100-line range [00:05:51].

Data Collection

The process required collecting “context, prompt, diff” triplets as training data [00:06:11]:

  • Challenges with existing data sources:
    • Features (Pull Requests): While containing human descriptions and diffs, their descriptions are too verbose for in-editor prompts, and the diffs are often too large (500-1000 lines) [00:07:01].
    • Commits: At Jane Street, commits are used as checkpoints, not isolated changes, and lack descriptions [00:07:57].
  • Solution: Workspace Snapshotting [00:08:15]:
    • Snapshots of developer workstations are taken frequently (e.g., every 20 seconds), along with the build status [00:08:21].
    • “Green to Red to Green” build status patterns indicate isolated changes where a developer broke and then fixed the build [00:08:38].
    • The build error at the “Red” state and the diff from “Red” to “Green” are captured as training data for the model to recover from mistakes [00:08:51].
  • Generating Descriptions: An LLM is used to write detailed descriptions of changes, which are then filtered down to approximate what a human would write in a short prompt [00:09:07].

Reinforcement Learning (RL)

RL is crucial for aligning the model with what humans consider “good code” [00:09:34]. Good code is defined as [00:09:44]:

  • Parsable [00:09:47].

  • Type-checked (for statically typed OCaml) [00:10:01].

  • Compilable and passes tests [00:10:15].

  • Code Evaluation Service (CES) [00:10:38]:

    • Acts like a fast build service.
    • A build is pre-warmed at a specific revision (green status) [00:10:50].
    • Workers continuously take diffs from the model, apply them, and determine if the build turns red or green [00:10:55].
    • This feedback helps align the model over months to write code that compiles and passes tests [00:11:07].

Evaluation

The same CES setup can be used for evaluation by holding out some RL data [00:11:19]. This allows them to assess whether the code generated by the model actually works [00:11:28]. Meaningful evaluations are critical to prevent models from “going off the rails” and wasting resources [00:12:24] (e.g., a code review model suggesting “I’ll do it tomorrow” based on human examples [00:12:08]).

Editor Integrations

The ultimate test for models is whether they work for humans [00:12:33]. The team built editor integrations with three main goals:

  1. Avoid Redundancy: Support Neovim, VS Code, and Emacs without writing context-building and prompting strategies three times [00:12:44].
  2. Maintain Flexibility: Easily swap out models or prompting strategies [00:13:02].
  3. Collect Metrics: Gather real-world data on latency and diff application success rates [00:13:17].

AI Development Environment (AID) Architecture

The chosen architecture involves an “AI Development Environment” (AID) service [00:13:32]:

  • LLMs are on one side, and AID handles prompt construction, context building, and accessing build status [00:13:38].
  • Thin layers are built on top of AID for each individual editor [00:13:49].
  • AID runs as a sidecar application on the developer’s machine [00:13:55]. This means changes to AID don’t require developers to restart their editors, as the AID service can be restarted remotely [00:14:00].

Editor Examples:

  • VS Code: AID integrates into the sidebar, offering multi-file diff suggestions through a visual interface [00:14:16].
  • Emacs: For Emacs users accustomed to text buffers, the AID experience is built into a Markdown buffer, allowing users to move around, ask questions, and append content using keybinds [00:14:35].

Benefits of AID Architecture:

  • Pluggability: Allows swapping new models, changing context building, and adding support for new editors (currently being done) [00:14:58].
  • Integration of AI coding agents with third-party tools: Different areas of the company can supply domain-specific tools that become available in all editors without individual integrations [00:15:15].
  • A/B Testing: Enables A/B testing different approaches, like sending 50% of the company to one model and 50% to another, to determine which achieves a higher acceptance rate [00:15:28].

The AID architecture is a long-term investment that pays off over time, as any change in large language models can be applied in one central place and instantly made available everywhere [00:15:39]. This foundational approach supports further work, including applying RAG (Retrieval Augmented Generation), multi-agent workflows, and reasoning models, while maintaining pluggability and allowing other parts of the company to add domain-specific tooling [00:15:57].