From: aidotengineer

At Jane Street, the AI Assistant team, led by John Kzi, aims to maximize the value derived from large language models (LLMs) in development tools [00:24:00]. LLMs offer a unique opportunity for open-ended tool development, limited primarily by creativity [00:38:00].

Unique Development Environment Challenges

Jane Street’s adoption of off-the-shelf tooling for LLMs is complicated by several factors:

  • OCaml as a Primary Language: OCaml is a powerful, functional, but obscure language, primarily used in theorem proving or formal verification [01:02:00]. Jane Street uses OCaml for almost everything, including web applications (transpiled to JavaScript via JS of OCaml), Vim plugins (transpiled to Vimscript via VAML), and FPGA code (HardCaml) [01:28:00].
  • LLM Proficiency with OCaml: Models generally perform poorly with OCaml because the amount of OCaml code within Jane Street’s internal systems likely exceeds the total combined amount of OCaml code available globally for training [02:14:00].
  • Internal Tooling: The company has built its own custom tools, including build systems, a distributed build environment, and a code review system called Iron [02:42:00].
  • Monorepo and Version Control: All software is developed on a giant monorepo, which is stored in Mercurial, not Git [02:53:00].
  • Editor Preference: A significant portion (67%) of the firm uses Emacs, contrasting with more common editors like VS Code [03:02:00].
  • Ambitious Goals: There’s a desire to deeply integrate LLMs across the entire development flow for tasks like resolving merge conflicts, generating feature descriptions, or identifying code reviewers, without being constrained by system boundaries [03:19:00].

Approach to LLMs in Developer Tools

Jane Street’s strategy focuses on:

Building Custom Models for Diff Generation

Inspired by Meta’s CodeCompose project, which fine-tuned models for Hack (another language primarily used at one company) [04:26:00], Jane Street set a goal: to generate diffs from a prompt within the editor [05:30:00]. These diffs should ideally be multi-file, apply cleanly, typecheck, and be up to 100 lines long [05:43:00].

The process involves two main phases:

1. Supervised Training Data Collection

To train models effectively, examples in the specific “Context + Prompt + Diff” shape are needed [06:16:00]. Traditional data sources proved unsuitable:

  • Code Review System (Iron): While containing descriptions and diffs, descriptions are too detailed (like pull request descriptions) and diffs are often too large for single-shot generation (500-1000 lines) [07:01:00].
  • Commits: Commits are used as checkpoints rather than isolated changes and lack descriptions [07:58:00].

Workspace Snapshotting: The solution was to capture “isolated changes” by periodically snapshotting developer workstations (e.g., every 20 seconds) and their build status [08:17:00]. A “Green-to-Red-to-Green” pattern in the build status often signifies an isolated change where a developer introduced a bug and then fixed it [08:38:00]. By capturing the build error at the “Red” state and the diff to the “Green” state, training data for error recovery is created [08:57:00]. An LLM is then used to generate a concise, human-like description for the prompt part of this training data [09:08:00].

2. Reinforcement Learning

This phase aligns the model’s output with human notions of “good code” [09:31:00]. “Good code” is defined as:

Code Evaluation Service (CES): A “build service” called CES was developed for this purpose [10:38:00]. It pre-warms a build, and then workers apply diffs from the model, checking if the build status remains green [10:47:00]. This continuous feedback loop over months helps align the model to generate compilable and test-passing code [11:07:00]. This setup is also used for model evaluation [11:20:00].

Training Challenges

A code review model, trained on human examples, once responded to a code review request with “I’ll do it tomorrow” [12:08:00]. This highlights the importance of meaningful evaluations to prevent models from going “off the rails” [12:22:02].

Editor Integrations: The AI Development Environment (Aid)

The real test for models is their utility for humans [12:33:00]. Jane Street’s editor integrations are built with three key ideas:

  1. Write Once, Integrate Across Editors: Avoid re-writing context and prompting strategies for different editors (Neovim, VS Code, Emacs) [12:48:00].
  2. Maintain Flexibility: Be able to easily swap models or prompting strategies [13:02:00].
  3. Collect Metrics: Gather real-world data on latency and diff application success [13:17:00].

Architecture of Aid: The AI Development Environment (Aid) acts as a sidecar application on the developer’s machine [13:55:00].

  • Aid handles prompt construction, context building, and build status integration [13:41:00].
  • Thin layers are built on top of Aid for each editor [13:49:00].
  • Changes to Aid can be deployed by restarting the service, without requiring editor restarts [14:00:00].

Editor-Specific Implementations:

  • VS Code: Integrates into the sidebar, similar to Copilot, providing a visual interface for multi-file diffs [14:16:00].
  • Emacs: The Aid experience is integrated directly into a markdown buffer, allowing users to interact with text buffers and use familiar keybinds to append content [14:35:00].

Benefits and Future Directions

Aid’s architecture enables significant flexibility:

  • Pluggable Components: New models, context-building methods, and editor support can be easily integrated [14:58:00].
  • Domain-Specific Tooling: Different areas of the company can supply specific tools that become available across all integrated editors without individual integrations [15:15:00].
  • A/B Testing: Different approaches can be A/B tested (e.g., sending 50% of users to one model vs. another) to determine which yields higher acceptance rates [15:28:00].
  • Iterative Improvement: Aid is an investment that pays off over time, allowing rapid adaptation to changes in LLM technology [15:39:00].

The team is also exploring other avenues, including applying RAG (Retrieval-Augmented Generation) within editors, large-scale multi-agentic workflows, and reasoning models [16:03:00]. The core philosophy remains consistent: maintain pluggability, build a strong foundation, and enable the rest of the company to contribute domain-specific tooling [16:16:00].