Editor integration and development workflow enhancements

From: aidotengineer

At Jane Street, the AI Assistant team, led by John Kzi, aims to maximize the value derived from large language models (LLMs) in development tools [00:24:00]. LLMs offer a unique opportunity for open-ended tool development, limited primarily by creativity [00:38:00].

Unique Development Environment Challenges

Jane Street’s adoption of off-the-shelf tooling for LLMs is complicated by several factors:

OCaml as a Primary Language: OCaml is a powerful, functional, but obscure language, primarily used in theorem proving or formal verification [01:02:00]. Jane Street uses OCaml for almost everything, including web applications (transpiled to JavaScript via JS of OCaml), Vim plugins (transpiled to Vimscript via VAML), and FPGA code (HardCaml) [01:28:00].
LLM Proficiency with OCaml: Models generally perform poorly with OCaml because the amount of OCaml code within Jane Street’s internal systems likely exceeds the total combined amount of OCaml code available globally for training [02:14:00].
Internal Tooling: The company has built its own custom tools, including build systems, a distributed build environment, and a code review system called Iron [02:42:00].
Monorepo and Version Control: All software is developed on a giant monorepo, which is stored in Mercurial, not Git [02:53:00].
Editor Preference: A significant portion (67%) of the firm uses Emacs, contrasting with more common editors like VS Code [03:02:00].
Ambitious Goals: There’s a desire to deeply integrate LLMs across the entire development flow for tasks like resolving merge conflicts, generating feature descriptions, or identifying code reviewers, without being constrained by system boundaries [03:19:00].

Approach to LLMs in Developer Tools

Jane Street’s strategy focuses on:

Building custom models [03:49:00].
Developing robust editor integrations [03:52:00].
Establishing comprehensive model evaluation processes [04:02:00].

Building Custom Models for Diff Generation

Inspired by Meta’s CodeCompose project, which fine-tuned models for Hack (another language primarily used at one company) [04:26:00], Jane Street set a goal: to generate diffs from a prompt within the editor [05:30:00]. These diffs should ideally be multi-file, apply cleanly, typecheck, and be up to 100 lines long [05:43:00].

The process involves two main phases:

1. Supervised Training Data Collection

To train models effectively, examples in the specific “Context + Prompt + Diff” shape are needed [06:16:00]. Traditional data sources proved unsuitable:

Code Review System (Iron): While containing descriptions and diffs, descriptions are too detailed (like pull request descriptions) and diffs are often too large for single-shot generation (500-1000 lines) [07:01:00].
Commits: Commits are used as checkpoints rather than isolated changes and lack descriptions [07:58:00].

Workspace Snapshotting: The solution was to capture “isolated changes” by periodically snapshotting developer workstations (e.g., every 20 seconds) and their build status [08:17:00]. A “Green-to-Red-to-Green” pattern in the build status often signifies an isolated change where a developer introduced a bug and then fixed it [08:38:00]. By capturing the build error at the “Red” state and the diff to the “Green” state, training data for error recovery is created [08:57:00]. An LLM is then used to generate a concise, human-like description for the prompt part of this training data [09:08:00].

2. Reinforcement Learning

This phase aligns the model’s output with human notions of “good code” [09:31:00]. “Good code” is defined as:

Parsable by the OCaml parser [09:47:00].
Type-checks successfully [10:01:00].
Compiles and passes tests [10:15:00].

Code Evaluation Service (CES): A “build service” called CES was developed for this purpose [10:38:00]. It pre-warms a build, and then workers apply diffs from the model, checking if the build status remains green [10:47:00]. This continuous feedback loop over months helps align the model to generate compilable and test-passing code [11:07:00]. This setup is also used for model evaluation [11:20:00].

Training Challenges

A code review model, trained on human examples, once responded to a code review request with “I’ll do it tomorrow” [12:08:00]. This highlights the importance of meaningful evaluations to prevent models from going “off the rails” [12:22:02].

Editor Integrations: The AI Development Environment (Aid)

The real test for models is their utility for humans [12:33:00]. Jane Street’s editor integrations are built with three key ideas:

Write Once, Integrate Across Editors: Avoid re-writing context and prompting strategies for different editors (Neovim, VS Code, Emacs) [12:48:00].
Maintain Flexibility: Be able to easily swap models or prompting strategies [13:02:00].
Collect Metrics: Gather real-world data on latency and diff application success [13:17:00].

Architecture of Aid: The AI Development Environment (Aid) acts as a sidecar application on the developer’s machine [13:55:00].

Aid handles prompt construction, context building, and build status integration [13:41:00].
Thin layers are built on top of Aid for each editor [13:49:00].
Changes to Aid can be deployed by restarting the service, without requiring editor restarts [14:00:00].

Editor-Specific Implementations:

VS Code: Integrates into the sidebar, similar to Copilot, providing a visual interface for multi-file diffs [14:16:00].
Emacs: The Aid experience is integrated directly into a markdown buffer, allowing users to interact with text buffers and use familiar keybinds to append content [14:35:00].

Benefits and Future Directions

Aid’s architecture enables significant flexibility:

Pluggable Components: New models, context-building methods, and editor support can be easily integrated [14:58:00].
Domain-Specific Tooling: Different areas of the company can supply specific tools that become available across all integrated editors without individual integrations [15:15:00].
A/B Testing: Different approaches can be A/B tested (e.g., sending 50% of users to one model vs. another) to determine which yields higher acceptance rates [15:28:00].
Iterative Improvement: Aid is an investment that pays off over time, allowing rapid adaptation to changes in LLM technology [15:39:00].

The team is also exploring other avenues, including applying RAG (Retrieval-Augmented Generation) within editors, large-scale multi-agentic workflows, and reasoning models [16:03:00]. The core philosophy remains consistent: maintain pluggability, build a strong foundation, and enable the rest of the company to contribute domain-specific tooling [16:16:00].

Tubegraph

Explorer

Table of Contents

Editor integration and development workflow enhancements

Unique Development Environment Challenges

Approach to LLMs in Developer Tools

Building Custom Models for Diff Generation

1. Supervised Training Data Collection

2. Reinforcement Learning

Editor Integrations: The AI Development Environment (Aid)

Benefits and Future Directions

Graph View