Integration of AI into Development Environments and Editors

From: aidotengineer

Jane Street’s AI Assistant team focuses on maximizing the value derived from large language models (LLMs) within the company [00:00:24]. John Kzi, who has a background in Dev Tools, notes the “amazing opportunity” LLMs present due to their open-ended nature, allowing the creation of “anything that we can imagine” [00:00:38]. The rapid progress of models is only outpaced by the creativity in employing them [00:00:47].

Unique Development Environment Challenges

Adopting off-the-shelf AI tooling at Jane Street is challenging due to several unique internal choices [00:00:53]:

OCaml as a Primary Language [00:01:02]: OCaml is a powerful functional language, but it’s obscure [00:01:08]. Its common applications are in theorem proving, formal verification, and writing programming languages [00:01:17]. Jane Street uses OCaml for “everything,” including:
- Web applications (using JS of OCaml to transpile to JavaScript) [00:01:31].
- Vim plugins (using Vaml to transpile to Vim script) [00:01:47].
- FPGA code (using HardCaml instead of Verilog) [00:01:58].
Lack of OCaml Training Data for LLMs [00:02:11]: LLMs are not proficient in OCaml, largely because the amount of OCaml code within Jane Street likely exceeds the total combined amount available in the public domain for training [00:02:20].
Custom Internal Infrastructure [00:02:37]:
- They built their own build systems and distributed build environment [00:02:42].
- An internal code review system called “Iron” [00:02:47].
- A single, giant monorepo stored in Mercurial instead of Git [00:02:53].
- 67% of the firm uses Emacs, though VS Code is also used [00:03:02].
Ambitious AI Integration Goals [00:03:14]: Jane Street wants to apply LLMs to various parts of their development workflow, such as resolving merge conflicts, building feature descriptions, or identifying code reviewers [00:03:20]. They seek seamless integration of AI agents into existing infrastructure without system boundaries [00:03:34].

Approach to LLMs in Developer Tools

Jane Street’s strategy for integrating AI into natural workflows in developer tools involves:

Building Custom Models [00:03:49].
Editor Integrations (VS Code, Emacs, Neovim) [00:03:52].
Model Evaluation [00:04:02].

Custom Model Development

Initially, the team was “naive,” thinking they could easily fine-tune an off-the-shelf model by showing it their code [00:04:55]. They learned from Meta’s CodeCompose project, which fine-tuned a model for Hack, a language similar to OCaml in its niche use within one company [00:04:26]. (Interestingly, Hack is implemented in OCaml [00:04:50]).

The key insight was that good outcomes require the model to see examples in the shape of the desired question [00:05:16].

Defining the Goal

The primary goal was to generate diffs given a prompt [00:05:24]. Users should be able to describe a desired change in their editor, and the model would suggest a multi-file diff (e.g., test file, .ml, .mli header file) [00:05:37]. These diffs needed to apply cleanly and have a high likelihood of type-checking, ideally within a 100-line range [00:05:51].

Data Collection

The process required collecting “context, prompt, diff” triplets as training data [00:06:11]:

Challenges with existing data sources:
- Features (Pull Requests): While containing human descriptions and diffs, their descriptions are too verbose for in-editor prompts, and the diffs are often too large (500-1000 lines) [00:07:01].
- Commits: At Jane Street, commits are used as checkpoints, not isolated changes, and lack descriptions [00:07:57].
Solution: Workspace Snapshotting [00:08:15]:
- Snapshots of developer workstations are taken frequently (e.g., every 20 seconds), along with the build status [00:08:21].
- “Green to Red to Green” build status patterns indicate isolated changes where a developer broke and then fixed the build [00:08:38].
- The build error at the “Red” state and the diff from “Red” to “Green” are captured as training data for the model to recover from mistakes [00:08:51].
Generating Descriptions: An LLM is used to write detailed descriptions of changes, which are then filtered down to approximate what a human would write in a short prompt [00:09:07].

Reinforcement Learning (RL)

RL is crucial for aligning the model with what humans consider “good code” [00:09:34]. Good code is defined as [00:09:44]:

Parsable [00:09:47].
Type-checked (for statically typed OCaml) [00:10:01].
Compilable and passes tests [00:10:15].
Code Evaluation Service (CES) [00:10:38]:
- Acts like a fast build service.
- A build is pre-warmed at a specific revision (green status) [00:10:50].
- Workers continuously take diffs from the model, apply them, and determine if the build turns red or green [00:10:55].
- This feedback helps align the model over months to write code that compiles and passes tests [00:11:07].

Evaluation

The same CES setup can be used for evaluation by holding out some RL data [00:11:19]. This allows them to assess whether the code generated by the model actually works [00:11:28]. Meaningful evaluations are critical to prevent models from “going off the rails” and wasting resources [00:12:24] (e.g., a code review model suggesting “I’ll do it tomorrow” based on human examples [00:12:08]).

Editor Integrations

The ultimate test for models is whether they work for humans [00:12:33]. The team built editor integrations with three main goals:

Avoid Redundancy: Support Neovim, VS Code, and Emacs without writing context-building and prompting strategies three times [00:12:44].
Maintain Flexibility: Easily swap out models or prompting strategies [00:13:02].
Collect Metrics: Gather real-world data on latency and diff application success rates [00:13:17].

AI Development Environment (AID) Architecture

The chosen architecture involves an “AI Development Environment” (AID) service [00:13:32]:

LLMs are on one side, and AID handles prompt construction, context building, and accessing build status [00:13:38].
Thin layers are built on top of AID for each individual editor [00:13:49].
AID runs as a sidecar application on the developer’s machine [00:13:55]. This means changes to AID don’t require developers to restart their editors, as the AID service can be restarted remotely [00:14:00].

Editor Examples:

VS Code: AID integrates into the sidebar, offering multi-file diff suggestions through a visual interface [00:14:16].
Emacs: For Emacs users accustomed to text buffers, the AID experience is built into a Markdown buffer, allowing users to move around, ask questions, and append content using keybinds [00:14:35].

Benefits of AID Architecture:

Pluggability: Allows swapping new models, changing context building, and adding support for new editors (currently being done) [00:14:58].
Integration of AI coding agents with third-party tools: Different areas of the company can supply domain-specific tools that become available in all editors without individual integrations [00:15:15].
A/B Testing: Enables A/B testing different approaches, like sending 50% of the company to one model and 50% to another, to determine which achieves a higher acceptance rate [00:15:28].

The AID architecture is a long-term investment that pays off over time, as any change in large language models can be applied in one central place and instantly made available everywhere [00:15:39]. This foundational approach supports further work, including applying RAG (Retrieval Augmented Generation), multi-agent workflows, and reasoning models, while maintaining pluggability and allowing other parts of the company to add domain-specific tooling [00:15:57].

Tubegraph

Explorer

Table of Contents