Integration of large language models in development

From: aidotengineer

John Kzi leads the AI Assistant team at Jane Street, which aims to maximize the value Jane Street can derive from large language models (LLMs) [00:00:24]. Kzi’s background is primarily in dev tools, having worked at GitHub and other development tool companies [00:00:28]. He views LLMs as presenting a significant opportunity due to their open-ended nature, allowing the creation of almost anything imaginable [00:00:38]. The pace of progress in models is only outmatched by the creativity in how to employ them [00:00:47].

Unique Challenges at Jane Street [00:00:53]

Jane Street’s specific operational choices make adopting off-the-shelf tooling more difficult than for other companies [00:00:53].

The primary challenges include:

OCaml as a Development Platform: Jane Street primarily uses OCaml, a powerful but obscure functional language [00:01:02]. OCaml is often used in theorem proving, formal verification, or to write programming languages [00:01:17]. Jane Street uses OCaml for almost everything, including:
- Web applications, transpiling OCaml bytecode to JavaScript via JS of OCaml [00:01:31].
- Vim plugins, transpiling OCaml to Vimscript via Vim OCaml [00:01:47].
- FPGA code, written in an OCaml library called HardCaml instead of Verilog [00:01:58].
Model Limitations with OCaml: Existing LLMs are not proficient in OCaml, mainly due to the limited amount of OCaml data available for training in the public domain [00:02:14]. Jane Street’s internal OCaml codebase may even exceed the total combined OCaml code outside their walls [00:02:26].
Self-Imposed Complexity: Jane Street has built its own:
- Build systems [00:02:42].
- Distributed build environment [00:02:44].
- Code review system, called “Iron” [00:02:47].
- Software is developed on a giant monorepo, stored in Mercurial instead of Git [00:02:52].
- Approximately 67% of the firm uses Emacs as their primary editor [00:03:02].
Aspiration for Deep LLM Integration: The company desires to apply LLMs across various parts of their development workflow (e.g., resolving merge conflicts, building feature descriptions, identifying code reviewers) without being constrained by system boundaries [00:03:19].

Jane Street’s Approach to LLMs in Developer Tools [00:03:40]

Jane Street’s strategy focuses on:

Building custom models [00:03:49].
Developing robust editor integrations [00:03:52].
Establishing comprehensive model evaluation capabilities [00:04:02].

Custom Model Development [00:03:49]

While training models is expensive and complex [00:04:10], Jane Street was encouraged by Meta’s “Code Compose” paper, which detailed successful fine-tuning of a model for Hack, a language primarily used at one company (similar to OCaml’s niche use) [00:04:25]. (Interestingly, Hack is implemented in OCaml [00:04:50]).

Initially, they naively thought they could fine-tune an off-the-shelf model by simply showing it their code [00:04:55]. However, it became clear that models need to see examples in the specific “shape” of the questions they are expected to answer [00:05:13].

Goal: The team’s primary goal was to generate diffs given a prompt [00:05:24]. This meant a user in an editor could describe a desired change, and the model would suggest a multi-file diff (e.g., modifying a test file, an .ml file, and an .mli header file) [00:05:35]. The diffs needed to apply cleanly and have a high likelihood of type-checking successfully, targeting changes up to 100 lines [00:05:50].

Training Data Shape: To achieve this, training data needed to be in the form of: context, prompt, diff [00:06:11].

Data Collection Strategy [00:06:09]

Initial Ideas (and why they failed):
- Features (Pull Requests): While features in their “Iron” code review system contain human-written descriptions and code diffs [00:06:37], their descriptions are too verbose for editor prompts (e.g., multi-paragraphs vs. “fix that error”) [00:07:03]. Also, features are often very large (500-1000 lines), requiring automated splitting [00:07:20].
- Commits: Jane Street uses commits as checkpoints, not isolated changes with descriptions [00:07:57]. They also share the problem of not being isolated changes [00:08:08].
Workspace Snapshotting: This is the chosen approach [00:08:15].
- Snapshots of developer workstations are taken frequently (e.g., every 20 seconds) [00:08:19].
- Build status (errors, green) is also snapshotted [00:08:28].
- Patterns like “green to red to green” builds often indicate an isolated change [00:08:37].
- A “red to green” transition signifies a developer fixing an error [00:08:49]. The build error at the “red” state and the diff from “red” to “green” are used as training data to help the model recover from mistakes [00:08:57].
- Description Generation: Large language models were used to generate detailed descriptions of changes, which were then filtered down to human-like prompt levels [00:09:07].

Reinforcement Learning and Defining “Good Code” [00:09:32]

After supervised training, reinforcement learning aligns the model with what humans consider “good code” [00:09:36]. Good OCaml code is defined as:

Parsable [00:09:47].
Type-checks successfully [00:09:59].
Compiles and passes tests [00:10:14].

To facilitate this, Jane Street built the Code Evaluation Service (CES) [00:10:38].

CES acts as a faster build service [00:10:41].
It pre-warms a build to a “green” state [00:10:50].
Workers continuously take diffs from the model, apply them, and determine if the build status turns red or green [00:10:55].
This feedback loop, used over months, helps align the model to write code that actually compiles and passes tests [00:11:07].

Model Evaluation [00:04:02]

The same CES setup used for reinforcement learning is also used for evaluating models [00:11:17]. By holding out some RL data, they can test the model’s ability to write functional code [00:11:22].

Importance of Meaningful Evaluations: Training can have “catastrophic but hilarious” results [00:11:37]. For example, a code review model trained on human examples once responded to a review request with “I’ll do it tomorrow” [00:12:08]. Meaningful evaluations are crucial to prevent models from going “off the rails” and wasting resources [00:12:22].

Editor Integrations: The AI Development Environment (AID) [00:13:31]

The ultimate test for models is human utility [00:12:31]. Jane Street built editor integrations to expose these models to developers.

Integration Goals:

Write Once: Avoid rewriting context-building and prompting strategies for each supported editor (Neovim, VS Code, Emacs) [00:12:44].
Maintain Flexibility: Allow swapping models or prompting strategies easily, anticipating future fine-tuned models [00:13:02].
Collect Metrics: Gather real-world data on latency and diff application success to gauge effectiveness [00:13:17].

Architecture: AID as a Sidecar Application:

The AI Development Environment (AID) handles prompt construction, context building, and build status visibility [00:13:32].
Thin layers are built on top of AID for each editor [00:13:49].
AID runs as a sidecar application on the developer’s machine [00:13:55]. This allows changes to AID to be deployed and restarted on all machines without requiring developers to restart their editors [00:14:00].

Editor Experiences:

VS Code: A visual sidebar interface, similar to Copilot, allows asking for and receiving multi-file diffs [00:14:16].
Emacs: The AID experience is integrated into a Markdown buffer, allowing users to interact via text, ask questions, and use keybinds to append content [00:14:34].

AID’s Flexibility and Future: AID’s architecture makes it highly pluggable [00:14:58], enabling:

Swapping in new models [00:15:02].
Changes to context building strategies [00:15:05].
Adding support for new editors [00:15:07].
Integrating domain-specific tools from different company areas, making them available across all editors without individual integrations [00:15:14].
A/B testing different approaches (e.g., routing 50% of users to one model and 50% to another to determine acceptance rates) [00:15:28].

AID represents a long-term investment, allowing Jane Street to quickly adapt to changes in large language models by modifying a single component downstream of the editors [00:15:39].

Future Directions [00:15:56]

Jane Street is also exploring:

Applying Retrieval-Augmented Generation (RAG) within editors [00:16:03].
Large-scale multi-agent workflows [00:16:06].
Working with reasoning models [00:16:11].

Their consistent approach involves maintaining pluggability, building a strong foundation, and enabling other parts of the company to contribute domain-specific tooling [00:16:16].

Tubegraph

Explorer

Table of Contents