Adaptation of OCaml in Developer Tools and Challenges

From: aidotengineer

John Kzi, from the AI Assistant team at Jane Street, focuses on maximizing the value Jane Street derives from large language models (LLMs) [00:00:26]. Kzi’s background is entirely in Dev tools, including time at GitHub [00:00:31]. LLMs offer a significant opportunity due to their open-ended nature, allowing the creation of almost anything imaginable, with human creativity being the primary limiter to their employment [00:00:38].

However, Jane Street has made specific choices that complicate the adoption of off-the-shelf tooling [00:00:53].

OCaml as a Core Development Platform

The main reason for these difficulties is Jane Street’s extensive use of OCaml as its primary development platform [00:01:01].

What is OCaml?

OCaml is described as a functional and powerful language, but also “incredibly obscure” [00:01:08]. Developed in France, its most common applications include theorem proving, formal verification, and writing programming languages [00:01:15].

OCaml’s Pervasive Use at Jane Street

At Jane Street, OCaml is used for nearly everything [00:01:28]:

Web Applications: Instead of JavaScript, they write OCaml and use JS of OCaml, which is an OCaml bytecode to JavaScript transpiler [00:01:31].
Vim Plugins: While Vim plugins typically require Vim script, Jane Street uses Vaml, an OCaml to Vim script transpiler [00:01:47].
FPGA Code: Even for FPGA code, developers write in Hardcaml, an OCaml library, rather than Verilog [00:01:57].

Challenges with LLM Tooling and OCaml

Several factors make off-the-shelf LLM tools poorly suited for OCaml at Jane Street [00:02:08]:

LLM Proficiency in OCaml: Models are “just not very good at OCaml” [00:02:14]. This is due to the limited training data available for OCaml globally, with the internal OCaml codebase at Jane Street potentially exceeding the total combined amount of OCaml code outside their walls [00:02:20].
Self-Imposed Complexity: Jane Street’s unique development environment further complicates matters [00:02:36]:
- Custom Build Systems: They built their own [00:02:42].
- Distributed Build Environment: They built their own [00:02:44].
- Code Review System: An internal system called “Iron” [00:02:46].
- Monorepo: Software is developed on a giant monorepo application [00:02:53].
- Version Control: The monorepo is stored in Mercurial, not Git [00:02:55].
- Editor Preference: 67% of the firm uses Emacs, with VS Code also used but less popular [00:03:02].
Desire for Broad LLM Application: Jane Street aims to apply LLMs to various parts of their development flow, such as resolving merge conflicts, building feature descriptions, or identifying reviewers [00:03:14]. This requires integration across different systems, without being hampered by boundaries [00:03:33].

Jane Street’s Approach to LLM Development for OCaml

Jane Street’s strategy involves custom models and a tailored evaluation process.

Custom Model Development

Inspired by Meta’s CodeCompose project, which fine-tuned a model for Hack (a language similar to OCaml in its niche, primarily company-specific use) [00:04:26], Jane Street sought to replicate these results for OCaml [00:04:55]. The initial naive assumption was that showing an off-the-shelf model their code would result in a model that understood their libraries and idioms [00:05:01].

However, obtaining good outcomes requires the model to see many examples in the shape of the desired question [00:05:16].

Goal: Generate Diffs

The primary goal was to enable users in an editor to describe a desired change and have the model suggest a potentially multifile diff [00:05:27]. These diffs needed to apply cleanly and have a good likelihood of type-checking, targeting up to 100 lines as an ideal range [00:05:50].

Training Data Collection

Training data needs to be in the “context, prompt, diff” shape [00:06:11].

Challenges with existing data:
- Features (Pull Requests): While they have descriptions and diffs, their descriptions differ significantly from what a user would type in an editor (e.g., “fix that error”) [00:07:01]. They are also often very large (500-1000 lines), requiring automated parsing into smaller components [00:07:20].
- Commits: Used as checkpoints, commits lack descriptions and are not isolated changes [00:07:56].
Workspace Snapshotting: The solution was to take snapshots of developer workstations every 20 seconds throughout the workday, including build status [00:08:15].
- Pattern Recognition: A “green to red to green” build status often indicates an isolated change where a developer broke and then fixed the build [00:08:36].
- Training Data Extraction: By capturing the build error at the “red” state and the diff from “red to green,” this data can be used to train the model to recover from mistakes [00:08:50].
- Description Generation: A separate LLM is used to generate detailed descriptions of changes, which are then filtered down to match the conciseness of human prompts [00:09:07].

Reinforcement Learning (RL)

RL aligns the model’s output with what humans consider “good code” [00:09:32].

Definition of “Good Code” for OCaml:
- Parses: Must successfully pass through the OCaml parser [00:09:47].
- Type Checks: For statically typed OCaml, the code must pass the type checker when applied to a base revision [00:09:59].
- Compiles and Passes Tests: The gold standard is code that compiles and passes tests [00:10:13].
Code Evaluation Service (CES): This service is central to the RL phase [00:10:36].
- Functionality: Similar to a build service, it pre-warms a build at a specific revision [00:10:41].
- Workers: Workers continuously take diffs from the model, apply them, determine if the build status turns red or green, and report the success or error back [00:10:55].
- Outcome: Over months of use, this service helps align the model to write code that compiles and passes tests [00:11:07].
Evaluation: The same setup used for RL can be used for evaluation by holding out some RL data [00:11:17].
- Importance of Meaningful Evaluation: A notable anecdote highlights this: A code review model, trained on human examples, once responded to a review with “I’ll do it tomorrow,” because humans often use such phrases [00:11:42]. Meaningful evaluations are crucial to prevent models from going “off the rails” and wasting time/money [00:12:22].

Editor Integrations: The AI Development Environment (AiDE)

The ultimate test for models is whether they work for humans [00:12:31]. Jane Street built editor integrations to expose these models to developers.

Design Principles

When building these integrations, three ideas were paramount [00:12:42]:

Avoid Redundancy: Support for three editors (Neovim, VS Code, Emacs) meant avoiding writing context-building and prompting strategies three times [00:12:48].
Maintain Flexibility: The ability to swap models or prompting strategies was essential, especially as they anticipated moving from general LLMs to fine-tuned models [00:13:02].
Collect Metrics: Crucial metrics for developers include latency and whether generated diffs actually apply [00:13:17].

AiDE Architecture

The chosen architecture for the AI Development Environment (AiDE) is simplified as follows [00:13:32]:

LLMs are on one side.
AiDE handles prompt construction, context building, and build status integration [00:13:41].
Thin layers are written on top of AiDE for each individual editor [00:13:49].

Benefits of AiDE as a Sidecar Application

AiDE runs as a sidecar application on the developer’s machine [00:13:55]. This means changes to AiDE do not require individual editor updates or restarts; restarting the AiDE service updates everyone [00:14:00].

Editor Examples

VS Code: AiDE works within the VS Code sidebar, offering a visual interface for multifile diffs [00:14:16].
Emacs: For Emacs developers who prefer text buffers, the AiDE experience is built into a Markdown buffer, allowing users to move around, ask questions, and apply changes via key binds [00:14:34].

Flexibility and A/B Testing

AiDE’s pluggable architecture allows for [00:14:58]:

Swapping in new models [00:15:00].
Changing context building strategies [00:15:05].
Adding support for new editors (currently underway) [00:15:07].
Integrating domain-specific tools from different company areas, which then become available across all editors without individual integrations [00:15:15].

AiDE also facilitates A/B testing, allowing Jane Street to send different segments of the company to different models and compare acceptance rates [00:15:26]. This investment pays off as any change in LLMs can be managed in one place and deployed everywhere [00:15:39].

The team is also exploring new ways to apply RAG (Retrieval Augmented Generation) within editors, large-scale multi-agent workflows, and reasoning models [00:16:03]. The core approach remains the same: keep things pluggable, build a strong foundation, and enable the rest of the company to add domain-specific tooling [00:16:16].

Tubegraph

Explorer

Table of Contents