Open source AI models and limitations

From: redpointai

In an episode of the Unsupervised Learning AI podcast, hosts Jacob Efron and Pat Chase were joined by Omj, founder of Replit, to discuss the landscape of AI models, focusing on the current state and perceived limitations of open source models [00:00:05]. Replit, a company valued at over a billion dollars, is at the forefront of integrating AI into coding solutions [00:00:17].

The Reality of Open Source AI Models

Omj challenges the notion that today’s “open source” AI models are truly open, likening their current state to providing Linux source code without a compiler or only a binary [00:31:50]. He argues that if a model cannot be reproduced, it is not truly open source [00:31:58]. This raises questions about long-term dependency:

“If you’re using open source models, you’re dependent on the Goodwill of Zuck to continue to to to push out you know Llama 2, 3, 4, 5” [00:32:29].

This dependency means companies are essentially betting their business on external goodwill, which Omj finds strategically precarious [00:32:46]. He emphasizes the need for a truly open source project that allows for contributions and has a functional open source flywheel [00:34:40].

Data as the Source Code

Omj highlights that current large language models (LLMs) are fundamentally a function of the data they are fed [00:11:38]. The real power of LLMs lies in interpolating different data distributions [00:11:53]. When training a model, the data becomes the source code itself [00:35:12].

A critical limitation of current open source models versus proprietary AI models is the lack of clarity regarding their training process and underlying data [00:36:51]. This presents a significant security risk, as “backdoors” or hidden functionalities could be built into models and remain undetected if the training data and process are not fully transparent [00:36:17].

Furthermore, Omj suggests that the supply of high-quality, truly open source tokens for training models might be dwindling [00:14:02]. While models like GPT-4 are trained on vast amounts of internet code and heavily annotated coding data, open source models often rely on permissive GitHub repositories [00:14:24]. The challenge is finding sufficient and diverse “coding adjacent reasoning things” like scientific or legal data to continue improving coding capabilities [00:15:27].

Replit’s Approach: Building Proprietary Models

Replit’s decision to build its own model, rather than solely relying on commercial APIs or open source models, was driven by specific needs:

Low Latency: Commercial models, even those used by partners like Copilot, can be too slow. Replit prioritized sub-second response times for a seamless user experience, which is crucial for in-editor code suggestions [00:28:07].
Cost Efficiency: Integrating AI features into a free product tier meant commercial model costs were prohibitive [00:28:46]. Training their own 3-billion parameter model cost approximately $100,000, which is not a huge capital expenditure compared to ongoing inference costs from commercial providers [00:29:35].
Small Models are Capable: Replit was early to realize that smaller models could be highly capable for specific tasks, especially with proper productionization [00:29:17]. This allowed them to tailor the model to their needs more effectively.
Strategic Control: Building in-house capability fosters internal talent and reduces external dependency [00:29:45].

Despite building their own core model, Replit still uses commercial models for other use cases, such as general-purpose chat features, where it doesn’t make sense to train a custom model [00:30:10]. This pragmatic approach focuses on solving customer problems and running the numbers to determine the best solution [00:30:24].

The Future Landscape

The future of AI models is still fluid. While a year ago, Omj might have predicted a strong reliance on fine-tuning open source models, his current view is that the commercial side is currently ahead [00:33:13]. Companies like OpenAI and Anthropic are developing robust fine-tuning and custom model businesses, providing services that address specific customer needs [00:34:04].

However, the rapid progress in OpenAI’s models and opensource models and partnerships with companies like Meta’s Code Llama (which claims to match GPT-4 on benchmarks, though “the vibes may be off” [00:55:00]), suggests that the open source landscape is quickly catching up [01:10:21].

Ultimately, the decision to build, fine-tune, or rely on commercial APIs will depend on specific strategic considerations, the problem being solved, and available resources and talent [00:30:24].

Tubegraph

Explorer

Table of Contents

Open source AI models and limitations

The Reality of Open Source AI Models

Data as the Source Code

Replit’s Approach: Building Proprietary Models

The Future Landscape

Graph View

Backlinks