Finetuning and production stability of open AI models

From: aidotengineer

The past 12 months have seen a dramatic increase in AI models, with over 50,000 models uploaded to Hugging Face per month, an acceleration equating to more than one AI model a minute [00:00:08]. This rapid growth highlights the increasing availability and adoption of open models in various applications [00:00:11].

The Rise of Open Models

Open-source models are proving capable of competing with large, proprietary models. DeepSeek-R1, for example, was the first open-source model to catch up with and surpass GPT-4, demonstrating that significant investment isn’t always necessary to compete with major labs [00:00:22]. DeepSeek-R1 recorded over 4 million downloads of its 685 GB model on Hugging Face in one month [00:00:32]. Companies like Featherless AI provide unlimited API requests to over 3,700 open AI models, including DeepSeek-R1, making them accessible to thousands of users [00:00:57].

Finetuning and Model Fragmentation

The proliferation of open models leads to significant finetuning and fragmentation, where larger models like Llama and Qwen become specialized into individual models with distinct “personalities” and use cases [02:51:30]. This allows for tailored solutions for specific tasks [02:58:00].

Production Stability and User Preference

A key insight into open model usage is the “staying power” of a model once it enters production [03:06:00]. Developers prioritize consistency and prefer to change their models only when they choose to, not when a provider decides to update [03:19:00].

Factors Contributing to Model Stickiness in Production:

Cost-effectiveness: Smaller models are generally cheaper at scale [04:02:00].
Established Tutorials and Integrations: Models that gain early adoption and accumulate “hundreds of fine-tuning tutorials” become default choices for cloud platforms (e.g., AWS, GCP), leading to widespread use in production environments [04:51:00].
Reliability and Accuracy: Once a model reliably performs at scale, especially after being prompted to 99%+ accuracy and with metrics in place for observing changes, enterprises are reluctant to change it [05:15:00]. The adage “if it ain’t broke, don’t fix it” applies [05:47:00].
Licensing: The A2 licensing of early truly open-source models like Mistral Nemo helped their adoption by enterprises, as it avoided the license restrictions of models like Llama that made legal teams uncomfortable [04:43:00].

Models like Mistral Nemo, even when 8 months old and replaced by larger, better models, still show significant dominance in commercial use due to their established position [03:50:00]. Similarly, Llama 2 remains a go-to for AI Safeguard tutorials and is actively used in production by new teams, despite newer versions being available [05:53:00].

User Behavior: Vibes over Benchmarks

For individual users, particularly in creative and companionship AI, model choice is often based on “Vibes” and preference rather than performance benchmarks (MML or price) [02:30:00]. In these communities, models with different “flavors” emerge frequently, and users may return to “old favorites” for their specific charms [11:52:00].

Key Use Cases and Their Impact on Stability

Based on platform usage data, AI models are commonly used for:

Creativity and Companionship (30-40% of traffic): Includes creative writing (e.g., Novel Crafter), role-playing, companionship (e.g., Spicy Chat, Soul Haven), and to a lesser extent, therapy and journaling [07:32:00]. This segment is characterized by rapid model changes driven by “Vibes” rather than strict performance metrics [10:52:00].
Coding Co-pilot and Agents (20-30% of traffic): Covers auto-completion tools (like GitHub Co-pilot) and agentic coding workflows [13:05:00]. While auto-completion is largely solved, the focus is shifting to “nearly autonomous agents with lots of clarifying questions” and human intervention [14:07:00]. This “Vibe coding” generates significantly more token traffic than companionship models [14:31:00]. Projects like Continue and Klon help achieve similar experiences to commercial models using open-source alternatives [16:02:00].
ComfyUI and Friends (Approx. 5% of traffic): Used for complex AI generation workflows, particularly in image diffusion [16:40:00]. These graph-style UIs are used by non-developers like musicians and lawyers [16:57:00].
Write/Check (ChatGPT clones) (Approx. 20% of traffic): General-purpose AI usage for writing and checking tasks [17:53:00].
Agents and Workflow Automation (10-20% of traffic): This category is split into workflow automation with human oversight (“human escape hatches”) and fully automated agents [18:57:00].

Scaling AI Agents in Production with Reliability

When scaling AI solutions in production, especially for enterprise use, maximizing ROI and minimizing negative impact are top priorities [19:10:00]. A common strategy involves building automation systems with built-in human escape hatches, allowing humans to take control when needed [19:37:00]. For example, an AI agent might draft 80-90% of email responses, with a human reviewing and finalizing before sending [20:21:00]. This approach builds confidence, allows for incremental automation of reliable use cases, and prevents catastrophic failures that could postpone AI adoption [21:07:00].

Fully automated, 100% reliable agentic agents for production environments are considered a mythical category that “does not exist” [22:47:00]. The recommended mindset for building AI into production is to solve for the 80% with escape hatches, iteratively improving reliability from there [23:05:00]. This incremental approach, similar to software reliability engineering, can achieve high reliability (e.g., 99.998%) without the risks of an all-or-nothing launch [24:11:00].

Challenges with early AI models and improvements

The speaker notes that many current AI models are not 100% reliable and can hallucinate or fail [27:24:00]. This highlights the ongoing need for robustness and coverage in AI models, especially for critical tasks.

The Future of AI and Benchmarks

As the average AI model surpasses the MML (Multi-task Language Understanding) capabilities of an average office worker, traditional benchmarks are losing their meaning [27:01:00]. The focus is shifting towards exploring linear transformer models as a means of “persisting memories, customization, and improving reliability for future AI models to make useful AI agents” [27:33:00].

Quirky, a 72-billion parameter linear Transformer and attention Transformer hybrid, is presented as an example of a new architecture that runs at less than half the GPU compute cost of other Transformer models, aiming to provide alternatives with lower inference costs [26:22:00]. It cost only $100, 000 t o b u i l d, co m p a re d t oDee pS ee k^{'} s$ 10 million [26:52:00].

Tubegraph

Explorer

Table of Contents