Data labeling and the role of synthetic data in AI

From: redpointai

The evolution of generative AI models highlights the critical importance of effective data labeling and the increasing role of synthetic data in improving model capabilities [00:00:08]. Aiden, speaking on Unsupervised Learning, emphasized that the best application companies will also build their own models, necessitating specific approaches to data [00:00:04].

The Challenge of Enterprise AI Data

For AI agents to function effectively and drive automation within enterprises, they require extensive access to data, including emails, chats, calls, CRM, ERP, and HR software, to gain necessary context [00:01:58]. This presents significant challenges:

Data Privacy: Very little software currently requires this degree of access, making privacy a much larger issue for AI and agents compared to other enterprise software [00:02:27].
Customization and Integration: Each company uses a unique “tapestry” or “mosaic” of software, meaning there’s no standard setup [00:02:43]. This necessitates a degree of custom setup to integrate all the relevant context into the AI model [00:02:59]. While AI agents might eventually alleviate some complexity, complete self-serve setup remains a “fantasy” [00:03:41].
Data Security: The high stakes of mistakes involving sensitive data like salary or customer information demand substantial guardrails [00:04:17].

The Enduring Role of Human Data Labeling

Humans remain the “gold standard” for evaluating AI model usefulness, especially when building models for people [00:13:03]. Therefore, evaluation (Eval) is an area where humans cannot be removed from the loop, unless an expert AI model, superior to the current one, can perform the observation [00:13:15]. This creates a “hard dependency on humans within Eval” [00:13:33].

The Emergence and Importance of Synthetic Data

While human data is still necessary, the cost of generating large volumes of specialized human data is prohibitive [00:13:39]. For example, finding 100,000 doctors to teach a model medicine is not a viable strategy [00:13:47]. However, teaching models general conversational abilities, often achieved with data from a large pool of average people, has “unlocked a certain degree of freedom in terms of synthetic data generation” [00:14:14].

Synthetic data plays a crucial role in:

Scaling Data Generation: It allows for the application of a much smaller pool of human data to specific domains like medicine [00:14:24]. A small, known good, and trustworthy pool of human data (e.g., from 100 doctors) can be used to generate a thousand-fold amount of synthetic lookalike data [00:14:37].
Addressing Data Scarcity/Siloing: While some domains like cancer research don’t necessarily have a “token scarcity,” the data is often siloed and locked up across different institutions [00:34:38]. Synthetic data can help bridge these gaps or facilitate the exploration of such data, even if the primary issue is a “human problem” of data sharing rather than data generation [00:35:00].
Verifiable Domains: In domains like code and math, it’s easier to check results, allowing for effective filtering of synthetic data to remove “garbage” and find “gold” [00:14:49]. While more complex, it is still viable in other domains [00:15:02].

At Coher, an “overwhelming majority” of the data generated for new models is synthetic [00:15:15].

Custom Models and Data Context

Custom models remain important because “fundamental context about a particular business or a particular domain” is often missing from models trained solely on web data [00:10:57]. Data not typically found on the web includes:

Manufacturing data [00:11:30]
Customer transactions [00:11:33]
Detailed personal health records [00:11:35]

Companies like Coher partner with organizations that possess this domain-specific data to create custom models, which only those organizations can access [00:11:43]. While synthetic data can significantly close the gap for general models, a handful of custom models might operate within an organization, but it’s unlikely for every single team to have their own fine-tuned model [00:12:02].

Future of Model Improvement and User Interaction

A key missing capability in current models is the “notion of learning from experience” [00:08:46]. Humans start as novices and become experts over time, and models should have the same ability to learn from real-world experience and user feedback [00:08:50]. This ongoing learning process, where models remember past interactions and user feedback, would significantly increase user investment and enable a personalized “me 2.0” system [00:46:07]. This could be actuated by storing interaction history in a queryable database, ensuring the model always has context of previous interactions [00:45:47].

The “scale is all you need” hypothesis is breaking, as there are heavy diminishing returns on capital and compute [00:09:21]. Future advancements will require smarter and more creative approaches [00:09:35]. While test-time compute still requires significant resources (making inference 3 to 10 times more expensive [00:39:10]), it’s not simply a matter of building larger computers [00:38:47]. The focus is shifting to data diversity and finding demonstrations for models to problem-solve in specific domains [00:38:33].

Tubegraph

Explorer

Table of Contents