Role of data labeling and synthetic data in AI

Current State of Data Labeling in AI

The type of data labeling that will matter for future model improvement involves expert data labelers who encode more reasoning tasks into models [00:00:08]. The initial wave of post-ChatGPT moments focused on RHHF data, which has now shifted towards expert data labelers [00:12:43].

Human Role in Evaluation

Humans remain the gold standard for evaluating AI models [00:13:03]. It is not yet possible to remove humans from the evaluation loop [00:13:15]. The only way to potentially remove humans from the loop would be to have an expert observing the model who is better than the current model [00:13:23].

Challenges of Human Data Generation

While human data is still necessary for data generation, it is prohibitively expensive [00:13:39]. For example, teaching a model general conversation or “chitchat” was viable by collecting data from 100,000 average people [00:14:00]. However, it is not a viable strategy to teach a model medicine by finding 100,000 doctors [00:14:51].

The Rise of Synthetic Data

The ability to chitchat and converse, initially taught through human data, has unlocked a degree of freedom in synthetic data generation [00:14:16]. This allows AI developers to apply synthetic data to specific domains like medicine, using a much smaller pool of human data [00:14:24]. For instance, one might go to 100 doctors to get some lessons, then use that trustworthy data to generate a thousandfold of synthetic lookalike data [00:14:37].

Synthetic Data in Verifiable Domains

In verifiable domains like code and math, it is significantly easier to check the results of synthetic data, allowing for effective filtering of garbage and discovery of useful information [00:14:49]. This process remains viable even in more complex domains [00:15:02]. Currently, an overwhelming majority of data generated by Cohere for new models is synthetic [00:15:15].

Data Beyond English and the Web

While the web contains extensive information about humanity, history, culture, and science, certain fundamental contexts about specific businesses or domains are missing from models built solely on web data [00:10:57]. This includes data such as:

Manufacturing data [00:11:30]
Customer transactions [00:11:33]
Detailed personal health records [00:11:35]

To address these gaps, companies like Cohere partner with organizations that possess this proprietary data to create custom models that are highly effective in those specific domains [00:11:43].

Multilingual Data Collection

For AI technology to be useful globally, it must speak local languages and understand local cultures [00:30:52]. Cohere’s open-source “Ya project” was the largest data collection effort for any machine learning project, involving thousands of native speakers contributing data in over 100 different languages [00:30:27]. This data was open-sourced to benefit all language models, not just Cohere’s [00:30:37]. Cohere is deeply committed to ensuring their technology works as well in Japanese and Korean as it does in English, through partnerships with companies like Fujitsu and LG [00:29:42].

Siloed Data as a Challenge

For specialized domains like cancer research, the issue is not necessarily a token scarcity in data, but rather that existing data is siloed and locked up in numerous places that refuse to share or communicate with each other [00:34:41]. For example, cancer data is often not well-structured or linked to outcomes [00:34:02]. While bio foundation model companies often spin up labs to generate more data, and the robotics sector faces similar data issues, the problem in areas like cancer research is more of a human problem of data access rather than a data generation problem [00:34:09].

Tubegraph

Explorer

Table of Contents