Data augmentation techniques in AI

From: aidotengineer

Data augmentation is a crucial area of focus in artificial intelligence, particularly in the realm of recommendation systems and search. The goal is to address inherent challenges related to data quality, sparsity, and the “cold-start” problem, especially for new items or infrequent queries. [00:03:55]

The Challenge: Data Scarcity and Quality

The “lifeblood of machine learning is data” [00:08:06] – specifically, good quality data at scale. [00:08:10] This is essential for search and recommendation systems, which require extensive metadata, query expansion, synonyms, and spell checking attached to the search index. [00:08:17] Historically, obtaining this data has been costly and high-effort, often relying on human annotations or complex automatic methods. [00:08:31]

Recommendation systems often suffer from:

Cold-start problem [00:04:17]: When a new item is introduced, the system must relearn about it from scratch. [00:04:19]
Sparsity [00:04:24]: Many “tail items” have very few interactions (e.g., one or two, up to ten), which is insufficient for effective learning. [00:04:26]
Popularity bias [00:04:32]: Systems tend to struggle with cold-start and sparsity, favoring popular items. [00:04:34]

Leveraging Large Language Models (LLMs) for Data Augmentation

Large Language Models (LLMs) have proven outstanding at generating synthetic data and labels, [00:08:37] offering a solution to these data challenges. [00:08:41]

Case Study: Indeed - Filtering Bad Job Recommendations

Indeed faced the challenge of sending bad job recommendations to users via email, leading to poor user experience and unsubscribes. [00:08:56] While explicit negative feedback (thumbs down) was available, it was very sparse. [00:09:25] Implicit feedback (not acting on recommendations) was often imprecise. [00:09:31]

Their solution involved using a lightweight classifier to filter out bad recommendations, with LLMs assisting in data generation and labeling:

Human Labeling: Experts initially labeled job recommendations and user pairs based on resume and activity data. [00:10:05]
LLM Prompting (Initial Attempts):
- Open LLMs (Mistral, Llama 2) showed very poor performance, struggling to pay attention to resume and job description details, providing generic output. [00:10:20]
- GPT-4 performed well (90% precision and recall) but was too costly and slow (22 seconds per query). [00:10:40]
- GPT-3.5 had poor precision; 37% of recommendations flagged as bad were actually good. [00:10:56]
Fine-tuning and Distillation:
- They then [[Finetuning AI models for specific use cases | fine-tuned]] GPT-2.5, which achieved the desired precision (0.3) at a quarter of GPT-4’s cost and latency, but was still too slow for online filtering (6.7 seconds). [00:11:30]
- The final step involved [[Techniques for improving AI model efficiency | distilling]] a lightweight classifier using the [[Finetuning AI models for specific use cases | fine-tuned]] GPT-2.5 labels. [00:11:51] This classifier achieved high performance (0.86 AU ROC) and was fast enough for real-time filtering (less than 200 milliseconds). [00:11:58]

The outcome was a 20% reduction in bad recommendations. [00:12:20] Surprisingly, application rates increased by 4% and unsubscribe rates dropped by 5%, demonstrating that “quantity is not everything. Quality makes a big difference.” [00:12:55]

Case Study: Spotify - Query Recommendations for New Categories

Spotify, traditionally known for songs and artists, introduced podcasts and audiobooks, facing a severe cold-start problem for these new content categories. [00:13:08] Exploratory search became essential for expanding beyond music. [00:13:40]

Their solution was a query recommendation system. [00:13:53]

Query Generation: Initial query ideas were extracted from catalog and playlist titles, and mined from search logs using conventional techniques. [00:14:01]
LLM Augmentation: LLMs were then used to “generate natural language queries” [00:14:20] to augment this existing data. The strategy was to “use the LLM to augment it when you need it. Don’t use the LLM for everything at the start.” [00:14:29]
Ranking: These exploratory queries were then ranked alongside immediate search results. [00:14:46]

This [[Enhancing existing systems with AI capabilities | enhancement to the existing system]] led to a 9% increase in exploratory queries, meaning one-tenth of Spotify’s users were now exploring new products daily. [00:15:13] This significant engagement accelerated the growth of new product categories. [00:15:27]

Benefits of LLM-Augmented Synthetic Data

LLM-augmented synthetic data provides significant benefits:

Richer, High-Quality Data at Scale: It allows for the creation of more comprehensive and precise data, even for “tail queries” and “tail items” where engagement data is scarce. [00:15:37] This addresses [[Challenges and solutions in AI driven data processing | challenges]] in data processing.
Lower Cost and Effort: This approach is “far lower cost and effort that is even possible with human adaptation.” [00:15:46]
Efficiency and Scalability: By [[Leveraging AI Tools for Efficiency and Scalability | leveraging AI tools]] like LLMs for data generation, systems can achieve greater efficiency and scale in data acquisition. [00:15:37]

Overall, data augmentation, particularly through the strategic use of LLMs, is a key strategy for improving AI model performance and addressing fundamental data challenges in complex systems like recommendation engines and search platforms. [00:22:57]

Tubegraph

Explorer

Table of Contents