AIgenerated recommendations and user experience

From: aidotengineer

The integration of artificial intelligence (AI) and recommendation systems is evolving, with a focus on enhancing the user experience and addressing long-standing challenges in personalized content delivery [00:02:47]. This field has seen significant advancements, particularly with the advent of large language models (LLMs) and transformer-based architectures [00:03:33].

Evolution of AI in Recommendation Systems

Language modeling techniques have been applied to recommendation systems since 2013, initially focusing on learning item embeddings from co-occurrences in user interaction sequences [00:03:08]. Early methods, like GRU4 (Gated Recurrent Units), predicted the next item from short sequences [00:03:22]. The introduction of transformers and attention mechanisms improved handling of long-range dependencies, allowing models to process hundreds to thousands of items in a user sequence [00:03:33].

Current trends in improving recommendation systems with AI focus on three key ideas:

Addressing Challenges in Recommendation Systems

Semantic IDs

Traditional hash-based item IDs do not encode the content of the item, leading to the “cold-start problem” for new items and sparsity issues for long-tail items with few interactions [00:04:00]. This causes recommendation systems to be popularity-biased [00:04:30].

Solution: Multimodal Semantic IDs Semantic IDs, potentially involving multimodal content, offer a solution by encoding the item’s content [00:04:39].

Example: Kuaishou (Kuaishou) Kuaishou, a short video platform in China, faced challenges learning from hundreds of millions of daily video uploads [00:04:59]. They developed a trainable multimodal semantic ID system [00:05:15].

Architecture: A two-tower network where content input is integrated [00:05:20].
Content Encoding: ResNet for visual [00:05:49], BERT for video descriptions [00:05:52], and VGGish for audio [00:05:55].
Trainable Embeddings: Content embeddings are concatenated, and K-means clustering is used to learn a fixed number of cluster IDs (e.g., 1,000 for 100 million videos) [00:06:08]. These cluster IDs are trainable and map the content space to the behavioral space [00:06:43].
Results: Semantic IDs not only outperformed hash-based IDs on clicks and likes [00:06:59], but also increased cold-start coverage by 3.6% and cold-start velocity significantly [00:07:07].

Benefits of Semantic IDs:

Addresses the cold-start problem [00:07:34].
Recommendations understand content [00:07:38].
Enables human-readable explanations for recommendations when combined with language models [00:07:46].

Data Augmentation with LLMs

High-quality, scaled data is crucial for search and recommendation systems [00:08:06]. Generating metadata, query expansions, and synonyms is costly and labor-intensive [00:08:31]. LLMs excel at generating synthetic data and labels [00:08:37].

Example: Indeed Indeed faced challenges with poor job recommendation quality, leading to users losing trust and unsubscribing from emails [00:08:51]. Explicit negative feedback (thumbs down) was sparse, and implicit feedback was imprecise [00:09:25].

Solution: LLM-distilled Lightweight Classifier Indeed developed a lightweight classifier to filter bad recommendations [00:09:50].

Initial Approach: Human experts labeled job recommendations and user pairs [00:10:03].
LLM Experimentation:
- Open LLMs (Mistral, Llama 2) performed poorly, producing generic output [00:10:20].
- GPT-4 achieved 90% precision and recall but was too costly and slow (22 seconds) [00:10:38].
- GPT-3.5 had very poor precision (63%), leading to discarding good recommendations [00:10:56].
- Fine-tuning GPT-2.5 yielded desired precision but was still too slow for online filtering (6.7 seconds) [00:11:30].
Final Solution: A lightweight classifier was distilled from the fine-tuned GPT-2.5 labels, achieving high performance (0.86 AU ROC) and real-time latency [00:11:51].

Outcome: Bad recommendations were reduced by 20% [00:12:20]. Application rate increased by 4%, and unsubscribe rate decreased by 5% [00:12:45]. This demonstrated that quality, not just quantity, significantly impacts recommendation performance [00:12:55].

Example: Spotify Spotify aimed to grow its podcast and audiobook categories beyond music, facing a cold-start problem for new content types [00:13:04].

Solution: LLM-Generated Query Recommendations Spotify used LLMs to generate natural language queries for exploratory search [00:14:16].

Query Generation: Combined conventional techniques (catalog titles, search logs, artist covers) with LLM-generated queries [00:14:01].
User Experience: When a user searches, the system presents query recommendations at the top, informing users about new categories like audiobooks or podcasts without obtrusive banners [00:14:49].

Outcome: A 9% increase in exploratory queries, meaning one-tenth of users explored new products daily, rapidly growing new content categories [00:15:11].

Benefits of LLM-augmented Synthetic Data:

Richer, high-quality data at scale, especially for tail queries and items [00:15:35].
Lower cost and effort compared to human annotation [00:15:46].

Unified Models

Traditional companies often have separate systems and models for ads, recommendations (e.g., homepage, item-to-item, cart), and search [00:16:03]. This leads to duplicative engineering, high maintenance costs, and limited knowledge transfer between models [00:16:36].

Solution: Unified Models Inspired by successes in vision and language, unified models consolidate multiple tasks into a single architecture [00:16:47].

Example: Netflix (Unicorn Ranker) Netflix faced high operational costs and missed learning opportunities due to bespoke models for search, similar item recommendations, and pre-query recommendations [00:17:14].

Solution: Unified Contextual Ranker (Unicorn) Netflix developed a unified contextual ranker, Unicorn, to consolidate these tasks [00:17:32].

Architecture: Takes unified input (user ID, item ID, search query, country, task) and uses a user foundation model with a context and relevance model [00:17:39].
Missing Item Imputation: For tasks like item-to-item recommendations where a search query might be absent, the model imputes it using the current item’s title [00:18:27].
Outcome: The unified model matched or exceeded the metrics of specialized models across multiple tasks [00:18:48]. This reduced technical debt and built a better foundation for future iterations, enabling faster innovation [00:19:02].

Example: Etsy (Unified Embeddings) Etsy needed to provide better results for both specific and broad queries while dealing with a constantly changing inventory [00:19:24]. Lexical embeddings didn’t account for user preferences [00:19:50].

Solution: Unified Embedding and Retrieval Etsy implemented a unified embedding approach based on a two-tower model [00:20:01].

Product Encoder: Uses T5 models for text embeddings (item descriptions) and query-product logs for query embeddings [00:20:18].
Query Encoder: Incorporates search query, product category, and user location [00:20:35].
Personalization: User preferences (past queries, purchases) are encoded [00:20:57].
Quality Vector: A quality vector (ratings, freshness, conversion rate) is concatenated to the product embedding to ensure relevant and high-quality results [00:21:17].
Outcome: A 2.6% increase in conversion across the entire site and over 5% increase in search purchases [00:21:51].

Benefits of Unified Models:

Simplifies the system [00:22:10].
Improvements to one part of the model automatically benefit other use cases [00:22:12].
Challenge: Potential for “alignment types” where improving one task degrades another, requiring careful model design [00:22:24].

Pinterest’s Approach to Search Relevance

Pinterest, a visual discovery platform handling over six billion searches monthly across billions of pins and 45+ languages [00:29:02], focuses on semantic relevance modeling in the re-ranking stage [00:30:00].

LLMs for Relevance Prediction

Model: A cross-encoder structure concatenates query and pin text, passing them to an LLM to get an embedding [00:31:09]. This embedding then feeds into an MLP layer to predict relevance on a five-point scale [00:31:30].
Fine-tuning: Open-source LLMs are fine-tuned using Pinterest’s internal data [00:31:42].
Results: LLMs substantially improved relevance prediction performance [00:32:10]. Larger, more advanced LLMs (e.g., 8 billion parameters) showed significant improvements (12% over multilingual BERT, 20% over in-house embedding model) [00:32:21].

Leveraging Multimodal Content Vision-language models (VLMs) generate captions from images and videos [00:32:54]. User actions (saves, clicks, searches) also provide valuable content annotations [00:32:56]. This diverse data helps create text representations of pins for LLM-based relevance prediction [00:33:06].

Efficiency in LLM Serving To achieve low latency (under 500ms) for LLM serving, Pinterest employs three levers:

Specification: Optimizing attention scores for specific tasks [00:33:41].
Smaller Models: Distilling larger models (e.g., 150 billion parameters) into smaller ones (8B, 3B, 1B) step-by-step to retain reasoning power [00:33:50]. Aggressive pruning can significantly reduce model quality [00:35:17].
Quantization: Using lower precision (FP8) for activations and parameters, with mixed precision where the LLM head remains in FP32 for better calibration [00:36:03].

These optimizations resulted in a 7x reduction in latency and a 30x increase in throughput (queries per GPU) [00:37:37].

Netflix’s Foundational Model for Personalization

Netflix aims to use one foundational model to cover all recommendation use cases, addressing diverse recommendation needs across various content types and pages [02:23:24]. Traditionally, this led to many specialized, independently built models with duplicative engineering and feature engineering [02:24:52].

Hypothesis:

Scalability: Through scale-up semi-supervised learning, personalization can be improved, similar to LLMs [02:27:36].
High Leverage: Integrating the foundation model into all systems can simultaneously improve downstream models [02:27:49].

Data and Training:

Tokenization: Crucial decision for model quality [02:28:38]. Unlike LLMs, each event token in recommendations has multiple facets (when, where, what) [02:29:11].
Event Representation: Deciding what information (time, location, device, canvas, entity, interaction type/duration) to keep or drop [02:30:31].
Embedding Layer: Combines ID embedding learning with semantic content information to address the cold-start problem [02:31:21].
Transformer Layer: Hidden state output is used as user representation [02:32:07]. Stability and aggregation methods are key considerations [02:32:26].
Objective/Loss Function: Richer than LLMs, using multiple sequences and facets of events (action type, metadata, future action prediction) as targets [02:33:04]. This can be a multitask learning problem or used for weighting/masking [02:34:01].

Scaling Results: Netflix observed continuous gains in model quality over two and a half years by scaling data and model parameters (up to 1 billion parameters) [02:34:27]. While scaling can continue, stringent latency and cost requirements necessitate distillation back to smaller models [02:35:08].

Learnings from LLMs:

Multi-token prediction: Forces the model to be less myopic, more robust to serving time shifts, and targets long-term user satisfaction [02:35:36].
Multi-layer representation: Techniques like layer-wise supervision and self-distillation create better and more stable user representations [02:36:12].
Long context window handling: Using truncated sliding windows, sparse attention, and progressive training for longer sequences improves efficiency [02:36:34].

Serving and Applications: The foundation model consolidates the data and representation layers (user and content representation), making application models thinner [02:37:12].

Integration as Subgraph: FM can be integrated as a pre-trained subgraph within downstream neural network models [02:37:48].
Embedding Push-out: Content and member embeddings can be pushed to a centralized store, allowing wider use cases beyond personalization (e.g., analytics) [02:38:15].
Fine-tuning/Distillation: Users can extract and fine-tune models for specific applications or distill them to meet online serving requirements [02:38:51].

Outcome: The foundational model has been incorporated into many applications, leading to significant AB test wins and infrastructure consolidation [02:39:12]. It’s a scalable solution that improves quality and accelerates innovation velocity [02:40:03].

Instacart’s Search and Discovery Enhancement with LLMs

Instacart, a leader in online grocery, focuses on search to help customers find both restocking items and discover new products [02:46:01]. New product discovery benefits customers, advertisers, and increases basket sizes [02:47:17].

Challenges with Conventional Search Engines:

Broad Queries: Overly broad queries (e.g., “snacks”) map to many products but lack engagement data for proper ranking [02:47:48].
Specific Queries: Very specific queries (e.g., “unsweetened plant-based yogurt”) occur infrequently, leading to insufficient engagement data [02:48:12].
Precision vs. Recall: Traditional models improved recall but struggled with precision [02:48:37].
Limited Discovery: Users found it difficult to discover related products after finding an initial item, requiring multiple searches [02:49:03].

Upleveling Query Understanding with LLMs: Instacart used LLMs to enhance its query understanding module, which includes query normalization, tagging, and classification [02:49:37].

Query-to-Category Classifier:

Task: Map a query (e.g., “watermelon”) to relevant categories (e.g., “fruits”) within a taxonomy of ~10,000 labels [02:50:16]. This is a multi-label classification problem [02:50:40].
Previous Models: FastText neural networks and NPMI (statistical co-occurrence) models worked for head/torso queries but had low coverage for tail queries due to lack of engagement data [02:50:46]. BERT-based models offered some improvement but insufficient for increased latency [02:51:22].
LLM Approach: Initially, LLMs alone produced decent but not optimal results in A/B tests because they didn’t understand Instacart-specific user behavior (e.g., “protein” meaning shakes/bars, not chicken/tofu) [02:51:36].
Hybrid Approach: The prompt was augmented with Instacart domain knowledge, such as top-converting categories for each query and annotations from query understanding models (brands, dietary attributes) [02:52:30].
Outcome: For tail queries, precision improved by 18 percentage points and recall by 70 percentage points [02:53:29]. A simple prompt with contextual information was highly effective [02:53:42].

Query Rewrites Model:

Purpose: Essential for e-commerce, especially with varied retailer catalogs, to ensure results are returned even for specific queries [02:54:10] (e.g., “1% milk” to “milk”) [02:54:27].
Previous Approach: Engagement data-trained models struggled with tail queries [02:54:37].
LLM Approach: LLMs generated precise rewrites (substitute, broad, synonymous) [02:54:48].
Outcome: Significant offline improvements through human evaluation and a large drop in “no results” queries online [02:55:16], benefiting the business by showing results where none existed before [02:55:47].

Scoring and Serving: Instacart precomputes outputs for head and torso queries offline in batch mode and caches them for low-latency online serving [02:56:06]. For the long tail, existing models are currently used as a fallback, with plans to replace them with distilled LLM models [02:56:49].

Future Directions for Query Understanding:

Consolidation: Consolidating multiple query understanding models into a single LLM to improve consistency and simplify management [02:57:21] (e.g., fixing “hum” vs. “hummus” error) [02:57:41].
Contextual Understanding: Using LLMs to understand the customer’s broader mission (e.g., buying ingredients for a recipe) [02:58:13].

Discovery-Oriented Content in Search Results: Users often found search results pages to be “dead ends” after adding an item to the cart [02:58:43]. LLMs were used to generate substitute and complementary results directly on the page [02:59:17].

Substitute Results: For queries with no exact matches (e.g., “swordfish”), LLMs generate alternatives (e.g., “other seafood alternatives”) [02:59:21].
Complementary Results: For queries with many exact matches (e.g., “sushi”), LLMs suggest related items (e.g., “Asian cooking ingredients”) at the bottom of the page [02:59:31].
Outcome: Both led to improvements in engagement and revenue per search [02:59:51].

Generation Requirements and Techniques:

Incrementality: Generate content incremental to existing solutions [03:00:08].
Domain Alignment: LLM answers must align with Instacart’s domain knowledge (e.g., “dishes” meaning cookware, not food, unless specified) [03:00:14].
Prompt Augmentation: Initially, LLM-generated common sense answers didn’t align with user behavior [03:01:11]. Augmenting prompts with Instacart domain knowledge (top converting categories, query annotations, subsequent user queries) significantly improved results [03:01:41].

Serving Discovery Content: Similar to query understanding, Instacart precomputes LLM outputs for historical search logs in batch mode, storing query content metadata and potential products [03:02:30]. Online serving is a quick lookup from a feature store, maintaining low latency [03:02:52].

Key Challenges Faced:

Aligning with Business Metrics: Iterating on prompts and metadata to ensure LLM generations drive top-line wins like revenue [03:03:13].
Ranking Improvement: Traditional models failed; strategies like diversity-based re-ranking were needed for user engagement [03:03:28].
Content Evaluation: Ensuring LLM outputs are accurate (not hallucinating) and adhere to product needs, often using an LLM as a judge [03:04:21].

YouTube’s Large Recommender Model (LRM)

YouTube, one of the largest consumer apps globally, relies heavily on recommendation systems for watch time across various surfaces (home, watch next, shorts, search) [03:09:01]. Google has built a new recommendation system using Gemini to recommend YouTube videos [03:07:08].

Adapting Gemini for YouTube Recommendations (LRM):

Foundation: Starts with a base Gemini checkpoint [03:11:36].
Continued Pre-training: Teaches the model information about YouTube to create a unified, YouTube-specific checkpoint called LRM [03:11:39].
Alignment: Aligns LRM for specific recommendation tasks like retrieval and ranking, creating custom versions for major surfaces [03:11:51]. LRM is in production for retrieval and being experimented with for ranking [03:12:06].

Semantic ID for Videos:

Challenge: Need to tokenize videos to allow LLMs to reason over many videos [03:12:24].
Process:
1. Extract features (title, description, transcript, audio/video frame data) from a video [03:13:21].
2. Create a multi-dimensional embedding [03:13:33].
3. Quantize using RQVE to assign a unique semantic ID (SID) token to each video [03:13:36].
Outcome: The entire corpus of billions of YouTube videos is organized around semantically meaningful tokens, creating a “new language of YouTube videos” [03:13:55]. This is a move away from hash-based tokenization [03:14:25].

Continued Pre-training of LRM:

Linking Text and SID: Training tasks connect English text with video tokens (SIDs) [03:14:45] (e.g., “This video has title XYZ” followed by the model outputting the title) [03:15:00].
Understanding Watch Sequences: Using YouTube engagement data (user paths through videos), the model is prompted with sequences of watched videos (e.g., ABCD) and learns to predict masked videos [03:15:26]. This teaches it relationships between videos based on user engagement [03:15:46].

Generative Retrieval with LRM:

Process: A prompt is constructed for each user including demographic information, context video, and watch history [03:17:05]. The model then decodes video recommendations as SIDs [03:17:37].
Outcome: Generates unique and interesting recommendations, particularly for difficult recommendation tasks or users with limited data [03:17:41].

Challenges for YouTube LRM:

Cost: LRM is powerful but expensive to serve [03:18:36]. YouTube achieved over 95% cost savings to enable production launch [03:19:01].
Vocabulary Size: Billions of videos (20 billion, millions added daily) vs. 100,000 words in LLM English vocabulary [03:20:24].
Freshness: Unlike LLMs, recommendation systems demand real-time freshness (minutes/hours) for new content [03:20:51]. LRM requires continuous pre-training on a daily/hourly basis [03:21:35].
Scale: Must use smaller, more efficient models (Flash and smaller checkpoints) to meet latency and scale requirements for billions of daily active users [03:21:52].

Recipe for LLM-based Recommendation System:

Tokenize Content: Create a domain-specific language by building rich embeddings from features and quantizing them into atomic tokens [03:22:28].
Adapt LLM: Link English and the new domain language through training tasks that enable reasoning across both [03:22:56], resulting in a bilingual LLM [03:23:10].
Prompt with User Information: Construct personalized prompts with user demographics, activity, and actions, then train task-specific models to create a generative recommendation system [03:23:21].

Future of AI in Recommendations

Currently, LLMs primarily augment recommendations, enhancing quality invisibly to users [03:23:56]. The AI trust gap in user experiences for recommendations is “underhyped” because users don’t directly know what’s happening [03:24:16].

Upcoming Developments:

Interactive Experiences: Users will be able to talk to bilingual LLMs in natural language to steer recommendations towards their goals [03:24:26].
Explainable Recommendations: Recommenders will explain why a candidate was recommended [03:24:42].
Blurring Search and Recommendations: The lines between search and recommendation will increasingly blur [03:24:52].
Generative Content: Recommendations may evolve to include personalized versions of content, or even content creation itself, leading to “end-of-one” content generated for individual users [03:25:01]. This represents a significant shift in the evolution of AI interfaces and user interaction and design process improvements with AI.

Learnings:

Semantic IDs: Solve cold-start and sparsity by encoding content.

Data Augmentation: LLMs generate rich, high-quality synthetic data for tail queries and items at lower cost.

Unified Models: Simplify systems, reduce maintenance, and accelerate innovation across multiple recommendation tasks.

Tubegraph

Explorer

Table of Contents

AIgenerated recommendations and user experience

Evolution of AI in Recommendation Systems

Addressing Challenges in Recommendation Systems

Semantic IDs

Data Augmentation with LLMs

Unified Models

Pinterest’s Approach to Search Relevance

Netflix’s Foundational Model for Personalization

Instacart’s Search and Discovery Enhancement with LLMs

YouTube’s Large Recommender Model (LRM)

Future of AI in Recommendations

Graph View

Backlinks