From: aidotengineer
The integration of artificial intelligence (AI) and recommendation systems is evolving, with a focus on enhancing the user experience and addressing long-standing challenges in personalized content delivery [00:02:47]. This field has seen significant advancements, particularly with the advent of large language models (LLMs) and transformer-based architectures [00:03:33].
Evolution of AI in Recommendation Systems
Language modeling techniques have been applied to recommendation systems since 2013, initially focusing on learning item embeddings from co-occurrences in user interaction sequences [00:03:08]. Early methods, like GRU4 (Gated Recurrent Units), predicted the next item from short sequences [00:03:22]. The introduction of transformers and attention mechanisms improved handling of long-range dependencies, allowing models to process hundreds to thousands of items in a user sequence [00:03:33].
Current trends in improving recommendation systems with AI focus on three key ideas:
- Semantic IDs [00:03:55]
- Data Augmentation [00:03:56]
- Unified Models [00:03:57]
Addressing Challenges in Recommendation Systems
Semantic IDs
Traditional hash-based item IDs do not encode the content of the item, leading to the “cold-start problem” for new items and sparsity issues for long-tail items with few interactions [00:04:00]. This causes recommendation systems to be popularity-biased [00:04:30].
Solution: Multimodal Semantic IDs Semantic IDs, potentially involving multimodal content, offer a solution by encoding the item’s content [00:04:39].
Example: Kuaishou (Kuaishou) Kuaishou, a short video platform in China, faced challenges learning from hundreds of millions of daily video uploads [00:04:59]. They developed a trainable multimodal semantic ID system [00:05:15].
- Architecture: A two-tower network where content input is integrated [00:05:20].
- Content Encoding: ResNet for visual [00:05:49], BERT for video descriptions [00:05:52], and VGGish for audio [00:05:55].
- Trainable Embeddings: Content embeddings are concatenated, and K-means clustering is used to learn a fixed number of cluster IDs (e.g., 1,000 for 100 million videos) [00:06:08]. These cluster IDs are trainable and map the content space to the behavioral space [00:06:43].
- Results: Semantic IDs not only outperformed hash-based IDs on clicks and likes [00:06:59], but also increased cold-start coverage by 3.6% and cold-start velocity significantly [00:07:07].
Benefits of Semantic IDs:
- Addresses the cold-start problem [00:07:34].
- Recommendations understand content [00:07:38].
- Enables human-readable explanations for recommendations when combined with language models [00:07:46].
Data Augmentation with LLMs
High-quality, scaled data is crucial for search and recommendation systems [00:08:06]. Generating metadata, query expansions, and synonyms is costly and labor-intensive [00:08:31]. LLMs excel at generating synthetic data and labels [00:08:37].
Example: Indeed Indeed faced challenges with poor job recommendation quality, leading to users losing trust and unsubscribing from emails [00:08:51]. Explicit negative feedback (thumbs down) was sparse, and implicit feedback was imprecise [00:09:25].
Solution: LLM-distilled Lightweight Classifier Indeed developed a lightweight classifier to filter bad recommendations [00:09:50].
- Initial Approach: Human experts labeled job recommendations and user pairs [00:10:03].
- LLM Experimentation:
- Open LLMs (Mistral, Llama 2) performed poorly, producing generic output [00:10:20].
- GPT-4 achieved 90% precision and recall but was too costly and slow (22 seconds) [00:10:38].
- GPT-3.5 had very poor precision (63%), leading to discarding good recommendations [00:10:56].
- Fine-tuning GPT-2.5 yielded desired precision but was still too slow for online filtering (6.7 seconds) [00:11:30].
- Final Solution: A lightweight classifier was distilled from the fine-tuned GPT-2.5 labels, achieving high performance (0.86 AU ROC) and real-time latency [00:11:51].
Outcome: Bad recommendations were reduced by 20% [00:12:20]. Application rate increased by 4%, and unsubscribe rate decreased by 5% [00:12:45]. This demonstrated that quality, not just quantity, significantly impacts recommendation performance [00:12:55].
Example: Spotify Spotify aimed to grow its podcast and audiobook categories beyond music, facing a cold-start problem for new content types [00:13:04].
Solution: LLM-Generated Query Recommendations Spotify used LLMs to generate natural language queries for exploratory search [00:14:16].
- Query Generation: Combined conventional techniques (catalog titles, search logs, artist covers) with LLM-generated queries [00:14:01].
- User Experience: When a user searches, the system presents query recommendations at the top, informing users about new categories like audiobooks or podcasts without obtrusive banners [00:14:49].
Outcome: A 9% increase in exploratory queries, meaning one-tenth of users explored new products daily, rapidly growing new content categories [00:15:11].
Benefits of LLM-augmented Synthetic Data:
- Richer, high-quality data at scale, especially for tail queries and items [00:15:35].
- Lower cost and effort compared to human annotation [00:15:46].
Unified Models
Traditional companies often have separate systems and models for ads, recommendations (e.g., homepage, item-to-item, cart), and search [00:16:03]. This leads to duplicative engineering, high maintenance costs, and limited knowledge transfer between models [00:16:36].
Solution: Unified Models Inspired by successes in vision and language, unified models consolidate multiple tasks into a single architecture [00:16:47].
Example: Netflix (Unicorn Ranker) Netflix faced high operational costs and missed learning opportunities due to bespoke models for search, similar item recommendations, and pre-query recommendations [00:17:14].
Solution: Unified Contextual Ranker (Unicorn) Netflix developed a unified contextual ranker, Unicorn, to consolidate these tasks [00:17:32].
- Architecture: Takes unified input (user ID, item ID, search query, country, task) and uses a user foundation model with a context and relevance model [00:17:39].
- Missing Item Imputation: For tasks like item-to-item recommendations where a search query might be absent, the model imputes it using the current item’s title [00:18:27].
- Outcome: The unified model matched or exceeded the metrics of specialized models across multiple tasks [00:18:48]. This reduced technical debt and built a better foundation for future iterations, enabling faster innovation [00:19:02].
Example: Etsy (Unified Embeddings) Etsy needed to provide better results for both specific and broad queries while dealing with a constantly changing inventory [00:19:24]. Lexical embeddings didn’t account for user preferences [00:19:50].
Solution: Unified Embedding and Retrieval Etsy implemented a unified embedding approach based on a two-tower model [00:20:01].
- Product Encoder: Uses T5 models for text embeddings (item descriptions) and query-product logs for query embeddings [00:20:18].
- Query Encoder: Incorporates search query, product category, and user location [00:20:35].
- Personalization: User preferences (past queries, purchases) are encoded [00:20:57].
- Quality Vector: A quality vector (ratings, freshness, conversion rate) is concatenated to the product embedding to ensure relevant and high-quality results [00:21:17].
- Outcome: A 2.6% increase in conversion across the entire site and over 5% increase in search purchases [00:21:51].
Benefits of Unified Models:
- Simplifies the system [00:22:10].
- Improvements to one part of the model automatically benefit other use cases [00:22:12].
- Challenge: Potential for “alignment types” where improving one task degrades another, requiring careful model design [00:22:24].
Pinterest’s Approach to Search Relevance
Pinterest, a visual discovery platform handling over six billion searches monthly across billions of pins and 45+ languages [00:29:02], focuses on semantic relevance modeling in the re-ranking stage [00:30:00].
LLMs for Relevance Prediction
- Model: A cross-encoder structure concatenates query and pin text, passing them to an LLM to get an embedding [00:31:09]. This embedding then feeds into an MLP layer to predict relevance on a five-point scale [00:31:30].
- Fine-tuning: Open-source LLMs are fine-tuned using Pinterest’s internal data [00:31:42].
- Results: LLMs substantially improved relevance prediction performance [00:32:10]. Larger, more advanced LLMs (e.g., 8 billion parameters) showed significant improvements (12% over multilingual BERT, 20% over in-house embedding model) [00:32:21].
Leveraging Multimodal Content Vision-language models (VLMs) generate captions from images and videos [00:32:54]. User actions (saves, clicks, searches) also provide valuable content annotations [00:32:56]. This diverse data helps create text representations of pins for LLM-based relevance prediction [00:33:06].
Efficiency in LLM Serving To achieve low latency (under 500ms) for LLM serving, Pinterest employs three levers:
- Specification: Optimizing attention scores for specific tasks [00:33:41].
- Smaller Models: Distilling larger models (e.g., 150 billion parameters) into smaller ones (8B, 3B, 1B) step-by-step to retain reasoning power [00:33:50]. Aggressive pruning can significantly reduce model quality [00:35:17].
- Quantization: Using lower precision (FP8) for activations and parameters, with mixed precision where the LLM head remains in FP32 for better calibration [00:36:03].
These optimizations resulted in a 7x reduction in latency and a 30x increase in throughput (queries per GPU) [00:37:37].
Netflix’s Foundational Model for Personalization
Netflix aims to use one foundational model to cover all recommendation use cases, addressing diverse recommendation needs across various content types and pages [02:23:24]. Traditionally, this led to many specialized, independently built models with duplicative engineering and feature engineering [02:24:52].
Hypothesis:
- Scalability: Through scale-up semi-supervised learning, personalization can be improved, similar to LLMs [02:27:36].
- High Leverage: Integrating the foundation model into all systems can simultaneously improve downstream models [02:27:49].
Data and Training:
- Tokenization: Crucial decision for model quality [02:28:38]. Unlike LLMs, each event token in recommendations has multiple facets (when, where, what) [02:29:11].
- Event Representation: Deciding what information (time, location, device, canvas, entity, interaction type/duration) to keep or drop [02:30:31].
- Embedding Layer: Combines ID embedding learning with semantic content information to address the cold-start problem [02:31:21].
- Transformer Layer: Hidden state output is used as user representation [02:32:07]. Stability and aggregation methods are key considerations [02:32:26].
- Objective/Loss Function: Richer than LLMs, using multiple sequences and facets of events (action type, metadata, future action prediction) as targets [02:33:04]. This can be a multitask learning problem or used for weighting/masking [02:34:01].
Scaling Results: Netflix observed continuous gains in model quality over two and a half years by scaling data and model parameters (up to 1 billion parameters) [02:34:27]. While scaling can continue, stringent latency and cost requirements necessitate distillation back to smaller models [02:35:08].
Learnings from LLMs:
- Multi-token prediction: Forces the model to be less myopic, more robust to serving time shifts, and targets long-term user satisfaction [02:35:36].
- Multi-layer representation: Techniques like layer-wise supervision and self-distillation create better and more stable user representations [02:36:12].
- Long context window handling: Using truncated sliding windows, sparse attention, and progressive training for longer sequences improves efficiency [02:36:34].
Serving and Applications: The foundation model consolidates the data and representation layers (user and content representation), making application models thinner [02:37:12].
- Integration as Subgraph: FM can be integrated as a pre-trained subgraph within downstream neural network models [02:37:48].
- Embedding Push-out: Content and member embeddings can be pushed to a centralized store, allowing wider use cases beyond personalization (e.g., analytics) [02:38:15].
- Fine-tuning/Distillation: Users can extract and fine-tune models for specific applications or distill them to meet online serving requirements [02:38:51].
Outcome: The foundational model has been incorporated into many applications, leading to significant AB test wins and infrastructure consolidation [02:39:12]. It’s a scalable solution that improves quality and accelerates innovation velocity [02:40:03].
Instacart’s Search and Discovery Enhancement with LLMs
Instacart, a leader in online grocery, focuses on search to help customers find both restocking items and discover new products [02:46:01]. New product discovery benefits customers, advertisers, and increases basket sizes [02:47:17].
Challenges with Conventional Search Engines:
- Broad Queries: Overly broad queries (e.g., “snacks”) map to many products but lack engagement data for proper ranking [02:47:48].
- Specific Queries: Very specific queries (e.g., “unsweetened plant-based yogurt”) occur infrequently, leading to insufficient engagement data [02:48:12].
- Precision vs. Recall: Traditional models improved recall but struggled with precision [02:48:37].
- Limited Discovery: Users found it difficult to discover related products after finding an initial item, requiring multiple searches [02:49:03].
Upleveling Query Understanding with LLMs: Instacart used LLMs to enhance its query understanding module, which includes query normalization, tagging, and classification [02:49:37].
Query-to-Category Classifier:
- Task: Map a query (e.g., “watermelon”) to relevant categories (e.g., “fruits”) within a taxonomy of ~10,000 labels [02:50:16]. This is a multi-label classification problem [02:50:40].
- Previous Models: FastText neural networks and NPMI (statistical co-occurrence) models worked for head/torso queries but had low coverage for tail queries due to lack of engagement data [02:50:46]. BERT-based models offered some improvement but insufficient for increased latency [02:51:22].
- LLM Approach: Initially, LLMs alone produced decent but not optimal results in A/B tests because they didn’t understand Instacart-specific user behavior (e.g., “protein” meaning shakes/bars, not chicken/tofu) [02:51:36].
- Hybrid Approach: The prompt was augmented with Instacart domain knowledge, such as top-converting categories for each query and annotations from query understanding models (brands, dietary attributes) [02:52:30].
- Outcome: For tail queries, precision improved by 18 percentage points and recall by 70 percentage points [02:53:29]. A simple prompt with contextual information was highly effective [02:53:42].
Query Rewrites Model:
- Purpose: Essential for e-commerce, especially with varied retailer catalogs, to ensure results are returned even for specific queries [02:54:10] (e.g., “1% milk” to “milk”) [02:54:27].
- Previous Approach: Engagement data-trained models struggled with tail queries [02:54:37].
- LLM Approach: LLMs generated precise rewrites (substitute, broad, synonymous) [02:54:48].
- Outcome: Significant offline improvements through human evaluation and a large drop in “no results” queries online [02:55:16], benefiting the business by showing results where none existed before [02:55:47].
Scoring and Serving: Instacart precomputes outputs for head and torso queries offline in batch mode and caches them for low-latency online serving [02:56:06]. For the long tail, existing models are currently used as a fallback, with plans to replace them with distilled LLM models [02:56:49].
Future Directions for Query Understanding:
- Consolidation: Consolidating multiple query understanding models into a single LLM to improve consistency and simplify management [02:57:21] (e.g., fixing “hum” vs. “hummus” error) [02:57:41].
- Contextual Understanding: Using LLMs to understand the customer’s broader mission (e.g., buying ingredients for a recipe) [02:58:13].
Discovery-Oriented Content in Search Results: Users often found search results pages to be “dead ends” after adding an item to the cart [02:58:43]. LLMs were used to generate substitute and complementary results directly on the page [02:59:17].
- Substitute Results: For queries with no exact matches (e.g., “swordfish”), LLMs generate alternatives (e.g., “other seafood alternatives”) [02:59:21].
- Complementary Results: For queries with many exact matches (e.g., “sushi”), LLMs suggest related items (e.g., “Asian cooking ingredients”) at the bottom of the page [02:59:31].
- Outcome: Both led to improvements in engagement and revenue per search [02:59:51].
Generation Requirements and Techniques:
- Incrementality: Generate content incremental to existing solutions [03:00:08].
- Domain Alignment: LLM answers must align with Instacart’s domain knowledge (e.g., “dishes” meaning cookware, not food, unless specified) [03:00:14].
- Prompt Augmentation: Initially, LLM-generated common sense answers didn’t align with user behavior [03:01:11]. Augmenting prompts with Instacart domain knowledge (top converting categories, query annotations, subsequent user queries) significantly improved results [03:01:41].
Serving Discovery Content: Similar to query understanding, Instacart precomputes LLM outputs for historical search logs in batch mode, storing query content metadata and potential products [03:02:30]. Online serving is a quick lookup from a feature store, maintaining low latency [03:02:52].
Key Challenges Faced:
- Aligning with Business Metrics: Iterating on prompts and metadata to ensure LLM generations drive top-line wins like revenue [03:03:13].
- Ranking Improvement: Traditional models failed; strategies like diversity-based re-ranking were needed for user engagement [03:03:28].
- Content Evaluation: Ensuring LLM outputs are accurate (not hallucinating) and adhere to product needs, often using an LLM as a judge [03:04:21].
YouTube’s Large Recommender Model (LRM)
YouTube, one of the largest consumer apps globally, relies heavily on recommendation systems for watch time across various surfaces (home, watch next, shorts, search) [03:09:01]. Google has built a new recommendation system using Gemini to recommend YouTube videos [03:07:08].
Adapting Gemini for YouTube Recommendations (LRM):
- Foundation: Starts with a base Gemini checkpoint [03:11:36].
- Continued Pre-training: Teaches the model information about YouTube to create a unified, YouTube-specific checkpoint called LRM [03:11:39].
- Alignment: Aligns LRM for specific recommendation tasks like retrieval and ranking, creating custom versions for major surfaces [03:11:51]. LRM is in production for retrieval and being experimented with for ranking [03:12:06].
Semantic ID for Videos:
- Challenge: Need to tokenize videos to allow LLMs to reason over many videos [03:12:24].
- Process:
- Extract features (title, description, transcript, audio/video frame data) from a video [03:13:21].
- Create a multi-dimensional embedding [03:13:33].
- Quantize using RQVE to assign a unique semantic ID (SID) token to each video [03:13:36].
- Outcome: The entire corpus of billions of YouTube videos is organized around semantically meaningful tokens, creating a “new language of YouTube videos” [03:13:55]. This is a move away from hash-based tokenization [03:14:25].
Continued Pre-training of LRM:
- Linking Text and SID: Training tasks connect English text with video tokens (SIDs) [03:14:45] (e.g., “This video has title XYZ” followed by the model outputting the title) [03:15:00].
- Understanding Watch Sequences: Using YouTube engagement data (user paths through videos), the model is prompted with sequences of watched videos (e.g., ABCD) and learns to predict masked videos [03:15:26]. This teaches it relationships between videos based on user engagement [03:15:46].
Generative Retrieval with LRM:
- Process: A prompt is constructed for each user including demographic information, context video, and watch history [03:17:05]. The model then decodes video recommendations as SIDs [03:17:37].
- Outcome: Generates unique and interesting recommendations, particularly for difficult recommendation tasks or users with limited data [03:17:41].
Challenges for YouTube LRM:
- Cost: LRM is powerful but expensive to serve [03:18:36]. YouTube achieved over 95% cost savings to enable production launch [03:19:01].
- Vocabulary Size: Billions of videos (20 billion, millions added daily) vs. 100,000 words in LLM English vocabulary [03:20:24].
- Freshness: Unlike LLMs, recommendation systems demand real-time freshness (minutes/hours) for new content [03:20:51]. LRM requires continuous pre-training on a daily/hourly basis [03:21:35].
- Scale: Must use smaller, more efficient models (Flash and smaller checkpoints) to meet latency and scale requirements for billions of daily active users [03:21:52].
Recipe for LLM-based Recommendation System:
- Tokenize Content: Create a domain-specific language by building rich embeddings from features and quantizing them into atomic tokens [03:22:28].
- Adapt LLM: Link English and the new domain language through training tasks that enable reasoning across both [03:22:56], resulting in a bilingual LLM [03:23:10].
- Prompt with User Information: Construct personalized prompts with user demographics, activity, and actions, then train task-specific models to create a generative recommendation system [03:23:21].
Future of AI in Recommendations
Currently, LLMs primarily augment recommendations, enhancing quality invisibly to users [03:23:56]. The AI trust gap in user experiences for recommendations is “underhyped” because users don’t directly know what’s happening [03:24:16].
Upcoming Developments:
- Interactive Experiences: Users will be able to talk to bilingual LLMs in natural language to steer recommendations towards their goals [03:24:26].
- Explainable Recommendations: Recommenders will explain why a candidate was recommended [03:24:42].
- Blurring Search and Recommendations: The lines between search and recommendation will increasingly blur [03:24:52].
- Generative Content: Recommendations may evolve to include personalized versions of content, or even content creation itself, leading to “end-of-one” content generated for individual users [03:25:01]. This represents a significant shift in the evolution of AI interfaces and user interaction and design process improvements with AI.
Learnings:
- Semantic IDs: Solve cold-start and sparsity by encoding content.
- Data Augmentation: LLMs generate rich, high-quality synthetic data for tail queries and items at lower cost.
- Unified Models: Simplify systems, reduce maintenance, and accelerate innovation across multiple recommendation tasks.