From: aidotengineer
Challenges and Advancements in LLM-Based Recommendation Models
The integration of Large Language Models (LLMs) with recommendation systems (Rex) marks a significant shift in how personalized experiences are delivered. This evolving field addresses long-standing challenges in Rex through novel applications of LLM capabilities [00:02:49]. Key areas of focus include semantic IDs, data augmentation, and unified models [00:03:55].
Semantic IDs: Understanding Content Beyond Hashes
Traditionally, recommendation systems rely on hash-based item IDs, which do not inherently encode content information [00:04:11]. This leads to several problems:
- Cold-start problem: New items require relearning their properties [00:04:17].
- Sparsity: Long-tail items with few interactions are difficult to learn from [00:04:26].
- Popularity bias: Systems struggle with cold-start items and sparsity, favoring popular items [00:04:32].
Advancement: Semantic IDs offer a solution by encoding item content, potentially involving multimodal data [00:04:39].
Kuaishou’s Multimodal Semantic IDs
Kuaishou, a short video platform, faced the challenge of learning from hundreds of millions of daily video uploads [00:05:03]. Their approach involved trainable multimodal semantic IDs:
- They combine static content embeddings (visual from ResNet, video descriptions from BERT, audio from VGGish) with dynamic user behavior [00:05:08].
- Content embeddings are concatenated [00:06:11].
- Cluster IDs are learned via k-means clustering (e.g., 1,000 clusters from 100 million videos) [00:06:22].
- A “motor encoder” learns to map the content space, via cluster IDs and an embedding table, to the behavioral space [00:06:48].
Outcome: These semantic IDs not only outperform hash-based IDs on clicks and likes but significantly increase cold-start coverage (3.6%) and cold-start velocity [00:07:07]. This means recommendations can understand content and even provide human-readable explanations [00:07:54].
Data Augmentation and Efficiency: Leveraging LLMs for Quality Data
High-quality data at scale is crucial for search and recommendation systems [00:08:12].
- Traditional methods like human annotations are costly and high-effort [00:08:31].
- Implicit feedback is often imprecise [00:09:33].
Advancement: LLMs excel at generating synthetic data and labels, offering a solution to these data challenges [00:08:37].
Indeed’s Lightweight Classifier for Job Recommendations
Indeed faced the issue of poor job recommendations leading to user dissatisfaction and unsubscribes [00:09:03].
- Challenge: Explicit negative feedback was sparse [00:09:28].
- Solution: A lightweight classifier to filter bad recommendations [00:09:50].
- Process:
- Experts labeled job recommendations and user pairs [00:10:07].
- Prompting open LLMs (Mistral, LLaMA 2) yielded poor performance [00:10:24].
- GPT-4 performed well (90% precision and recall) but was costly and too slow (22 seconds) [00:10:43].
- GPT-3.5 had very poor precision (63%) [00:11:00].
- Fine-tuning GPT-2.5 achieved desired precision but was still too slow for online filtering (6.7 seconds) [00:11:39].
- Finally, they distilled a lightweight classifier using fine-tuned GPT-2.5 labels, achieving high performance (0.86 AU ROC) and real-time suitability [00:11:53].
- Outcome: Reduced bad recommendations by 20%, increased application rate by 4%, and decreased unsubscribe rate by 5% [00:12:20]. This highlighted that quality over quantity in recommendations makes a significant difference [00:12:57].
Spotify’s Query Recommendation System
Spotify aimed to grow new content categories like podcasts and audiobooks, facing a cold-start problem not just on items but on categories [00:13:34].
- Solution: A query recommendation system [00:13:53].
- Process:
- Traditional techniques extracted ideas from catalog/playlist titles and search logs [00:14:03].
- LLMs were used to generate natural language queries, augmenting existing methods [00:14:20].
- Outcome: +9% in exploratory queries, meaning one-tenth of users were exploring new products daily, significantly accelerating new product category growth [00:15:14].
Benefit: LLM-augmented synthetic data provides richer, high-quality data at scale, even for tail queries and items, at a much lower cost than human annotation [00:15:35].
Unified Models: Streamlining Recommendation Systems
Traditional companies often have separate, bespoke systems for ads, recommendations, and search, with different models even within recommendations for various surfaces (e.g., homepage, item page) [00:16:10].
- Challenge: Duplicative engineering pipelines, high maintenance costs, and improvements in one model not transferring to others [00:16:36].
Advancement: Unified models, inspired by vision and language domains, address these inefficiencies [00:16:47].
Netflix’s Unified Contextual Ranker (Unicorn)
Netflix faced high operational costs due to bespoke models for search, similar item recommendations, and pre-query recommendations [00:17:18].
- Solution: A unified contextual ranker (Unicorn) [00:17:34].
- Process: It uses a user foundation model and a context and relevance model with unified input (user ID, item ID, search query, country, task) [00:17:39]. Smart imputation handles missing items, for instance, by using the current item’s title as a search query [00:18:27].
- Outcome: The unified model matched or exceeded the metrics of specialized models on multiple tasks, leading to infrastructure consolidation and faster iteration [00:18:50].
Etsy’s Unified Embeddings
Etsy aimed to provide better results for both specific and broad queries despite its constantly changing inventory [00:19:24].
- Challenge: Lexical embeddings didn’t account for user preferences [00:19:53].
- Solution: Unified embedding and retrieval [00:20:03].
- Process: A product encoder (using T5 for text embeddings and query-product logs) and a query encoder share token, product category, and user location encoders [00:20:18]. User preferences are encoded via query-user scale effect features [00:20:59]. A “quality vector” (ratings, freshness, conversion rate) is concatenated to the product embedding [00:21:22].
- Outcome: 2.6% increase in conversion across the entire site and over 5% increase in search purchases [00:21:52].
Benefits: Unified models simplify systems, and improvements in one part of the model transfer to other use cases [00:22:12]. However, misalignment can occur, potentially requiring multiple unified models [00:22:25].
LLM-Based Search Relevance and Efficiency: Pinterest’s Approach
Pinterest, handling over six billion searches monthly across billions of pins and 45+ languages, uses LLMs for semantic relevance modeling in the reranking stage [00:29:04].
Advancements:
- Relevance Prediction: A cross-encoder structure concatenates query and pin text, passing them to an LLM to get an embedding. This embedding is then fed into an MLP layer for five relevance levels [00:31:11]. Fine-tuning open-source LLMs with Pinterest data substantially improves performance, with larger models yielding greater gains (e.g., 12% improvement over multilingual BERT) [00:32:17].
- Content Annotations: Vision-language model-generated captions and user actions provide useful content annotations for text representation of pins [00:32:54].
- Efficiency: To address latency (e.g., 500-400 ms) and throughput requirements [00:33:26], Pinterest employs:
- Distillation: Start with a larger model (e.g., 150 billion parameters) and gradually distill it into smaller models (e.g., 8B, 3B, 1B) [00:34:08]. This step-by-step approach is more effective than direct distillation [00:34:17].
- Pruning: Gradually reduce redundancy in transformer models (e.g., number of heads, MLPs) [00:34:55]. Aggressive pruning can lead to significant quality reduction [00:35:20].
- Quantization: Use lower precision (e.g., FP8) for activations and parameters, but maintain FP32 for the LM head to preserve prediction precision and calibration [00:36:03].
- Sparsification: Specify attention scores, preventing every item from attending to every other item, especially for recommended items in a sequence [00:37:03].
Outcome: These optimizations resulted in a 7x reduction in latency and a 30x increase in throughput (queries per GPU) [00:37:51].
Netflix’s Foundational Model for Personalization
Netflix has made a “big bet” on using one foundational model to cover all recommendation use cases [02:23:26].
Challenges of Traditional Systems:
- Diverse Needs: Netflix has varied recommendation requirements across rows (genres, new/trending, Netflix originals), items (movies, TV shows, games, live streaming), and pages (homepage, search, kids’ homepage, mobile feed) [02:23:48].
- Many Specialized Models: Historically, this led to numerous independently built models, each with different objectives and significant overlap, resulting in duplications in feature/label engineering [02:24:52].
- Scalability: Spinning up new models for each use case is unsustainable and hinders innovation velocity [02:26:32].
Advancement: Centralizing user representation learning through a transformer-based foundation model [02:27:19].
Hypotheses:
- Scaling Law: Scaling up semi-supervised learning can improve personalization, akin to LLM scaling laws [02:27:36].
- High Leverage: Integrating the foundation model into all systems can simultaneously improve downstream, canvas-facing models [02:27:49].
Data and Training Considerations:
- Tokenization: Crucial decision for model quality [02:28:44]. Unlike language tokens, each “token” in a recommendation context is an interaction event with many facets (when, where, what) [02:29:21].
- Event Representation: Break down events by time, location, device, canvas, target entity, and interaction details [02:30:31].
- Embedding Feature Transformation: Combine ID embedding learning with semantic content information to address the cold-start problem (not an issue for LLMs but critical for Rex) [02:31:28].
- Transformer Layer: Hidden state output serves as a long-term user representation. Stability and aggregation methods (across time/layers) are key [02:32:10].
- Objective/Loss Function: Richer than LLMs, as multiple sequences can represent output (entity IDs, action types, metadata, duration, next play time) [02:33:06]. This allows for multitask learning or using targets as weights/rewards [02:34:01].
Scaling Results: The foundation model demonstrated continuous gains over two-and-a-half years, scaling from millions of profiles to billions of parameters [02:34:38]. Latency/cost requirements necessitate distillation for larger models [02:35:08].
Learnings from LLMs:
- Multi-token prediction: Forces the model to be less myopic, more robust to serving time shifts, and targets long-term user satisfaction [02:35:36].
- Multi-layer representation: Techniques like layer-wise supervision and self-distillation lead to better, more stable user representations [02:36:12].
- Long context window handling: Utilizes techniques like truncated sliding windows, sparse attention, and progressive training [02:36:34].
Application and Serving:
- The foundation model (FM) consolidates data and representation layers (user and content) [02:37:15].
- Application models become thinner layers built on top of FM [02:37:30].
- Consumption Patterns:
- FM integrated as a subgraph within downstream models [02:37:48].
- Push out embeddings (content and member) to a centralized store for wider use [02:38:17].
- Extract and fine-tune/distill models for specific applications to meet online serving requirements [02:38:55].
Wins: High leverage of FM leads to AB test gains and infrastructure consolidation, validating the big bet. It provides a scalable solution, higher leverage, and faster innovation velocity [02:39:12].
Current Directions:
- Universal representation for heterogeneous entities (semantic ID) [02:40:23].
- Generative retrieval for collection recommendation (multi-step decoding) [02:40:40].
- Faster adaptation through prompt tuning (training soft tokens) [02:40:59].
Instacart’s LLM-Powered Search and Discovery
Instacart uses LLMs to transform search and discovery for grocery e-commerce, where customers have long shopping lists and seek both restocking and new product discovery [02:46:17].
Challenges of Conventional Search Engines:
- Overly Broad Queries: E.g., “snacks,” where engagement data is hard to collect for long-tail products [02:47:48].
- Very Specific Queries: E.g., “unsweetened plant-based yogurt,” where sparse data limits model training [02:48:12].
- New Item Discovery: Difficulty in surfacing related products or enabling exploratory browsing, leading to dead ends [02:48:43].
Advancements: LLMs enhance query understanding and enable discovery-oriented content.
Query Understanding: Enhancing Accuracy
Instacart’s query understanding module includes models for normalization, tagging, and classification [02:49:59].
- Query-to-Category Classifier: Maps a query to relevant categories (multilabel problem, 10,000 labels) [02:50:16].
- Traditional Challenge: FastText and NPMI models struggled with low coverage for tail queries due to lack of engagement data [02:51:15].
- LLM Approach: Initially, raw LLM predictions were decent but mismatched Instacart user behavior (e.g., “protein” for chicken vs. protein shakes) [02:51:38].
- Hybrid Approach: Augmented the prompt with Instacart’s domain knowledge (e.g., top converting categories, query annotations like brands/dietary attributes) [02:52:34].
- Outcome: Significant improvements for tail queries: 18 percentage points increase in precision, 70 percentage points increase in recall [02:53:29].
- Query Rewrites Model: Generates broader or synonymous queries to ensure results even with small retailer catalogs [02:54:10].
- LLM Approach: Similar to category classification, LLMs generate precise rewrites (substitute, broad, synonymous) [02:54:51].
- Outcome: Offline improvements in human evaluation and a significant drop in queries without any results online [02:55:19].
Serving Strategy: Precompute outputs for head and torso queries offline in batch mode and cache them for low-latency serving. Fallback to existing models (or distilled LLMs) for the long tail [02:56:00].
Future Direction: Consolidate multiple query understanding models into a single LLM for consistency and pass extra context (e.g., user mission like recipe ingredients) [02:57:30].
Discovery-Oriented Content: Enhancing User Experience
Instacart aimed to make search results pages more useful after an item is added to the cart, by showing complementary or substitute items [02:58:43].
- Traditional Challenge: Required extensive feature engineering or manual work [02:59:08].
- LLM Approach: Instruct LLM to act as an AI assistant generating complementary and substitute shopping lists [03:00:40].
- Initial LLM common-sense answers didn’t align with user behavior (e.g., “protein” for chicken vs. protein bars) [03:01:17].
- Improved Approach: Augmented prompts with Instacart domain knowledge (top converting categories, query annotations, subsequent user queries) [03:01:41].
- Outcome: Significantly improved engagement and revenue per search [02:59:53].
Serving Strategy: Precompute LLM-generated content for historical search logs in batch mode and store it for quick lookup [03:02:29].
Key Challenges in Implementation:
- Aligning generation with business metrics like revenue [03:03:16].
- Improving ranking of generated content, often requiring diversity-based reranking [03:03:28].
- Evaluating LLM-generated content for correctness (no hallucination) and adherence to product needs (using LLM as a judge) [03:03:45].
Overall Takeaways: LLM world knowledge improves query understanding for tail queries. Success comes from combining domain knowledge with LLMs. Robust evaluation is crucial [03:04:02].
YouTube’s Large Recommender Model (LRM)
YouTube aims to transform its recommendation system by adapting Gemini, a large language model, to recommend videos. Over 70% of YouTube’s watch time is driven by its recommendation system [03:09:08].
Problem: Learning a function that takes user and context as input to provide recommendations [03:09:44].
Advancement: The Large Recommender Model (LRM) adapts Gemini for YouTube recommendations [03:11:43].
Recipe for LLM-Based Rex:
- Tokenize Content (Semantic ID):
- Goal: Create atomic units for a new “language” of YouTube videos [03:13:48].
- Process: Extract features (title, description, transcript, audio, video frames) from a video, combine into a multi-dimensional embedding, then quantize using RQVA to assign each video a semantic ID (SID) [03:13:21].
- Outcome: A semantically meaningful tokenization where related videos share common prefixes (e.g., sports → volleyball) [03:14:04].
- Adapt the LLM (Continued Pre-training):
- Goal: Teach the LLM to understand both English and this new YouTube language [03:14:39].
- Tasks:
- Linking Text and SID: Prompt the model with a video’s SID and ask it to output its title, creator, or topics [03:14:48].
- Understanding Watch Sequences: Prompt with user watch history (video SIDs), mask some, and train the model to predict them, learning relationships based on user engagement [03:15:26].
- Outcome: A “bilingual” LLM that can reason across English and YouTube videos [03:16:00].
- Prompt with User Information (Generative Retrieval):
- Goal: Generate personalized recommendations [03:17:03].
- Process: Construct a prompt with user demographics, context video, and watch history [03:17:12].
- Outcome: Yields unique and interesting recommendations, especially for hard cases (e.g., finding women’s track races related to men’s Olympic highlights) [03:17:45].
Challenges in LLM-Based Rex (YouTube’s Perspective):
- Vocabulary and Corpus Size: Billions of YouTube videos compared to 100,000 words in English vocabulary [03:20:44].
- Freshness: Millions of videos added daily require continuous pre-training (on the order of days/hours) to recommend new, relevant content immediately (e.g., Taylor Swift’s new music video) [03:20:51], unlike classical LLM pre-training which is less frequent [03:21:40].
- Scale and Serving Cost: Large LLMs like Gemini Pro are too expensive for billions of daily active users; smaller, efficient models (e.g., Flash) are needed to meet latency and scale requirements, often after significant cost reduction efforts (e.g., 95%+ savings) [03:21:52].
- Balancing Language Capability: Training on semantic IDs can cause the model to “forget” English, requiring strategies like Mixture of Experts [03:25:50].
Solution for Serving Costs: Turn it into an offline problem by removing personalized aspects from the prompt, building an offline recommendations table for popular videos. This allows simple lookup for serving [03:19:10].
Future Directions for LLM and Rexus
- Augmentation to Interaction: LLMs currently augment recommendations invisibly, enhancing quality [03:23:58]. The future involves interactive experiences where users can steer recommendations using natural language, receive explanations for recommendations, and align them with their goals [03:24:28].
- Blurring Search and Recommendation: The lines between these two domains will blur [03:24:52].
- Generative Content: Recommendations may evolve to personalized versions of content, and eventually, the system might even create content tailored for the user (end-of-one content) [03:25:01].
These advancements highlight a dynamic field where LLMs are not just improving existing recommendation paradigms but enabling entirely new capabilities and experiences.