From: aidotengineer

Challenges and Advancements in LLM-Based Recommendation Models

The integration of Large Language Models (LLMs) with recommendation systems (Rex) marks a significant shift in how personalized experiences are delivered. This evolving field addresses long-standing challenges in Rex through novel applications of LLM capabilities [00:02:49]. Key areas of focus include semantic IDs, data augmentation, and unified models [00:03:55].

Semantic IDs: Understanding Content Beyond Hashes

Traditionally, recommendation systems rely on hash-based item IDs, which do not inherently encode content information [00:04:11]. This leads to several problems:

  • Cold-start problem: New items require relearning their properties [00:04:17].
  • Sparsity: Long-tail items with few interactions are difficult to learn from [00:04:26].
  • Popularity bias: Systems struggle with cold-start items and sparsity, favoring popular items [00:04:32].

Advancement: Semantic IDs offer a solution by encoding item content, potentially involving multimodal data [00:04:39].

Kuaishou’s Multimodal Semantic IDs

Kuaishou, a short video platform, faced the challenge of learning from hundreds of millions of daily video uploads [00:05:03]. Their approach involved trainable multimodal semantic IDs:

  • They combine static content embeddings (visual from ResNet, video descriptions from BERT, audio from VGGish) with dynamic user behavior [00:05:08].
  • Content embeddings are concatenated [00:06:11].
  • Cluster IDs are learned via k-means clustering (e.g., 1,000 clusters from 100 million videos) [00:06:22].
  • A “motor encoder” learns to map the content space, via cluster IDs and an embedding table, to the behavioral space [00:06:48].

Outcome: These semantic IDs not only outperform hash-based IDs on clicks and likes but significantly increase cold-start coverage (3.6%) and cold-start velocity [00:07:07]. This means recommendations can understand content and even provide human-readable explanations [00:07:54].

Data Augmentation and Efficiency: Leveraging LLMs for Quality Data

High-quality data at scale is crucial for search and recommendation systems [00:08:12].

  • Traditional methods like human annotations are costly and high-effort [00:08:31].
  • Implicit feedback is often imprecise [00:09:33].

Advancement: LLMs excel at generating synthetic data and labels, offering a solution to these data challenges [00:08:37].

Indeed’s Lightweight Classifier for Job Recommendations

Indeed faced the issue of poor job recommendations leading to user dissatisfaction and unsubscribes [00:09:03].

  • Challenge: Explicit negative feedback was sparse [00:09:28].
  • Solution: A lightweight classifier to filter bad recommendations [00:09:50].
  • Process:
    • Experts labeled job recommendations and user pairs [00:10:07].
    • Prompting open LLMs (Mistral, LLaMA 2) yielded poor performance [00:10:24].
    • GPT-4 performed well (90% precision and recall) but was costly and too slow (22 seconds) [00:10:43].
    • GPT-3.5 had very poor precision (63%) [00:11:00].
    • Fine-tuning GPT-2.5 achieved desired precision but was still too slow for online filtering (6.7 seconds) [00:11:39].
    • Finally, they distilled a lightweight classifier using fine-tuned GPT-2.5 labels, achieving high performance (0.86 AU ROC) and real-time suitability [00:11:53].
  • Outcome: Reduced bad recommendations by 20%, increased application rate by 4%, and decreased unsubscribe rate by 5% [00:12:20]. This highlighted that quality over quantity in recommendations makes a significant difference [00:12:57].

Spotify’s Query Recommendation System

Spotify aimed to grow new content categories like podcasts and audiobooks, facing a cold-start problem not just on items but on categories [00:13:34].

  • Solution: A query recommendation system [00:13:53].
  • Process:
    • Traditional techniques extracted ideas from catalog/playlist titles and search logs [00:14:03].
    • LLMs were used to generate natural language queries, augmenting existing methods [00:14:20].
  • Outcome: +9% in exploratory queries, meaning one-tenth of users were exploring new products daily, significantly accelerating new product category growth [00:15:14].

Benefit: LLM-augmented synthetic data provides richer, high-quality data at scale, even for tail queries and items, at a much lower cost than human annotation [00:15:35].

Unified Models: Streamlining Recommendation Systems

Traditional companies often have separate, bespoke systems for ads, recommendations, and search, with different models even within recommendations for various surfaces (e.g., homepage, item page) [00:16:10].

  • Challenge: Duplicative engineering pipelines, high maintenance costs, and improvements in one model not transferring to others [00:16:36].

Advancement: Unified models, inspired by vision and language domains, address these inefficiencies [00:16:47].

Netflix’s Unified Contextual Ranker (Unicorn)

Netflix faced high operational costs due to bespoke models for search, similar item recommendations, and pre-query recommendations [00:17:18].

  • Solution: A unified contextual ranker (Unicorn) [00:17:34].
  • Process: It uses a user foundation model and a context and relevance model with unified input (user ID, item ID, search query, country, task) [00:17:39]. Smart imputation handles missing items, for instance, by using the current item’s title as a search query [00:18:27].
  • Outcome: The unified model matched or exceeded the metrics of specialized models on multiple tasks, leading to infrastructure consolidation and faster iteration [00:18:50].

Etsy’s Unified Embeddings

Etsy aimed to provide better results for both specific and broad queries despite its constantly changing inventory [00:19:24].

  • Challenge: Lexical embeddings didn’t account for user preferences [00:19:53].
  • Solution: Unified embedding and retrieval [00:20:03].
  • Process: A product encoder (using T5 for text embeddings and query-product logs) and a query encoder share token, product category, and user location encoders [00:20:18]. User preferences are encoded via query-user scale effect features [00:20:59]. A “quality vector” (ratings, freshness, conversion rate) is concatenated to the product embedding [00:21:22].
  • Outcome: 2.6% increase in conversion across the entire site and over 5% increase in search purchases [00:21:52].

Benefits: Unified models simplify systems, and improvements in one part of the model transfer to other use cases [00:22:12]. However, misalignment can occur, potentially requiring multiple unified models [00:22:25].

LLM-Based Search Relevance and Efficiency: Pinterest’s Approach

Pinterest, handling over six billion searches monthly across billions of pins and 45+ languages, uses LLMs for semantic relevance modeling in the reranking stage [00:29:04].

Advancements:

  • Relevance Prediction: A cross-encoder structure concatenates query and pin text, passing them to an LLM to get an embedding. This embedding is then fed into an MLP layer for five relevance levels [00:31:11]. Fine-tuning open-source LLMs with Pinterest data substantially improves performance, with larger models yielding greater gains (e.g., 12% improvement over multilingual BERT) [00:32:17].
  • Content Annotations: Vision-language model-generated captions and user actions provide useful content annotations for text representation of pins [00:32:54].
  • Efficiency: To address latency (e.g., 500-400 ms) and throughput requirements [00:33:26], Pinterest employs:
    • Distillation: Start with a larger model (e.g., 150 billion parameters) and gradually distill it into smaller models (e.g., 8B, 3B, 1B) [00:34:08]. This step-by-step approach is more effective than direct distillation [00:34:17].
    • Pruning: Gradually reduce redundancy in transformer models (e.g., number of heads, MLPs) [00:34:55]. Aggressive pruning can lead to significant quality reduction [00:35:20].
    • Quantization: Use lower precision (e.g., FP8) for activations and parameters, but maintain FP32 for the LM head to preserve prediction precision and calibration [00:36:03].
    • Sparsification: Specify attention scores, preventing every item from attending to every other item, especially for recommended items in a sequence [00:37:03].

Outcome: These optimizations resulted in a 7x reduction in latency and a 30x increase in throughput (queries per GPU) [00:37:51].

Netflix’s Foundational Model for Personalization

Netflix has made a “big bet” on using one foundational model to cover all recommendation use cases [02:23:26].

Challenges of Traditional Systems:

  • Diverse Needs: Netflix has varied recommendation requirements across rows (genres, new/trending, Netflix originals), items (movies, TV shows, games, live streaming), and pages (homepage, search, kids’ homepage, mobile feed) [02:23:48].
  • Many Specialized Models: Historically, this led to numerous independently built models, each with different objectives and significant overlap, resulting in duplications in feature/label engineering [02:24:52].
  • Scalability: Spinning up new models for each use case is unsustainable and hinders innovation velocity [02:26:32].

Advancement: Centralizing user representation learning through a transformer-based foundation model [02:27:19].

Hypotheses:

  1. Scaling Law: Scaling up semi-supervised learning can improve personalization, akin to LLM scaling laws [02:27:36].
  2. High Leverage: Integrating the foundation model into all systems can simultaneously improve downstream, canvas-facing models [02:27:49].

Data and Training Considerations:

  • Tokenization: Crucial decision for model quality [02:28:44]. Unlike language tokens, each “token” in a recommendation context is an interaction event with many facets (when, where, what) [02:29:21].
  • Event Representation: Break down events by time, location, device, canvas, target entity, and interaction details [02:30:31].
  • Embedding Feature Transformation: Combine ID embedding learning with semantic content information to address the cold-start problem (not an issue for LLMs but critical for Rex) [02:31:28].
  • Transformer Layer: Hidden state output serves as a long-term user representation. Stability and aggregation methods (across time/layers) are key [02:32:10].
  • Objective/Loss Function: Richer than LLMs, as multiple sequences can represent output (entity IDs, action types, metadata, duration, next play time) [02:33:06]. This allows for multitask learning or using targets as weights/rewards [02:34:01].

Scaling Results: The foundation model demonstrated continuous gains over two-and-a-half years, scaling from millions of profiles to billions of parameters [02:34:38]. Latency/cost requirements necessitate distillation for larger models [02:35:08].

Learnings from LLMs:

  • Multi-token prediction: Forces the model to be less myopic, more robust to serving time shifts, and targets long-term user satisfaction [02:35:36].
  • Multi-layer representation: Techniques like layer-wise supervision and self-distillation lead to better, more stable user representations [02:36:12].
  • Long context window handling: Utilizes techniques like truncated sliding windows, sparse attention, and progressive training [02:36:34].

Application and Serving:

  • The foundation model (FM) consolidates data and representation layers (user and content) [02:37:15].
  • Application models become thinner layers built on top of FM [02:37:30].
  • Consumption Patterns:
    1. FM integrated as a subgraph within downstream models [02:37:48].
    2. Push out embeddings (content and member) to a centralized store for wider use [02:38:17].
    3. Extract and fine-tune/distill models for specific applications to meet online serving requirements [02:38:55].

Wins: High leverage of FM leads to AB test gains and infrastructure consolidation, validating the big bet. It provides a scalable solution, higher leverage, and faster innovation velocity [02:39:12].

Current Directions:

  • Universal representation for heterogeneous entities (semantic ID) [02:40:23].
  • Generative retrieval for collection recommendation (multi-step decoding) [02:40:40].
  • Faster adaptation through prompt tuning (training soft tokens) [02:40:59].

Instacart’s LLM-Powered Search and Discovery

Instacart uses LLMs to transform search and discovery for grocery e-commerce, where customers have long shopping lists and seek both restocking and new product discovery [02:46:17].

Challenges of Conventional Search Engines:

  • Overly Broad Queries: E.g., “snacks,” where engagement data is hard to collect for long-tail products [02:47:48].
  • Very Specific Queries: E.g., “unsweetened plant-based yogurt,” where sparse data limits model training [02:48:12].
  • New Item Discovery: Difficulty in surfacing related products or enabling exploratory browsing, leading to dead ends [02:48:43].

Advancements: LLMs enhance query understanding and enable discovery-oriented content.

Query Understanding: Enhancing Accuracy

Instacart’s query understanding module includes models for normalization, tagging, and classification [02:49:59].

  • Query-to-Category Classifier: Maps a query to relevant categories (multilabel problem, 10,000 labels) [02:50:16].
    • Traditional Challenge: FastText and NPMI models struggled with low coverage for tail queries due to lack of engagement data [02:51:15].
    • LLM Approach: Initially, raw LLM predictions were decent but mismatched Instacart user behavior (e.g., “protein” for chicken vs. protein shakes) [02:51:38].
    • Hybrid Approach: Augmented the prompt with Instacart’s domain knowledge (e.g., top converting categories, query annotations like brands/dietary attributes) [02:52:34].
    • Outcome: Significant improvements for tail queries: 18 percentage points increase in precision, 70 percentage points increase in recall [02:53:29].
  • Query Rewrites Model: Generates broader or synonymous queries to ensure results even with small retailer catalogs [02:54:10].
    • LLM Approach: Similar to category classification, LLMs generate precise rewrites (substitute, broad, synonymous) [02:54:51].
    • Outcome: Offline improvements in human evaluation and a significant drop in queries without any results online [02:55:19].

Serving Strategy: Precompute outputs for head and torso queries offline in batch mode and cache them for low-latency serving. Fallback to existing models (or distilled LLMs) for the long tail [02:56:00].

Future Direction: Consolidate multiple query understanding models into a single LLM for consistency and pass extra context (e.g., user mission like recipe ingredients) [02:57:30].

Discovery-Oriented Content: Enhancing User Experience

Instacart aimed to make search results pages more useful after an item is added to the cart, by showing complementary or substitute items [02:58:43].

  • Traditional Challenge: Required extensive feature engineering or manual work [02:59:08].
  • LLM Approach: Instruct LLM to act as an AI assistant generating complementary and substitute shopping lists [03:00:40].
    • Initial LLM common-sense answers didn’t align with user behavior (e.g., “protein” for chicken vs. protein bars) [03:01:17].
    • Improved Approach: Augmented prompts with Instacart domain knowledge (top converting categories, query annotations, subsequent user queries) [03:01:41].
  • Outcome: Significantly improved engagement and revenue per search [02:59:53].

Serving Strategy: Precompute LLM-generated content for historical search logs in batch mode and store it for quick lookup [03:02:29].

Key Challenges in Implementation:

  1. Aligning generation with business metrics like revenue [03:03:16].
  2. Improving ranking of generated content, often requiring diversity-based reranking [03:03:28].
  3. Evaluating LLM-generated content for correctness (no hallucination) and adherence to product needs (using LLM as a judge) [03:03:45].

Overall Takeaways: LLM world knowledge improves query understanding for tail queries. Success comes from combining domain knowledge with LLMs. Robust evaluation is crucial [03:04:02].

YouTube’s Large Recommender Model (LRM)

YouTube aims to transform its recommendation system by adapting Gemini, a large language model, to recommend videos. Over 70% of YouTube’s watch time is driven by its recommendation system [03:09:08].

Problem: Learning a function that takes user and context as input to provide recommendations [03:09:44].

Advancement: The Large Recommender Model (LRM) adapts Gemini for YouTube recommendations [03:11:43].

Recipe for LLM-Based Rex:

  1. Tokenize Content (Semantic ID):
    • Goal: Create atomic units for a new “language” of YouTube videos [03:13:48].
    • Process: Extract features (title, description, transcript, audio, video frames) from a video, combine into a multi-dimensional embedding, then quantize using RQVA to assign each video a semantic ID (SID) [03:13:21].
    • Outcome: A semantically meaningful tokenization where related videos share common prefixes (e.g., sports volleyball) [03:14:04].
  2. Adapt the LLM (Continued Pre-training):
    • Goal: Teach the LLM to understand both English and this new YouTube language [03:14:39].
    • Tasks:
      • Linking Text and SID: Prompt the model with a video’s SID and ask it to output its title, creator, or topics [03:14:48].
      • Understanding Watch Sequences: Prompt with user watch history (video SIDs), mask some, and train the model to predict them, learning relationships based on user engagement [03:15:26].
    • Outcome: A “bilingual” LLM that can reason across English and YouTube videos [03:16:00].
  3. Prompt with User Information (Generative Retrieval):
    • Goal: Generate personalized recommendations [03:17:03].
    • Process: Construct a prompt with user demographics, context video, and watch history [03:17:12].
    • Outcome: Yields unique and interesting recommendations, especially for hard cases (e.g., finding women’s track races related to men’s Olympic highlights) [03:17:45].

Challenges in LLM-Based Rex (YouTube’s Perspective):

  • Vocabulary and Corpus Size: Billions of YouTube videos compared to 100,000 words in English vocabulary [03:20:44].
  • Freshness: Millions of videos added daily require continuous pre-training (on the order of days/hours) to recommend new, relevant content immediately (e.g., Taylor Swift’s new music video) [03:20:51], unlike classical LLM pre-training which is less frequent [03:21:40].
  • Scale and Serving Cost: Large LLMs like Gemini Pro are too expensive for billions of daily active users; smaller, efficient models (e.g., Flash) are needed to meet latency and scale requirements, often after significant cost reduction efforts (e.g., 95%+ savings) [03:21:52].
  • Balancing Language Capability: Training on semantic IDs can cause the model to “forget” English, requiring strategies like Mixture of Experts [03:25:50].

Solution for Serving Costs: Turn it into an offline problem by removing personalized aspects from the prompt, building an offline recommendations table for popular videos. This allows simple lookup for serving [03:19:10].

Future Directions for LLM and Rexus

  • Augmentation to Interaction: LLMs currently augment recommendations invisibly, enhancing quality [03:23:58]. The future involves interactive experiences where users can steer recommendations using natural language, receive explanations for recommendations, and align them with their goals [03:24:28].
  • Blurring Search and Recommendation: The lines between these two domains will blur [03:24:52].
  • Generative Content: Recommendations may evolve to personalized versions of content, and eventually, the system might even create content tailored for the user (end-of-one content) [03:25:01].

These advancements highlight a dynamic field where LLMs are not just improving existing recommendation paradigms but enabling entirely new capabilities and experiences.