From: aidotengineer

The intersection of recommendation systems and large language models (LLMs) is a significant area of development, promising to revolutionize how users discover content and products [00:02:47]. This integration builds upon a long history of language modeling techniques in recommendation systems, which date back to 2013 with item embeddings and later evolved with GRU4Rec and transformers for handling longer interaction sequences [00:03:06].

Challenges in Conventional Recommendation Systems

Traditional recommendation systems face several inherent challenges:

  • Hash-based Item IDs: These do not inherently encode content, leading to the “cold-start problem” for new items where systems must relearn everything about them [00:04:00].
  • Sparsity: Many “tail” items have very few interactions, making it difficult for models to learn effectively [00:04:24].
  • Popularity Bias: Systems often favor popular items, struggling to recommend new or niche content [00:04:30].
  • Data Quality and Scale: Machine learning models, especially for search and recommendations, require vast amounts of high-quality, metadata-rich data, which is costly and labor-intensive to acquire through traditional means [00:08:04].
  • System Silos: Historically, systems for ads, recommendations, and search operate separately, leading to duplicated engineering efforts, high maintenance costs, and limited knowledge transfer between models [00:16:03].

Personalization Strategies Leveraging LLMs

Three key strategies are emerging to address these challenges and enhance personalization:

1. Semantic IDs

Semantic IDs encode the content of an item, including multimodal information, allowing recommendations to understand content [00:04:39]. This approach directly tackles the cold-start problem for new items [00:07:34].

Example: Kwai’s Trainable Multimodal Semantic IDs

Kwai, a short-video platform, faced the challenge of learning from hundreds of millions of daily video uploads [00:04:59]. Their solution involved combining static content embeddings with dynamic user behavior using trainable multimodal semantic IDs [00:05:08].

  • Architecture: They used a standard two-tower network. Content inputs (visual via ResNet, video descriptions via BERT, audio via VGGish) were concatenated [00:05:15].
  • Clustering: Non-trainable content embeddings were used to learn 1,000 cluster IDs via K-means clustering from 100 million short videos [00:06:17]. These cluster IDs were mapped to their own embedding table [00:06:39].
  • Learning: The model encoder learned to map the content space via these cluster IDs to the behavioral space [00:06:44].
  • Outcome: These semantic IDs not only outperformed hash-based IDs on clicks and likes but significantly increased cold-start coverage (3.6%) and cold-start velocity, enabling new videos to reach view thresholds faster [00:06:59].

Future integrations may involve blending LLMs with semantic IDs to explain why a user might like a recommendation, providing human-readable explanations [00:07:48]. YouTube has also built semantic IDs by distilling multimodal features into tokens, organizing billions of videos into semantically meaningful tokens [03:06:52].

2. Data Augmentation

LLMs excel at generating synthetic data and labels, providing richer, high-quality data at scale, especially for tail queries and items, at a significantly lower cost and effort than human annotation [00:08:35].

Example: Indeed’s Job Recommendation Filtering

Indeed faced the challenge of bad job recommendations leading to poor user experience and unsubscriptions [00:09:01]. Explicit feedback (thumbs up/down) was sparse, and implicit feedback was imprecise [00:09:25].

  • Solution: They developed a lightweight classifier to filter bad recommendations [00:09:50].
  • Process:
    1. Human experts labeled job recommendations and user pairs based on resume and activity data [00:10:05].
    2. Open LLMs (Mistral, Llama 2) showed poor performance due to generic output [00:10:20].
    3. GPT-4 performed well (90% precision and recall) but was too costly and slow (22 seconds per prediction) [00:10:38].
    4. GPT-3.5 had poor precision, incorrectly filtering out 37% of good recommendations [00:10:56].
    5. Fine-tuning GPT-2.5 achieved the desired precision but was still too slow (6.7 seconds) for online filtering [00:11:30].
    6. Finally, they distilled a lightweight classifier using labels from the fine-tuned GPT-2.5, achieving high performance (0.86 AU ROC) and real-time latency [00:11:51].
  • Outcome: This reduced bad recommendations by 20%, increased application rates by 4%, and decreased unsubscribe rates by 5%, demonstrating that quality over quantity significantly improves recommendation impact [00:12:20].

Example: Spotify’s Query Recommendation System

Spotify aimed to expand beyond music into podcasts and audiobooks, facing a cold-start problem for new content categories [00:13:04].

  • Solution: They used an LLM to generate natural language queries for an exploratory search system [00:13:50]. Existing techniques generated queries from catalog titles, playlists, and search logs [00:14:01]. The LLM augmented this by generating more natural language queries [00:14:16].
  • Outcome: This led to a 9% increase in exploratory queries, meaning one-tenth of their users were now exploring new products daily, significantly accelerating category growth [00:15:11].

Example: Instacart’s Search and Discovery

Instacart used LLMs to improve query understanding and product discovery, tackling challenges like overly broad or specific queries, and enabling new item discovery [02:47:32].

  • Query to Product Category Classifier: Traditional models struggled with tail queries due to lack of engagement data [02:50:46]. Initial LLM prompting was decent but failed in A/B tests due to a mismatch with Instacart user behavior (e.g., “protein” meaning supplements vs. chicken) [02:51:36]. The solution was to augment the prompt with Instacart’s domain knowledge, such as top converting categories for each query, significantly improving precision and recall for tail queries [02:52:30].
  • Query Rewrites: LLMs generated precise rewrites (substitute, broad, synonymous) for queries, which was crucial for retailers with varying catalog sizes [02:54:08]. This drastically reduced queries with no results, boosting business [02:55:41].
  • Discovery-Oriented Content: LLMs generated substitute and complementary product suggestions for search results pages (e.g., seafood alternatives for swordfish, Asian cooking ingredients for sushi) [02:58:43]. Again, augmenting LLM prompts with Instacart domain knowledge (top converting categories, query annotations, subsequent user queries) was key to aligning generated content with user behavior and business metrics [03:01:39].
  • Serving: Instacart precomputed LLM outputs for head and torso queries offline, caching them for low-latency online serving, and falling back to existing models for the long tail, with plans to replace those with distilled LLMs [02:56:02].

3. Unified Models

Unified models aim to consolidate multiple specialized models for different recommendation tasks (e.g., homepage, item, cart recommendations) into a single, cohesive system [00:16:00]. This approach leverages shared learning and reduces engineering overhead [00:16:47].

Example: Netflix’s Unified Contextual Ranker (Unicorn)

Netflix sought to address the proliferation of bespoke models for various recommendation and search tasks [00:17:16].

  • Solution: They developed Unicorn, a unified contextual ranker built on a user foundation model and a context/relevance model [00:17:31].
  • Unified Input: The model uses a single data schema for all use cases, incorporating user ID, item ID, search query (if applicable), country, and task [00:17:53]. Smart imputation fills missing data, like using the current item’s title as a search query if none exists [00:18:27].
  • Outcome: Unicorn matched or exceeded the metrics of specialized models across multiple tasks [00:18:48], reducing technical debt and accelerating future iterations [00:19:04].

Scaling Foundation Models for Recommendations

Netflix observed that scaling laws, similar to those in LLMs, apply to recommendation systems; continuous scaling up of models and data yielded performance gains [02:34:31]. However, stringent latency and cost requirements for real-time recommendations necessitate distillation of larger models for production [02:35:08].

Learnings from LLM Development Applied to Recommendation Models

  • Multi-Token Prediction: Forces the model to be less myopic, more robust to serving time shifts, and targets long-term user satisfaction [02:35:36].
  • Multi-Layer Representation: Improves the stability and quality of user representations [02:36:12].
  • Long Context Window Handling: Techniques from LLMs, such as truncated sliding windows and sparse attention, maximize learning and training efficiency [02:36:34].

Integration and Application of Foundation Models

Netflix’s foundation model (FM) integrates with downstream models in three ways:

  1. Subgraph Integration: The FM can be used as a pre-trained subgraph within downstream neural networks [02:37:48].
  2. Embedding Push-out: Content and user embeddings from the FM are pushed to a centralized embedding store for wider use cases, including analytics [02:38:15].
  3. Model Extraction and Fine-tuning/Distillation: Specific applications can fine-tune or distill the FM to meet online serving requirements [02:38:51].

This approach yielded significant “wins” in both A/B test improvements and infrastructure consolidation [02:39:12].

Example: Etsy’s Unified Embeddings

Etsy used unified embeddings for search and retrieval to address challenges with specific or broad queries and constantly changing inventory [01:59:30].

  • Architecture: Similar to a two-tower model, with a product encoder (using T5 for text embeddings and query-product logs) and a query encoder. Both share encoders for text tokens, product categories, and user location [02:00:03]. User preferences are personalized via query-user scale effect features [02:05:40].
  • Quality Vector: A “quality vector” (ratings, freshness, conversion rate) was concatenated to the product embedding vector, with a constant vector slapped onto the query embedding to match dimensions for similarity calculation [02:11:18].
  • Outcome: This resulted in a 2.6% increase in conversion across the entire site and over 5% increase in search purchases [02:18:50], demonstrating the strong impact of unified models on core business metrics.

Future Directions

The application of LLMs in personalization is rapidly evolving:

  • Invisible Augmentation: Currently, LLMs largely augment recommendation quality invisibly to users [02:39:56].
  • Interactive Experiences: Future developments aim for users to directly interact with recommendation systems using natural language, allowing them to steer recommendations, receive explanations, and align recommendations with their specific goals [02:40:07]. This will blur the lines between search and recommendation [02:40:52].
  • Generative Content: Ultimately, recommendation systems may not just recommend content but also generate personalized versions of content, leading to “end-of-one” content created specifically for individual users [02:41:01].

The recipe for building an LLM-based recommendation system involves three major steps [02:42:25]:

  1. Content Tokenization: Creating a domain-specific language by tokenizing content (e.g., video frames, audio, text) into atomic units (like semantic IDs) [02:42:30].
  2. LLM Adaptation: Teaching an LLM to understand both natural language and this new domain-specific language, creating a bilingual LLM through targeted training tasks [02:42:57].
  3. Personalized Prompting: Using user information (demographics, activity) to construct personalized prompts, leading to a generative recommendation system [02:43:21].

This approach represents a significant advancement in personalization, promising more relevant, dynamic, and interactive user experiences [02:43:36].