Integration of LLMs with recommendation systems

From: aidotengineer

The future of recommendation systems involves a significant merger with Large Language Models (LLMs) [00:02:47]. This integration builds on a history of language modeling techniques in recommendation systems, which began around 2013 with learning item embeddings from user interaction sequences [00:03:08]. Early methods, like GRU4, predicted the next item from short sequences [00:03:22]. The advent of transformers and attention mechanisms improved handling of long-range dependencies, allowing processing of hundreds or thousands of item IDs in user sequences [00:03:33].

Three key ideas are emerging for the future of this integration: semantic IDs, data augmentation, and unified models [00:03:53].

Semantic IDs

Traditional hash-based item IDs do not encode the content of an item, leading to cold-start problems for new items and sparsity issues for tail items with limited interactions [00:04:00]. This often results in recommendation systems being popularity-biased [00:04:30].

The solution proposed is the use of semantic IDs, which can incorporate multimodal content [00:04:39].

Kwai’s Trainable Multimodal Semantic IDs

Kwai, a short-video platform, faced challenges learning from hundreds of millions of daily user-uploaded videos [00:04:46]. They sought to combine static content embeddings with dynamic user behavior [00:05:08].

Kwai’s approach involves a standard two-tower network, with embedding layers for user and item IDs [00:05:20]. What’s new is the incorporation of content input:

Visual content is encoded using ResNet [00:05:49].
Video descriptions use BERT [00:05:52].
Audio is encoded with VGGish [00:05:55].

These content embeddings are concatenated [00:06:08]. The system then learns 1,000 cluster IDs (from 100 million videos) using K-means clustering [00:06:17]. These trainable cluster IDs are mapped to their own embedding table [00:06:38]. The model encoder learns to map the content space via these cluster IDs to the behavioral space [00:06:43].

The outcomes were significant:

Semantic IDs outperformed regular hash-based IDs on clicks and likes [00:06:59].
Cold-start coverage (new videos shared) increased by 3.6% [00:07:07].
Cold-start velocity (new videos reaching a view threshold) also increased substantially [00:07:17].

The benefits of semantic IDs include addressing cold-start issues and enabling recommendations that understand content [00:07:32]. This allows for human-readable explanations of why a user might like a particular item, by blending LLMs with semantic IDs [00:07:48].

Data Augmentation

Machine learning relies on high-quality data at scale, especially for search and recommendation systems, which require extensive metadata for query expansion, synonyms, and spell checking [00:08:04]. This data collection is costly and high-effort, traditionally relying on human annotations or automatic methods [00:08:31]. LLMs have proven outstanding at generating synthetic data and labels [00:08:35].

Indeed’s Lightweight Classifier

Indeed faced the challenge of sending poor job recommendations via email, leading to user distrust and unsubscribes [00:08:51]. Explicit feedback (thumbs up/down) was sparse, and implicit feedback was often imprecise [00:09:25].

Their solution was a lightweight classifier to filter out bad recommendations [00:09:50]. Their journey to this solution was iterative:

Human Labeling: Experts labeled job recommendations and user pairs based on resume and activity data [00:10:03].
Open LLMs: Prompting open LLMs like Mistral and Llama 2 yielded very poor performance due to their inability to focus on resume/job description details and generic outputs [00:10:20].
GPT-4: GPT-4 worked well with 90% precision and recall, but was too costly and slow (22 seconds) for practical use [00:10:38].
GPT-3.5: GPT-3.5 had very poor precision (63%), meaning 37% of “bad” recommendations it flagged were actually good, which was unacceptable [00:10:56].
Fine-tuned GPT-2.5: Fine-tuning GPT-2.5 achieved the desired precision (0.83) at a quarter of GPT-4’s cost and latency, but was still too slow (6.7 seconds) for online filtering [00:11:29].
Distilled Lightweight Classifier: Finally, they distilled a lightweight classifier using the fine-tuned GPT-2.5 labels [00:11:51]. This classifier achieved high performance (0.86 AU ROC) and was fast enough for real-time filtering [00:11:58].

The outcome was a 20% reduction in bad recommendations [00:12:20]. Contrary to initial fears, this led to a 4% increase in application rate and a 5% decrease in unsubscribe rate, demonstrating that quality in recommendations makes a significant difference over mere quantity [00:12:44].

Spotify’s Query Recommendation System

Spotify, known for music, introduced podcasts and audiobooks, facing a cold-start problem for these new content categories [00:13:04]. Exploratory search was crucial for expanding beyond music [00:13:40].

The solution was a query recommendation system [00:13:53]. New queries were generated from catalog/playlist titles and search logs, and augmented with LLMs to generate natural language queries [00:14:01]. This approach leverages conventional techniques and augments them with LLMs only where needed [00:14:26]. These exploratory queries are then ranked alongside immediate search results [00:14:34].

This strategy led to a 9% increase in exploratory queries, meaning one-tenth of Spotify’s users were exploring new products daily, significantly accelerating new product category growth [00:15:11].

The benefits of LLM-augmented synthetic data include richer, high-quality data at scale, even for tail queries and items, at a significantly lower cost and effort compared to human adaptation [00:15:34].

Unified Models

Traditional company setups often involve separate systems and bespoke models for ads, recommendations (homepage, item, cart, thank you page, etc.), and search [00:16:03]. This results in duplicative engineering pipelines, high maintenance costs, and limited knowledge transfer between models [00:16:36].

The solution is unified models (or foundation models), similar to how they’ve worked for vision and language tasks [00:16:47].

Netflix’s Unified Contextual Ranker (Unicorn)

Netflix faced high operational costs and missed learning opportunities due to teams building bespoke models for search, similar item recommendations, and pre-query recommendations [00:17:16].

Their solution, Unicorn, is a unified ranker [00:17:32]. It uses a user foundation model and a context/relevance model [00:17:39]. Unicorn takes unified input, allowing different use cases and features to use the same data schema [00:17:53]. Input includes user ID, item ID, search query (if present), country, and task (e.g., search, pre-query, more like this) [00:18:08]. Missing items are smartly imputed; for example, in item-to-item recommendations, the current item’s title can be used as a search query [00:18:27].

This unified model was able to match or exceed the metrics of specialized models across multiple tasks [00:18:48]. This unification removes technical debt and builds a better foundation for future iterations, leading to faster development [00:19:02].

Etsy’s Unified Embeddings

Etsy aimed to help users find better results for both very specific and very broad queries, especially with its constantly changing inventory [00:19:22]. Lexical embedding retrieval did not account for user preferences [00:19:50].

Their solution was unified embedding and retrieval [00:20:01]. Similar to the Kwai model, they use a product tower and a query encoder [00:20:09]. Product encoding uses T5 models for text embeddings (from item descriptions and query-product logs) [00:20:21]. Both towers share encoders for text tokens, product category tokens, and user location, enabling embeddings to match users to product locations [00:20:34]. User preferences (via query user scale effect features) are encoded for personalization [00:20:57].

Their system architecture also included a “quality vector” (ratings, freshness, conversion rate) concatenated to the product embedding [00:21:17]. This quality vector was then mapped to the query vector to maintain dimension for dot product or cosine similarity [00:21:38].

The results were impressive: a 2.6% increase in conversion across the entire site and over 5% increase in search purchases [00:21:51].

Netflix’s Foundation Model for Personalization

Netflix has diverse recommendation needs across various content types (movies, TV shows, games, live streaming) and pages (homepage, search page, kids homepage, mobile feed) [02:23:32]. Traditionally, this led to many specialized models developed independently, with duplicated efforts in feature and label engineering [02:24:52]. This system was not scalable, offered little leverage, and hindered innovation velocity [02:26:29].

Netflix’s “big bet” is to centralize user representation learning in one foundation model based on transformer architecture [02:27:10]. Their hypotheses are:

Scalability: Scaling up semi-supervised learning can improve personalization, similar to how scaling laws apply to LLMs [02:27:36].
High Leverage: Integrating the foundation model into all systems can simultaneously improve all downstream models [02:27:49].

Data and Training:

Data Cleaning and Tokenization: Critical initial steps. Unlike LLMs with single language tokens, recommendation systems deal with rich “event interaction” tokens, each having multiple facets (when, where, what) [02:28:38]. The granularity of tokenization and its trade-off with context window size are carefully considered [02:29:42].
Model Layers:
- Event Representation: Captures information about when, where (physical location, device, page), and what (target entity, interaction type, duration) an event happened [02:30:31].
- Embedding Feature Transformation: Combines ID embedding learning with semantic content information to address cold-start problems, which are common in recommendations but not LLMs [02:31:21].
- Transformer Layer: Hidden state output is used as user representation [02:32:07]. Considerations include representation stability, aggregation methods (across time or layers), and explicit adaptation for downstream objectives [02:32:24].
- Objective/Loss Function: Richer than LLMs, with multiple sequences to represent output. Targets can include entity IDs, action type, entity metadata (type, genre, language), or action details (duration, device, next play time) [02:33:04]. These can be used for multi-task learning, or as weights/rewards/masks for loss functions [02:34:01].

Netflix observed that scaling laws apply to their foundation model, showing continuous gains over two and a half years, scaling up to a billion model parameters [02:34:27].

Learnings borrowed from LLMs include:

Multi-token prediction: Forces the model to be less myopic, more robust to serving time shifts, and targets long-term user satisfaction [02:35:36].
Multi-layer representation: Techniques like layer-wise supervision and multi-layer output aggregation lead to better and more stable user representations [02:36:12].
Long context window handling: Using truncated sliding windows, sparse attention, and progressively training longer sequences for efficient training [02:36:34].

Serving and Applications: The foundation model consolidates the data and representation layers, particularly for user and content representations [02:37:12]. Application models become thinner layers built on top of the foundation model [02:37:28]. The foundation model can be integrated in three ways:

Subgraph: Integrated as a subgraph within the downstream model, replacing existing sequence transformer towers [02:37:47].
Pushed-out Embeddings: Content and member embeddings can be pushed to a centralized store, allowing wider use cases beyond personalization [02:38:15].
Fine-tuning/Distillation: Models can be extracted and fine-tuned or distilled for specific applications to meet online serving requirements [02:38:53].

Netflix has seen high leverage, with many applications incorporating the foundation model and showing AB test wins, alongside infrastructure consolidation [02:39:12]. This has validated their big bets, proving the solution is scalable and provides high leverage and faster innovation velocity [02:39:46].

Future directions include:

Universal representation for heterogeneous entities: Further developing semantic IDs as Netflix expands content types [02:40:23].
Generative retrieval for collection recommendation: Generating collections of recommendations, where business rules or diversity can be handled during multi-step decoding [02:40:40].
Faster adaptation through prompt tuning: Training soft tokens to prompt the foundation model to behave differently at inference time [02:40:59].

Pinterest’s LLM Integration for Search Relevance

Pinterest handles over six billion searches per month across billions of pins, supporting over 45 languages [02:29:02]. Their search backend involves query understanding, retrieval, re-ranking, and blending [02:29:42]. They focused on semantic relevance modeling in the re-ranking stage [02:30:02].

Their search relevance model is a classification model predicting the relevance of a pin to a search query on a five-point scale [02:30:18].

Model Architecture: Query and pin text are concatenated and passed into an LLM using a cross-encoder structure to capture interaction [02:31:11]. The LLM embedding feeds into an MLP layer to produce a five-dimensional vector for relevance levels [02:31:30]. Open-source LLMs are fine-tuned with Pinterest internal data [02:31:42].
LLM Effectiveness: LLMs substantially improved relevance prediction compared to Pinterest’s in-house content and query embedding baseline [02:31:58]. Using more advanced LLMs and increasing model size continued to improve performance, with an 8-billion parameter LLM showing 20% improvement over the search stage embedding model [02:32:20].
Useful Annotations: Vision-language model generated captions and user actions (e.g., user-curated boards, queries leading to pins) are useful content annotations [02:32:50].
Efficiency for Serving: Achieving low latency (e.g., 500-400ms) for serving LLMs requires efficiency [02:33:23]. Three levers:
1. Smaller Models: Going “big and then small” by distilling larger models (e.g., 150B parameters) step-by-step into smaller ones (8B, 3B, 1B) maintains reasoning power and improves throughput [02:33:55].
2. Pruning: Gradual pruning of redundant layers or reducing precision in transformers is more effective than aggressive pruning, which can significantly reduce model quality [02:34:45].
3. Quantization: Using lower precision (e.g., FP8) for activations and parameters. Mixed precision is important, with the LLM head (prediction output) needing FP32 to maintain precision and calibration [02:36:03].
4. Sparsification: Specifying attention scores so not every item attends to every other item, especially when recommending multiple items simultaneously [02:36:53].

By combining these techniques, Pinterest achieved a 7x reduction in latency and a 30x increase in throughput (queries per GPU) [02:37:37].

Instacart’s LLMs for Search and Discovery

Instacart focuses on online grocery, where customers have long shopping lists, mostly restocking purchases [02:46:40]. Search plays a dual role: enabling quick, efficient finding of known products and facilitating new product discovery [02:47:07]. New product discovery benefits customers (new items), advertisers (showcasing products), and the platform (larger basket sizes) [02:47:17].

Challenges with Conventional Search Engines:

Overly Broad Queries: E.g., “snacks” results in tons of products [02:47:48]. Engagement data is hard to collect for unexposed items, leading to a cold-start problem [02:47:57].
Very Specific Queries: E.g., “unsweetened plant-based yogurt” are infrequent, lacking sufficient engagement data [02:48:12].
New Item Discovery: Users desire an experience similar to physical stores (seeing related items) [02:48:43]. Conventional methods struggled with precision due to lack of engagement data [02:49:25].

Leveraging LLMs: Instacart used LLMs to uplevel their query understanding module, which is the most upstream part of the search stack [02:49:37].

Query to Product Category Classifier: Maps a query to categories in their 10,000-label taxonomy [02:50:16].
- Previous Approach: FastText-based neural network for semantic relationships, with NPMI for statistical co-occurrence. Low coverage for tail queries due to lack of engagement data [02:50:46]. BERT-based models showed limited improvement for increased latency [02:51:22].
- LLM Approach: Initially, LLMs were prompted with queries and taxonomy to predict relevant categories. While output seemed decent, online AB tests showed poor results. LLMs lacked Instacart’s domain understanding (e.g., “protein” meaning supplements to users, but chicken/tofu to LLM) [02:51:36].
- Augmented LLM Prompt: The problem was rephrased by providing the LLM with top-converting categories for each query as additional context [02:52:30]. This greatly simplified the problem for the LLM, leading to precise results (e.g., “Wernner soda” as “ginger ale”) [02:53:00].
- Results: For tail queries, precision improved by 18 percentage points and recall by 70 percentage points [02:53:26]. The prompt is very simple, passing converted categories and guidelines [02:53:42].
Query Rewrites Model: Important for e-commerce where catalog variations mean queries may not always return results [02:54:08]. Rewrites like “1% milk” to “milk” ensure results [02:54:27].
- Previous Approach: Engagement data-trained models performed poorly on tail queries [02:54:37].
- LLM Approach: Similar to category classification, LLMs generated precise rewrites (substitute, broad, synonymous) [02:54:49].
- Results: Offline improvements based on human evaluation, and online, a large drop in queries with no results, which was significant for the business [02:55:16].

Scoring and Serving: Instacart’s query pattern has a fat head/torso and a long tail [02:56:06]. They precomputed outputs for head/torso queries offline in batch mode and cached them [02:56:18]. Online queries are served from the cache with low latency. For the long tail, they fall back to existing models, planning to replace them with a distilled LLM [02:56:37].

Discovery-Oriented Content: LLMs were used to show more discovery content in search results:

For queries with no exact results, LLMs generated substitute results (e.g., other seafood for “swordfish”) [02:59:17].
For queries with many exact results, complementary items were shown (e.g., Asian cooking ingredients for “sushi”) [02:59:31]. Both led to improved engagement and revenue [02:59:51].

Requirements for Content Generation:

Incremental Content: Avoid duplicates of already shown results [03:00:07].
Domain Alignment: LLM answers must align with Instacart’s domain knowledge (e.g., “dishes” meaning cookware, not food, unless “Thanksgiving dishes” is queried) [03:00:12].

Addressing Challenges:

Initial LLM Generation: Basic prompts produced common-sense answers but didn’t align with user behavior on Instacart (e.g., “protein” again) [03:00:36].
Augmented Prompt: Instacart’s domain knowledge (top converting categories, query annotations like brand/dietary attributes, subsequent user queries) was added to the prompt [03:01:40]. This yielded much better results, leading to significant engagement and revenue improvements [03:02:16].
Serving: Similar to query understanding, LLMs were called in batch mode on historical search logs, storing query content and potential products [03:02:29]. Online serving is a quick lookup from a feature store [03:02:51].
Key Challenges Solved:
1. Aligning generation with business metrics: Iterating prompts and metadata to achieve topline wins [03:03:12].
2. Improving ranking: Traditional PCTR/PCBR models didn’t work, requiring strategies like diversity-based re-ranking [03:03:28].
3. [[evaluation_of_llms_using_realworld_scenarios | Evaluating content]]: Ensuring LLM outputs are accurate, not hallucinating, and adhere to product needs, often using LLM as a judge [03:04:19].

Key Takeaways:

LLM’s world knowledge improved query understanding predictions for tail queries [03:04:02].
Success came from combining Instacart’s domain knowledge with LLMs [03:04:11].
Evaluating content and query predictions was critical and more difficult than anticipated [03:04:21].
Consolidating multiple query understanding models into a single LLM could improve consistency [02:57:21].
LLMs for query understanding allow passing extra context to understand customer intent (e.g., recipe ingredients) [02:58:13].

YouTube’s Large Recommender Model (LRM) with Gemini

YouTube, one of the largest consumer apps, has most of its watch time driven by its recommendation system across home, watch next, shorts, and personalized search results [03:09:01]. The recommendation problem involves learning a function to provide recommendations based on user and context input (demographics, watch history, engagement, subscriptions) [03:09:44].

YouTube aimed to rethink its recommendation system on top of Gemini, leading to the Large Recommender Model (LRM) [03:10:26].

Adaptation: LRM starts with a base Gemini checkpoint, adapted for YouTube recommendations by teaching it YouTube-specific information [03:11:32]. This unified, YouTube-specific Gemini checkpoint (LRM) is then aligned for tasks like retrieval and ranking, creating custom versions for major recommendation surfaces [03:11:46].
Tokenizing Videos (Semantic IDs): The first step was to tokenize videos so the LLM could take video tokens as input and output video tokens as recommendations [03:12:24]. Given the need to reason over many videos with up to a million tokens of context, video representation had to be compressed [03:12:52].
- They built semantic ID by extracting features (title, description, transcript, audio, video frames) into a multi-dimensional embedding [03:13:14]. This embedding is quantized using RQVE to give each video a token [03:13:36].
- This creates an atomic unit for a new language of YouTube videos, organizing billions of videos around semantically meaningful tokens (e.g., music, gaming, sports, specific sports) [03:13:47]. This was a milestone in moving from hash-based to semantically meaningful tokenization [03:14:23].
Continued Pre-training: LRM is trained to understand both English and this new YouTube language in two steps:
1. Linking Text and SID: Teaching the model to connect text (e.g., video title, creator, topics) with video tokens (SID) [03:14:48].
2. Understanding Watch Sequences: Using YouTube engagement data (user watch paths), the model learns to predict masked videos in a sequence, understanding relationships between videos based on user engagement [03:15:26].

After pre-training, the model can reason across English and YouTube videos. For example, given a sequence of videos, it can infer shared topics like sports or technology based on their semantic ID [03:16:07].

Generative Retrieval: LRM is used for generative retrieval by constructing personalized prompts for each user, including demographics, context video, and watch history [03:17:03]. The model then decodes video recommendations as SIDs [03:17:37]. This yields unique recommendations, especially for hard recommendation tasks (e.g., finding related women’s track races based on a user’s Olympic highlight watch history) [03:17:41].
Serving Costs: LRM is powerful and learns quickly but is expensive to serve. YouTube achieved over 95% cost savings to launch it in production [03:18:36].
Offline Recommendation Table: They also created an offline recommendation table by removing personalized aspects from the prompt, allowing for simple lookup serving for “watch next” type recommendations [03:19:10].
Challenges for YouTube Recommendations:
1. Vocabulary and Corpus Size: YouTube’s billions of videos (millions added daily) dwarf LLM text vocabularies [03:20:23].
2. Freshness: New videos (e.g., Taylor Swift music video) must be recommended within minutes or hours, requiring continuous pre-training on the order of days/hours, unlike classical LLM pre-training (months) [03:20:51].
3. Scale: Serving models to billions of daily active users necessitates focusing on smaller, more efficient models (like Gemini Flash or smaller checkpoints) to meet latency and scale requirements [03:21:52].

The benefits of unified models include system simplification, where improvements to one part of the model transfer to other use cases, leading to faster iteration [00:22:10]. However, there can be “alignment types” where improving one task might worsen another, potentially requiring splitting into multiple unified models [00:22:24].

Recipe for LLM-Based Recommendation Systems

A general recipe for building LLM-based recommendation systems involves three major steps [03:22:28]:

Tokenize Your Content:
- Goal: Create an atomic token representation of your content.
- Method: Build rich representations from features, create embeddings, and then tokenize or quantize them.
- Outcome: A domain-specific language for your content [03:22:50].
Adapt the LLM:
- Goal: Link the LLM’s natural language understanding with your new domain language.
- Method: Find training tasks that enable the LLM to reason across both English and your domain-specific tokens.
- Outcome: A bilingual LLM that understands both natural language and your domain language [03:22:56].
Prompt with User Information:
- Goal: Create a generative recommendation system.
- Method: Construct personalized prompts using user demographics, activity, and actions. Train task-specific or surface-specific models.
- Outcome: An LLM-powered system capable of generating recommendations [03:23:21].

Future Directions

Currently, LLMs primarily augment recommendation systems, enhancing quality invisibly to users [03:23:56]. However, the future holds more interactive experiences:

Users will be able to communicate with the recommender in natural language to steer recommendations towards their goals [03:24:28].
The recommender will be able to explain why a candidate was recommended [03:24:42].
The lines between search and recommendation will blur [03:24:52].
Eventually, recommendations may evolve into generative content, creating personalized versions of content or even entirely new content tailored for the user [03:25:01].

Tubegraph

Explorer

Table of Contents