From: aidotengineer
The intersection of recommendation systems and large language models (LLMs) is a significant area of development, promising to revolutionize how users discover content and products [00:02:47]. This integration builds upon a long history of language modeling techniques in recommendation systems, which date back to 2013 with item embeddings and later evolved with GRU4Rec and transformers for handling longer interaction sequences [00:03:06].
Challenges in Conventional Recommendation Systems
Traditional recommendation systems face several inherent challenges:
- Hash-based Item IDs: These do not inherently encode content, leading to the “cold-start problem” for new items where systems must relearn everything about them [00:04:00].
- Sparsity: Many “tail” items have very few interactions, making it difficult for models to learn effectively [00:04:24].
- Popularity Bias: Systems often favor popular items, struggling to recommend new or niche content [00:04:30].
- Data Quality and Scale: Machine learning models, especially for search and recommendations, require vast amounts of high-quality, metadata-rich data, which is costly and labor-intensive to acquire through traditional means [00:08:04].
- System Silos: Historically, systems for ads, recommendations, and search operate separately, leading to duplicated engineering efforts, high maintenance costs, and limited knowledge transfer between models [00:16:03].
Personalization Strategies Leveraging LLMs
Three key strategies are emerging to address these challenges and enhance personalization:
1. Semantic IDs
Semantic IDs encode the content of an item, including multimodal information, allowing recommendations to understand content [00:04:39]. This approach directly tackles the cold-start problem for new items [00:07:34].
Example: Kwai’s Trainable Multimodal Semantic IDs
Kwai, a short-video platform, faced the challenge of learning from hundreds of millions of daily video uploads [00:04:59]. Their solution involved combining static content embeddings with dynamic user behavior using trainable multimodal semantic IDs [00:05:08].
- Architecture: They used a standard two-tower network. Content inputs (visual via ResNet, video descriptions via BERT, audio via VGGish) were concatenated [00:05:15].
- Clustering: Non-trainable content embeddings were used to learn 1,000 cluster IDs via K-means clustering from 100 million short videos [00:06:17]. These cluster IDs were mapped to their own embedding table [00:06:39].
- Learning: The model encoder learned to map the content space via these cluster IDs to the behavioral space [00:06:44].
- Outcome: These semantic IDs not only outperformed hash-based IDs on clicks and likes but significantly increased cold-start coverage (3.6%) and cold-start velocity, enabling new videos to reach view thresholds faster [00:06:59].
Future integrations may involve blending LLMs with semantic IDs to explain why a user might like a recommendation, providing human-readable explanations [00:07:48]. YouTube has also built semantic IDs by distilling multimodal features into tokens, organizing billions of videos into semantically meaningful tokens [03:06:52].
2. Data Augmentation
LLMs excel at generating synthetic data and labels, providing richer, high-quality data at scale, especially for tail queries and items, at a significantly lower cost and effort than human annotation [00:08:35].
Example: Indeed’s Job Recommendation Filtering
Indeed faced the challenge of bad job recommendations leading to poor user experience and unsubscriptions [00:09:01]. Explicit feedback (thumbs up/down) was sparse, and implicit feedback was imprecise [00:09:25].
- Solution: They developed a lightweight classifier to filter bad recommendations [00:09:50].
- Process:
- Human experts labeled job recommendations and user pairs based on resume and activity data [00:10:05].
- Open LLMs (Mistral, Llama 2) showed poor performance due to generic output [00:10:20].
- GPT-4 performed well (90% precision and recall) but was too costly and slow (22 seconds per prediction) [00:10:38].
- GPT-3.5 had poor precision, incorrectly filtering out 37% of good recommendations [00:10:56].
- Fine-tuning GPT-2.5 achieved the desired precision but was still too slow (6.7 seconds) for online filtering [00:11:30].
- Finally, they distilled a lightweight classifier using labels from the fine-tuned GPT-2.5, achieving high performance (0.86 AU ROC) and real-time latency [00:11:51].
- Outcome: This reduced bad recommendations by 20%, increased application rates by 4%, and decreased unsubscribe rates by 5%, demonstrating that quality over quantity significantly improves recommendation impact [00:12:20].
Example: Spotify’s Query Recommendation System
Spotify aimed to expand beyond music into podcasts and audiobooks, facing a cold-start problem for new content categories [00:13:04].
- Solution: They used an LLM to generate natural language queries for an exploratory search system [00:13:50]. Existing techniques generated queries from catalog titles, playlists, and search logs [00:14:01]. The LLM augmented this by generating more natural language queries [00:14:16].
- Outcome: This led to a 9% increase in exploratory queries, meaning one-tenth of their users were now exploring new products daily, significantly accelerating category growth [00:15:11].
Example: Instacart’s Search and Discovery
Instacart used LLMs to improve query understanding and product discovery, tackling challenges like overly broad or specific queries, and enabling new item discovery [02:47:32].
- Query to Product Category Classifier: Traditional models struggled with tail queries due to lack of engagement data [02:50:46]. Initial LLM prompting was decent but failed in A/B tests due to a mismatch with Instacart user behavior (e.g., “protein” meaning supplements vs. chicken) [02:51:36]. The solution was to augment the prompt with Instacart’s domain knowledge, such as top converting categories for each query, significantly improving precision and recall for tail queries [02:52:30].
- Query Rewrites: LLMs generated precise rewrites (substitute, broad, synonymous) for queries, which was crucial for retailers with varying catalog sizes [02:54:08]. This drastically reduced queries with no results, boosting business [02:55:41].
- Discovery-Oriented Content: LLMs generated substitute and complementary product suggestions for search results pages (e.g., seafood alternatives for swordfish, Asian cooking ingredients for sushi) [02:58:43]. Again, augmenting LLM prompts with Instacart domain knowledge (top converting categories, query annotations, subsequent user queries) was key to aligning generated content with user behavior and business metrics [03:01:39].
- Serving: Instacart precomputed LLM outputs for head and torso queries offline, caching them for low-latency online serving, and falling back to existing models for the long tail, with plans to replace those with distilled LLMs [02:56:02].
3. Unified Models
Unified models aim to consolidate multiple specialized models for different recommendation tasks (e.g., homepage, item, cart recommendations) into a single, cohesive system [00:16:00]. This approach leverages shared learning and reduces engineering overhead [00:16:47].
Example: Netflix’s Unified Contextual Ranker (Unicorn)
Netflix sought to address the proliferation of bespoke models for various recommendation and search tasks [00:17:16].
- Solution: They developed Unicorn, a unified contextual ranker built on a user foundation model and a context/relevance model [00:17:31].
- Unified Input: The model uses a single data schema for all use cases, incorporating user ID, item ID, search query (if applicable), country, and task [00:17:53]. Smart imputation fills missing data, like using the current item’s title as a search query if none exists [00:18:27].
- Outcome: Unicorn matched or exceeded the metrics of specialized models across multiple tasks [00:18:48], reducing technical debt and accelerating future iterations [00:19:04].
Scaling Foundation Models for Recommendations
Netflix observed that scaling laws, similar to those in LLMs, apply to recommendation systems; continuous scaling up of models and data yielded performance gains [02:34:31]. However, stringent latency and cost requirements for real-time recommendations necessitate distillation of larger models for production [02:35:08].
Learnings from LLM Development Applied to Recommendation Models
- Multi-Token Prediction: Forces the model to be less myopic, more robust to serving time shifts, and targets long-term user satisfaction [02:35:36].
- Multi-Layer Representation: Improves the stability and quality of user representations [02:36:12].
- Long Context Window Handling: Techniques from LLMs, such as truncated sliding windows and sparse attention, maximize learning and training efficiency [02:36:34].
Integration and Application of Foundation Models
Netflix’s foundation model (FM) integrates with downstream models in three ways:
- Subgraph Integration: The FM can be used as a pre-trained subgraph within downstream neural networks [02:37:48].
- Embedding Push-out: Content and user embeddings from the FM are pushed to a centralized embedding store for wider use cases, including analytics [02:38:15].
- Model Extraction and Fine-tuning/Distillation: Specific applications can fine-tune or distill the FM to meet online serving requirements [02:38:51].
This approach yielded significant “wins” in both A/B test improvements and infrastructure consolidation [02:39:12].
Example: Etsy’s Unified Embeddings
Etsy used unified embeddings for search and retrieval to address challenges with specific or broad queries and constantly changing inventory [01:59:30].
- Architecture: Similar to a two-tower model, with a product encoder (using T5 for text embeddings and query-product logs) and a query encoder. Both share encoders for text tokens, product categories, and user location [02:00:03]. User preferences are personalized via query-user scale effect features [02:05:40].
- Quality Vector: A “quality vector” (ratings, freshness, conversion rate) was concatenated to the product embedding vector, with a constant vector slapped onto the query embedding to match dimensions for similarity calculation [02:11:18].
- Outcome: This resulted in a 2.6% increase in conversion across the entire site and over 5% increase in search purchases [02:18:50], demonstrating the strong impact of unified models on core business metrics.
Future Directions
The application of LLMs in personalization is rapidly evolving:
- Invisible Augmentation: Currently, LLMs largely augment recommendation quality invisibly to users [02:39:56].
- Interactive Experiences: Future developments aim for users to directly interact with recommendation systems using natural language, allowing them to steer recommendations, receive explanations, and align recommendations with their specific goals [02:40:07]. This will blur the lines between search and recommendation [02:40:52].
- Generative Content: Ultimately, recommendation systems may not just recommend content but also generate personalized versions of content, leading to “end-of-one” content created specifically for individual users [02:41:01].
The recipe for building an LLM-based recommendation system involves three major steps [02:42:25]:
- Content Tokenization: Creating a domain-specific language by tokenizing content (e.g., video frames, audio, text) into atomic units (like semantic IDs) [02:42:30].
- LLM Adaptation: Teaching an LLM to understand both natural language and this new domain-specific language, creating a bilingual LLM through targeted training tasks [02:42:57].
- Personalized Prompting: Using user information (demographics, activity) to construct personalized prompts, leading to a generative recommendation system [02:43:21].
This approach represents a significant advancement in personalization, promising more relevant, dynamic, and interactive user experiences [02:43:36].