Cold start problem in recommendation systems

From: aidotengineer

The “cold start problem” in recommendation systems refers to the challenge of making accurate recommendations for new users or new items that have little to no interaction data [00:04:15]. This issue also applies to “tail items”—products or content with very few interactions—making it difficult for systems to learn and recommend them effectively [00:04:26]. Recommendation systems often exhibit a popularity bias, struggling with new items and data sparsity [00:04:30].

Challenges Posed by Cold Start [00:04:15]

When a new item is introduced, the system must “relearn” everything about it [00:04:19]. This lack of initial interaction data leads to:

Data Sparsity: Many items, especially those in the “long tail,” have few or no user interactions, making it insufficient for traditional models to learn from [00:04:26].
Popularity Bias: Recommendation systems tend to favor popular items due to abundant interaction data, further exacerbating the visibility problem for new or niche content [00:04:32].
Poor User Experience: For platforms like Indeed, bad job recommendations for new users can lead to a loss of trust and unsubscribes, which are very difficult to recover from [00:09:06].
Limited Discovery: Users struggle to find new or specific items, as seen with overly broad or highly specific queries on platforms like Instacart [02:47:48].

Addressing Cold Start with AI [00:04:36]

Several advanced techniques, often leveraging large language models (LLMs), are being developed and applied to mitigate the cold start problem.

1. Semantic IDs [00:04:39]

Unlike hash-based item IDs that do not encode an item’s content [00:04:11], semantic IDs incorporate the item’s inherent content, potentially even multimodal content (e.g., visual, audio, text) [00:04:39]. This allows recommendations to understand content directly [00:07:38].

Case Study: Quao (Kuaishou) Quao, a short video platform, faces the challenge of users uploading hundreds of millions of new short videos daily, making it hard to learn from them [00:05:03]. Their solution uses trainable multimodal semantic IDs [00:05:15].

Process: They encode visual content with ResNet, video descriptions with BERT, and audio with VGGish [00:05:49]. These content embeddings are concatenated [00:06:10]. K-means clustering is applied to learn 1,000 cluster IDs from 100 million videos, which are then mapped to an embedding table [00:06:17]. A model encoder learns to map the content space to the behavioral space via these trainable cluster IDs [00:06:45].
Outcome: Semantic IDs not only outperform hash-based IDs on clicks and likes but significantly increase “cold-start coverage” (new videos shared) by 3.6% and “cold-start velocity” (new videos hitting view thresholds) [00:07:07].

Case Study: YouTube YouTube uses semantic IDs to tokenize videos, representing rich multimodal features as tokens [03:06:52].

Process: Video features like title, description, transcript, audio, and video frame data are extracted into a multi-dimensional embedding, then quantized into tokens [03:13:21]. This creates a “new language of YouTube videos” where billions of videos are organized by these semantically meaningful tokens [03:13:48].
Benefit: This approach helps address the extreme freshness requirements of YouTube, where new videos (like a Taylor Swift music video) need to be recommended within minutes or hours [03:21:14]. The model can be continuously pre-trained on the order of days and hours to keep up with new content [03:21:38].

2. Data Augmentation [00:08:41]

LLMs are highly effective for generating synthetic data and labels, which is crucial for search and recommendation systems that require vast amounts of high-quality metadata [00:08:37]. This is far more cost-effective than human annotation [00:15:46].

Case Study: Indeed Indeed struggled with bad job recommendations leading to user unsubscribes [00:09:03]. Explicit feedback (thumbs up/down) was sparse, and implicit feedback was imprecise [00:09:28].

Process: They used LLMs to generate labels for a lightweight classifier [00:09:50]. Initial attempts with open LLMs (Mistral, Llama 2) were poor [00:10:24]. GPT-4 performed well but was too costly and slow [00:10:40]. Fine-tuning GPT-2.5 on GPT-4 generated labels allowed them to achieve desired precision at a lower cost and latency [00:11:30]. This distilled classifier achieved high performance for real-time filtering [00:11:58].
Outcome: Reducing bad recommendations by 20% led to a 4% increase in application rate and a 5% decrease in unsubscribe rate [00:12:23]. This demonstrated that quality, not just quantity, significantly impacts recommendations [00:12:57].

Case Study: Spotify Spotify aimed to grow new content categories like podcasts and audiobooks, facing a cold start problem not just for items but for entire categories [00:13:00].

Process: They leveraged LLMs to generate natural language queries to facilitate exploratory search [00:14:16]. These LLM-generated queries augmented existing query generation techniques (e.g., extracting from catalog titles, search logs) [00:14:03]. The new queries were ranked and displayed to users, informing them of new categories without explicit banners [00:14:49].
Outcome: This led to a 9% increase in exploratory queries, meaning one-tenth of their users were now exploring new products daily, accelerating new product category growth [00:15:14].

Case Study: Instacart Instacart used LLMs to improve query understanding, addressing challenges with overly broad or specific queries and enabling new item discovery [02:47:41].

Process: For query-to-category classification, initial LLM prompts were decent but lacked Instacart user behavior context [02:52:16]. By augmenting prompts with Instacart domain knowledge (e.g., top converting categories, user annotations), LLMs generated much more precise results, such as identifying “Wernner soda” as ginger ale instead of generic fruit-flavored soda [02:53:02]. For query rewrites, LLMs generated precise substitutes, broader terms, and synonyms [02:54:51].
Outcome: Precision for tail queries improved by 18 percentage points and recall by 70 percentage points [02:53:31]. This significantly reduced the number of “no results” queries, benefiting the business [02:55:43]. For discovery, LLMs generated substitute and complementary items, leading to engagement and revenue improvement [02:59:53].

3. Unified Models / Foundation Models [00:16:47]

Traditionally, companies have separate models for ads, recommendations, and search, often with multiple bespoke models even within recommendations [00:16:06]. Unified models consolidate these systems, reducing duplication, maintenance costs, and allowing improvements to transfer across use cases [00:16:36].

Case Study: Netflix Netflix, with its diverse recommendation needs across different content types and pages, traditionally had many specialized models [02:24:52]. This led to duplication in feature and label engineering [02:25:28].

Solution: Netflix adopted a “big bet” on a foundation model (Unicorn) based on transformer architecture for personalization [02:23:08]. This model learns a centralized user representation, combining ID embedding with semantic content information to handle cold start items that the model hasn’t seen during training [02:31:28]. The model supports various tasks like search, similar item, and pre-query recommendations using a unified input schema [00:17:14].
Outcome: The unified model matched or exceeded the metrics of specialized models on multiple tasks [00:18:48]. This approach is scalable and provides high leverage, accelerating innovation velocity by allowing new applications to directly fine-tune the foundation model [02:39:47].

Case Study: Etsy Etsy sought to improve results for specific or broad queries given its constantly changing inventory [02:24:32].

Solution: They developed a unified embedding and retrieval system [02:00:03]. It uses a two-tower model: a product encoder (T5 models for text embeddings, query-product logs) and a query encoder (search query encoder) [02:19:50]. Both share encoders for text tokens, product category, and user location, allowing embeddings to match users to product locations [02:42:07]. A “quality vector” (ratings, freshness, conversion rate) is concatenated to product embeddings to ensure good quality results [02:21:18].
Outcome: A 2.6% increase in conversion across the entire site and over 5% increase in search purchases [02:21:52]. This highlights how unified models simplify systems and improve performance across various use cases [02:22:12].

The LLM-Based Recommendation Recipe [03:08:50]

A general recipe for building an LLM-based recommendation system includes three main steps:

Tokenize Content: Develop a method to tokenize your content into atomic tokens, essentially creating a domain-specific language (e.g., semantic IDs for videos) [03:22:30].
Adapt the LLM: Continuously pre-train a base LLM to understand both natural language and your new domain language, creating a “bilingual” LLM [03:22:57]. This involves linking text to content tokens and teaching the model to reason about sequences of user interactions [03:14:48].
Prompt with User Information: Construct personalized prompts with user demographics, activity, and actions to train task-specific models, resulting in a generative recommendation system [03:23:21].

The future of LLM-based recommendations suggests a move towards user-steerable systems, where users can interact with the recommender in natural language, receive explanations for recommendations, and align results with their personal goals [03:24:31].

Tubegraph

Explorer

Table of Contents