From: aidotengineer

The concept of Semantic IDs and multimodal embeddings is proposed as a solution to long-standing challenges in recommendation systems, such as handling new items and sparsity issues [00:03:08]. Traditionally, hash-based item IDs do not encode the content of an item, leading to a “cold-start problem” for new items where systems have to relearn everything about them [00:04:00]. This also contributes to sparsity, especially for “tail items” with few interactions, making it difficult for recommendation systems to learn effectively and leading to popularity bias [00:04:24].

Addressing Challenges with Semantic IDs

The proposed solution is the use of semantic IDs, which can incorporate multimodal content [00:04:36]. These IDs encode the content of the item, allowing recommendations to understand content and directly address the cold-start problem [00:07:34].

Quao’s Trainable Multimodal Semantic IDs

Quao, a short video platform similar to TikTok or Xiaohongshu in China, faced the challenge of learning from hundreds of millions of short videos uploaded daily [00:04:46]. They developed trainable multimodal semantic IDs to combine static content embeddings with dynamic user behavior [00:05:07].

The Quao model uses a standard two-tower network architecture [00:05:20].

  • Input Encoding: Content inputs are encoded using specialized models: ResNet for visual content, BERT for video descriptions, and VGGish for audio [00:05:41].
  • Concatenation and Clustering: These content embeddings are concatenated [00:06:08]. For 100 million short videos, Quao learned 1,000 cluster IDs via k-means clustering [00:06:17].
  • Model Encoder: The model encoder learns to map the content space via these trainable cluster IDs to the behavioral space [00:06:43]. The cluster IDs are mapped to their own embedding table [00:06:38].

This approach resulted in Semantic IDs that not only outperformed regular hash-based IDs on clicks and likes but significantly increased cold-start coverage by 3.6% and cold-start velocity [00:06:59].

YouTube’s Large Recommender Model (LRM)

YouTube adapted its recommendation system using Google’s Gemini model, creating what they call the Large Recommender Model (LRM) [03:10:41]. The core innovation for this model is the development of a method to tokenize videos [03:12:22].

Semantic ID for Videos

To allow reasoning over many videos, YouTube built a semantic ID system for video tokenization [03:12:50].

  • Feature Extraction: Features like title, description, transcript, audio, and video frame-level data are extracted from a video [03:13:21].
  • Embedding and Quantization: These features are transformed into a multi-dimensional embedding and then quantized using RQVE to assign each video a token [03:13:33].
  • Domain-Specific Language: This process creates atomic units for a new “language of YouTube videos,” organizing billions of videos around semantically meaningful tokens [03:13:47]. These tokens can represent topics like music, gaming, or sports, with prefixes shared among related content and unique identifiers for specific videos [03:14:04].

Continued Pre-training and Reasoning

The LRM undergoes “continued pre-training” to understand both English and this new YouTube language [03:14:36].

  1. Text and SID Linking: The model learns to link text (e.g., video titles, creators, topics) with their corresponding Semantic IDs [03:14:48].
  2. Sequence Understanding: It learns from sequences of user watches, predicting masked videos within a watch history to understand relationships based on user engagement [03:15:26].

This training enables the model to reason across English and YouTube videos, inferring topics and connections based solely on the Semantic ID [03:16:07].

Generative Retrieval

The LRM is used for generative retrieval, where a prompt incorporating user demographics, context videos, and watch history is constructed to recommend new SIDs [03:17:03]. This yields unique recommendations, particularly for challenging recommendation tasks or users with limited historical data [03:17:41]. An offline recommendations table can also be built by removing personalized aspects from the prompt, which, due to the LRM’s large pre-trained checkpoint, still provides differentiated recommendations [03:19:12].

Etsy’s Unified Embeddings

Etsy uses unified embeddings and retrieval to improve search results for both specific and broad queries, especially given its constantly changing inventory [03:19:13].

  • Product Encoder: Uses T5 models for text embeddings of item descriptions and query product logs for query embeddings (what was clicked/purchased after a query) [03:20:18].
  • Query Encoder: Encodes search queries [03:20:35].
  • Shared Encoders: Both product and query encoders share encoders for text tokens, product category tokens, and user location, enabling the embedding to match users to product locations [03:20:40].
  • Personalization: User preferences are encoded via query user scale effect features, including past queries and purchase history [03:20:57].
  • Quality Vector: A “quality vector” (incorporating ratings, freshness, conversion rate) is concatenated to the product embedding to ensure retrieved items are of good quality [03:21:18]. For the query vector, a constant vector is “slapped on” to match dimensions for dot product or cosine similarity [03:21:38].

This approach led to a 2.6% increase in conversion across the entire site and over 5% increase in search purchases, demonstrating that quality in recommendations makes a significant difference [03:21:51].

Challenges and Future Directions

Implementing multimodal embedding-based systems presents unique challenges compared to traditional LLMs:

  • Vocabulary Size and Freshness: Recommendation systems deal with much larger vocabularies (e.g., billions of videos on YouTube vs. thousands of words in English) and require constant updates for new content (e.g., a new music video must be recommended within minutes or hours) [03:20:24]. This necessitates continuous pre-training on the order of days or hours [03:21:35].
  • Serving Costs: Large multimodal models are expensive to serve, requiring significant optimization (e.g., 95%+ cost savings for YouTube’s LRM) to meet the latency and scale requirements of billions of daily active users [03:18:36].
  • Balancing Language Capabilities: A key challenge is balancing the learning of semantic ID embeddings with retaining general language capability in multimodal models [03:25:50]. Over-training on domain-specific examples can lead to the model “forgetting” how to speak English [03:26:26].

The future of multimodal-powered recommendation systems is envisioned to involve more explicit user interaction, where users can steer recommendations using natural language, receive explanations for recommendations, and blur the lines between search and recommendation [03:24:24]. Ultimately, this could lead to the recommendation of personalized and even dynamically generated content [03:25:01].