Semantic IDs for video tokenization

From: aidotengineer

Language modeling techniques have been integrated into recommendation systems since 2013, initially by learning item embeddings from co-occurrences in user interaction sequences, and later evolving with GRU4Rec for short-term prediction and Transformers for long-range dependencies [00:03:08]. Today, a key advancement is the use of Semantic IDs to enhance recommendation systems, particularly for video content [00:03:55].

Challenges with Hash-Based Item IDs

Traditional recommendation systems often rely on hash-based item IDs, which present several challenges [00:04:00]:

Lack of Content Encoding: Hash-based IDs do not inherently encode the content of the item itself [00:04:11].
Cold-Start Problem: When a new item is introduced, the system must relearn about it entirely, leading to poor recommendations until sufficient interaction data is collected [00:04:15].
Sparsity: Many items, especially those in the “long tail,” have very few interactions (e.g., 1-10), making it difficult to learn effective representations [00:04:24].
Popularity Bias: Recommendation systems tend to favor popular items, struggling to recommend new or niche content due to cold start and sparsity [00:04:32].

Semantic IDs as a Solution

Semantic IDs address these issues by embedding items based on their content, often leveraging multimodal data [00:04:39]. This allows recommendations to inherently “understand” the content of the items, improving cold-start coverage and enabling human-readable explanations for recommendations [00:07:34].

Quao’s Trainable Multimodal Semantic IDs

Quao, a short-video platform similar to TikTok, faced the challenge of learning from hundreds of millions of daily video uploads [00:04:59]. Their solution involved trainable multimodal semantic IDs to combine static content embeddings with dynamic user behavior [00:05:08].

Model Architecture Quao utilized a standard two-tower network, with separate embedding layers for users and items [00:05:20]. The key innovation was integrating content input through specialized encoders [00:05:41]:

Visual Content: Encoded using ResNet [00:05:49].
Video Descriptions: Encoded using BERT [00:05:52].
Audio Content: Encoded using VGGish [00:05:57].

Encoding and Clustering To enable backpropagation and update these encoder model embeddings, Quao concatenated all content embeddings [00:06:08]. They then learned cluster IDs using k-means clustering; for example, 100 million short videos were grouped into 1,000 clusters [00:06:17]. These trainable cluster IDs were mapped to their own embedding tables, allowing the model encoder to learn to map the content space (via cluster IDs) to the behavioral space during training [00:06:38].

Results Quao’s semantic IDs not only outperformed regular hash-based IDs in terms of clicks and likes but also significantly improved cold-start metrics [00:06:59]:

Cold-Start Coverage: Increased by 3.6%, meaning more new videos were shown [00:07:07].
Cold-Start Velocity: Increased by a notable margin (specific threshold not shared), indicating new videos reached view thresholds faster [00:07:17].

YouTube’s Large Recommender Model (LRM)

YouTube adapted Google’s Gemini LLM to create a Large Recommender Model (LRM), aimed at revolutionizing video recommendations [03:10:43]. This model operates by treating videos as tokens, similar to how LLMs process text tokens [03:12:44].

Video Tokenization with Semantic IDs To enable the LRM to reason over large numbers of videos, YouTube developed Semantic IDs, which were presented at Rexus [03:13:14].

Feature Extraction: Videos are analyzed to extract various features, including title, description, transcript, and even audio and video frame-level data [03:13:25].
Multidimensional Embedding: These features are combined into a rich, multidimensional embedding [03:13:33].
Quantization: The embedding is then quantized using Residual Quantized Vector Embedding (RQVE) to assign a unique, atomic token to each video [03:13:36]. This process transforms videos into “atomic units for a new language of YouTube videos” [03:13:48].

This system organizes the billions of videos on YouTube into semantically meaningful tokens. For instance, topics like music, gaming, or sports would be represented by initial tokens, with further tokens specializing down to specific sub-genres or events (e.g., sports → volleyball) [03:14:04]. This represents a significant shift from hash-based to semantically meaningful tokenization [03:14:25].

Continued Pre-training The LRM undergoes continuous pre-training to understand both English and this new “YouTube language” [03:14:36]. This involves two main steps:

Linking Text and SID: The model learns to associate text (like titles or creator names) with specific video Semantic IDs [03:14:48]. For example, prompting “This video has title XYZ” trains the model to output the title for a given Semantic ID [03:15:08].
Understanding Watch Sequences: The model is trained on sequences of user watches, predicting masked videos within a sequence to learn relationships based on user engagement [03:15:26]. This helps the model understand which videos are watched together [03:15:51].

Through these tasks, the LRM develops the ability to reason across both English and YouTube videos. For example, given a series of video SIDs, it can infer the topic of a new video based solely on its Semantic ID [03:16:04].

Generative Retrieval and Recommendations The LRM can be used for generative retrieval by constructing personalized prompts for each user, including their demographics, context videos, and watch history [03:17:03]. The model then decodes video recommendations as SIDs [03:17:37]. This approach yields unique and interesting recommendations, particularly for challenging recommendation tasks or users about whom less is known [03:17:45].

Challenges and Optimizations While powerful, the LRM is expensive to serve at YouTube’s scale (billions of users) [03:18:36]. Significant effort went into reducing TPU serving costs, achieving over 95% savings [03:19:01].

Key challenges include:

Vocabulary Size: YouTube’s corpus of 20 billion videos (with millions added daily) dwarfs the vocabulary of traditional English LLMs (around 100,000 words) [03:20:24].
Freshness: New content (e.g., a Taylor Swift music video) must be recommended within minutes or hours, requiring continuous pre-training on the order of days or hours, unlike the months-long pre-training cycles of classical LLMs [03:20:51].
Scale and Efficiency: YouTube must focus on smaller, more efficient models (like Gemini Flash) to meet latency and scale requirements for billions of daily active users [03:21:52].

To address serving costs, YouTube also developed an offline recommendations table. By performing offline inference on the head of the video corpus (which accounts for much watch time) using the LRM without personalized elements, a simple lookup table can serve recommendations [03:19:13]. This “unpersonalized” model, due to its foundation on a large Gemini checkpoint, still provides differentiated recommendations [03:19:40].

LLM and Rexus Recipe

The process of building an LLM-based recommendation system can be summarized in three steps [03:22:25]:

Tokenize Content: Create an “essence” of content into an atomic token by building a rich representation (features), forming an embedding, and then quantizing it [03:22:50]. This effectively creates a domain-specific language.
Adapt the LLM: Link English with the newly created domain language through training tasks that enable reasoning across both [03:22:57]. The outcome is a bilingual LLM that understands both natural language and the domain-specific token language.
Prompt with User Information: Construct personalized prompts using user demographics, activity, and actions [03:23:21]. Train task-specific or surface-specific models to create a generative recommendation system on top of the LLM [03:23:36].

Future Directions

Next token prediction in AI models, specifically LLMs, are poised to transform recommendation systems even further [03:23:50]:

Augmented but Invisible: Currently, LLMs primarily augment recommendations, improving quality largely invisibly to users [03:23:58].
Interactive Experiences: In the near future, users may be able to talk to recommendation systems in natural language, steer recommendations to their goals, and receive explanations for why a candidate was recommended [03:24:26]. This will blur the lines between search and recommendation [03:24:52].
Generative Content: Ultimately, recommendations may evolve to include generative content, where a personalized version of content is recommended, or even entirely new content is created for the user [03:25:01].

Cold-Start for Semantic IDs

The training process for Semantic IDs is entirely unsupervised, as the system makes its own quantization of the video corpus [03:27:17]. This enables the model to learn concepts (e.g., sports vs. movies) without explicit instruction [03:27:30]. Semantic IDs effectively warm-start new content into a semantically meaningful space, leading to improved performance for fresh or “tail” videos uploaded recently [03:27:40].

Tubegraph

Explorer

Table of Contents