From: allin
The emergence of AI has led to a new paradigm for content monetization, with major tech companies entering into significant licensing deals for training data. This shift is being described as “Traffic Acquisition Cost (TAC) 2.0,” evolving from previous models of traffic acquisition. [00:42:01]
Traffic Acquisition Cost (TAC) 1.0 vs. TAC 2.0
Historically, Google’s “Traffic Acquisition Cost” (TAC 1.0) model involved paying partners to feature Google search, which would then generate revenue through ads. [00:42:03] This model scaled from small companies to large ones like Apple, with Google paying Apple between 20 billion annually to be the default search engine on iPhones. [00:43:05]
The new “TAC 2.0” framework sees companies like Google paying for data to train their AI models, rather than solely for search distribution. [00:43:19] This represents a significant revenue stream for businesses with unique and high-quality data. [00:43:34]
Key Licensing Deals and Their Implications
Recent examples of these deals highlight the growing trend:
- Reddit: Google reportedly made a 200 million worth of AI licensing deals over the next two to three years. [00:40:43]
- Stack Overflow: This platform is now using its API to train Google’s Gemini AI. [00:40:33]
- Axel Springer: Partnered with OpenAI for content licensing. [00:41:00]
- CNN, Fox, Time: OpenAI is reportedly in talks with these media companies to license their content. [00:41:06]
These deals are in response to lawsuits, such as the New York Times’ case against OpenAI over copyright infringement. [00:41:11] Both OpenAI and Google’s Gemini are actively “guardrailing” their systems to prevent copyright infringement. [00:41:40]
Value of Data and Content
The value of content for AI training depends on several factors:
- Uniqueness and Proprietary Nature: Data that is unique and proprietary (like user-generated content from Reddit) can command high prices. [00:44:06]
- Freshness: For certain types of content, such as news, data can quickly become stale, reducing its value over time. [00:47:24]
- Attribution: A challenge is determining how much incremental value a model derives from specific datasets, especially for smaller websites. [00:45:21]
The market for this type of content licensing is compared to the content licensing deals seen in the entertainment industry (e.g., Netflix paying studios). [00:46:44]
Data Volume and Future Trends
Currently, there are approximately one million petabytes of data on the internet, with humans generating about 2,500 petabytes of new data daily. [00:47:41] However, about half of all generated data is never used, and the majority of public domain data is not on the internet. [00:47:57] The rate of data generation is continuously increasing, potentially making older data less valuable over time. [00:49:11]
Challenges and Opportunities for Content Creators
- Small Websites: While large platforms like Reddit can secure multi-million dollar deals, the monetization potential for smaller websites remains uncertain. [00:44:46]
- Exclusivity: Current deals, like Reddit’s with Google, appear to be non-exclusive, meaning content providers can license their data to multiple AI developers. [00:50:00] Some argue that major AI companies should pursue exclusive deals to block competitors. [00:50:08]
- Unionization/Federation: The idea of content creators, particularly news organizations, forming a federation to collectively bargain with tech giants for better terms is suggested, akin to the music industry’s successful efforts in licensing. [00:52:26] This could prevent a situation where a large number of content suppliers are competing for a limited number of buyers, driving down prices. [00:52:50]