From: redpointai

HeyGen CEO Joshua Xu discussed the company’s work in AI video tools for enterprises on the Unsupervised Learning podcast [00:00:11]. HeyGen recently secured 500 million valuation [00:00:04].

HeyGen’s Core Offering

HeyGen is an AI video platform designed to help users create, localize, and personalize video content [00:03:50] [00:09:42]. The core mission is to replace the need for physical cameras by generating footage using AI [00:04:38] [00:04:45]. This approach makes video production significantly faster and cheaper, enabling users who may lack expensive equipment, camera comfort, or specialized editing skills [00:05:06] [00:11:48].

The “Magic Moment” of AI Video

The company experienced a viral “magic moment” when HeyGen was used to dub the speech of Argentina’s president at the World Economic Forum into different languages, highlighting the value and “magic” of AI video translation [00:00:53] [00:01:45]. This event demonstrated the ability for individuals to speak in different languages with natural voices and expressions in front of a camera [00:01:53]. Internally, HeyGen believes that continuously shipping and improving the product experience, while listening to customers, leads to such market-hitting moments [00:02:17].

Joshua’s personal “magic moment” occurred when he created his first avatar and saw himself speaking on screen [00:02:36]. He noted the ease of using his avatar to generate product update videos from a script, eliminating the need to film himself [00:03:04].

Evolution of Video Creation Workflows

Historically, video production involved filming with a camera and then extensive post-production editing [00:04:03]. With generative AI, it is now possible to generate footage directly using AI, aiming to replace the camera [00:04:36]. The belief that a video can be encoded as binary data implies that machines can learn and generate it [00:05:41].

Joshua predicts that future editing experiences will be vastly different from today’s timeline editors [00:06:40]. Timeline editors primarily exist because cameras and footage were expensive, requiring multiple takes and extensive post-production [00:06:16]. With AI-generated footage, this dependency is removed [00:06:33]. The future of video creation may involve generating video from text and combining various user experience elements like script writing and documentation-style editing [00:07:00].

Prioritizing AI Quality

HeyGen’s primary focus is on the quality of AI-generated video, ensuring that the “magic works” [00:07:42]. Quality in AI models for video isn’t solely about mathematical optimization but involves many aspects such as lighting, naturalism, and matching body motion and gestures to the script [00:08:21].

Evaluating AI models for subjective aesthetic quality is a challenge, requiring the internal team to develop a strong sense of what makes a good avatar [00:08:55]. The internal benchmark is whether a team member would be happy to use the generated avatar in their day-to-day work [00:09:09].

HeyGen’s Customer Use Cases

HeyGen serves over 40,000 customers with three primary use cases [00:09:37]:

  1. Creation: Users can create videos using avatars (either their own or stock avatars) by typing text, eliminating the need for a camera [00:09:48].
  2. Localization: Existing videos (even non-HeyGen videos) can be localized into over 175 different languages and dialects while retaining voice tone, facial expression, and lip-syncing [00:10:01].
  3. Personalization: A single video can be personalized into over 100,000 variations based on customer demographics, industry, and specific problems they face, similar to personalizing emails [00:10:19].

Target Audience and User Experience

HeyGen is designed for the 99% of users who are not professional video editors and do not have access to expensive cameras or sophisticated software [00:11:18]. The tool enables “content people” and marketers who can write scripts but may lack video production skills to produce videos [00:11:29]. The mission is to enable visual storytelling for everyone [00:11:39].

Teaching users a net new way of doing things is a common challenge for AI products [00:11:58]. HeyGen addresses this by showcasing what is possible with the technology and highlighting diverse use cases across marketing, sales, customer support, and content creation [00:12:36]. The goal is to quickly guide new users to their specific use case and demonstrate a “magic moment” [00:13:08].

Technical Aspects of Avatar Models

The key differentiator for HeyGen’s avatar models is engagement [00:13:40]. In a business context, video must effectively deliver a message, meaning it needs to be engaging to prevent viewers from disengaging quickly [00:13:56]. Engagement goes beyond mouth movement to include head movement, eyebrow expressions, and body motion and gestures, which is the most challenging aspect to coordinate [00:14:17].

HeyGen’s dedicated research team builds the entire video layer in-house, covering lip-syncing, body motion, and full-body rendering [00:15:07]. Their Avatar 3.0 model can render the entire body [00:15:20]. The process of training a good avatar model involves extensive data and continuous improvement of model architecture to capture various dimensions of human speech and integrate them [00:16:06].

To achieve personalization, HeyGen’s “video avatar” feature requires users to submit a short video (30 seconds to 2 minutes) for the AI model to learn their unique “talking style,” encompassing not just mouth movements but overall behavior [00:17:02] [00:17:23]. The company is working on larger models to capture different modes of speaking, like presentation mode or interview mode, and build adaptive avatar behaviors based on the script or content [00:17:36].

Future of AI Video at HeyGen

Synchronous Generation & Streaming

HeyGen has a beta version of an interactive avatar that can attend and interact in real-time Zoom meetings [00:19:15]. The main technical challenge for synchronous generative streaming is optimizing inference speed for increasingly larger and more complex models [00:19:42]. Joshua is optimistic that real-time AI video generation, including on-device processing, will be widespread within 12 months [00:20:46]. This will enable new use cases, such as personalized video ads where content adapts to individual user preferences and watch history [00:20:19].

Full Body Movement

Full-body rendering is crucial for creating engaging human presenters, as body motion and gestures are vital components of communication [00:21:10]. HeyGen’s 3.0 avatar version supports full-body rendering, with the inclusion of gestures as the next development step [00:21:35]. The challenge lies in developing the right model architecture and data to capture the nuances of full-body movement [00:21:48].

Integration with Text-to-Video Models

HeyGen primarily focuses on business videos and approaches video generation through an orchestration engine [00:22:30]. This engine captures text, script, voice, sound, music, avatar footage, and background generation to create a cohesive video [00:23:00]. This “orchestration” approach provides more control, consistency, and quality, which are crucial for brands and enterprises [00:23:17].

HeyGen plans to work closely with text-to-video partners (like Sora, Pika) by incorporating their outputs as components within its broader orchestration system [00:24:00]. HeyGen acts as a service layer that directly interacts with the customer, leveraging the best available generative models as inputs [00:24:12].

Brand Personalization

A significant future area is the “brand personalization layer” [00:25:03]. Currently, AI can help write scripts in a brand’s tone, but applying brand consistency to video is not yet seamless [00:25:10]. The vision is for AI models to learn a company’s color palette, style, and opening/closing video clips from existing content (e.g., a URL) and bake these elements into the final video assembly process [00:25:51]. This involves disassembling video into components, reassembling them, and integrating user input as a “memory” to feed into the AI model [00:26:31].

Startup vs. Incumbent Dynamics

HeyGen sees itself in the “creative tools stage” rather than competing directly with distribution platforms [00:27:23]. The company believes it’s not competing in the old market but opening up a new market opportunity for users who don’t have access to traditional video production means [00:28:27].

Platforms like Snapchat and TikTok focus on enabling creators using mobile cameras [00:29:12]. HeyGen’s key value is to make the camera obsolete and enable video creation without it [00:29:30]. This creates a dilemma for existing platforms: if AI-generated content becomes prevalent (e.g., 50% of content), it directly competes with and could reduce views for human creators, potentially necessitating new platforms specifically for AI-generated content [00:30:15]. While HeyGen’s mission is to build creative tools, not a consumption platform, they acknowledge this as a possible future opportunity [00:31:10].

Challenges and Strategies in Enterprise AI Deployment

HeyGen’s recent push into the enterprise market has highlighted key requirements [00:31:27]:

  • Higher Quality Requirements: Enterprise customers demand much higher quality and brand consistency in video output [00:31:43].
  • Integration with Workflows: Integrating the technology and product into day-to-day enterprise workflows is crucial [00:32:07]. For marketing, this means integrating with existing CRM and go-to-market tools like HubSpot, allowing for seamless data pulling for context and easy video distribution [00:32:25].

Trust & Safety in AI Video

Trust and safety are critical for HeyGen’s business, especially when serving large enterprise customers [00:33:45]. HeyGen implements a two-pronged approach:

  1. Avatar Creation: For every avatar created, HeyGen requires a video consent format to ensure the person in the footage is the same person giving consent [00:34:11]. They also use dynamically generated passcodes that expire quickly to add a secure layer [00:34:25]. This makes it nearly impossible to create an avatar without consent [00:34:38].
  2. Content Moderation: HeyGen has a platform moderation policy that prohibits hate speech, misinformation, political campaigns, and other undesirable content [00:35:03]. This is enforced through a hybrid solution of AI model review and human moderation [00:35:15].

HeyGen also engages in IP partnerships with actors who build avatars on the platform [00:35:49]. The emergence of AI-generated voices and persons could lead to new forms of IP, especially as image generation models improve consistency across different generations [00:36:10]. This opens up possibilities for new roles like AI influencers [00:36:49].

Business Model and Capital Intensity

In the AI category, the primary cost factors are GPU resources and talent [00:37:32]. Unlike traditional software companies where marginal cost for additional customers approaches zero, AI businesses consume more GPU compute per user, meaning marginal costs are not zero [00:37:56].

However, Joshua noted that individual employees are becoming much more efficient with AI tools like ChatGPT [00:38:24]. AI-native companies with AI-native teams can achieve greater efficiency [00:38:54]. The accelerated growth trajectory and market excitement around AI mean that companies might require less capital to build great AI products than previously thought [00:39:10] [00:39:36].

HeyGen offers a free tier to allow users to discover “magic moments,” balancing the inference costs with the need for broad market exposure [00:39:43]. HeyGen builds products 12 months ahead of current capabilities, anticipating future model advancements and cost reductions rather than waiting for them [00:40:51].

Overhyped vs. Underhyped in AI

Joshua believes the overhyped aspect in AI is the speed at which AI will deliver massive value to businesses and enterprises [00:41:39]. Conversely, the underhyped aspect is the ultimate impact of AI [00:41:50]. People often overestimate short-term gains but underestimate the long-term transformative power [00:41:55].

Mindset Changes and Customer Insights

Joshua’s mindset shifted significantly around 2021 with the emergence of technologies like Stable Diffusion [00:43:14]. Initially, HeyGen explored 3D model layers for video generation, aligned with the metaverse hype [00:42:40]. However, pixel-by-pixel generation proved to be a faster path due to its ability to train on large-scale data [00:43:29].

Customer feedback often surprises HeyGen, particularly the detailed attention paid to avatar quality and engagement [00:44:09]. Users have a higher bar for their own avatars compared to others [00:44:26]. For example, a customer highlighted that while avatar gestures were fine for the first few minutes of a video, they became random later on, indicating a need for improved long-duration gesture matching [00:45:30].

Vision for 2030

Joshua’s vision for video creation workflows in 2030 is that everyone will have a “video agency on their pocket” [00:48:33]. This means a product like HeyGen will interact with users like a personal video agency, guiding them from ideas to filmed footage and editing, with iterative feedback loops [00:47:56].

By 2030, any text, audio, or video content currently being made will be generatable by AI at a much faster rate and lower cost [00:49:11]. The true power of creative tools is opening up entirely new use cases that are currently unimaginable, similar to how mobile cameras led to platforms like Instagram, Snapchat, and TikTok [00:49:29]. Improving tools and lowering creation barriers will unlock a whole new world of possibilities [00:49:56].

Joshua’s passion for the space stems from his experience at Snap, working with mobile platform evolution and witnessing the emergence of content platforms [00:50:17]. His greatest joy comes from seeing people use the technology and tools he builds to create something on their own [00:50:56].

To learn more, visit heygen.com [00:51:14].