Challenges and innovations in AI music tech and infrastructure

From: redpointai

AI has profoundly changed the music industry [00:00:11]. Suno, an AI music generation product, has gained significant traction, with over 10 million users having generated songs and a recent fundraise of $125 mi ll i o n, v a l u in g t h eco m p an y a t$ 500 million [00:00:13]. Mikey Shulman, CEO of Suno, discusses the challenges and future possibilities of AI in audio, particularly concerning product and infrastructure challenges faced while scaling [00:00:36].

User Experience and Creative Outlets

Suno caters to two main user categories [00:06:00]:

Casual Users (Soundtracking Life): These users musically narrate their lives, creating songs about everyday occurrences like Starbucks getting their name wrong or unexpected deliveries [00:06:13]. Music serves as a powerful medium for storytelling [00:06:25].
Power Users (Creative Outlet): This group finds Suno an amazing creative outlet, enjoying both the process and the final product [00:06:48]. They spend hours crafting songs to tell a specific story or achieve a particular sound [00:07:05].

A key insight has been the discovery of many users with great musical taste and ideas but lacking the traditional means to create music [00:07:29]. This highlights that established production software like Ableton or Logic, while amazing, have steep learning curves [00:07:47]. New technological breakthroughs in AI allow for reimagined processes, enabling people to make music in entirely different ways [00:07:54].

Overcoming the “Blank Canvas” Problem

A significant challenge for AI products is the “blank canvas” problem, where users are overwhelmed by where to begin [00:08:14]. Suno has acknowledged this as an ongoing challenge [00:08:40].

Possible future solutions for prompting the AI beyond text include:

Guiding users through purpose-driven experiences, such as creating a song for Valentine’s Day [00:09:03].
Allowing input based on mood, visuals, or sounds from everyday life, turning them into music [00:11:41].
Enabling users to hum a melody or tap a beat to inspire the model [00:10:01].
Using images or current thoughts as prompts [00:10:08].

The text-driven nature of many AI tools, including Suno, is seen as a sign of how early the technology is and the vast room for growth in intuitive interaction [00:10:35].

Enhancing User Control and Collaborative Creation

For power users, the focus is on giving them greater control to achieve the sounds in their heads, but not through traditional, complex production software interfaces [00:12:54]. The aim is to find ways for people to “pour themselves into” the creative process, making it more enjoyable [00:13:02].

[!NOTE] The sterile nature of a text box is seen as a limitation, as it doesn’t allow users to fully express their musical vision by pouring their heart out in song, using moving images, or mood-boarding sounds [00:13:06].

A major future focus for Suno is “multiplayer” experiences, enabling collaborative music creation [00:14:31]. This can be synchronous (jam sessions) or asynchronous (sending half a song for someone to finish) [00:14:39]. The goal is to recreate the joy of jamming with friends, even for those not expert musicians [00:15:31]. This shared experience can be both personal and public, as seen with Twitch streamers creating live music with audience interaction [00:16:51].

Business Model and Model Evaluation

Pricing Challenges in AI Products

Suno currently employs a freemium model, offering a free tier with a set number of songs and charging power users for more generations [00:18:12]. However, the company is intentionally not trying to innovate on the business model at this early stage, prioritizing product innovation [00:18:40].

[!WARNING] The current AI product pricing often mirrors the Software-as-a-Service (SaaS) model, which may not be suitable because AI model usage incurs a non-zero marginal cost (compute expenses), unlike typical SaaS [00:19:08]. The optimal pricing model is expected to be highly product and use-case dependent and will likely evolve significantly over time [00:19:40].

Evaluating AI Music Models

Evaluating music models is more complex than text or image models because music lacks a “correct answer” [00:20:20]. While objective metrics for audio quality exist, they are often flawed [00:20:29].

[!INFO] Aesthetics matter: Subjective, intangible metrics and human judgment are crucial for evaluating AI-generated music [00:20:36]. The ultimate test of quality is how much users love the music produced and how much control they feel they have over the output [00:21:11].

Suno relies on user feedback, both implicit (usage patterns, model choices) and explicit (Discord community feedback), to understand model performance and identify issues [00:22:18]. Fixing model issues often involves deep data analysis, as is common in machine learning [00:23:39].

Current limitations of music models include:

Difficulty with iterative changes (e.g., “do that, but change X”) [00:23:55].
Lack of precise control over objective musical parameters like BPM [00:24:17].

North Star metrics for Suno revolve around user enjoyment:

Number of users making songs [00:24:50].
Daily active users [00:24:53].
Probability of exhausting free credits [00:24:57].
Sharing activity (did users share songs, or were songs shared with them?) [00:25:10].

Infrastructure and Development

Speed of Generation

Suno prioritizes rapid song generation, acknowledging that users expect near-instant results, similar to playing a song on Spotify [00:25:40]. The goal is to minimize latency, as every 100 milliseconds increases the likelihood of user disengagement [00:26:11]. Suno uses auto-regressive Transformers, which allows streaming the song to the user while it’s still being made, providing a significant speed advantage over diffusion-based models [00:26:34].

Scaling Infrastructure

Having experienced an “insane spike of usage” [00:27:12], Suno has learned to be selective about where to innovate versus where to leverage existing tools. They are “big fans and customers of Modal,” which simplifies deploying jobs onto GPU infrastructure [00:27:45].

[!INFO] The audio domain in AI benefits from lessons learned and open-source contributions from the more advanced image and text communities, which have already solved many common problems, such as continuous batching [00:28:15]. Builders are advised to be deliberate about where they choose to innovate versus leverage existing solutions [00:29:03].

Funding and Future Direction

Suno’s recent $125 million fundraise is primarily for scaling [00:35:08]:

Training big models: Music models may not require the same compute levels as the largest text models in the near future, but they demand significant care and specialized data [00:35:13]. The correct way to model music is still being figured out [00:35:44].
Research: The research required for music models is expensive, especially in exploring control axes and taste [00:36:04].
Hiring: Attracting top talent is crucial for growth [00:36:17].

The overarching goal is to accelerate the envisioned future of music by deploying capital to pull it forward more quickly [00:36:39].

Broader Applications and Market Dynamics

The Rise of Audio AI

The increasing prominence of audio AI, exemplified by models like GPT-4o, signals a realization that audio should be a first-class citizen in AI, given that it’s how the vast majority of human communication occurs [00:29:32]. The ability to interact with systems like another human being has immense, still-unforeseen impacts, beyond obvious applications like customer service [00:29:52]. This represents a significant shift towards more expressive forms of communication in various domains [00:30:37].

Future of AI Music Capabilities

While an AI model generating a 3.5-minute pop song indistinguishable from a human-made one is a common benchmark, it’s not the most interesting milestone for Suno [00:37:37]. Music is about how it makes you feel, suggesting the ceiling for AI music is much higher than mere indistinguishability [00:37:54].

Personal aspirations include a Vision Pro app that enables “air guitar” or conducting a symphony, where the music responds in real-time to the user’s movements, transforming music creation into an enjoyable game [00:38:23].

Market Landscape and Intellectual Property

The AI music market is considered “really, really big” and still Green Field [00:39:40]. AI, along with other technologies, has the power to greatly expand the market for music [00:39:43]. There will likely be multiple companies, some focusing on professional artists, others on background music, and Suno targeting general consumers [00:40:21].

In terms of intellectual property (IP) partnerships, the music world is still very early, akin to the Napster-to-Spotify evolution [00:41:44]. Suno aims to collaborate with the industry [00:42:05]. Direct artist partnerships (e.g., making new songs in a specific artist’s voice) are generally avoided by Suno [00:42:10]. These viral moments are seen as “flash in the pan” rather than the core future of music [00:43:39]. The true future lies in enabling people to enjoy the process of creating music that is relevant to them [00:43:09].

Hot Takes on AI Trends

Open Source vs. Closed Source AI

Open source AI is considered “overhyped” due to the high computational barriers and lack of clear financial incentives compared to proprietary models [00:46:21]. The best open-source models often come from financially resourceful companies like Meta [00:46:44]. As compute costs for state-of-the-art models increase, it will be challenging for open source to keep up without robust business models [00:46:51].

The dynamic between open and closed-source models is complex. While state-of-the-art closed models will likely continue to lead at the frontier for economically valuable applications, open-source models often catch up to previous advancements within a year [01:00:51]. This suggests a continued existence for both, with closed models pushing the cutting edge and open-source democratizing access to mature capabilities [01:01:01].

Underhyped Areas

Music itself is seen as an underhyped aspect of people’s lives, not just within AI [00:47:07].

Surprises and Lessons Learned from Building Suno

Positive Surprise: Letting users feel pride in their creations, even down to editing song titles to include their names on a trending page, has been very successful [00:47:26]. This validates the importance of users taking ownership and celebrating their work [00:48:04].
Negative Surprise: Early investment in owning hardware (GPU boxes) proved to be a misstep, as the scale required for growth quickly outstripped in-house capacity [00:48:16].
Change of Mind: The initial belief that Discord would be a long-term primary platform was incorrect [00:48:41]. A thin web app, launched in November, quickly captured 90% of usage within five days, demonstrating that for building an “all-encompassing music experience,” a dedicated web platform is superior to a messaging platform [00:49:06]. However, the Discord community remains an invaluable resource for feedback and community management [00:49:54].

Tubegraph

Explorer

Table of Contents