The role of specialized models in speech recognition

From: redpointai

Connor Wick, CEO of Speak, an English language learning platform, highlights the significant role of specialized models in speech recognition, particularly in the context of AI-driven language education [01:12:05].

Speak’s Approach to Specialized Speech Recognition

Speak has developed its own in-house speech recognition models, in addition to utilizing larger foundational models [01:12:07]. This strategic investment in specialized models is driven by the unique requirements of their platform, which focuses on conversational fluency [01:12:11].

Key aspects of Speak’s specialized models include:

Accent Recognition Their models are “super, super good” at understanding users speaking with various accents [01:12:28].
Mistake Detection The system can identify specific types of pronunciation mistakes made by learners [01:12:34].
Speed and Reliability These models ensure super-fast, reliable, and streaming responses back to the user’s client, which is crucial for an effective product experience [01:12:37].
Phoneme Recognition A dedicated phoneme recognition system, built from Speak’s own data, helps detect errors in pronunciation and other prosodic mistakes [01:12:45].

Connor emphasizes that while generalized large foundational cognition models are expected to eventually subsume many tasks, specialized models currently offer a significant advantage for niche applications [01:12:16]. Even if these specialized models are only used for a few years, they are a worthwhile investment in building a business [01:13:07].

Rationale for Building Specialized Models

The decision to build and utilize specialized models rather than solely relying on general-purpose models like LLMs is based on a strategic, long-term vision [08:21:00].

Connor explains the philosophy:

“We knew in the beginning that there was a long ways to go on the technology and we couldn’t perfectly predict it but the thing we did know is like over the next 5 to 10 years like with more data more compute models would get better and better and better and eventually they would like surpass humans on various tasks and eventually you know that would that would mean that we could fully replace the human in the learning process” [08:23:00]

This long-term orientation allowed Speak to make product decisions aligned with future technological capabilities, ensuring continuous evolution [08:50:00].

Building these specialized models is a “really big investment” in terms of compute, team, and resources [01:13:59]. However, it allows Speak to build a business, collect more data, and invest further [01:13:28].

Specialized vs. General Models

Connor draws an analogy to the personal computer industry in the 1980s, where companies like Apple used Intel processors rather than building their own [01:14:30]. Similarly, businesses today might use foundational LLMs [01:14:26]. The “AI firmware” or “ML scaffolding” — the technology built to orchestrate and integrate these models with the product and backend — represents a significant and often overlooked investment that can form a long-term technological moat [01:15:20].

“People are always talking about modeling… I think the modeling is definitely one investment but we’re making a much bigger investment on this piece and I actually think that’s like if I were going to say what’s our like long-term technological mode I would actually say that probably a bigger one” [01:15:42]

Challenges and Opportunities

A key challenge for companies building on AI is deciding whether to build around the shortcomings of current models or wait for improvements [09:47:00]. Speak’s strategy is to continually make progress on the core problem (language learning methodology) even if it means swapping out technology later [01:10:48].

Connor believes there is still room for specialized, audio-only models to “win” in certain contexts [01:10:07]. These include:

Niche Use Cases Specialized models can cater to unique needs not fully addressed by large cognition audio models [01:43:22].
Security and On-Premise Needs Certain applications might require specific security or on-premise solutions for speech data [01:43:35].
Specific Vocabulary Handling highly specialized or unusual vocabulary not commonly found on the internet [01:43:42].
Risk-Taking Smaller, specialized startups can take more risks than larger players [01:43:55].

The Future of Audio and Multimodal AI

Connor emphasizes that the future of UI will likely be “fluid,” allowing users to choose between talking, typing, or tapping [02:10:00]. While speech isn’t always superior, it’s often better and will drive a huge shift, especially as speech-to-speech models improve [02:27:00].

Speak is excited about multimodal audio, seeing it as a “holy grail” for their use case [01:40:18]. The progression towards AGI with continuous models that integrate speech recognition, LLM, and speech synthesis in one turn, would significantly reduce latency and retain more nuance [01:45:38]. This advancement would enable more natural, human-like tutor experiences [01:45:27].

The ability to build specialized solutions on top of these improving multimodal models will offer “endless possibility” [01:45:00]. These improvements are expected to lead to much smarter curriculum planning [01:46:44].

Evaluation of Models

Model evaluation is crucial and often underrated [01:57:57]. For Speak, evaluation is not just about word error rate but also about catching individual mistakes and understanding unintelligible speech, sometimes even training models to understand words humans wouldn’t [01:31:34]. A robust evaluation framework provides execution clarity for the team [01:32:16]. When new models like GPT-4o are released, Speak runs them against numerous internal eval loops and human-in-the-loop evaluations, relying on a playbook to manage the process [01:33:13]. They also track product metrics directly with users to gauge success [01:34:12].

Tubegraph

Explorer

Table of Contents