Speech recognition and AI

From: redpointai

Connor Wick, CEO of Speak.com, leads an English language learning platform that has garnered over 10 million users in more than 40 countries since its 2019 launch in South Korea [00:00:20]. Backed by OpenAI, Speak recently achieved a $500 million valuation [00:00:16]. The company focuses on developing an AI-driven solution for language fluency, emphasizing conversational skills over traditional grammar or memorization [00:06:30].

An Entrepreneurial Spark

Connor Wick’s entrepreneurial journey began in high school with a flashcard app for early iPhones [00:01:24]. This app gained several million users and facilitated the creation of hundreds of millions of “knowledge pairs” [00:01:32]. Wick envisioned aggregating this data into a graph to “generate anything and teach anything that anyone wants to learn,” ultimately creating an “omniscient tutor” [00:02:04]. He notes that the technology to achieve this, specifically Large Language Models (LLMs) that can crawl the internet, now largely exists [00:02:16]. Flashcard data, he believes, would be “really good data specifically for learning” due to its structured nature [00:02:47].

Wick’s formal exposure to AI began in 2015 when he “crashed” a Berkeley course, becoming convinced that underlying models would significantly improve [00:04:06]. At that time, the focus was on Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), before the Transformer architecture was invented [00:04:32]. Initial ideas for AI applications included computer vision tasks like automated parking enforcement (deemed “horrible for the world”) and measuring bodies for custom clothing or medical imaging [00:05:05]. The team was ultimately “more drawn to… speech recognition” due to the potential to build technology that felt like it had a “persona” and fostered a “relationship” [00:05:57].

Speak’s AI-Driven Product Evolution

Speak teaches language learners “chunks of words” that frequently occur in everyday speech, encouraging repetition until automaticity [00:07:09]. Users then practice in simulated conversations aimed at achieving specific communication goals relevant to their motivation for learning [00:07:25]. The entire experience is “extremely individuated to the individual user,” adapting to their motivation, interests, and proficiency level [00:07:45]. This approach makes Speak a prime example of AI in language learning.

Connor emphasizes Speak’s “long-term oriented” vision, anticipating that with more data and compute, models would improve and eventually surpass human capabilities in various tasks, leading to the full replacement of humans in the learning process [00:08:21]. Early “unlocks” for the product focused on highly accurate speech recognition to ensure a positive user experience, later incorporating phoneme recognition and basic language understanding [00:09:28].

Building Defensibility in AI

Speak’s strategy for building a “technological moat” involves several key areas:

Specialized In-House Models: While acknowledging that large foundational models will eventually “subsume” many tasks, Speak develops its own specialized models for short-to-medium-term advantages [00:12:00].
- Accented Speech Recognition: Speak has an in-house speech recognition model specifically trained to understand users speaking with accents, identify specific mistakes, and provide fast, reliable, streaming feedback [00:12:24].
- Phoneme Recognition: A system built on their data detects pronunciation errors and “prosodic types of mistakes” [00:12:45].
ML Scaffolding / AI Firmware: Connor identifies the “ML scaffolding” or “AI firmware” as a significant long-term technological moat [00:15:24]. This includes:
- Orchestrating models for specific tasks [00:15:31].
- Continuously collecting new data [00:16:39].
- Fine-tuning models [00:16:41].
- Building robust evaluation frameworks [00:16:42].
- Developing infrastructure for language representation, such as knowledge graphs to track user proficiency [00:16:51]. This area accounts for at least 50% of Speak’s product development and time [00:17:09].
End-to-End User Experience: The focus is on the core problem of teaching language effectively and engaging users, even if underlying technology needs to be swapped out [00:11:02].

User Experience and Interface Design

Speak is inventing “new interface paradigms around like audio first experiences” [00:19:02]. Users are unfamiliar with “talking into Speak in a way that is kind of fundamentally unfamiliar when it comes to technology” [00:19:06]. An example is the onboarding process, which features a microphone button and a simple question like “Why are you learning English?” [00:19:19] Users often wonder about the expected length or language of their response [00:19:30].

Connor believes the interface will become more “fluid” and “hybrid,” allowing users to “talk or type or tap” at any point [00:21:00]. While speech isn’t “always better,” it’s “definitely better some of the time,” especially as speech-to-speech models improve [00:21:25]. However, typing can be faster with a keyboard, and there will always be scenarios where tapping is preferred [00:21:38]. The increasing familiarity with paradigms from apps like ChatGPT has already led to a “meaningful shift in… the average user’s understanding of these paradigms” [00:20:25].

A future interface “unlock” lies in the AI “thinking about you in the background,” observing data and proactively taking actions [00:22:56]. For instance, after an hour of using Speak, the system could process a user’s activity overnight and generate “distilled analysis and lessons” to start their next day [00:23:46].

AI’s Impact on Learning and Industries

Curriculum Design

Speak aims for a hybrid approach to curriculum design. While there’s a “right sequence of ways to learn a language” (e.g., common high-frequency words first), the specific ordering and the exact set of words can be “individual and bespoke for the user” [00:25:22]. Humans will likely remain “in the loop” for “artistic creation of the actual curriculum,” but the machine learning team is increasingly involved in this process, necessitating cross-functional understanding [00:26:01].

Model Costs and Pricing

Speak does not feel “constrained” by model reference costs, as they operate at scale with a subscription model [00:28:06]. They believe that if they were constrained, they would “build it anyway and eat the costs” because costs are expected to decrease over time, leading to increased demand [00:28:27].

Speak’s pricing strategy considers both extremes:

Radical Accessibility: Providing a software solution with low marginal cost to “literally hundreds of millions of people” [00:29:21].
Premium Pricing: Charging significantly more for a consumer product that competes with offline tutoring or classroom education, which can cost hundreds of dollars per month [00:29:51].

Evaluation of Models

Connor stresses the importance and difficulty of evaluation, particularly for open-ended LLM tasks [00:30:57]. For speech, evaluation goes beyond mere word error rates to include catching individual mistakes and understanding when a user’s speech is “substantially unintelligible” even if a model could understand it [00:31:34]. The goal is sometimes to “dumb down your understanding model to human level” to better assess real-world communication [00:32:05]. A well-defined evaluation framework provides “execution clarity” for the team [00:32:16].

When new models like GPT-4o are released, Speak has a detailed process involving running them against 40+ major internal tasks and eval loops, including human-in-the-loop evaluations [00:33:31]. They also track product metrics with subsets of customers to see if the new models improve engagement [00:34:12].

Disruptive vs. Sustaining Technologies

Connor differentiates between AI as a “sustaining technology” (improving an existing solution) and a “disruptive technology” (fundamentally changing the problem or its solution) [00:35:06]. He argues that Duolingo, primarily focused on casual language learning for native English speakers who weren’t previously learning languages [00:36:11], might not be as directly impacted by AI as Speak, which targets users seeking conversational fluency who lack access to human speakers [00:37:34].

While tools like real-time translation might obviate the need for some casual users to learn a language [00:38:05], Connor points out the inherent latency and imperfection of translation due to fundamental linguistic differences [00:38:25]. For Speak’s users, the core motivation is “human connection” and the desire to connect with more people globally, which a “live Babble fish” cannot fully address [00:38:50].

The release of models like GPT-4o, even if causing stock fluctuations for companies like Duolingo [00:39:37], is seen by Speak as a positive development. More people using ChatGPT to learn and practice languages will realize the potential of AI in language learning and then seek specialized, more effective solutions like Speak [00:40:35].

Future Outlook

General AI Capabilities

Connor anticipates significant advancements in “multimodal audio” connected to LLMs [00:44:51]. The “Holy Grail” is a single, continuous model that can handle speech recognition, LLM processing, and speech synthesis in one turn with lower latency and less information loss [00:45:34]. This would allow for a more natural, human-like tutor experience, understanding nuances, tone, emotions, and mistakes directly from speech [00:45:56].

He also notes room for improvement in the “cognition piece,” specifically “reasoning and general ability to like actually follow a task through completion and do that well and reliably” [00:46:27]. This would enable “much smarter planning around the curriculum” [00:46:44].

Expansion Areas

Speak’s current advantage in language learning is that AI can already offer a “fully useful” and disruptive solution, potentially “fully take the human out of the loop” [00:47:20]. Other higher-stakes industries may still require human intervention due to the need for higher reasoning and lower hallucinations [00:47:44].

Future expansion areas for AI in education include:

Schools: Improving the quality of education beyond just digital quizzes or video lectures [00:50:18].
Businesses and Professional Skills: Developing and certifying professional skills, such as public speaking or giving presentations in English, which is a significant part of Speak’s growing Enterprise business in South Korea [00:48:11].
Personal Learning: This “invisible” but “massive” sector includes daily activities like reading books, listening to podcasts, watching videos, and reading articles, all driven by the desire to “become like a better version of yourself” [00:50:37]. Connor envisions personal learning in 10-15 years as “highly individuated,” with an AI that has “long-term memory” and a “good mental mapping of everything you kind of know” [00:52:05].

Connor believes education will be one of the “biggest and most exciting areas of change and disruption” in the AI era [00:53:57]. He notes that while software has “eaten the world,” the fundamental quality of education hasn’t changed much in 20 years, still relying on methods similar to those from 2,000 years ago [00:54:10].

Challenges in AI Progress

While optimistic about the decade ahead, Connor expresses concern that the current “obsession” with the Transformer architecture might be a “local maximum” and hopes research continues into other AI approaches [00:55:51]. He notes that for subjects like math, the “bar to… having a thing that people will really enjoy using is much higher” than for language learning, because current solutions are already relatively effective [00:57:27].

Lessons Learned

A continuous challenge is that new AI capabilities, while exciting, are “Never As Good as you think it will be” and are not a “Panacea” [01:00:48]. Building something that actually changes user behavior remains “really really hard” [01:00:54]. An example of this is the introduction of “human level transcription” tied to GPT-4 for open-ended lessons, which was “good but it wasn’t a game changer” [01:01:08].

Originally, Speak aimed to build all its models in-house, but later realized that some models would be too costly to develop independently, leading to a shift in strategy [01:01:45].

Conclusion

Speak continues to innovate in AI in language learning, focusing on specialized models, robust ML scaffolding, and intuitive user experiences. The company is actively hiring and can be found at speak.com [01:02:44].

Tubegraph

Explorer

Table of Contents