From: redpointai
Connor Wick, CEO of Speak.com, discusses the evolution and future of personalized learning powered by AI, focusing on the company’s English language learning platform. Speak.com, launched in 2019, has grown to over 10 million users in 40+ countries and is backed by OpenAI, recently valued at $500 million [00:00:11].
Foundational Principles
Connor Wick’s entrepreneurial journey in education technology began in high school with a flashcard app, which amassed several million users and billions of cards [00:01:22]. He envisioned aggregating this knowledge into a graph to “generate anything and teach anything that anyone wants to learn” [00:02:03], aiming to create an “omniscient tutor” [00:02:30]. This early aspiration is now becoming a reality with the advent of large language models (LLMs) [00:02:16]. He believes flashcard data is particularly good for learning because it’s structured around information someone is trying to memorize [00:02:49].
Wick’s deep dive into AI began around 2015, attending a Berkeley course where he became convinced that underlying models would significantly improve [00:04:00]. His team focused on speech recognition due to the potential to build technology that feels like it has a “persona” and fosters a “relationship” [00:06:03].
Speak.com Product and Pedagogy
Speak is designed as a “full fluency solution” for language learning, emphasizing speaking and real conversations over grammar or rote memorization [00:06:30]. The methodology involves teaching high-frequency word “chunks” that appear together in everyday speech, encouraging repetition until automaticity [00:07:06]. Users then practice in simulated conversations with real-world goals, tailored to their individual motivations, interests, and skill levels [00:07:24].
Technological Evolution and Strategy
Speak adopted a long-term, “North Star” oriented approach, anticipating that with more data and compute, AI models would eventually surpass human capabilities in learning tasks [00:08:21]. This allowed them to build products aligned with future capabilities, iterating step-by-step from accurate speech recognition to phoneme recognition and language understanding [00:09:00].
Balancing Innovation and Practicality
Founders must balance building around current model shortcomings versus waiting for future improvements [00:09:48]. Speak prioritizes understanding and solving user problems even if it means swapping out core technology later [00:10:44].
Moats and AI Firmware
Speak identifies several “technological moats”:
- Specialized In-House Models: They develop their own models for niche tasks like speech recognition for accented speakers, understanding specific mistakes, and fast, reliable streaming [00:12:00].
- ML Scaffolding: This refers to the “AI firmware”—all the complex technology built in-house to orchestrate models, integrate with backend systems, collect data, fine-tune, and evaluate [00:15:20]. Connor considers this a larger and more significant long-term technological moat than core modeling [00:15:51].
- Continuous Learning and Iteration: Building these models allows them to learn about different aspects, collect more data, and invest further in the business [00:13:23].
Evaluation Challenges
Evaluating AI models, especially for open-ended tasks, is crucial and difficult [00:30:57]. Connor believes a clear evaluation framework is key to problem definition and execution clarity [00:31:19]. This involves:
- Beyond simple metrics like word error rate, understanding individual mistakes [00:31:38].
- Sometimes intentionally “dumbing down” model understanding to human level for better pedagogical feedback [00:32:05].
- A structured playbook with internal tools, eval loops, and human-in-the-loop evaluations for new model releases [00:33:12].
- Tracking product metrics from user experiments to gauge real-world impact [00:34:20].
Model Costs and Pricing
Speak is not currently constrained by model inference costs for its subscription tier [00:28:06]. They believe that as costs decrease, demand will increase, benefiting model providers [00:28:35]. Their pricing strategy balances radical accessibility for hundreds of millions of people with the opportunity to charge significantly more for a premium consumer product, given that similar offline tutoring can cost hundreds of dollars per month [00:29:19].
User Experience and Interface Innovation
Speak faces the challenge of designing new interface paradigms for audio-first experiences. The goal is to minimize user education and tooltips by creating intuitive designs [00:18:34]. An example is the microphone button in onboarding, where users are simply asked “why are you learning English?” but may feel unsure about how to respond [00:19:19]. However, increasing familiarity with generative AI apps like ChatGPT is rapidly evolving user understanding of these paradigms [00:30:28].
The future of UI will likely be “hybrid,” allowing users to fluidly choose between speaking, typing, or tapping [00:21:07]. While speech is not always better, it offers significant advantages in many situations, especially with improved speech-to-speech models [00:21:24].
Proactive AI
A significant future interface unlock is the concept of a “GPU that’s like thinking about you in the background” [00:22:56]. This means the AI could proactively analyze user activity (e.g., an hour of speaking practice) and generate distilled analysis or lessons overnight to prepare for the next session [00:23:51]. This represents a shift from users “pulling” information to AI “pushing” relevant insights [00:22:28].
Business and Market Dynamics
Disrupting vs. Sustaining Technology
Connor believes AI in education broadly helps incumbents if it merely improves an existing problem [00:35:06]. However, if AI enables a fundamentally new solution (e.g., fully automating customer support instead of just making agents more efficient), it becomes highly disruptive [00:35:39].
Speak vs. Duolingo
Speak and Duolingo solve different problems [00:36:06]. Duolingo’s primary audience consists of native English speakers in Western countries who were not previously learning a language, treating it as a casual “brain training app” [00:36:11]. AI may not necessarily enhance this casual experience [00:37:04].
Speak, by contrast, targets users, particularly in markets like South Korea, who have spent years trying to achieve conversational fluency but lack access to human speakers [00:37:12]. For this use case, AI is profoundly helpful [00:37:42].
Impact of Foundational Models
While real-time translation (like a “Babble Fish”) might obviate the need for casual tourists to learn a language, it doesn’t address the fundamental human desire for connection that drives serious language learners [00:38:15]. Connor views new foundational models like GPT-4o as a net positive for AI in language learning products [00:40:53]. Even if people initially try to learn with general-purpose AI like ChatGPT, it validates the concept of AI for personalized user experiences in language learning, leading serious learners to seek specialized solutions like Speak [00:40:50].
Future of Learning with AI
Curriculum Design
While a “right sequence” exists for learning foundational vocabulary (e.g., the 100 most common words), AI can personalize the specific ordering within that framework based on individual user interests [00:25:22]. Human expertise is still needed for high-level curriculum strategy, but machine learning teams are increasingly involved in curriculum creation [00:26:01]. The ultimate goal is a “very unique” learning path for each individual [00:26:50].
Expansion Beyond Language
Speak sees opportunities to expand beyond language learning into three major sectors of learning:
- Schools: Traditional educational institutions [00:50:18].
- Businesses and Professional Skills: Developing and certifying skills for employees [00:50:27]. This includes areas like public speaking in a second language [00:48:53].
- Personal Learning: An “invisible” but massive sector encompassing daily activities like reading, listening to podcasts, or watching videos – anything driven by the desire to “become a better version of yourself” [00:50:37]. Connor envisions this as a highly individuated experience, like the “omniscient primer” from Neal Stephenson’s The Diamond Age, where AI has long-term memory and a mental map of a user’s knowledge, interests, and personality [00:52:05].
Timeline for Disruption
Connor believes that while little may change in a few years, a tremendous amount of change will occur over a decade [00:55:31]. He feels that AI in education will be one of the biggest and most exciting areas of disruption, fundamentally changing how people learn, unlike previous technological shifts that merely digitized existing methods [00:53:57].
“The way that people learn… has the quality of education changed? Like fundamentally people are still basically taking quizzes but they’re taking them on a laptop instead of a piece of paper… the quality of all that the efficacy like I don’t really think it’s changed very much” [00:54:19]
Future of AI Models
Connor notes that while Transformers are currently dominant, he hopes research continues into other areas beyond this “local maximum” [00:55:58]. For language learning, improved multimodal and multilingual audio capabilities, especially low-latency, continuous models that combine speech recognition, LLMs, and speech synthesis, are a “holy grail” [00:45:56]. These models could capture more nuanced information like confidence, emotions, and subtle mistakes [00:46:04].
While AI in education and human interaction still needs more development for higher-stakes subjects like math, language learning is uniquely positioned because current technology can already create fully useful and disruptive experiences without continuous human intervention [00:57:08]. The challenge for other subjects lies in meeting a higher bar for user enjoyment, as existing solutions (e.g., classroom math teaching) are perceived as more effective than traditional language learning [00:57:27].
Overhyped and Underhyped Aspects of AI
- Overhyped: “Probably everything,” especially the amount of funding without true product-market fit or real activity [01:00:00].
- Underhyped: Technology that isn’t the Transformer, and the amount of research going into alternative approaches [01:00:23].
Lessons Learned
A continuous challenge is that new technologies and capabilities are “never as good as you think it will be,” and building something that genuinely changes user behavior is extremely difficult [01:00:41]. Initially, Speak thought they would build all their models in-house, but later realized that some models are too expensive or specialized for a single company to develop [01:01:45].
Conclusion
Connor invites interested individuals to visit Speak.com to learn more about the product or careers at Speak.com/careers [01:02:44]. The podcast hosts conclude that Speak is an “incredible use case of AI” [01:03:45], demonstrating how AI can deliver a solution (personalized language tutoring) that was previously very difficult to achieve, creating a “profound shift” [00:00:43]. This approach may soon extend to other subject areas beyond language [01:04:11].