From: redpointai
Connor Wick, CEO of Speak.com, leads an English language learning platform that has garnered over 10 million users across more than 40 countries since its launch in South Korea in 2019, backed by OpenAI and recently valued at $500 million [00:00:11]. Wick has been involved in AI in education long before its mainstream popularity [00:00:37], noting that this area is a “totally new area” that will bring a “profound shift” [00:00:40].
Early Entrepreneurial Journey and AI Exposure
Connor Wick’s entrepreneurial journey began in high school with a flashcard app that replicated physical flashcards for the early iPhone [00:01:22]. This app achieved millions of users and billions of cards [00:01:31]. He envisioned aggregating this data into a “graph” to create an “omniscient tutor” capable of teaching anything [00:01:50]. He notes that the technology to realize this vision largely exists today [00:02:16]. He believes flashcard data is highly valuable for learning due to its structured nature around information people want to learn [00:02:47].
Wick’s formal exposure to AI began around 2015 when he “crashed” a Berkeley course [00:04:03]. At that time, the focus was on technologies like RNNs and convolutional neural networks, with the Transformer model not yet invented [00:04:32]. Early ideas included computer vision applications (e.g., automated meter maids, body measurement for clothing, medical imaging) and weather prediction using deep learning [00:05:05]. Ultimately, Wick was drawn to speech recognition due to the potential to build technology that felt like it had a persona and could form relationships [00:05:52].
Speak.com: An AI-Powered Language Learning Solution
Speak.com offers a comprehensive solution for language fluency, emphasizing speaking and real conversations rather than traditional grammar or rote memorization [00:06:30]. Its methodology involves teaching high-frequency word “chunks” and encouraging repeated practice until they become automatic [00:07:06]. This is followed by simulated conversations designed to achieve specific real-world goals, with content highly individuated to the user’s motivations, interests, and proficiency level [00:07:24].
Evolution of AI Integration
Speak’s long-term orientation allowed it to adapt to evolving AI technology [00:08:19]. The initial focus was on accurate speech recognition to enable effective spoken interaction [00:09:28]. As models improved, they added features like phoneme recognition and basic language understanding [00:09:38]. The company aims to fully replace the human element in the learning process as AI models surpass human capabilities on various tasks [00:08:35].
Building AI Capabilities
Speak makes strategic investments in building specialized in-house models for tasks where general models are insufficient or not yet mature [00:12:00]. Examples include:
- Speech Recognition: Developed in-house models that are highly accurate for users speaking with accents, detecting specific mistakes, and providing fast, reliable real-time feedback [00:12:24].
- Phon Recognition System: Built using their extensive data to detect pronunciation errors [00:12:45].
These specialized models, though potentially temporary, provide a significant advantage by enabling a working product, user growth, and data collection that fuels further model development [00:13:06].
AI “Firmware” and Scaffolding
While external large language models (LLMs) are used, a major investment at Speak is in what they call “AI firmware” or “ML scaffolding” [00:15:20]. This refers to the complex technology built to orchestrate and integrate AI models with the product and backend systems [00:15:26]. This scaffolding, including continuous data collection, fine-tuning, and evaluation frameworks, is considered a significant long-term technological moat [00:15:57].
A common challenge is “prompt optimization,” which feels like a temporary, “silly” aspect of current AI development that will likely disappear as models become more intelligent [00:17:42].
User Experience and Interfaces
Speak faces the challenge of designing intuitive interfaces for new audio-first experiences. For example, their onboarding prompts users to speak into a microphone, which can be unfamiliar and generate questions about what to say or how long to speak [00:19:17]. The goal is minimal user education, striving for intuitive design [00:19:51]. The increasing familiarity with apps like ChatGPT is already shifting user understanding of these paradigms [00:20:25].
The future of UI/UX in AI is seen as “hybrid,” allowing users to fluidly switch between talking, typing, or tapping [00:21:07]. Speech is not always superior but offers significant advantages, especially as speech-to-speech models improve [00:21:24].
Another area of development is proactive AI interfaces, where a “GPU thinking about you in the background” observes data and performs tasks for the user [00:22:56]. For Speak, this could mean analyzing a user’s practice session overnight and generating distilled lessons or analyses to start their next session [00:23:51].
Curriculum and Methodology
Speak believes in a balanced approach to curriculum, combining a structured “right sequence” for learning a language (e.g., starting with high-frequency words) with highly individualized paths within that structure [00:25:22]. While human expertise is crucial for high-level curriculum strategy, machine learning teams are increasingly involved in adapting and delivering content, creating a cross-functional challenge [00:26:14]. The aim is to create unique and creative learning experiences tailored to the individual [00:26:50].
Business Model and Competition
Speak does not feel constrained by model inference costs for its subscription-based service, believing that costs will continue to decrease, driving increased demand [00:28:04].
Regarding pricing, Speak aims for radical accessibility to reach hundreds of millions of people with a software solution that traditionally has high marginal costs [00:29:21]. Concurrently, there’s an opportunity to charge significantly more for a premium consumer product, given that offline tutoring or classroom education can cost hundreds of dollars per month [00:29:51]. The goal is to build a differentiated and valuable product that isn’t commoditized [00:30:28].
On the topic of incumbents like Duolingo, Wick argues that AI in education broadly helps incumbents if the problem being solved remains the same and AI simply makes it better. However, if AI enables a fundamentally new solution to a different problem, it can be highly disruptive [00:35:02]. Speak and Duolingo are seen as solving fundamentally different problems:
- Duolingo: Primarily serves casual learners, often native English speakers who weren’t previously learning a language, offering a “brain training app” experience [00:36:06]. AI’s benefit to this casual experience is unclear [00:37:04].
- Speak: Focuses on teaching English to individuals who have often studied for years but lack conversational fluency due to limited access to human speakers [00:37:12]. AI clearly and significantly helps this use case by providing simulated conversational practice [00:37:42].
Wick believes that increased use of AI tools like ChatGPT for language learning is a net positive for specialized AI language learning products, as it familiarizes users with the concept and encourages them to seek more effective, specialized solutions if they are serious about achieving fluency [00:40:35].
The Future of AI in Education
Wick identifies three major sectors for AI in education:
- Schools: Traditional learning environments [00:50:18].
- Businesses and Professional Skills: Opportunities for certification, assessment, and skill development for companies [00:50:27]. Speak is building an enterprise version for companies like Samsung and SK to offer English learning to employees [00:48:21]. This also extends to areas like public speaking [00:48:53].
- Personal Learning: A “massive” and often “invisible” sector, encompassing daily activities like reading books, listening to podcasts, watching videos, and reading articles. This represents a desire to know more and become a better version of oneself [00:50:37].
Wick envisions personal learning in 10-15 years as highly individualized, with AI systems possessing long-term memory, understanding user interests and personality, and proactively providing relevant information [00:52:05], similar to the concept in the sci-fi novel Diamond Age [00:27:06]. He predicts a future with both wide, casual platforms (like a more intelligent ChatGPT) and specialized, premium solutions [00:52:51].
Connor is confident that AI in education will be one of the most significant and exciting areas of change and disruption [00:53:57]. Unlike past software trends that merely digitized existing educational methods (e.g., digital quizzes instead of paper, digital flashcards), AI has the potential to fundamentally alter the quality and efficacy of learning, akin to how Socrates taught Alexander the Great [00:54:18].
Timeline and Challenges
While people may be “overhyped” about short-term changes, significant shifts are expected in a decade [00:55:31]. Wick expresses concern that over-obsession with the Transformer architecture might lead to a “local maximum” in research, potentially hindering exploration of other crucial AI advancements [00:56:00].
For subjects other than language learning, the “delta of advocacy” (the improvement needed over existing solutions) is higher, meaning it will take longer for AI to make a profound impact [00:56:49]. Language learning is uniquely poised for AI disruption because traditional classroom models (e.g., 1 teacher to 30 students) are less effective for speaking practice, whereas AI can offer a one-on-one “teacher” [00:56:54].
Even without further AI progress, significant improvements in educational experiences can be built on current technology, but it will take time [00:57:41]. The primary challenge in these areas is often not technological, but rather about building a genuinely good product and finding the right market [00:59:12].