From: redpointai

Speak.com is an English language learning platform that utilizes artificial intelligence (AI) and is backed by OpenAI [00:00:13]. Since its launch in South Korea in 2019, Speak has amassed over 10 million users across more than 40 countries and was recently valued at $500 million [00:00:18]. The company’s goal is to become an “omniscient tutor” capable of teaching anything to anyone [00:02:12].

The Evolution of AI in Language Learning

Connor Wick, CEO of Speak, began his entrepreneurial journey in high school with a flashcard app, which he believes would today incorporate generative AI (GenAI) [00:01:07]. His early vision for the flashcard app was to aggregate “pairs of knowledge” into a graph to generate and teach anything, essentially creating an omniscient tutor [00:02:04].

Wick’s deeper dive into AI began around 2015, when models like RNNs and Convolutional Neural Networks were the focus, before the Transformer architecture was invented [00:04:30]. Initially, ideas ranged from computer vision applications like automated parking enforcement and body measurement for custom clothing, to predicting weather patterns [00:05:05]. However, Speak’s founders were more drawn to speech recognition due to the potential to build technology that felt like it had a “persona” and could form a “relationship” with the user [00:06:01].

Speak’s development has been guided by a long-term vision that anticipated AI models would continuously improve with more data and compute, eventually surpassing human capabilities in various tasks [00:08:21]. This foresight allowed them to make product decisions aligned with eventually replacing the human element in the learning process [00:08:41]. Early breakthroughs involved highly accurate speech recognition, enabling users to speak into the app effectively [00:09:28]. Subsequent additions included phoneme recognition and basic language understanding [00:09:39].

Speak’s Approach to Language Learning

Speak offers a comprehensive solution for achieving fluency in a new language, focusing specifically on spoken communication and real conversations [00:06:30]. This contrasts with traditional methods that emphasize grammar or vocabulary memorization [00:06:41].

Speak’s core methodology involves:

  • Teaching high-frequency “chunks of words” that commonly appear together in everyday speech [00:07:06].
  • Encouraging users to repeatedly practice these chunks until they become automatic [00:07:18].
  • Facilitating practice in simulated conversations where users work towards specific, real-world communication goals tied to their motivation for learning the language [00:07:24].

The entire learning experience is highly individualized, adapting to the user’s motivation, interests, and proficiency level [00:07:45].

Technological Moats and Investments

Speak’s strategy involves a combination of leveraging foundational models and developing specialized in-house AI [00:12:00]. While recognizing that large foundational models will eventually “subsume” many specialized tasks, Speak builds its own models for niche areas where it can achieve greater specialization and speed [00:12:16].

Examples of in-house models include:

  • Speech Recognition: A highly accurate speech recognition system optimized for users speaking with accents, capable of identifying specific mistakes and providing fast, reliable feedback [00:12:26].
  • Phoneme Recognition: A system trained on Speak’s data to detect pronunciation errors and other prosodic mistakes made by learners [00:12:45].

These specialized models, even if used for only a few years before foundational models catch up, provide significant value and allow Speak to build a large user base, collect more data, and make further investments [00:13:06].

The Importance of ML Scaffolding

Connor Wick emphasizes that the “AI firmware” or “ML scaffolding” — the technology built to orchestrate and integrate AI models with the product and backend — is a larger and more important investment than just the core modeling itself [00:15:24]. This includes:

  • Getting models to excel at individual tasks [00:16:35].
  • Orchestrating these models [00:16:37].
  • Continuously collecting new data and fine-tuning models [00:16:39].
  • Developing robust evaluation frameworks [00:16:43].
  • Building infrastructure for representing language and user proficiency (e.g., knowledge graphs) [00:16:51].

Wick sees this “scaffolding” as Speak’s long-term technological moat [00:15:57].

Pricing and Cost Considerations

Speak operates on a subscription model and does not currently feel constrained by the cost of AI model inference [00:28:11]. They anticipate that model costs will continue to decrease, driving increased demand [00:28:31].

Speak aims to make its solution radically accessible, recognizing that it offers a software solution to a problem traditionally addressed by expensive human tutors or classrooms [00:29:21]. However, they also see an opportunity to charge significantly more for a consumer product that delivers a high-end experience, as consumers currently pay hundreds of dollars per month for offline tutoring [00:29:51].

Model Evaluation

Model evaluation is considered extremely difficult but crucial [00:30:57]. For Speak, evaluation goes beyond simple metrics like word error rates, assessing nuanced aspects like understanding “unintelligible” words or specific types of mistakes users make [00:31:34]. A well-defined evaluation framework provides “execution clarity” for the development team [00:32:16].

When new models like GPT-4o are released, Speak employs a thorough process involving internal tools and human-in-the-loop evaluations for dozens of major tasks, combined with A/B testing on subsets of customers to track key product metrics [00:33:12].

Challenges and Future of AI in Language Learning

User Interface and Education

Designing intuitive user interfaces for AI-first experiences, especially audio-first ones, is a significant challenge [00:18:59]. Users are unfamiliar with “open-ended” interactions, such as being prompted to simply talk to a microphone button [00:19:16]. However, the increasing familiarity with AI paradigms (like ChatGPT) is rapidly evolving user understanding [00:30:30].

The future of AI interfaces is expected to be more fluid, allowing users to seamlessly choose between speaking, typing, or tapping [00:21:00]. While speech is not always superior, it offers significant advantages in certain contexts, particularly as speech-to-speech models improve [00:21:24].

Another emerging interface paradigm involves AI systems proactively “thinking about you in the background,” observing user data, and initiating actions or providing insights without being explicitly queried [00:22:56]. For Speak, this could mean analyzing a user’s daily practice overnight to distill lessons or prepare personalized content for the next day [00:23:59].

AI in Education: Broader Perspective

Connor Wick believes that AI in education will be one of the biggest and most exciting areas of change and disruption [00:53:57]. Unlike other industries where software has fundamentally transformed processes, education, despite the introduction of technology, has seen little change in its core efficacy or learning methods for centuries [00:54:18].

Key areas of focus for AI in education include:

  • Schools: Integrating AI into traditional educational institutions [00:50:20].
  • Businesses and Professional Skills: Developing AI-powered solutions for skill development, certification, and assessment within companies [00:50:27].
  • Personal Learning: This “invisible” but massive sector encompasses everyday activities like reading, listening to podcasts, or watching videos, all driven by a desire for self-improvement [00:50:37].

A vision for personalized learning through AI in 10-15 years is a highly individualized system with long-term memory that understands a user’s interests and personality, providing relevant information proactively [00:52:05].

Connor believes that while AI in education will see tremendous change, it will likely take a decade rather than a few years [00:55:21]. A key concern is the current over-obsession with the Transformer architecture, hoping that research continues into other foundational technologies [00:56:00].

For subjects other than language learning (e.g., math), the “bar to enjoyment” for AI-powered solutions is higher because existing human-centric methods are more effective than traditional language teaching [00:56:49]. While AI technology might already be capable, widespread adoption depends on truly building substantially better products and finding the right market fit, which are often “harder parts” than the technology itself [00:58:48].

Disruptive vs. Sustaining AI

AI can be a “sustaining technology” if it merely improves an existing solution to a problem. However, it becomes “disruptive” when it fundamentally changes how a problem is solved, potentially automating entire processes [00:35:06].

For example, while Duolingo focuses on casual language learning for a broad audience, Speak targets conversational fluency, often for users who lack access to human speakers [00:36:06]. AI clearly benefits Speak’s use case, whereas its impact on a casual learning experience might be less disruptive [00:37:42].

Connor argues that real-time translation, even with advanced AI, may not fully obviate the need for language learning because it introduces latency and imperfection, and the fundamental desire for human connection remains [00:38:14].

While new models like GPT-4o with advanced speech-to-speech capabilities may lead more people to try learning languages with general AI chatbots, Speak views this as a positive development [00:40:40]. This exposure can make users realize the potential of AI tutoring systems, leading them to seek more specialized and effective solutions for serious language learning [00:40:55]. Speak aims to “own that category” of specialized AI tutoring systems for language learning, similar to how Airbnb and Uber dominated their respective markets [00:41:35].

Future Capabilities

Speak is excited about continued progress in multimodal audio models, especially those that are integrated with LLMs [00:44:51]. The “Holy Grail” is a continuous model that can handle speech recognition, LLM processing, and speech synthesis in one turn, reducing latency and lossiness [00:45:39]. This would allow for a more natural, human-like interaction with an AI tutor, understanding nuances like confidence and emotion [00:46:01].

Improvements in reasoning abilities of models are also anticipated to enable much smarter curriculum planning, which is currently a missing piece [00:46:44]. However, Connor believes that language learning is uniquely positioned because current technologies are already “fully useful” and “disruptive,” enabling the human to be taken out of the loop, unlike higher-stakes industries that still require human intervention due to current AI limitations in reasoning and consistency [00:47:20].

Future expansion areas for Speak include public speaking and other speech-related skills, leveraging their expertise in assessment and teaching [00:48:07]. They are developing an enterprise version of their product to offer English proficiency training and certification to companies [00:48:22].