From: redpointai

AI powered tutoring tools have emerged as a significant development in education, particularly in language learning. Speak.com, an English language learning platform, is a prominent example, leveraging AI to provide a comprehensive fluency solution for users worldwide [00:00:13].

Speak.com: An Overview

Speak.com is an English language learning platform that focuses on teaching users how to speak and have real conversations, prioritizing fluency over grammar or rote memorization [00:06:30]. It is backed by OpenAI and, as of a recent funding round, was valued at $500 million [00:00:16]. Since its launch in South Korea in 2019, Speak.com has grown to over 10 million users across more than 40 countries [00:00:20].

The platform’s methodology involves teaching high-frequency “chunks of words” that appear together in everyday speech. Users then practice saying these phrases repeatedly until they become automatic, building patterns that can be applied in simulated conversations to achieve real-world goals [00:07:09]. The entire experience is highly individualized, adapting to the user’s motivations, interests, and proficiency level [00:07:45].

The Entrepreneurial Journey of Connor Wick

Connor Wick, CEO of Speak.com, began his entrepreneurial journey in high school with a flashcard app for the iPhone [00:00:57]. This app allowed millions of users to create hundreds of millions of flashcard decks, accumulating a vast dataset of linked knowledge pairs [00:01:32]. Wick envisioned aggregating this data into a graph to generate and teach anything, creating an “omniscient tutor” – a vision that modern AI powered tutoring tools are now realizing [00:02:04]. He believes this structured learning data would be “really good” for building such a tutor [00:02:48].

Wick’s formal exposure to AI began in 2015, where he “crashed” a Berkeley course on the subject [00:04:03]. At that time, the focus was on RNNs and convolutional neural networks; the Transformer architecture had not yet been invented [00:04:32]. He considered various applications for AI, including:

  • Automated parking enforcement using city vehicle cameras (deemed “horrible for the world”) [00:05:16].
  • Measuring bodies for custom clothing or medical imaging [00:05:29].
  • Predicting weather using deep learning [00:05:36].

Ultimately, Wick and his co-founder Andrew were drawn to speech applications because they offered the potential to build technology that felt like it had a “persona” and could form a “relationship” with the user [00:06:03]. This vision led them to focus on Speak.com.

Evolution with AI Models and Strategic Investment

Speak.com’s strategy has always been long-term oriented, anticipating that over 5 to 10 years, increasing data and compute power would lead models to surpass human capabilities, eventually allowing AI to fully replace the human in the learning process [00:08:21]. This “North Star” guided product decisions, ensuring they aligned with the long-term vision rather than short-term optimizations [00:08:48].

Early technological breakthroughs for Speak.com included achieving highly accurate speech recognition, enabling users to speak naturally into the app, and subsequently adding phoneme recognition and basic language understanding capabilities [00:09:28].

Building Moats in AI

Connor Wick highlights several “moats” or areas of defensibility for Speak.com in the AI infrastructure and developer tools space:

  1. Specialized In-House Models: Speak.com invests in developing its own models for niche tasks where they can outperform generalized models. This includes:
    • Speech recognition optimized for users speaking with accents, accurately understanding their intent and identifying specific mistakes [00:12:26].
    • Phoneme recognition systems built from their extensive user data to detect pronunciation errors [00:12:45]. These specialized models, though potentially subsumed by larger multimodal models in the long term, provide significant value in the short to medium term by enabling a functional product and business growth [00:13:06].
  2. ML Scaffolding / AI Firmware: A “much bigger investment” than modeling, this refers to the complex technical infrastructure built to orchestrate and ensure models work effectively with the product and backend [00:15:51]. This includes:
    • Getting models to excel at individual tasks [00:16:35].
    • Orchestrating these tasks [00:16:37].
    • Continuously collecting new data and fine-tuning models [00:16:39].
    • Developing robust evaluation frameworks [00:16:43].
    • Building infrastructure like knowledge graphs to represent language proficiency and individual user progress [00:16:51]. This “AI firmware” is considered a primary long-term technological moat [00:15:57].
  3. End-to-End User Experience: The focus is on solving the user’s problem effectively, even if it means swapping out underlying technology components later. This involves understanding user engagement and motivation [00:11:06].

A current “painful” aspect of AI development is prompt optimization, which Connor believes will become obsolete as models become more intelligent [00:17:46].

User Experience and Interface

Speak.com aims to build intuitive, audio-first experiences, minimizing the need for explicit user education or tooltips, as these indicate design shortcomings [00:18:34]. The challenge lies in introducing unfamiliar interactions, such as an onboarding process where users are simply asked to speak into a microphone without much instruction [00:19:17]. However, the prevalence of apps like ChatGPT is rapidly increasing users’ familiarity with these conversational paradigms [00:30:30].

The future of AI interfaces will be fluid and hybrid, allowing users to choose to talk, type, or tap as appropriate [00:20:58]. Speech is not always superior but offers a significant shift when it is [00:21:24]. The goal is for technology to adapt to human interaction patterns, rather than vice versa [00:22:10].

Proactive AI and Curriculum

Speak.com is exploring “proactive AI,” where the system “thinks about you in the background” [00:23:09]. For instance, after a user finishes a long session, GPUs could run overnight to distill personalized lessons and analysis to be presented the next day [00:23:51]. This approach leverages user data to provide highly tailored and timely interventions.

Curriculum development at Speak.com balances a structured pedagogical path with individual user customization [00:25:06]. While there’s an optimal sequence for learning a language (e.g., starting with high-frequency words), the specific ordering and content within that structure can be individualized [00:25:22]. Humans remain in the loop for high-level curriculum strategy and artistic creation, but machine learning teams increasingly contribute to and understand the methodology [00:26:01]. The vision is partly inspired by Neil Stevenson’s sci-fi novel “Diamond Age,” which features an AI-powered “primer” that teaches anything in a unique, individualized way [00:27:06].

Business and Market Dynamics

Model Costs and Pricing

Speak.com operates on a subscription model and currently does not feel constrained by model inference costs [00:28:06]. They believe that as model costs decrease, demand will naturally increase [00:28:35].

The pricing strategy aims for both radical accessibility and premium offerings. As a software solution, Speak.com can provide value to hundreds of millions of people at low marginal cost. However, it also sees an opportunity to charge substantially more than typical consumer products because it substitutes for expensive offline tutoring or classroom education, which costs hundreds of dollars per month [00:29:19].

Model Evaluation

Evaluation of AI models is critical and often underrated [00:30:57]. For Speak.com, evaluating an AI’s performance goes beyond simple metrics like word error rate; it involves understanding if the model catches individual mistakes, or even comprehends speech that humans might find unintelligible [00:31:34]. Developing a clear evaluation framework is crucial for driving execution clarity and better decision-making within the team [00:32:16].

When new models like GPT-4o are released, Speak.com has a structured process, including internal tools and playbooks, to run evaluations across its 40+ major tasks, often involving human-in-the-loop assessments [00:33:12]. They also rely on A/B testing and tracking core product metrics to gauge the real-world impact on users [00:34:20].

Competition and Product-Market Fit

Connor Wick distinguishes between AI as a sustaining technology (improving an existing solution) and a disruptive technology (automating or fundamentally changing a problem) [00:35:06]. He argues that Speak.com and Duolingo, despite both being language learning platforms, solve fundamentally different problems [00:36:06]. Duolingo primarily serves native English speakers in Western countries, functioning as a casual, “brain training” app that makes learning a language feel productive [00:36:13]. Speak.com, conversely, targets users who have often studied English for over a decade but lack access to human conversational partners, focusing on achieving conversational fluency [00:37:30]. This specific problem is greatly aided by AI, making it a disruptive force in this niche [00:37:42].

While real-time translation might reduce the need for some users to learn a language for casual tourism, it doesn’t address the fundamental desire for human connection that drives many to seek fluency [00:38:15]. The latency and imperfections of translation also limit its effectiveness for nuanced communication [00:38:28].

The release of GPT-4o, with its improved speech-to-speech capabilities, has sparked speculation about AI’s impact on language learning. However, Connor believes that more people using tools like ChatGPT to learn languages is a net positive for specialized AI language learning platforms like Speak.com [00:40:39]. ChatGPT can introduce users to the concept of AI in education, and those who are serious about long-term language acquisition will then seek out more effective, specialized solutions [00:41:01].

Future of AI in Education

Connor Wick believes AI in education will be one of the biggest and most exciting areas of change and disruption [00:53:57]. While software has permeated many industries, the fundamental quality of education, particularly how people learn, has remained largely unchanged (e.g., digital quizzes instead of paper, digital flashcards instead of physical ones) [00:54:18]. AI has the potential to fundamentally transform learning, moving beyond superficial changes to truly improve efficacy [00:54:43].

He identifies three major sectors for AI in education:

  1. Schools: Integrating AI into traditional schooling [00:50:20].
  2. Businesses/Professional Skills: Developing professional skills, such as public speaking in English, and offering certification [00:50:27]. Speak.com is building an Enterprise version of its product for companies like Samsung and SK in South Korea [00:48:20].
  3. Personal Learning: This is an “invisible” but massive area, encompassing activities like reading books, listening to podcasts, watching videos, and reading articles – anything driven by the desire to “know more” or become a “better version of yourself” [00:50:37]. Personal learning will become highly individualized, with AI acting as a “long-term memory” agent that understands user interests and personality to provide relevant information [00:52:05].

The timeline for these changes is likely to be significant, with not much happening in a few years but profound shifts occurring over a decade or more [00:55:33]. Connor expresses concern that the current obsession with the Transformer architecture might lead to neglecting other important research areas [00:56:00].

While language learning presents a unique advantage because the current classroom model is less effective for it (requiring one-on-one interaction), other subjects like math may require higher levels of reasoning and consistency from AI [00:56:50]. The bar for adoption in these areas is higher because existing solutions (e.g., human tutors for math homework) are already better than the status quo for language learning [00:57:27]. Ultimately, building a truly effective product and finding the right market, rather than just technological capability, remains the harder challenge [00:59:14].

Overhyped vs. Underhyped in AI

  • Overhyped: “Probably everything” – too much funding without sufficient product-market fit or real user activity [01:00:00].
  • Underhyped: Technology that isn’t the Transformer architecture and the amount of research going into alternative approaches [01:00:23].

Surprises in Building AI Features

A continuous challenge is the cycle of excitement over new AI capabilities, only to find that they are “never as good as you think they will be” and not a “Panacea” [01:00:43]. Changing user behavior fundamentally is “really, really hard” [01:00:54]. For example, while human-level transcription combined with GPT-4 was good, it wasn’t a “game-changer” for engagement [01:01:10].

Changed Minds

Connor initially believed Speak.com would do all of its modeling in-house but realized that building certain models would be too costly, necessitating the use of external foundational models [01:01:45].

AI Startups to Watch

He keeps a close eye on startups funded by the OpenAI Startup Fund due to their “incredible deal flow” [01:02:16]. He also finds AI video creation tools like ElevenLabs “really cool” for unleashing creative applications [01:02:24].

Speak.com Career Opportunities

Speak.com is actively hiring for all roles [01:02:47]. More information can be found at speak.com or speak.com/careers [01:02:44].