Use and evaluation of large language models LLMs

Grammarly’s Evolution with LLMs

Grammarly, a personalized AI productivity tooling assistant app for writing, has been building AI solutions for 15 years, launching in 2009 [04:13:00]. The company has ridden multiple technology waves, starting primarily with rules-based Natural Language Processing (NLP) [04:23:00], moving to deep learning models [04:27:00], and now extensively utilizing LLMs and GenAI [04:28:00]. The approach is to identify user problems and then apply the best available technology to solve them effectively [04:32:00].

The emergence of ChatGPT was a “watershed moment” for Grammarly, surprising them with its speed, scale, and pace of quality improvement [15:55:00]. While earlier NLP-based systems had greater precision compared to initial LLM outputs that frequently hallucinated, the quality improvements in LLMs over the last couple of years have been phenomenal, with these models now essentially as good as rule-based systems for grammar [17:35:00]. Grammarly views LLMs as a “huge enabler” for its mission, allowing them to provide better, deeper, and more meaningful solutions to long-standing user problems [16:32:00].

Expanding Communication Capabilities with LLMs

Grammarly envisions the communication lifecycle in four stages:

Ideation and Conceptualization: Thinking about what to say [04:53:00].
Composition: Writing down the message [05:01:00].
Revision and Polishing: Making the text better [05:04:00].
Comprehension: The recipient understanding the message [05:08:00].

Historically, Grammarly focused on the revision phase, helping with correctness, style guides, tone, and brevity [05:23:00]. LLMs enable Grammarly to “turbocharge” the value provided to users in two main ways:

Strategic Suggestions: Tying communication to business outcomes by offering suggestions aligned with desired goals (e.g., adding free food details to an event email to drum up enthusiasm, or clarifying a call to action in a board email) [06:00:00]. Basic mechanics like correctness and tone can be auto-applied [07:08:00].
Full Communication Lifecycle Support: Moving beyond just revision to assist with ideation, composition, and comprehension (e.g., summarizing long email threads and identifying action items) [07:22:00].

Looking ahead, future LLMs are expected to be more capable of multi-step reasoning [21:25:00], which will enable agentic workflows [21:28:00]. This means Grammarly could orchestrate complex, multi-step communication flows by pulling in context from various sources and even proposing the best steps to achieve a communication goal, reducing “drudgery” and enabling “flow state” [21:33:00].

Challenges and Evaluation Strategies

A significant challenge in deploying LLMs is ensuring quality and safety, especially given the “high stakes nature of human communication” [08:38:00]. False positives, sensitive text issues, or safety concerns can have real consequences [08:45:00]. Therefore, Grammarly does extensive work to fine-tune models for specific use cases and conducts rigorous quality evals and safety evals before shipping models to users [08:56:00].

Evaluation Methods

Grammarly employs a multi-dimensional process for evaluating LLMs:

External Benchmarks: Looking at general-purpose benchmarks closest to their use cases for objective external measurements [28:59:00].
Safety Evals: Running internal safety evaluations based on extensive user feedback and understanding of guardrails around sensitive content [29:11:00]. This provides a “powerful” eval data set from real-world scenarios [29:30:00].
Side-by-Side Comparisons: Linguistic experts rate LLM output against human-curated output to determine preferences [29:36:00].
User Feedback and Engagement: Tracking how users accept or reject suggestions and engage with features provides continuous quality input [10:02:00]. Experiments are run with small user percentages to gauge real-world performance [10:21:00]. This feedback loop is crucial, as features that pass internal evaluations might not resonate with users in practice [10:51:00].
- Example: The tone detector feature, while popular, revealed unexpected edge cases where tone suggestions were inappropriate, such as for police reports of serious crimes [11:28:00].

Tailoring and Optimizing LLMs

Grammarly uses a combination of closed source and open source models in production, often fine-tuned on their vast user data [24:08:00]. The company processes 75 billion user events daily, providing a unique advantage in fine-tuning and personalizing models for different use cases [26:01:00].

The goal is not to use the smallest possible model, but the most efficient model without compromising quality or fidelity [24:44:00]. Cost and latency are key considerations; low latency enhances user experience and “flow state” [25:11:00].

Grammarly tailors its product in several ways:

Personalization: Allowing individual users to define and refine their unique voice [26:37:00].
Organizational Customization: Ingesting organization-specific knowledge like style guides, brand tones, and corporate values to ensure consistent and compliant communication across the organization and with customers [26:54:00]. This helps automate the application of rules that might otherwise be found in large, manual documents [27:51:00].

The models are fine-tuned on use cases, and organizational-specific knowledge is ingested separately to apply rules and guidelines in the flow of communication [27:37:00].

LLMs in Enterprise and Education

Enterprise Adoption

The Enterprise AI Market is seen as a “transformation journey” rather than a one-time deployment [36:17:00]. Enterprises need to select trusted partners for this multi-year transformation [36:52:00]. While there’s much excitement and investment, measurable productivity gains from AI are still “elusive” outside of a few core use cases like software engineering and code generation [37:23:00]. Grammarly emphasizes demonstrating measurable value; for instance, the average Grammarly user in an organization saves 19 days per year [38:03:00].

Education Sector

The education sector faces the challenge of responsibly incorporating powerful new AI tools into classrooms and pedagogical methods [39:51:00]. The initial impulse to ban AI has largely dissipated, replaced by an eagerness to engage and equip graduates with essential AI skills for the workforce [40:53:00].

Grammarly supports responsible AI use in education through features like:

Citing AI Use: Allows students to cite how they used AI in their work, differentiating between using AI for full content generation (e.g., writing an essay) versus using it as a co-pilot for feedback and refinement [41:18:00]. This promotes deeper engagement and learning [42:01:00].
Authorship Detection: A feature that identifies the provenance of every piece of content in a document, distinguishing between manually written, cut-and-pasted, or AI-generated sections [42:27:00]. This provides transparency and tools for educators to build guardrails around AI use [43:04:00].

Ultimately, AI in education should serve as a “tool to give us superpowers” and “augment” human capabilities, not displace them [44:24:00]. AI also acts as a “great leveler” and “democratizer of skills,” providing educational support to those who may not otherwise have access [44:48:00].

Future Outlook

Grammarly currently utilizes about “half a dozen or so” LLMs in production, a combination of closed source and open source models, mostly fine-tuned [24:08:00]. While current models are “idiosyncratic” and require significant work to adapt for specific use cases [23:17:00], the increasing efficiency gains mean that a lot of inference could move on device in the near future [19:34:00]. This shift to on-device processing offers benefits in security, privacy, latency, and user experience [20:01:00].

The general consensus is that AI is “overhyped in the short term and underhyped in the long term” [14:50:00]. While transformative, simply “sticking AI into everything” without clear goals is not an effective way to solve user problems [14:57:00]. For the broader Enterprise market, AI is expected to profoundly transform work [36:03:00].

One underhyped aspect of AI is its potential to “upskill and uplevel people around the world,” serving as a powerful “democratizer of skills” and a “force multiplier” in the workforce [46:52:00]. This is particularly impactful for those who may lack access to traditional educational resources or professional development opportunities [47:05:00].

Tubegraph

Explorer

Table of Contents