Inhouse Data Labeling and Infrastructure

From: redpointai

DeepL, a company founded by Yarkovsky, which was recently valued at $2 billion and serves over 100,000 businesses globally in AI translation, has been involved in cutting-edge AI research long before its recent surge in popularity [00:00:00]. The company’s unique approach to building its own data centers and managing data labeling in-house has been central to its success against larger competitors like Google Translate [00:22:22].

In-house Data Labeling

DeepL emphasizes the increasing importance of human data in AI development, especially with the rise of reinforcement learning in Large Language Models (LLMs) [00:14:48]. The company has run large-scale data annotation projects internally for years, utilizing human translators to train models and ensure quality assurance [00:15:04].

This in-house approach is deemed crucial for the specialized models DeepL builds, where customers have high expectations for consistent quality [00:15:31].

Role of Human Translators

DeepL works with thousands of human translators globally [00:15:53]. The company prioritizes hiring native speakers for specific language translations, such as Brazilian Portuguese for Brazilian Portuguese [00:16:09]. These translators help in two key ways:

Model Training: They assist in training models to achieve the desired translation style [00:15:22].
Quality Assurance: They are vital for maintaining top-notch quality assurance, which is critical for specialized models [00:15:29].

The process also involves constant monitoring of performance, as even a short absence of a high-performing translator can impact quality [00:18:08]. This close touch helps manage quality issues and ensure data requirements are met [00:18:20].

Rationale for In-house Data Labeling

DeepL’s decision to keep data labeling in-house stems from the desire for maximum control over the process and the importance of quality for their specific tasks [00:17:21]. While considering outsourcing parts of it, the core reason for internal management is the need for high-quality, specialized data that can be meticulously chosen and managed [00:17:46]. This granular control allows for direct feedback loops between application performance and model adjustments [00:11:15].

For example, DeepL’s ability to allow customers to embed specific terminology into their models, while adhering to grammatical rules and managing word ambiguities, is a result of this tight feedback loop and deep control over their models [00:11:38].

AI Infrastructure

DeepL operates as a “build it yourself” company [00:09:59]. This philosophy originated from the early days when necessary tools, data centers, and models were not readily available [00:10:06]. Owning the entire vertical stack, from go-to-market to product, engineering, and research, allows the company to effectively identify and solve complex customer problems that simple prompt engineering might not address [00:10:25].

Building Own Data Centers

DeepL chose to build and operate its own data centers from the beginning, primarily due to the lack of alternatives at the time [00:27:04].

Advantages of this approach include:

Cost Efficiency: Running their own data centers offers significant cost advantages at scale [00:27:22].
Hardware Availability: It ensures access to the newest hardware, enabling faster market entry and innovation [00:29:20].
Optimization: Given the scarcity and power of GPU compute, optimizing operations by running their own infrastructure is crucial for sustainability, both environmentally and commercially [00:28:18].

Challenges and Future Considerations

While beneficial, operating own data centers adds complexity and can impact development speed [00:29:47]. DeepL is currently moving large parts of its stack towards hyperscalers and a hybrid cloud model [00:29:56]. This allows them to cut out only what needs to be run on-premise for efficiency, security, or data protection reasons [00:30:02].

The overall tooling for GPU compute is still in early stages [00:27:37]. While CPU compute has seen sophisticated abstraction layers making it cheap and sustainable, GPU compute is scarce and powerful, requiring careful optimization for large-scale operations [00:27:43].

For other companies, using hyperscalers is recommended for kickstarting operations, but transitioning to their own data centers may become advantageous when reaching significant scale due to cost and hardware availability [00:26:51].

Tubegraph

Explorer

Table of Contents