AI transcription model development at Nvidia

From: aidotengineer

Nvidia’s approach to AI transcription model development focuses on delivering highly efficient and robust conversational AI solutions for enterprise-level applications [00:00:36]. The development process emphasizes deployment and customization for a diverse customer base [00:00:28]. These models are designed for low latency and high efficiency, suitable for deployment on embedded devices [00:00:59].

Core Development Principles

Nvidia’s model development at Nvidia focuses on four main categories:

Robustness [00:01:13]: Ensuring models perform well in both noisy and clean environments, accounting for varying sound quality, telephone audio, and environmental contamination factors [00:01:15].
Coverage [00:01:31]: Addressing diverse customer domains (medical, entertainment, call center), language demands (monolingual or multilingual), dialects, and code-switching [00:01:34].
Personalization [00:01:58]: Customizing models to meet specific customer needs, including target speaker AI, word boosting for uncommon vocabulary, and text normalization using FST models [00:02:00].
Deployment [00:02:21]: Optimizing for the trade-off between speed and accuracy, and balancing model variety with efficiency for specific use cases [00:02:24].

Model Architectures

Nvidia utilizes several model architectures to achieve its goals:

CTC (Connectionist Temporal Classification) Models [00:02:51]: These non-autoregressive models are optimal for high-speed inference, particularly in streaming environments where data can be chunked and processed efficiently [00:02:55].
R&T (Recurrent Neural Network Transducer) or TDT (Transformer Transducer) Models [00:03:24]: For higher accuracy where non-autoregression is insufficient, these models use an encoder’s audio output with an internal Language Model (LM) to enable autoregressive streaming setups [00:03:30].
Attention Encoder-Decoder Setups [00:03:46]: These are used when maximum accuracy is required and streaming is not the primary concern [00:03:42]. They excel at accommodating multiple tasks within a single model, such as speech translation, timestamp prediction, language identification, and speech recognition, through simple prompt changes [00:04:06].
Fast Conformer [00:04:33]: This is the fundamental architecture unifying all decoding platforms [00:04:31]. Through empirical trials, it was found that the original conformer model can be subsampled further, achieving 80-millisecond compression instead of the conventional 40-millisecond step [00:04:37]. This reduces memory load during training, increases training efficiency by allowing quicker convergence with less data, and enables faster inference by chunking data into 80-millisecond timesteps [00:05:00].

Nvidia Reva Offerings

Nvidia’s model offerings are split into two primary options:

Reva Parakeet [00:05:31]: Focuses on streaming speech recognition cases, utilizing CTC and TDT models [00:05:34]. It is designed for fast and efficient recognition for tasks like speech recognition, speech translation, and target speaker ASR [00:05:42].
Reva Canary [00:06:00]: Incorporates the Fast Conformer models and prioritizes accuracy and multitask modeling [00:06:03]. It aims for the best possible accuracy, with speed being a secondary consideration [00:06:13].

The overarching philosophy is to provide a variety of models to meet specific customer needs, rather than a “one model fits all” approach [00:06:39].

Advanced Capabilities

Multi-speaker and Target Speaker ASR

The Parakeet ASR model can be extended for multi-speaker and target speaker scenarios by integrating the Softformer model [00:07:00]. The Softformer is an end-to-end neural diarizer that follows the “rival timing principle” (who comes first) [00:07:10]. It acts as a bridge between speaker timestamps from diarization and speaker tokens recognizable by the ASR model [00:07:21]. The ASR encoder embedding and Softformer embedding are fused via a speaker kernel to address the “who spoke what and when” problem [00:07:31]. This unified architecture can be fine-tuned with a simple objective similar to ASR model training [00:07:44] and can perform target speaker ASR or multi-speaker ASR depending on whether an optional query audio is provided [00:07:56]. This architecture can be applied in both parallel joint and cascade system manners [00:08:10].

Ancillary Models for Accuracy & Readability

Additional models are offered to improve overall accuracy, customization, and readability:

Voice Activity Detection (VAD) [00:08:34]: Detects speech segments for better noise robustness, with models based on Marblet [00:08:36].
External Language Models (LM) [00:08:47]: Enhance ASR transcription for better accuracy and customization, including N-gram based LMs in the Reva pipelines [00:08:50].
Text Normalization and Inverse Text Normalization (ITN) [00:09:00]: Convert spoken terms to written forms for readability, using WFSD-based ITN models [00:09:02].
Punctuation and Capitalization (PNC) [00:09:15]: Adds punctuation and capitalization to transcriptions, supported by word-based PNC models [00:09:18].
Speaker Diarization [00:09:27]: Identifies multiple speakers in a conversation, with Zooform-based speaker diarization models available in both cascade and upcoming end-to-end models [00:09:29].

Development Process

Nvidia’s development process emphasizes robust data sourcing, efficient training, and thorough validation to ensure high-quality models.

Data Sourcing

Training and validation of Nvidia models involves a focus on robustness and multilingual coverage with dialect sensitivity [00:11:14]. Nvidia incorporates both open-source and proprietary data:

Open Source Data [00:11:30]: Allows for variety and domain shift.
Proprietary Data [00:11:37]: Focuses on high-quality entity data.

Pseudo-labeling is used, where transcripts from top-of-the-line commercial models are utilized to benefit from community and internal developments [00:11:44].

Training

Training and validation of Nvidia models utilizes standard, publicly available tools and processes [00:12:05]. The Nemo research toolkit, an open-source library, is used for model training [00:12:07]. It includes tools for GPU maximization, data bucketing, and high-speed data loading via the Latte backend [00:12:20]. The focus is on maximizing data utilization and ingestion speed across different settings [00:12:31]. Most data is stored on an object store infrastructure for quick migration between cluster settings [00:12:41].

Validation

Validation also incorporates a mix of open-source and proprietary data to ensure comprehensive coverage [00:12:52]. Before models reach end-users, they undergo extensive bias and domain testing across all language categories to ensure maximum robustness [00:13:01].

Deployment & Customization

Nvidia Reva and NIM

Trained models are deployed to Nvidia Reva via Nvidia NIM for low-latency and high-throughput inference [00:13:22]. High-performance inference is powered by Nvidia Tensor optimizations and the Nvidia Triton inference server [00:13:31]. Nvidia Reva offers gRPC-based microservices for low-latency streaming and high-throughput offline use cases [00:13:42]. It is fully containerized, allowing it to scale to hundreds of parallel streams and run on-prem, in any cloud, at the edge, or on embedded platforms [00:13:50]. This supports a variety of applications, including contact centers, consumer applications, and video conferencing [00:14:04]. Nvidia NIM also offers pre-built containers with industry-standard API support for custom models and optimized inference engines [00:14:17].

Customization Features

Finetuning and production stability of open AI models is crucial, as real-world scenarios often require domain-specific knowledge (e.g., medical terms, menu names) and accommodation for telephonic or noisy environments [00:14:26]. Nvidia Reva provides customization features at every stage [00:14:52]:

Acoustic Model Fine-tuning [00:15:04]: For both Parakeet-based and Canary-based models.
External Language Model Fine-tuning [00:15:10]: For N-gram, punctuation, and inverse text normalization models.
Word Boosting [00:15:18]: To improve recognition of product names, jargon, and context-specific knowledge [00:15:24].

The focus on customization and variety has led to Nvidia models frequently appearing among the top-ranked on the Hugging Face Open ASR leaderboard [00:10:00].

Tubegraph

Explorer

Table of Contents