Architecture and deployment of speech AI models

From: aidotengineer

Nvidia’s approach to speech AI aims to eliminate “awkward AI transcripts” by focusing on robust model development, deployment, and customization [00:00:08]. The Nvidia Riva platform specializes in enterprise-level speech AI model deployment, covering speech translation, text-to-speech, and speech recognition [00:00:36]. The primary focus is on low-latency, highly efficient models suitable for embedded devices [00:00:59].

Key Focus Areas in Model Development

Nvidia centers its model development around four main categories [00:01:08]:

Robustness: Models are designed to perform effectively in diverse sound environments, including noisy settings and telephone calls, by accounting for various environmental contamination factors [00:01:13].
Coverage: Development considers customer domains (e.g., medical, entertainment, call center) and language demands, supporting both monolingual and multilingual development, dialect variations, and code-switching [00:01:31].
Personalization: Customers can tailor models to their specific needs through features like target speaker AI, word boosting for uncommon vocabulary, and text normalization [00:01:58].
Deployment Cases: Considerations include the trade-off between speed and accuracy, and whether models should prioritize high variety or efficiency [00:02:21].

Model Architectures

Nvidia utilizes several core model architectures for speech AI [00:02:41]:

CTC (Connectionist Temporal Classification) Models: These are preferred for high-speed inference, especially in streaming environments, due to their non-auto-regressive decoding [00:02:51].
RNT (Recurrent Neural Transducer) / TDT (Transducer) Models: When higher accuracy is needed beyond non-auto-regression, RNT/TDT models (an Nvidia variant) are used. They employ an encoder’s audio output with an internal language model for auto-regressive streaming setups [00:03:15].
Attention Encoder-Decoder Setups: For even greater accuracy, particularly when streaming is not a primary concern, these models (similar to Whisper, ChatGPT, LLMs) are offered [00:03:40]. They excel at accommodating multiple tasks within a single model, such as speech translation, timestamp prediction, language identification, and speech recognition, often with simple prompt changes [00:04:06].

The Fast Conformer Backbone

A unifying tool across Nvidia’s decoding platforms is the Fast Conformer architecture [00:04:31]. Through empirical trials, it was found that the original Conformer model could be greatly sub-sampled, allowing for 80-millisecond compression instead of the conventional 40-millisecond time step compression [00:04:37]. This innovation leads to:

Smaller audio inputs and reduced memory load during training [00:04:56].
More efficient training with quicker convergence and less data [00:05:02].
Faster inference due to data chunking into 80-millisecond time steps [00:05:12].

Nvidia Reva Offerings

Nvidia’s model offerings are split into two main options:

Reva Parakeet: Focuses on streaming speech recognition cases, utilizing CTC and TDT models for fast and efficient recognition, including target speaker ASR [00:05:31].
Reva Canary: Incorporates Fast Conformer models, prioritizing accuracy and multitask modeling where speed is less critical [00:06:00].

This dual approach provides customers with a comprehensive toolkit offering a mixture of fast, multitasking, or high-accuracy models, emphasizing variety and coverage over a one-model-fits-all solution [00:06:30].

Multi-Speaker and Target Speaker ASR

The Parakeet ASR model can be extended to multi-speaker and target speaker scenarios by integrating the Softformer diarization model [00:06:57]. Softformer is an end-to-end neural diarizer that identifies “who comes first” [00:07:10]. It acts as a bridge between speaker time stamps from diarization and speaker tokens recognizable by the ASR model [00:07:21]. By fusing ASR encoder embedding and Softformer embedding through a speaker kernel, it addresses the “who spoke what and when” problem [00:07:31]. This unified architecture can be fine-tuned with simple ASR model training objectives [00:07:44].

Depending on whether an optional query audio is fed, the model can conduct target speaker ASR or single/multi-speaker ASR tasks [00:07:58]. The unified model architecture can be applied in both parallel joint and cascade systems [00:08:10].

Additional Models and Enhancements

To further improve accuracy, customization, and readability, Nvidia offers additional models [00:08:27]:

Voice Activity Detection (VAD): Detects speech segments for better noise robustness, available in MarbleNet-based VAD models [00:08:34].
External Language Models: Encompasses N-gram based language models within Reva pipelines for enhanced ASR transcription accuracy and customization [00:08:47].
Text Normalization (TN) and Inverse Text Normalization (ITN): Converts spoken terms to written forms for readability, using WFST-based ITN models [00:09:01].
Punctuation Capitalization (PNC): Adds punctuation and capitalization to transcriptions for better readability, supported by BERT-based PNC models [00:09:14].
Speaker Diarization: Identifies multiple speakers in a conversation, available as a cascade system with Zoho-based models, and upcoming end-to-end models [00:09:28].

This level of customization contributes to Nvidia models frequently ranking among the top on leaderboards like the Hugging Face Open ASR leaderboard [00:09:51].

Training and Data Philosophy

Nvidia emphasizes fundamental data development practices [00:11:06]:

Data Sourcing: Focuses on robustness, multilingual coverage, and dialect sensitivity [00:11:14]. Extensive language documentation is gathered to guide data acquisition [00:11:23].
Data Integration: Combines both open-source data (for variety and domain shift) and proprietary data (for high-quality entity data) [00:11:30].
Pseudo Labeling: Utilizes transcripts from top commercial models to benefit from community and internal developments [00:11:44].

For training, standard, openly available tools are primarily used [00:12:01]. The Nemo research toolkit, an open-source library, is employed for model training, offering GPU maximalization, data bucketing, and high-speed data loading via the Latte backend [00:12:07]. Data is stored on an object store infrastructure to facilitate quick migration between different cluster settings [00:12:41].

Validation involves a mix of open-source and proprietary data to ensure models are extensively tested for bias and domain across various language categories before reaching end-users, ensuring maximum robustness [00:12:52].

Deployment and Customization

Trained models are deployed through Nvidia Riva using Nvidia NIM for low latency and high throughput inference [00:13:22]. High-performance inference is enabled by Nvidia Tensor optimizations and the Nvidia Triton Inference Server [00:13:31].

Nvidia Riva is fully containerized and designed for scalability, capable of handling hundreds of parallel streams [00:13:50]. It can be run on-premise, in any cloud, at the edge, or on embedded platforms [00:13:58], supporting diverse applications such as contact centers, consumer applications, and video conferencing [00:14:07]. Nvidia NIM provides pre-built containers and industry-standard API support for custom models and optimized inference engines [00:14:17].

Customer customization is a critical focus, as every application demands domain-specific knowledge (e.g., medical terms, menu names, telephonic conditions) [00:14:26]. Nvidia Riva offers customization features at every stage [00:14:52]:

Fine-tuning of acoustic models (Parakeet and Canary-based models) [00:15:04].
Fine-tuning of external language models, punctuation models, and inverse text normalization models [00:15:10].
Word boosting for improved recognition of product names, jargon, and context-specific knowledge [00:15:18].

Getting Started

Nvidia Riva models are available through Nvidia NIM [00:15:32]. More information and available models can be found at build.envidia.com/explore/speech [00:15:41]. Resources include a quick starter guide, a developer forum, and a fine-tuning guide for models within the Nemo frameworks [00:15:57].

Tubegraph

Explorer

Table of Contents