From: aidotengineer
Nvidia’s approach to developing AI models for speech AI focuses on creating highly efficient, low-latency, and robust models for enterprise-level deployment [00:00:36]. The team, known as Nvidia Riva, provides solutions for speech translation, text-to-speech development, and speech recognition, aiming to offer the best possible conversational AI [00:00:41].
Model Development Philosophy
Nvidia’s model development is guided by four key categories [00:01:08]:
- Robustness: Models are designed to perform well in both noisy and clean environments, accounting for factors like telephone sound quality and environmental contamination [00:01:13].
- Coverage: Development considers various domains (medical, entertainment, call centers), language demands (monolingual or multilingual), dialects, and code-switching [00:01:31].
- Personalization: Customers can tailor models to their specific needs, including target speaker AI, word boosting for uncommon vocabulary, and text normalization using FST models [00:01:58].
- Deployment: Focus is placed on the trade-off between speed and accuracy, and whether models should prioritize high variety or efficiency for specific embedded devices [00:02:21].
This philosophy emphasizes variety and coverage rather than a one-model-fits-all approach [00:06:41].
Model Architectures
Nvidia utilizes several model architectures, often unified by a core component:
- CTC Models: Used for high-speed inference, especially in streaming environments, due to their non-auto-regressive decoding [00:02:51].
- RNT/TDT Models: An Nvidia variant of RNT, these models use an encoder’s audio output with an internal language model for auto-regressive streaming setups, balancing speed and accuracy [00:03:24].
- Attention Encoder-Decoder Setups: Offered for higher accuracy when streaming is not a primary concern. These models (similar to Whisper, ChatGPT, LLMs) excel at accommodating multiple tasks within a single model, such as speech translation, timestamp prediction, language identification, and speech recognition [00:03:46].
The unifying tool across these decoding platforms is the Fast Conformer architecture [00:04:31]. Through empirical trials, it was found that the original Conformer model could be significantly subsampled, achieving 80-millisecond compression instead of the conventional 40-millisecond time step compression [00:04:34]. This innovation allows for smaller audio inputs, lightens memory load during training, makes training more efficient with quicker convergence (requiring less data), and enables faster inference due to 80-millisecond time steps [00:04:54].
Nvidia’s model offerings are split into two options:
- Riva Parakeet: Focuses on streaming speech recognition using CTC and TDT models for fast and efficient recognition, including speech translation and target speaker ASR [00:05:31].
- Riva Canary: Utilizes Fast Conformer models for high accuracy and multitask modeling, prioritizing accuracy over speed [00:06:00].
Additional models are offered to improve accuracy, customization, and readability [00:08:27]:
- Voice Activity Detection (VAD): Detects speech segments for better noise robustness using MarbleNet-based VAD models [00:08:34].
- External Language Models: EnGram-based models enhance ASR transcription accuracy and allow for customization [00:08:47].
- Text Normalization (TN) and Inverse Text Normalization (ITN): Convert spoken terms to written forms for readability using WFST-based ITN models [00:09:02].
- Punctuation and Capitalization (PNC): Adds punctuation and capitalization to transcriptions for readability using BERT-based PNC models [00:09:15].
- Speaker Diarization: Identifies multiple speakers in a conversation, with models like Zoo for based speaker diarization available in cascade systems [00:09:29]. The Parakeet ASR model can be extended for multi-speaker and target speaker scenarios by integrating the Softformer diarization model, an end-to-end neural diarizer following the arrival time principle [00:06:57]. This Softformer acts as a bridge between speaker timestamps from diarization and speaker tokens recognizable by the ASR model [00:07:21].
Data Collection and Training Techniques for AI Models
Data collection and training at Nvidia focus on fundamentals to meet demand [00:11:06].
- Data Sourcing: Emphasis is placed on robustness, multilingual coverage, and dialect sensitivity [00:11:14]. Extensive language documentation is gathered to define appropriate data spans [00:11:23].
- Data Types: Both open-source and proprietary data are incorporated. Open-source data aids in achieving variety and domain shift, while proprietary data ensures high-quality entity data [00:11:30].
- Pseudo-labeling: Transcripts from top-of-the-line commercially available models are used to benefit from community and internal developments [00:11:44].
For training, standard available tools are primarily used [00:12:01]. The Nemo research toolkit, an open-source library, is utilized for model training [00:12:07]. Nemo provides tools for GPU maximalization, data bucketing, and high-speed data loading via the Latte backend [00:12:20]. This approach allows for maximizing data and ingestion speed across different settings [00:12:31]. Most data is stored on an object store infrastructure, enabling quick migration between cluster settings [00:12:41].
Testing and Evaluation of AI Models
Validation mirrors the training philosophy, focusing on a diverse mixture of open-source and proprietary data [00:12:52]. Before models reach end-users, they undergo extensive bias and domain testing across all possible language categories [00:13:01]. This rigorous testing ensures models are as robust as possible [00:13:13].
Deployment and Customization
Trained models are deployed via Nvidia Riva through Nvidia NIM for low-latency and high-throughput inference [00:13:22]. High-performance inference is powered by Nvidia Tensor optimizations and the Nvidia Triton inference server [00:13:31]. These services are available for gRPC-based microservices, supporting both low-latency streaming and high-throughput offline use cases [00:13:42].
Nvidia Riva is fully containerized and can scale to hundreds of parallel streams, deployable on-premises, in any cloud, at the edge, or on embedded platforms [00:13:50]. This supports a variety of applications including contact centers, consumer applications, and video conferencing [00:14:04]. Nvidia NIM offers pre-built containers with industry-standard API support for custom models and optimized inference engines [00:14:16].
Customization is a significant focus, as real-world scenarios often require domain-specific knowledge (e.g., medical terms, menu names, telephonic conditions) [00:14:29]. Nvidia Riva offers customization features at every stage [00:14:56]:
- Fine-tuning acoustic models from Parakeet or Canary base models [00:15:04].
- Fine-tuning EnGram external language models, punctuation models, and inverse text normalization models [00:15:10].
- Offering word boosting to improve recognition of product names, jargon, and context-specific knowledge [00:15:18].
This comprehensive toolkit, focusing on customization and variety, has led to Nvidia models dominating the Hugging Face Open ASR leaderboard, with the majority of top-five models originating from Nvidia [00:09:57].