From: aidotengineer

Nvidia focuses on the development and deployment of enterprise-level speech AI models [00:00:36]. Their aim is to enable customers to deliver the best possible conversational AI solutions [00:00:51]. Key aspects of their work include speech translation, text-to-speech development, and speech recognition [00:00:39]. A primary focus is on creating low-latency, highly efficient models suitable for embedded devices [00:00:58].

Model Development Philosophy

Nvidia’s model development revolves around four core categories [00:01:08]:

  • Robustness: Models are designed to perform effectively in both noisy and clean environments [00:01:13], accounting for sound quality, telephone audio, and environmental contamination [00:01:21].
  • Coverage: Development considers various domains like medical, entertainment, and call centers [00:01:31]. They address language demands (monolingual or multilingual), dialect variations, and code-switching [00:01:42].
  • Personalization: Models are tailored to specific customer needs, which can involve target speaker AI, word boosting for uncommon vocabulary, or text normalization using FST models [00:01:58].
  • Deployment: This category focuses on the trade-off between speed and accuracy [00:02:21], and whether models should prioritize high variety or efficiency [00:02:31].

Model Architectures

Nvidia utilizes several model types to achieve its goals [00:02:41]:

  • CTC Models: Used for high-speed inference, especially in streaming environments due to their non-auto aggressive decoding [00:02:51].
  • RNT (Recurrent Neural Transducer) / TDT Models: Employed when higher accuracy is needed than what non-auto aggressive methods can provide [00:03:15]. These use an audio encoder output with an internal language model for auto aggressive streaming setups [00:03:30].
  • Attention Encoder-Decoder Setups: Offered for maximum accuracy when streaming is not a primary concern [00:03:40]. These models, akin to Whisper and LLMs, are effective for multiple tasks within a single model, including speech translation, timestamp prediction, language identification, and speech recognition [00:04:06].

The fundamental architecture unifying these decoding platforms is the Fast Conformer [00:04:31]. Through empirical trials, Nvidia found that the conformer model can be significantly subsampled, achieving 80-millisecond compression instead of the conventional 40-millisecond step [00:04:37]. This innovation leads to smaller audio inputs, reduced memory load during training, more efficient training with quicker convergence, and faster inference due to 80-millisecond time steps [00:04:54].

Riva Model Offerings

Nvidia’s model offerings are divided into two main categories:

  • Riva Parakeet: Focuses on streaming speech recognition cases, utilizing CTC and TDT models [00:05:31]. It is designed for fast and efficient recognition, handling speech recognition, speech translation, and target speaker ASR [00:05:42].
    • The Parakeet ASR model can be extended to multi-speaker and target speaker scenarios by integrating the Diarization Model Softformer [00:07:00]. The Softformer is an end-to-end neural diarizer that identifies who speaks first [00:07:10]. It acts as a bridge between speaker timestamp information from diarization and speaker tokens recognizable by the ASR model [00:07:21]. By fusing ASR encoder embedding and Softformer embedding through a speaker kernel, it addresses the “who spoke what and when” problem [00:07:31]. This unified architecture can be applied in parallel joint manner or as a cascade system [00:08:10].
  • Riva Canary: Utilizes Fast Conformer models, emphasizing accuracy and multitask modeling [00:06:00]. It aims for the best possible accuracy, with speed being a secondary consideration [00:06:13].

Nvidia’s approach prioritizes providing the right model for the specific need, focusing on variety and coverage rather than a “one model fits all” philosophy [00:06:39].

Additional Models and Tools for Accuracy and Customization

To further enhance accuracy, customization, and readability, Nvidia offers additional models [00:08:27]:

  • Voice Activity Detection (VAD): Detects speech segments to improve noise robustness [00:08:34], including MarbleNet-based VAD models [00:08:43].
  • External Language Models: Resource for ASR transcription, improving accuracy and customization [00:08:47], including n-gram based language models in Riva pipelines [00:08:52].
  • Text Normalization and Inverse Text Normalization (ITN): Converts spoken terms to written forms for better readability [00:09:00], including WFST-based ITN models [00:09:07].
  • Punctuation and Capitalization (PNC): Adds punctuation and capitalization to transcriptions for improved readability [00:09:15], supporting BERT-based PNC models [00:09:21].
  • Speaker Diarization: Identifies multiple speakers in a conversation [00:09:28], with Zoo-based speaker diarization models available in cascade models and upcoming end-to-end models [00:09:31].

This level of customization has contributed to Nvidia models ranking highly on leaderboards like the Hugging Face Open ASR [00:09:52].

Training Process

Nvidia’s training approach emphasizes foundational principles [00:11:06]:

  • Data Sourcing: Focuses on robustness, multilingual coverage, and dialect sensitivity [00:11:14]. They acquire extensive language documentation to guide data selection [00:11:23]. Both open-source data (for variety and domain shift) and proprietary data (for high-quality entity data) are incorporated [00:11:30].
  • Pseudo-labeling: Involves using top-tier commercial models to generate transcripts, leveraging community and internal developments [00:11:44].
  • Training Tools: The open-source Nemo research toolkit is used for model training [00:12:07]. It includes tools for GPU maximalization, data bucketing, and high-speed data loading via the Latte backend [00:12:20]. Data is often stored on object store infrastructure for quick migration between cluster settings [00:12:41].
  • Validation: Involves a mix of open-source and proprietary data to ensure extensive bias and domain testing across language categories before models reach end-users [00:12:52].

Deployment

Trained models are deployed through Nvidia Riva via Nvidia NIM for low-latency and high-throughput inference [00:13:22].

  • High Performance: Inference is powered by Nvidia Tensor optimizations and the Nvidia Triton inference server [00:13:34].
  • Accessibility: Available as a gRPC-based microservice for low-latency streaming and high-throughput offline use cases [00:13:42].
  • Scalability: Nvidia Riva is fully containerized and can easily scale to hundreds of parallel streams [00:13:50].
  • Flexibility: It can run on-prem, in any cloud, at the edge, or on embedded platforms to support diverse applications like contact centers, consumer applications, and video conferencing [00:13:58].
  • Nvidia NIM: Offers pre-built containers and industry-standard API support for custom models and optimized inference engines [00:14:17].

Customization in Deployment

Customization is a critical aspect, as every application often requires specific domain knowledge [00:14:26] (e.g., medical terms, menu names, telephonic conditions) [00:14:39]. Nvidia Riva offers customization features at every stage [00:14:56]:

  • Fine-tuning: Acoustic models (Parakeet and Canary-based), external language models (n-gram), punctuation models, and inverse text normalization models can be fine-tuned [00:15:04].
  • Word Boosting: Offered to improve recognition of product names, jargon, and context-specific knowledge [00:15:18].

Resources

Nvidia Riva models are available through Nvidia NIM [00:15:32]. Users can find more available models and quick start guides at build.nvidia.com/explore/speech [00:15:41], along with developer forums and fine-tuning guides for models in the Nemo frameworks [00:16:01].