From: aidotengineer

This article outlines the process of preparing data for text-to-speech model fine-tuning, specifically focusing on creating a voice dataset from a YouTube video to enable a model to sound like a specific voice [00:00:32]. The process involves selecting source material, transcribing audio, manually correcting transcripts, and segmenting the data into a usable format.

Workshop Context

The process detailed here is part of a workshop focused on text-to-speech model fine-tuning of Sesame’s CSM 1B model [00:00:08]. All materials, including the Colab notebook and slides, are available on the Trellis Research GitHub under AIWorldsfare-2025 [00:00:14].

Data Generation Process

The first section of the notebook covers data generation [00:07:01].

Selecting Source Material

The initial step is to select a YouTube video as the audio source [00:00:56]. It is recommended to choose a video with a single speaker [00:08:09]. While videos with multiple speakers can be used, they require additional data processing like diarization to split the audio, which is not covered in the basic notebook [00:08:13].

Transcription with Whisper

After selecting the video, the audio is transcribed using OpenAI’s Whisper model within a Google Colab notebook [00:07:06], leveraging a GPU (CUDA) for processing [00:09:12].

  • Whisper Model Size: The “turbo” transcription model is recommended for its balance of quality and speed, being almost as good as “large” but much faster than “small,” “base,” or “tiny” [00:08:31].
  • Tools: youtube-dlp is used for downloading YouTube videos [00:08:48]. It’s advised to run this in Colab to avoid potential authentication issues when downloading from YouTube [00:08:51].
  • Output: The transcription is saved to a local JSON file [00:09:18].

Manual Transcript Correction

Once the JSON transcription file is generated, it’s crucial to review it for any misspellings or inaccuracies [00:09:58]. Users can perform a quick find-and-replace for common errors (e.g., correcting “Trellis” spelling) [00:10:03]. After corrections, the updated JSON file should be re-uploaded to Google Colab to serve as the basis for the fine-tuning dataset [00:10:36].

Creating the Dataset

The transcribed audio, originally in short segments from Whisper, is then combined into longer chunks for the dataset [00:11:01].

  • Segmentation: The simple algorithm is used to stack segments until they reach a maximum length of about 30 seconds [00:11:09]. Each resulting chunk forms a row of data [00:11:16].
  • Dataset Structure: The final dataset will consist of rows containing audio snippets (max 30 seconds) and their corresponding text transcriptions [00:07:12].
  • Hugging Face: The prepared dataset can be pushed to Hugging Face [00:12:25].
  • Speaker ID: The dataset requires an audio column and a text column, and optionally a source column referring to the speaker number (starting at zero) [00:19:11]. If no speaker column is provided, a default speaker ID of zero is assigned [00:19:28].

Dataset Requirements and Best Practices

  • Snippet Length: Audio snippets up to 30 seconds are used [00:07:15].
  • Dataset Size: A dataset of approximately 50 rows of 30-second snippets is generally sufficient to have a noticeable effect on the quality of the fine-tuned model [00:07:42]. For better performance, especially without voice cloning, aiming for 500 rows of 30-second snippets is recommended [00:33:04].
  • Sentence Boundaries: While the current implementation simply stacks segments, an opportunity for improvement is to end each row of data on a full stop. This can be achieved using libraries like NLTK to detect sentence boundaries or by using regular expressions [00:13:17]. Generally, having complete “paragraphs” within each data row is preferred [00:13:33].
  • Audio Quality: Better filtering of the original data, such as removing segments with long pauses, can improve the pacing and overall quality of the generated voice [00:31:36].

Impact on Model Performance

The quality of the prepared dataset directly influences the fine-tuning outcome. Even with a relatively small dataset (e.g., 30 minutes of video resulting in 41 clips) [00:33:21], combining fine-tuning with voice cloning can yield good performance [00:33:15]. More data typically leads to better results, especially for zero-shot inference without voice cloning [00:33:07].