From: aidotengineer

This workshop focuses on text to speech model fine-tuning for models like Sesame’s CSM 1B. The process involves creating a voice dataset, which can be done by recording audio or by pulling audio from a YouTube video [00:00:51].

Overview of Data Preparation

The data preparation stage for fine-tuning a text-to-speech model includes several key steps:

  1. Data Generation: Selecting a YouTube video, transcribing it using Whisper, and converting it into a dataset [00:07:01].
  2. Transcription and Correction: Transcribing the video and manually correcting any misspellings in the transcript [00:10:03].
  3. Dataset Creation: Combining short segments from Whisper into longer chunks [00:11:04].
  4. Formatting for Training: Preparing the processed dataset for the training phase [00:24:14].

Acquiring Source Data

To begin, you select a YouTube video to use as the source for your voice dataset [00:00:56]. It is recommended to choose a video with only one speaker, as videos with multiple speakers would require additional data processing like diarization to split the audio, which is not supported in a basic notebook setup [00:08:09].

The process for acquiring audio involves:

  • Using yt-dlp (YouTube-DLP) to download the audio from the selected YouTube video [00:08:49].
  • Running this process within Google Colab is recommended to avoid authentication issues that might block downloads from YouTube when running outside Colab [00:08:51].

Transcribing Audio

Once the audio is acquired, it needs to be transcribed:

  • Tool: OpenAI’s Whisper model is used for transcription [00:07:06].
  • Model Size: The “turbo” Whisper model size is recommended for transcription, as it’s almost as good as “large” but significantly faster [00:08:33].
  • Output: The transcript is saved as a JSON file locally [00:09:18].

Refining Transcriptions

After initial transcription, manual correction is advised:

  • Review the JSON transcript file for any misspelled words or inaccuracies [00:09:58].
  • Perform a find-and-replace operation for common issues, such as correcting specific spellings like “Trellis” (one L) from its common English spelling (two L’s) [00:10:12].
  • Re-upload the corrected JSON file to Google Colab to ensure the refined transcript is used for fine-tuning [00:10:36].

Segmenting and Structuring the Dataset

Whisper typically provides short segments of transcription [00:11:02]. For fine-tuning, these need to be combined into longer chunks:

  • Goal: Create rows of data with audio snippets up to 30 seconds in length, each paired with its transcription [00:07:15], [00:11:07].
  • Algorithm: A simple stacking algorithm is used to combine segments until they reach more than 30 seconds, then a new row of data is created [00:11:11].
  • Dataset Size: Approximately 50 snippets of 30 seconds each is generally enough data to have an effect on the quality of the fine-tuning [00:07:42]. A 30-minute video can provide a sufficient amount of data [00:33:21].
  • Improvements: Data compilation could be improved by ending each row of data on a full stop, possibly using a library like NLTK or regular expressions to detect sentence boundaries [00:13:17]. This ensures more complete “paragraphs” within each data row [00:13:33].
  • Storage: The generated dataset can be pushed to Hugging Face or saved locally [00:11:26].

Dataset Structure for Training

The prepared dataset must have specific columns for the trainer:

  • Required Columns: audio and text columns [00:19:11].
  • Optional Column: A source column to refer to the speaker number (e.g., 0 for the first speaker, 1 for a different speaker) [00:19:16].
  • Default Speaker: If no speaker column is present, the code assigns a default source of 0 to all entries [00:19:28].

Preparing Data for Training

Before training, the data needs final formatting:

  • Length Determination: The maximum audio length (in audio steps/tokens) and maximum text length (in tokens) are measured from the dataset [00:24:16]. For example, a maximum audio length of about 700,000 audio steps and a maximum text length of 587 tokens were observed in the workshop [00:24:20], [00:24:37].
  • Input Preparation: Unsloth is used to prepare the input IDs, attention mask, labels, input values, and cutoffs, which are necessary inputs for the trainer [00:24:59].
  • This processed dataset is then passed to the trainer for fine-tuning the token-based text-to-speech model [00:25:15], [00:25:29].