From: aidotengineer
This workshop focuses on text to speech model fine-tuning for models like Sesame’s CSM 1B. The process involves creating a voice dataset, which can be done by recording audio or by pulling audio from a YouTube video [00:00:51].
Overview of Data Preparation
The data preparation stage for fine-tuning a text-to-speech model includes several key steps:
- Data Generation: Selecting a YouTube video, transcribing it using Whisper, and converting it into a dataset [00:07:01].
- Transcription and Correction: Transcribing the video and manually correcting any misspellings in the transcript [00:10:03].
- Dataset Creation: Combining short segments from Whisper into longer chunks [00:11:04].
- Formatting for Training: Preparing the processed dataset for the training phase [00:24:14].
Acquiring Source Data
To begin, you select a YouTube video to use as the source for your voice dataset [00:00:56]. It is recommended to choose a video with only one speaker, as videos with multiple speakers would require additional data processing like diarization to split the audio, which is not supported in a basic notebook setup [00:08:09].
The process for acquiring audio involves:
- Using
yt-dlp
(YouTube-DLP) to download the audio from the selected YouTube video [00:08:49]. - Running this process within Google Colab is recommended to avoid authentication issues that might block downloads from YouTube when running outside Colab [00:08:51].
Transcribing Audio
Once the audio is acquired, it needs to be transcribed:
- Tool: OpenAI’s Whisper model is used for transcription [00:07:06].
- Model Size: The “turbo” Whisper model size is recommended for transcription, as it’s almost as good as “large” but significantly faster [00:08:33].
- Output: The transcript is saved as a JSON file locally [00:09:18].
Refining Transcriptions
After initial transcription, manual correction is advised:
- Review the JSON transcript file for any misspelled words or inaccuracies [00:09:58].
- Perform a find-and-replace operation for common issues, such as correcting specific spellings like “Trellis” (one L) from its common English spelling (two L’s) [00:10:12].
- Re-upload the corrected JSON file to Google Colab to ensure the refined transcript is used for fine-tuning [00:10:36].
Segmenting and Structuring the Dataset
Whisper typically provides short segments of transcription [00:11:02]. For fine-tuning, these need to be combined into longer chunks:
- Goal: Create rows of data with audio snippets up to 30 seconds in length, each paired with its transcription [00:07:15], [00:11:07].
- Algorithm: A simple stacking algorithm is used to combine segments until they reach more than 30 seconds, then a new row of data is created [00:11:11].
- Dataset Size: Approximately 50 snippets of 30 seconds each is generally enough data to have an effect on the quality of the fine-tuning [00:07:42]. A 30-minute video can provide a sufficient amount of data [00:33:21].
- Improvements: Data compilation could be improved by ending each row of data on a full stop, possibly using a library like NLTK or regular expressions to detect sentence boundaries [00:13:17]. This ensures more complete “paragraphs” within each data row [00:13:33].
- Storage: The generated dataset can be pushed to Hugging Face or saved locally [00:11:26].
Dataset Structure for Training
The prepared dataset must have specific columns for the trainer:
- Required Columns:
audio
andtext
columns [00:19:11]. - Optional Column: A
source
column to refer to the speaker number (e.g.,0
for the first speaker,1
for a different speaker) [00:19:16]. - Default Speaker: If no speaker column is present, the code assigns a default
source
of0
to all entries [00:19:28].
Preparing Data for Training
Before training, the data needs final formatting:
- Length Determination: The maximum audio length (in audio steps/tokens) and maximum text length (in tokens) are measured from the dataset [00:24:16]. For example, a maximum audio length of about 700,000 audio steps and a maximum text length of 587 tokens were observed in the workshop [00:24:20], [00:24:37].
- Input Preparation: Unsloth is used to prepare the input IDs, attention mask, labels, input values, and cutoffs, which are necessary inputs for the trainer [00:24:59].
- This processed dataset is then passed to the trainer for fine-tuning the token-based text-to-speech model [00:25:15], [00:25:29].