Text to speech model fine tuning

From: aidotengineer

This article summarizes a workshop on text-to-speech (TTS) model fine-tuning, focusing on the Sesame CSM 1B model. The aim is to train a TTS model to replicate a specific voice [00:00:30]. All workshop materials, including the Colab notebook and slides, are available on the Trellis Research GitHub under AI Worldsfare-2025 [00:00:14].

How Token-Based Text to Speech Models Work

Token-based text-to-speech models function similarly to large language models like GPT-4o or Llama, which predict the next word or token in a text string [00:01:43]. For TTS, the goal is to take a series of text tokens and predict the next “audio token” [00:02:08]. The model can also take a history of text and audio from a conversation to recursively decode the next audio token [00:02:19].

The core challenge is how to represent continuous audio waveforms as discrete tokens [00:02:46]. This is achieved by representing a piece of audio as a choice from a “codebook,” which is a dictionary of vectors representing specific sounds [00:03:01]. These codebooks can be large to capture a wide variety of acoustics and semantics [00:03:18].

Codebooks are trained using an encoder-decoder structure: a soundwave is encoded into tokens, then decoded back into a wave, and the difference between the input and output (loss) is used to update weights [00:03:36]. For better representation, a hierarchical approach or multiple tokens per time step (window) is used, allowing for more detailed and layered sound representation [00:04:31].

The Sesame model, specifically CSM1B, uses 32 tokens at every window to represent sound [00:05:06]. A primary transformer predicts the zeroth token, while a secondary transformer decodes the remaining 31 tokens [00:05:16]. Thus, Sesame comprises two models: a main 1 billion parameter model and a smaller model for hierarchical token decoding [00:05:38].

Workshop Process

The workshop demonstrates fine-tuning a pre-trained Sesame model using a custom dataset [00:05:54]. The process involves:

Data Generation: Creating a voice dataset from a YouTube video [00:00:51].
Fine-tuning: Utilizing Unsloth, a library based on transformers, to fine-tune the model [00:01:03].
Performance Evaluation: Assessing the model’s performance both before and after fine-tuning [00:01:10].

Data Generation

The data generation section of the Colab notebook guides users to:

Select a YouTube video: It is recommended to choose a video with a single speaker to avoid complex diarization [00:08:09].
Transcribe audio: Whisper (OpenAI) is used within the Colab environment to transcribe the audio, with “turbo” transcription recommended for its balance of quality and speed [00:07:06].
Correct transcripts: Users can manually review and correct misspelled words in the generated JSON transcript file [00:09:58].
Create the dataset: The long transcript is split into shorter segments, combined into chunks of up to 30 seconds, forming rows of data with audio snippets and their corresponding text [00:11:04]. A dataset of approximately 50 rows of 30-second snippets is suggested for a noticeable effect [00:07:43].
Push to Hugging Face (optional): The created dataset can be pushed to Hugging Face Hub [00:11:26].
Dataset structure: The dataset requires an audio column and a text column, with an optional source column for speaker ID (defaulting to zero for single speakers) [00:19:11].
Future improvements: Compiling segments to end on full stops using libraries like NLTK or regular expressions could improve data quality [00:13:14].

Fine-Tuning

The fine-tuning process involves:

Installing Unsloth: A library that includes transformers and packages for loading and fine-tuning models [00:14:00].
Loading the base model: The CSM1B Sesame model is loaded. It’s a few gigabytes in size and fits easily on a T4 GPU (15GB memory) [00:15:02]. The model architecture includes a backbone transformer, a depth decoder, and a codec model [00:15:45].
Applying LoRA adapters: Instead of training all parameters, small adapters are applied to target specific layers like linear layers (QVO, MLP linear layers) [00:22:18]. This saves memory and accelerates training [00:16:22].
- A LoRA alpha of 16 is recommended for 1 billion parameter models, and a rank of 32 for adapter matrices [00:22:29].
- Only about 2% of the model parameters are trainable with this approach [00:23:24].
Data preparation for trainer: The maximum audio and text lengths from the dataset are determined and passed to the trainer [00:24:16]. Unsloth handles the preparation of input IDs, attention masks, labels, etc. [00:24:59].
Training parameters:
- Virtual batch size of 8 (with 41 rows, leading to 5 steps per epoch) [00:25:41].
- One epoch of training [00:26:00].
- Warm-up steps for slow learning rate increase [00:26:05].
- Learning rate of 2E-4 is suitable for a 1 billion parameter model [00:26:16].
- AdamW 8-bit optimizer for memory reduction [00:26:27].
- Weight decay helps prevent overfitting [00:26:32].
- Training loss is expected to decrease from around 6.34 to 3.7 [00:27:03].
Monitoring training: Ideally, an evaluation dataset would be used to monitor eval loss and stop training if it ceases to fall [00:28:36]. TensorBoard can be used to monitor losses and grad_norm [00:28:57].

Performance Evaluation

Model performance is evaluated through inference, comparing results before and after fine-tuning:

Zero-shot inference (before fine-tuning):
- The model generates audio with a random speaker voice (male or female, deep or high-pitched) due to a non-zero temperature [00:17:48].
- Example: A woman’s voice for the text “We just finished fine-tuning a text to speech model” [00:20:46].
- Example: A different voice for “Sesame is a super cool TTS model which can be fine-tuned with Unsloth” [00:21:02].
Voice cloning (before fine-tuning):
- A sample of the target voice is passed to the model, which then generates text that sounds more like the input sample [00:18:00]. This provides a closer match to the target voice than zero-shot inference [00:18:35].
- Example: A voice closer to the speaker’s original voice, but still not as good as after fine-tuning [00:21:18].
Fine-tuned model inference:
- Zero-shot inference: After fine-tuning, the model generates a male voice, closer to the speaker, but with some pacing issues or a slight Irish accent [00:31:10].
- Voice cloning with fine-tuning: Combining fine-tuning with voice cloning yields the best performance, producing audio that closely matches the speaker’s voice, including their accent [00:32:14].
- Example: “Sesame is a super cool TTS model which can be fine-tuned with Unsloth,” generated with a natural-sounding Irish accent [00:32:21].

Conclusion

The workshop demonstrates that even a relatively small amount of data (e.g., 30 minutes of video for 41 rows) can yield good performance when combining fine-tuning with voice cloning [00:33:14]. For even better performance, especially without voice cloning, aiming for around 500 rows of 30-second data is suggested [00:33:04].

The fine-tuned model (LoRA adapters) can be saved locally or pushed to Hugging Face Hub, either as lightweight adapters or merged into a full 16-bit model [00:29:38]. To reload a previously fine-tuned model, its name would be specified instead of the base model name [00:30:31].

Tubegraph

Explorer

Table of Contents