TexttoSpeech Model FineTuning

From: aidotengineer

This article outlines a workshop on text-to-speech model fine-tuning, specifically focusing on token-based text-to-speech models using Sesame’s CSM 1B model [00:00:08]. The materials, including a Colab notebook and slides, are available on the Trellis Research GitHub repository under AI Worldsfare-2025 [00:00:19].

Workshop Objectives

By the end of this workshop, you should be able to:

Train a text-to-speech model to sound like a specific voice [00:00:30].
Understand how token-based text-to-speech models function [00:00:38]. The workshop focuses on Sesame CSM1B, but models like Orpheus from Canopy Labs are also token-based [00:00:42].
Create a voice data set [00:00:51]. This can involve recording your own voice or extracting audio from a YouTube video [00:00:53].
Perform fine-tuning using Unsloth, a library based on transformers [00:01:03].
Evaluate performance both before and after fine-tuning [00:01:10].

How Token-Based Text-to-Speech Models Work

Analogy to Language Models

Large language models like OpenAI’s GPT-4o or Meta’s Llama series take a string of text and predict the next word or token recursively [00:01:43]. For text-to-speech, the goal is to take a series of text tokens and predict the next audio token [00:02:08]. This involves passing a history of both text and audio to recursively decode the next audio token, producing a string of audio tokens that represent the desired speech [00:02:19].

Audio Tokens and Codebooks

The core challenge is representing continuous soundwaves as discrete audio tokens [00:02:46]. This is achieved by representing a piece of audio as a choice from a “codebook” [00:03:01]. A codebook functions like a dictionary, containing vectors where each vector represents a specific sound [00:03:07]. These codebooks can be large to capture a wide variety of acoustics and semantics, allowing sound to be represented discretely, similar to how text represents meaning [00:03:18].

Codebooks are trained using an encoder-decoder architecture [00:03:36]. A soundwave is encoded into tokens, then decoded back into a wave [00:03:39]. During training, the system compares the decoded output with the original input, using the difference (loss) to update the model’s weights [00:03:55].

Hierarchical Representation

For effective sound representation, it’s often insufficient to have just one token per time step [00:04:20]. A hierarchical representation with multiple tokens allowed at the same timestamp or window provides a more detailed sound representation [00:04:31]. This approach uses multiple layers of tokens, capturing both higher-level meaning/acoustics and more granular details [00:04:48].

The Sesame Model Architecture

The Sesame model employs this hierarchical approach, using 32 tokens at each audio window to represent sound [00:05:06].

A “zeroth token” is predicted by the main transformer [00:05:16].
A secondary transformer decodes the remaining 31 “stack tokens” [00:05:20].
Sesame therefore consists of two models: the main 1 billion parameter model and a much smaller model for decoding hierarchical tokens [00:05:38].
The model also includes a codec to convert waveforms to tokens and vice-versa [00:16:02].

Fine-tuning Process Overview

The fine-tuning process begins with a pre-trained model capable of taking text and audio inputs to output an audio stream [00:05:51]. The goal is to adapt this model to a specific voice by training it on a custom data set [00:06:05].

Colab Notebook Setup and Data Generation

The entire process is covered in a single Colab notebook, available via the Trellis Research GitHub repo for AI Worlds Fair 2025 [00:06:18].

GPU Connection: Ensure your Colab runtime is connected to a GPU (e.g., T4), which should be available for free [00:06:36].
The notebook has two main sections: Data Generation and Fine-tuning [00:07:01].

Data Generation Steps

Select YouTube Video: Choose a YouTube video, preferably with a single speaker to avoid complex diarization (speaker separation), which is not supported in this basic notebook [00:08:04].
Transcribe with Whisper: Use OpenAI’s Whisper model (e.g., “turbo” for speed and quality) within the Colab notebook to transcribe the audio [00:08:28]. YouTube DLP and Whisper libraries need to be installed [00:08:47]. The transcription is saved as a JSON file locally [00:09:17].
Manual Correction: Review the JSON transcript for misspelled words and perform find-and-replace operations (e.g., “trellis” with two ‘l’s to one ‘l’ for the company name) [00:09:58]. Re-upload the corrected file to Google Colab [00:10:36].
Create Data Set Snippets: Combine short Whisper segments into longer chunks, up to 30 seconds in length [00:10:59]. A simple algorithm stacks segments until they exceed 30 seconds, forming a new data row [00:11:11].
- Roughly 50 clips of 30 seconds each are usually sufficient to have an effect on quality [00:07:42].
- Improvement Note: Ideally, segments should end on a full stop (sentence boundary detection, e.g., using NLTK or regular expressions) to create more natural-sounding paragraphs within each data row [00:13:14].
Push to Hugging Face: The generated data set can be pushed to Hugging Face Hub [00:12:26].
- When loading the data, it must have audio and text columns, and optionally a source column (for speaker ID, starting at zero) [00:19:11]. If no speaker column is present, the code defaults to speaker zero [00:19:28].

Model Loading and Initial Evaluation

Installing Unsloth and Loading the Base Model

First, install Unsloth, which includes transformers and other necessary packages for loading and fine-tuning [00:13:50]. Then, load the base CSM1B (Sesame) model using Unsloth [00:14:10].

The model should fit easily within a 15GB T4 GPU as it’s only a few gigabytes in size [00:15:02].
Sesame models are now supported by transformers [00:14:55].
The model architecture includes a backbone (main transformer), a depth decoder (for the 31 additional tokens), and a codec model [00:15:45].

Initial Inference (Before Fine-Tuning)

To evaluate the base model, perform zero-shot inference by passing text and observing the generated audio [00:16:59]. With a non-zero temperature, the model will produce a random speaker’s voice (male, female, deep, high-pitched) [00:17:46].

Example Audio (Zero-Shot): “We just finished fine-tuning a text to speech model.” [00:20:46] (Result: a woman’s voice)
Example Audio (Zero-Shot): “Sesame is a super cool TTS model which can be fine-tuned with an SL.” [00:21:02] (Result: wide variance in speaker type)

Voice Cloning

Voice cloning is distinct from fine-tuning [00:18:00]. In voice cloning, a sample audio is passed into the model, influencing the generated output to sound more like the provided sample [00:18:04].

The custom data set generated earlier is used for this purpose [00:18:22].
Example Audio (Cloned): “Sesame is a super cool TTS model which can be fine-tuned with onslaught.” [00:21:28] (Result: closer to the speaker’s voice, but still not as good as after fine-tuning)

Fine-tuning with Unsloth

Applying LoRA Adapters

Fine-tuning is performed by applying LoRA (Low-Rank Adaptation) adapters [00:22:18]. These adapters target specific layers (linear layers, QVO, MLP gate up and down) [00:22:20].

LoRA Alpha: 16 (recommended for a 1 billion parameter model) [00:22:29].
Rescale LoRA: Scales the learning rate of adapters based on their size [00:22:36].
LoRA Rank: 32 (determines the “width” or “height” of adapter matrices) [00:22:49].
Only a small subset of parameters (under 2%) are made trainable, which saves memory and speeds up training [00:16:11], [00:23:25]. Training embeddings is generally unnecessary if tokens aren’t changing [00:23:51].

Data Preparation for Trainer

The loaded data set needs further formatting for the trainer [00:24:11].

Determine maximum audio length (e.g., 700,000 audio steps) [00:24:16].
Measure maximum text length (e.g., 587) [00:24:35].
These lengths, along with a sampling rate, are passed into the trainer to prepare input IDs, attention masks, labels, and other tensors [00:24:45].

Training Configuration

The processed data set is then passed to the trainer.

Virtual Batch Size: 8 (for 41 rows, this means 5 steps per epoch) [00:25:41].
Epochs: 1 epoch for this example [00:26:00].
Warm-up Steps: 3 (slowly increases the learning rate, though 1 might be more appropriate for very few steps) [00:26:05].
Learning Rate: 2e-4 (good for a 1 billion parameter model) [00:26:16].
Precision: Automatically uses float16 or brain float16 based on the GPU (e.g., float16 on T4) [00:26:20].
Optimizer: AdamW 8-bit to reduce memory requirements [00:26:27].
Weight Decay: Used to prevent overfitting [00:26:32].
Monitoring: Ideally, use an evaluation data set to monitor evaluation loss and gradnorm (around 1 or less) using tools like TensorBoard [00:28:36].

Training takes about 10 minutes [00:26:54]. The loss typically falls from around 6.34 to 3.7 [00:27:03].

Post-Training Evaluation

Saving the Model

After training, the model can be saved locally or pushed to Hugging Face Hub [00:29:38].

Pushing just the LoRA adapters is lightweight [00:29:54].
To push the full model, it needs to be merged first (e.g., to 16-bit format) [00:29:59].
A previously created fine-tuned model can be reloaded by specifying its name instead of the base model name [00:30:31].

Fine-tuned Model Performance

After fine-tuning, new voice samples are generated using the adapted model.

Example Audio (Fine-tuned, Zero-Shot): “We just finished fine-tuning a text to speech model.” [00:31:10] (Result: male voice, sounds a little bit Irish, with some pacing issues)
Example Audio (Fine-tuned, Zero-Shot): “Sesame is a super cool TTS model which can be tuned on Slack.” [00:31:55] (Result: slight Irish accent, ‘T’ in ‘tuned’ still American, indicating room for improvement with more data)

Fine-tuned Model with Voice Cloning

Combining fine-tuning with voice cloning yields the best performance, even with a relatively small data set (e.g., a 30-minute video) [00:33:14].

Example Audio (Fine-tuned + Cloned): “Sesame is a super cool TTS model which can’t be fine-tuned with onslaught.” [00:32:21] (Result: very good, strong Irish accent, even with natural-sounding errors, though intonation might not be perfect for every phrase)

Conclusion

This workshop demonstrates that combining fine-tuning with voice cloning can achieve good performance for text-to-speech models even with limited data [00:33:14]. For further improvements, aiming for around 500 rows of 30-second audio clips could significantly enhance performance, especially without voice cloning [00:33:04].

All resources are available on the Trellis Research GitHub: trellis research/aiworldsfair 2025 [00:33:28]. Future videos will cover more detailed data preparation and hyperparameter tuning [00:33:42].

Tubegraph

Explorer

Table of Contents