From: aidotengineer
This article outlines a workshop on text-to-speech model fine-tuning, specifically focusing on token-based text-to-speech models using Sesame’s CSM 1B model [00:00:08]. The materials, including a Colab notebook and slides, are available on the Trellis Research GitHub repository under AI Worldsfare-2025
[00:00:19].
Workshop Objectives
By the end of this workshop, you should be able to:
- Train a text-to-speech model to sound like a specific voice [00:00:30].
- Understand how token-based text-to-speech models function [00:00:38]. The workshop focuses on Sesame CSM1B, but models like Orpheus from Canopy Labs are also token-based [00:00:42].
- Create a voice data set [00:00:51]. This can involve recording your own voice or extracting audio from a YouTube video [00:00:53].
- Perform fine-tuning using Unsloth, a library based on transformers [00:01:03].
- Evaluate performance both before and after fine-tuning [00:01:10].
How Token-Based Text-to-Speech Models Work
Analogy to Language Models
Large language models like OpenAI’s GPT-4o or Meta’s Llama series take a string of text and predict the next word or token recursively [00:01:43]. For text-to-speech, the goal is to take a series of text tokens and predict the next audio token [00:02:08]. This involves passing a history of both text and audio to recursively decode the next audio token, producing a string of audio tokens that represent the desired speech [00:02:19].
Audio Tokens and Codebooks
The core challenge is representing continuous soundwaves as discrete audio tokens [00:02:46]. This is achieved by representing a piece of audio as a choice from a “codebook” [00:03:01]. A codebook functions like a dictionary, containing vectors where each vector represents a specific sound [00:03:07]. These codebooks can be large to capture a wide variety of acoustics and semantics, allowing sound to be represented discretely, similar to how text represents meaning [00:03:18].
Codebooks are trained using an encoder-decoder architecture [00:03:36]. A soundwave is encoded into tokens, then decoded back into a wave [00:03:39]. During training, the system compares the decoded output with the original input, using the difference (loss) to update the model’s weights [00:03:55].
Hierarchical Representation
For effective sound representation, it’s often insufficient to have just one token per time step [00:04:20]. A hierarchical representation with multiple tokens allowed at the same timestamp or window provides a more detailed sound representation [00:04:31]. This approach uses multiple layers of tokens, capturing both higher-level meaning/acoustics and more granular details [00:04:48].
The Sesame Model Architecture
The Sesame model employs this hierarchical approach, using 32 tokens at each audio window to represent sound [00:05:06].
- A “zeroth token” is predicted by the main transformer [00:05:16].
- A secondary transformer decodes the remaining 31 “stack tokens” [00:05:20].
- Sesame therefore consists of two models: the main 1 billion parameter model and a much smaller model for decoding hierarchical tokens [00:05:38].
- The model also includes a codec to convert waveforms to tokens and vice-versa [00:16:02].
Fine-tuning Process Overview
The fine-tuning process begins with a pre-trained model capable of taking text and audio inputs to output an audio stream [00:05:51]. The goal is to adapt this model to a specific voice by training it on a custom data set [00:06:05].
Colab Notebook Setup and Data Generation
The entire process is covered in a single Colab notebook, available via the Trellis Research GitHub repo for AI Worlds Fair 2025 [00:06:18].
- GPU Connection: Ensure your Colab runtime is connected to a GPU (e.g., T4), which should be available for free [00:06:36].
- The notebook has two main sections: Data Generation and Fine-tuning [00:07:01].
Data Generation Steps
- Select YouTube Video: Choose a YouTube video, preferably with a single speaker to avoid complex diarization (speaker separation), which is not supported in this basic notebook [00:08:04].
- Transcribe with Whisper: Use OpenAI’s Whisper model (e.g., “turbo” for speed and quality) within the Colab notebook to transcribe the audio [00:08:28].
YouTube DLP
andWhisper
libraries need to be installed [00:08:47]. The transcription is saved as a JSON file locally [00:09:17]. - Manual Correction: Review the JSON transcript for misspelled words and perform find-and-replace operations (e.g., “trellis” with two ‘l’s to one ‘l’ for the company name) [00:09:58]. Re-upload the corrected file to Google Colab [00:10:36].
- Create Data Set Snippets: Combine short Whisper segments into longer chunks, up to 30 seconds in length [00:10:59]. A simple algorithm stacks segments until they exceed 30 seconds, forming a new data row [00:11:11].
- Roughly 50 clips of 30 seconds each are usually sufficient to have an effect on quality [00:07:42].
- Improvement Note: Ideally, segments should end on a full stop (sentence boundary detection, e.g., using NLTK or regular expressions) to create more natural-sounding paragraphs within each data row [00:13:14].
- Push to Hugging Face: The generated data set can be pushed to Hugging Face Hub [00:12:26].
- When loading the data, it must have
audio
andtext
columns, and optionally asource
column (for speaker ID, starting at zero) [00:19:11]. If no speaker column is present, the code defaults to speaker zero [00:19:28].
- When loading the data, it must have
Model Loading and Initial Evaluation
Installing Unsloth and Loading the Base Model
First, install Unsloth, which includes transformers
and other necessary packages for loading and fine-tuning [00:13:50].
Then, load the base CSM1B (Sesame) model using Unsloth [00:14:10].
- The model should fit easily within a 15GB T4 GPU as it’s only a few gigabytes in size [00:15:02].
- Sesame models are now supported by
transformers
[00:14:55]. - The model architecture includes a backbone (main transformer), a depth decoder (for the 31 additional tokens), and a codec model [00:15:45].
Initial Inference (Before Fine-Tuning)
To evaluate the base model, perform zero-shot inference by passing text and observing the generated audio [00:16:59]. With a non-zero temperature, the model will produce a random speaker’s voice (male, female, deep, high-pitched) [00:17:46].
- Example Audio (Zero-Shot): “We just finished fine-tuning a text to speech model.” [00:20:46] (Result: a woman’s voice)
- Example Audio (Zero-Shot): “Sesame is a super cool TTS model which can be fine-tuned with an SL.” [00:21:02] (Result: wide variance in speaker type)
Voice Cloning
Voice cloning is distinct from fine-tuning [00:18:00]. In voice cloning, a sample audio is passed into the model, influencing the generated output to sound more like the provided sample [00:18:04].
- The custom data set generated earlier is used for this purpose [00:18:22].
- Example Audio (Cloned): “Sesame is a super cool TTS model which can be fine-tuned with onslaught.” [00:21:28] (Result: closer to the speaker’s voice, but still not as good as after fine-tuning)
Fine-tuning with Unsloth
Applying LoRA Adapters
Fine-tuning is performed by applying LoRA (Low-Rank Adaptation) adapters [00:22:18]. These adapters target specific layers (linear layers, QVO, MLP gate up and down) [00:22:20].
- LoRA Alpha: 16 (recommended for a 1 billion parameter model) [00:22:29].
- Rescale LoRA: Scales the learning rate of adapters based on their size [00:22:36].
- LoRA Rank: 32 (determines the “width” or “height” of adapter matrices) [00:22:49].
- Only a small subset of parameters (under 2%) are made trainable, which saves memory and speeds up training [00:16:11], [00:23:25]. Training embeddings is generally unnecessary if tokens aren’t changing [00:23:51].
Data Preparation for Trainer
The loaded data set needs further formatting for the trainer [00:24:11].
- Determine maximum audio length (e.g., 700,000 audio steps) [00:24:16].
- Measure maximum text length (e.g., 587) [00:24:35].
- These lengths, along with a sampling rate, are passed into the trainer to prepare input IDs, attention masks, labels, and other tensors [00:24:45].
Training Configuration
The processed data set is then passed to the trainer.
- Virtual Batch Size: 8 (for 41 rows, this means 5 steps per epoch) [00:25:41].
- Epochs: 1 epoch for this example [00:26:00].
- Warm-up Steps: 3 (slowly increases the learning rate, though 1 might be more appropriate for very few steps) [00:26:05].
- Learning Rate: 2e-4 (good for a 1 billion parameter model) [00:26:16].
- Precision: Automatically uses float16 or brain float16 based on the GPU (e.g., float16 on T4) [00:26:20].
- Optimizer: AdamW 8-bit to reduce memory requirements [00:26:27].
- Weight Decay: Used to prevent overfitting [00:26:32].
- Monitoring: Ideally, use an evaluation data set to monitor evaluation loss and gradnorm (around 1 or less) using tools like TensorBoard [00:28:36].
Training takes about 10 minutes [00:26:54]. The loss typically falls from around 6.34 to 3.7 [00:27:03].
Post-Training Evaluation
Saving the Model
After training, the model can be saved locally or pushed to Hugging Face Hub [00:29:38].
- Pushing just the LoRA adapters is lightweight [00:29:54].
- To push the full model, it needs to be merged first (e.g., to 16-bit format) [00:29:59].
- A previously created fine-tuned model can be reloaded by specifying its name instead of the base model name [00:30:31].
Fine-tuned Model Performance
After fine-tuning, new voice samples are generated using the adapted model.
- Example Audio (Fine-tuned, Zero-Shot): “We just finished fine-tuning a text to speech model.” [00:31:10] (Result: male voice, sounds a little bit Irish, with some pacing issues)
- Example Audio (Fine-tuned, Zero-Shot): “Sesame is a super cool TTS model which can be tuned on Slack.” [00:31:55] (Result: slight Irish accent, ‘T’ in ‘tuned’ still American, indicating room for improvement with more data)
Fine-tuned Model with Voice Cloning
Combining fine-tuning with voice cloning yields the best performance, even with a relatively small data set (e.g., a 30-minute video) [00:33:14].
- Example Audio (Fine-tuned + Cloned): “Sesame is a super cool TTS model which can’t be fine-tuned with onslaught.” [00:32:21] (Result: very good, strong Irish accent, even with natural-sounding errors, though intonation might not be perfect for every phrase)
Conclusion
This workshop demonstrates that combining fine-tuning with voice cloning can achieve good performance for text-to-speech models even with limited data [00:33:14]. For further improvements, aiming for around 500 rows of 30-second audio clips could significantly enhance performance, especially without voice cloning [00:33:04].
All resources are available on the Trellis Research GitHub: trellis research/aiworldsfair 2025
[00:33:28]. Future videos will cover more detailed data preparation and hyperparameter tuning [00:33:42].