From: aidotengineer
This article provides an overview of fine-tuning text-to-speech (TTS) models, specifically focusing on the Sesame CSM1B model using the Unsloth library [00:00:08]. The objective is to train a TTS model to sound like a specific voice [00:00:32], understand token-based TTS models [00:00:38], create voice datasets [00:00:50], perform fine-tuning with Unsloth [00:01:03], and evaluate performance [00:01:10]. All workshop materials, including the Colab notebook and slides, are available on the Trellis Research GitHub under AI Worldsfare-2025
[00:00:16].
Understanding Token-Based Text-to-Speech Models
Token-based TTS models function similarly to token-based text models like GPT-4o or Llama series [00:01:46]. While text models predict the next word or token from an input string of text [00:01:58], TTS models take a series of text tokens and predict the next audio token [00:02:10]. They can also incorporate a history of text and audio from previous conversation to recursively produce the next audio token [00:02:19].
The core challenge is representing continuous audio waveforms as discrete tokens [00:02:48]. This is achieved by representing a small piece of audio as a choice from a “code book” [00:03:04]. A code book acts like a dictionary containing vectors, where each vector represents a specific sound [00:03:09].
To train these code books, an encoder-decoder architecture is used [00:03:39]. A soundwave is encoded into tokens, then decoded back into a wave [00:03:42]. The difference between the input and output waves (loss) is used to update the model’s weights [00:04:01].
A single token per time step is often insufficient for detailed audio representation [00:04:26]. Therefore, token-based audio models typically use a hierarchical representation or multiple tokens per time window [00:04:31]. This allows for representing both higher-level meaning/acoustics and more granular details [00:04:51].
Sesame CSM1B Model Architecture
The Sesame model, specifically CSM1B, uses 32 tokens at every audio window to represent sound [00:05:06].
- A “zeroth” token is predicted by the main transformer [00:05:16].
- A secondary transformer decodes the other 31 “stack” tokens [00:05:20].
- The Sesame model consists of two parts: the main 1-billion parameter model and a much smaller model for decoding hierarchical tokens [00:05:38].
The architecture includes:
- A backbone model (the main transformer) [00:15:48].
- A depth decoder for the 31 additional tokens [00:15:51].
- A codec model to convert between waveform and tokens [00:16:00].
Data Preparation for Fine-tuning
The first step in fine-tuning is data generation [00:07:01].
- Select a YouTube video: Choose a video with a single speaker to avoid complex diarization [00:08:09].
- Transcribe with Whisper: Use OpenAI’s Whisper model (e.g.,
turbo
size for speed and quality) within the Colab Notebook to transcribe the video [00:07:06]. The transcription is saved as a JSON file [00:09:18].Colab Recommendation [00:08:51].
Running YouTube downloads and Whisper transcription is recommended within Google Colab to avoid authentication issues
- Correct Transcription: Manually review the JSON transcript for misspellings (e.g., proper nouns like “Trellis” with one ‘L’ vs. two ‘L’s) and perform find-and-replace operations [00:09:58]. Re-upload the corrected file to Colab [00:10:36].
- Segment and Combine Audio: Split the long transcript into shorter segments, combining them into chunks of up to 30 seconds [00:11:04]. The
simple
algorithm stacks Whisper segments until they exceed 30 seconds, forming a new data row [00:11:09].- A typical dataset size of about 41 rows of 30-second clips can have an effect on quality [00:07:27]. Around 50 rows of 30-second snippets is a good starting point [00:07:43]. For better performance, particularly without voice cloning, aim for around 500 rows of 30-second clips [00:33:07].
Data Improvement [00:13:17].
Consider ending each row of data on a full stop by using libraries like NLTK or regular expressions to detect sentence boundaries
- Push to Hugging Face (Optional): The processed dataset can be pushed to Hugging Face Hub [00:12:26].
- Load Raw Data Set: The dataset needs to have an
audio
column and atext
column [00:19:11]. An optionalsource
column refers to the speaker ID, starting at zero [00:19:16]. If not provided, a defaultsource
of zero is assigned [00:19:30]. - Format Data for Trainer: Determine the maximum audio and text lengths from the dataset [00:24:18]. Unsloth prepares the input IDs, attention masks, labels, input values, and cutoffs [00:25:01]. This mapping is applied to create a processed dataset [00:25:12].
Fine-tuning with Unsloth
Unsloth is a library built around Hugging Face Transformers [00:01:05] that simplifies fine-tuning [00:14:05].
- Install Unsloth: Install Unsloth, which includes transformers and other necessary packages [00:14:00].
- Load Base Model: Load the Sesame CSM1B model using Unsloth [00:14:10]. It’s a 1-billion parameter model and fits within a T4 GPU’s 15GB memory [00:15:02]. The model supports auto-model for conditional generation [00:14:53].
GPU Requirement [00:06:39].
A GPU (e.g., T4 on Colab) is required
- Apply LoRA Adapters: Instead of training all model parameters, LoRA (Low-Rank Adaptation) adapters are applied to a subset of layers (e.g., linear layers, QVO, MLP linear layers) [00:22:18]. This saves memory and speeds up training [00:16:22].
lora_alpha
can be set to 16 for 1-billion parameter models [00:22:29].rescale_lora
scales the learning rate based on adapter size [00:22:36].rank
of 32 can be used for adapter matrices [00:22:49].- Typically, less than 2% of parameters are trainable with this approach [00:23:25].
- Training embeddings (e.g.,
LM head
,lm_embed
,embed_tokens
) is generally not necessary unless token changes are involved [00:23:51].
Trainer Configuration
The processed dataset is passed to the trainer [00:25:29]. Key configurations include:
- Virtual batch size: For a T4, a virtual batch size of 8 or more can be used [00:25:42]. A batch size of 4 might be possible on a T4 [00:27:44].
- Epochs: Even one epoch can yield results [00:26:00].
- Warm-up steps: Controls how slowly the learning rate increases [00:26:05]. For small datasets (e.g., 5 total steps), reducing warm-up steps to one can be beneficial [00:26:11].
- Learning rate: 2e-4 is typically suitable for a 1-billion parameter model [00:26:16].
- Data type: Automatically selected (e.g.,
float16
on T4,bfloat16
on Hopper/Blackwell/Ampere GPUs) [00:26:20]. - Optimizer: AdamW 8-bit optimizer reduces memory requirements [00:26:27].
- Weight decay: Prevents overfitting [00:26:32].
- Learning rate scheduler: Constant learning rate can be used [00:26:35].
- Output directory: Specifies where model outputs are saved [00:26:39].
Training can take about 10 minutes [00:26:57]. The loss should decrease significantly (e.g., from ~6.34 to ~3.7) [00:27:03].
Evaluation Data
Ideally, the training set should be split into
train
andeval
sets to monitoreval_loss
andgrad_norm
(should be around 1 or less) using tools like TensorBoard [00:28:36].
Performance Evaluation
Zero-Shot Inference
Without fine-tuning or voice cloning, the base model generates speech with a random speaker’s voice due to non-zero temperature [00:17:47]. This can result in a wide variance of voices (e.g., male, female, deep, high-pitched) [00:17:50].
Voice Cloning
Voice cloning involves passing an audio sample to the model, which then attempts to generate new text in a voice similar to the sample [00:18:00]. This yields a voice much closer to the sample than zero-shot inference, but still not as accurate as after fine-tuning [00:18:35].
Fine-tuned Model Performance
After fine-tuning, the model’s output will have a consistent male voice (if trained on a male voice) [00:31:25]. While improved, issues like pacing or subtle accent characteristics might still be present, indicating potential for further data filtering or more data [00:31:34].
Fine-tuning combined with Voice Cloning
The best performance is typically achieved by combining fine-tuning with voice cloning [00:32:14]. This can produce very high-quality speech that closely matches the target voice, even with a relatively small amount of data (e.g., 30 minutes of video) [00:33:14].
Saving and Pushing Models
After training, the model can be saved locally or pushed to Hugging Face Hub [00:29:38].
- Saving LoRA Adapters: Saving the LoRA adapters is lightweight and creates a smaller repository [00:29:54].
- Merging and Saving Full Model: To push the full, merged model, it needs to be merged into a 16-bit format first [00:30:00]. This saves both the model and processor [00:30:13].
- Reloading a Fine-tuned Model: To reload a previously fine-tuned model, specify its name (e.g.,
trellis/my-YouTube-tts
) instead of the base model name [00:30:31].