From: aidotengineer

This article provides an overview of fine-tuning text-to-speech (TTS) models, specifically focusing on the Sesame CSM1B model using the Unsloth library [00:00:08]. The objective is to train a TTS model to sound like a specific voice [00:00:32], understand token-based TTS models [00:00:38], create voice datasets [00:00:50], perform fine-tuning with Unsloth [00:01:03], and evaluate performance [00:01:10]. All workshop materials, including the Colab notebook and slides, are available on the Trellis Research GitHub under AI Worldsfare-2025 [00:00:16].

Understanding Token-Based Text-to-Speech Models

Token-based TTS models function similarly to token-based text models like GPT-4o or Llama series [00:01:46]. While text models predict the next word or token from an input string of text [00:01:58], TTS models take a series of text tokens and predict the next audio token [00:02:10]. They can also incorporate a history of text and audio from previous conversation to recursively produce the next audio token [00:02:19].

The core challenge is representing continuous audio waveforms as discrete tokens [00:02:48]. This is achieved by representing a small piece of audio as a choice from a “code book” [00:03:04]. A code book acts like a dictionary containing vectors, where each vector represents a specific sound [00:03:09].

To train these code books, an encoder-decoder architecture is used [00:03:39]. A soundwave is encoded into tokens, then decoded back into a wave [00:03:42]. The difference between the input and output waves (loss) is used to update the model’s weights [00:04:01].

A single token per time step is often insufficient for detailed audio representation [00:04:26]. Therefore, token-based audio models typically use a hierarchical representation or multiple tokens per time window [00:04:31]. This allows for representing both higher-level meaning/acoustics and more granular details [00:04:51].

Sesame CSM1B Model Architecture

The Sesame model, specifically CSM1B, uses 32 tokens at every audio window to represent sound [00:05:06].

  • A “zeroth” token is predicted by the main transformer [00:05:16].
  • A secondary transformer decodes the other 31 “stack” tokens [00:05:20].
  • The Sesame model consists of two parts: the main 1-billion parameter model and a much smaller model for decoding hierarchical tokens [00:05:38].

The architecture includes:

  • A backbone model (the main transformer) [00:15:48].
  • A depth decoder for the 31 additional tokens [00:15:51].
  • A codec model to convert between waveform and tokens [00:16:00].

Data Preparation for Fine-tuning

The first step in fine-tuning is data generation [00:07:01].

  1. Select a YouTube video: Choose a video with a single speaker to avoid complex diarization [00:08:09].
  2. Transcribe with Whisper: Use OpenAI’s Whisper model (e.g., turbo size for speed and quality) within the Colab Notebook to transcribe the video [00:07:06]. The transcription is saved as a JSON file [00:09:18].

    Colab Recommendation [00:08:51].

    Running YouTube downloads and Whisper transcription is recommended within Google Colab to avoid authentication issues

  3. Correct Transcription: Manually review the JSON transcript for misspellings (e.g., proper nouns like “Trellis” with one ‘L’ vs. two ‘L’s) and perform find-and-replace operations [00:09:58]. Re-upload the corrected file to Colab [00:10:36].
  4. Segment and Combine Audio: Split the long transcript into shorter segments, combining them into chunks of up to 30 seconds [00:11:04]. The simple algorithm stacks Whisper segments until they exceed 30 seconds, forming a new data row [00:11:09].
    • A typical dataset size of about 41 rows of 30-second clips can have an effect on quality [00:07:27]. Around 50 rows of 30-second snippets is a good starting point [00:07:43]. For better performance, particularly without voice cloning, aim for around 500 rows of 30-second clips [00:33:07].

    Data Improvement [00:13:17].

    Consider ending each row of data on a full stop by using libraries like NLTK or regular expressions to detect sentence boundaries

  5. Push to Hugging Face (Optional): The processed dataset can be pushed to Hugging Face Hub [00:12:26].
  6. Load Raw Data Set: The dataset needs to have an audio column and a text column [00:19:11]. An optional source column refers to the speaker ID, starting at zero [00:19:16]. If not provided, a default source of zero is assigned [00:19:30].
  7. Format Data for Trainer: Determine the maximum audio and text lengths from the dataset [00:24:18]. Unsloth prepares the input IDs, attention masks, labels, input values, and cutoffs [00:25:01]. This mapping is applied to create a processed dataset [00:25:12].

Fine-tuning with Unsloth

Unsloth is a library built around Hugging Face Transformers [00:01:05] that simplifies fine-tuning [00:14:05].

  1. Install Unsloth: Install Unsloth, which includes transformers and other necessary packages [00:14:00].
  2. Load Base Model: Load the Sesame CSM1B model using Unsloth [00:14:10]. It’s a 1-billion parameter model and fits within a T4 GPU’s 15GB memory [00:15:02]. The model supports auto-model for conditional generation [00:14:53].

    GPU Requirement [00:06:39].

    A GPU (e.g., T4 on Colab) is required

  3. Apply LoRA Adapters: Instead of training all model parameters, LoRA (Low-Rank Adaptation) adapters are applied to a subset of layers (e.g., linear layers, QVO, MLP linear layers) [00:22:18]. This saves memory and speeds up training [00:16:22].
    • lora_alpha can be set to 16 for 1-billion parameter models [00:22:29].
    • rescale_lora scales the learning rate based on adapter size [00:22:36].
    • rank of 32 can be used for adapter matrices [00:22:49].
    • Typically, less than 2% of parameters are trainable with this approach [00:23:25].
    • Training embeddings (e.g., LM head, lm_embed, embed_tokens) is generally not necessary unless token changes are involved [00:23:51].

Trainer Configuration

The processed dataset is passed to the trainer [00:25:29]. Key configurations include:

  • Virtual batch size: For a T4, a virtual batch size of 8 or more can be used [00:25:42]. A batch size of 4 might be possible on a T4 [00:27:44].
  • Epochs: Even one epoch can yield results [00:26:00].
  • Warm-up steps: Controls how slowly the learning rate increases [00:26:05]. For small datasets (e.g., 5 total steps), reducing warm-up steps to one can be beneficial [00:26:11].
  • Learning rate: 2e-4 is typically suitable for a 1-billion parameter model [00:26:16].
  • Data type: Automatically selected (e.g., float16 on T4, bfloat16 on Hopper/Blackwell/Ampere GPUs) [00:26:20].
  • Optimizer: AdamW 8-bit optimizer reduces memory requirements [00:26:27].
  • Weight decay: Prevents overfitting [00:26:32].
  • Learning rate scheduler: Constant learning rate can be used [00:26:35].
  • Output directory: Specifies where model outputs are saved [00:26:39].

Training can take about 10 minutes [00:26:57]. The loss should decrease significantly (e.g., from ~6.34 to ~3.7) [00:27:03].

Evaluation Data

Ideally, the training set should be split into train and eval sets to monitor eval_loss and grad_norm (should be around 1 or less) using tools like TensorBoard [00:28:36].

Performance Evaluation

Zero-Shot Inference

Without fine-tuning or voice cloning, the base model generates speech with a random speaker’s voice due to non-zero temperature [00:17:47]. This can result in a wide variance of voices (e.g., male, female, deep, high-pitched) [00:17:50].

Voice Cloning

Voice cloning involves passing an audio sample to the model, which then attempts to generate new text in a voice similar to the sample [00:18:00]. This yields a voice much closer to the sample than zero-shot inference, but still not as accurate as after fine-tuning [00:18:35].

Fine-tuned Model Performance

After fine-tuning, the model’s output will have a consistent male voice (if trained on a male voice) [00:31:25]. While improved, issues like pacing or subtle accent characteristics might still be present, indicating potential for further data filtering or more data [00:31:34].

Fine-tuning combined with Voice Cloning

The best performance is typically achieved by combining fine-tuning with voice cloning [00:32:14]. This can produce very high-quality speech that closely matches the target voice, even with a relatively small amount of data (e.g., 30 minutes of video) [00:33:14].

Saving and Pushing Models

After training, the model can be saved locally or pushed to Hugging Face Hub [00:29:38].

  • Saving LoRA Adapters: Saving the LoRA adapters is lightweight and creates a smaller repository [00:29:54].
  • Merging and Saving Full Model: To push the full, merged model, it needs to be merged into a 16-bit format first [00:30:00]. This saves both the model and processor [00:30:13].
  • Reloading a Fine-tuned Model: To reload a previously fine-tuned model, specify its name (e.g., trellis/my-YouTube-tts) instead of the base model name [00:30:31].