Token based text to speech models

From: aidotengineer

Token-based text-to-speech (TTS) models allow for the generation of speech from text, often with the ability to fine-tune the model to produce speech in a specific voice [00:00:30]. This approach aims to train a model to output speech that sounds like a chosen voice [00:00:32].

How Token-Based TTS Models Work

These models operate similarly to token-based text models like OpenAI’s ChatGPT (e.g., GPT-4o) or Llama series models [00:01:46]. While text models take a string of text and predict the next word or token recursively [00:01:58], TTS models aim to take a series of text tokens and predict the next “audio token” [00:02:08]. Ideally, they can also incorporate a history of text and audio from previous conversation to recursively produce the next audio token [00:02:19].

The core idea is to use a transformer model that inputs text and audio, and outputs a string of audio tokens [00:02:36].

Audio Tokens

A key challenge is representing continuous sound waves as discrete tokens [00:02:46]. This is achieved by representing a small piece of audio as a choice from a “code book” [00:03:01]. A codebook functions like a dictionary, containing a series of vectors where each vector represents a specific sound [00:03:07]. These codebooks can be quite large to encompass a wide range of sounds, including acoustics and semantics [00:03:18].

Codebooks are typically trained using an encoder-decoder structure [00:03:36]:

A soundwave is taken as input.
A transformer converts it into tokens.
Another transformer decodes it back into a wave.
During training, the model is optimized to minimize the difference between the input soundwave and the decoded output, adjusting weights through backpropagation [00:03:55].

While a single token per time step is possible, a hierarchical representation or multiple tokens per time window works better for more detailed sound representation [00:04:31]. This allows for representing higher-level meaning or acoustics alongside more granular details through multiple layers of representation [00:04:51].

Sesame CSM1B Architecture

The Sesame model, specifically CSM1B, utilizes this hierarchical approach [00:05:06]. It uses 32 tokens at every audio window to represent sound [00:05:10].

A “zeroth token” is predicted by the main transformer [00:05:16].
A secondary transformer decodes the other 31 “stack tokens” [00:05:20].

Sesame consists of two models: the main 1-billion parameter model and a smaller model for decoding hierarchical tokens [00:05:38]. The full model architecture includes:

A backbone model (the main transformer) [00:15:45].
A depth decoder for the 31 additional tokens [00:15:51].
A codec model to convert between waveforms and tokens [00:15:58].

Fine-tuning a Token-Based TTS Model

The process of fine-tuning a pre-trained model like CSM1B involves several steps to adapt it to a specific voice.

Data Preparation

To fine-tune, a voice dataset is needed [00:00:51]. This can be recorded or extracted from existing audio, such as a YouTube video [00:00:56]. It’s recommended to use a video with a single speaker to avoid complex diarization [00:08:09].

The data preparation process typically includes:

Audio Extraction: Download audio from the chosen source (e.g., YouTube) [00:10:51].
Transcription: Use a tool like OpenAI’s Whisper model (e.g., “turbo” size for efficiency) to transcribe the audio [00:07:06].
Manual Correction: Review the generated transcript (often a JSON file) for misspellings and make corrections [00:09:58].
Snippet Generation: Split the long transcript and audio into shorter chunks, ideally around 30 seconds in length, combining Whisper’s shorter segments [00:07:10]. Aim for roughly 50 snippets of 30 seconds for noticeable effects [00:07:43]. It is beneficial, though not strictly necessary for basic setup, to end each snippet on a full stop to improve pacing [00:13:17].

The resulting dataset will have audio snippets paired with their transcriptions [00:07:14]. If no speaker column is provided, a default speaker ID (e.g., ‘0’) is assigned [00:19:28].

Model Loading and Adapters

The CSM1B model is loaded using libraries like Unsloth, which leverages Hugging Face Transformers [00:14:10]. The model is loaded with a specified maximum sequence length for text [00:14:17].

For efficient fine-tuning, LoRA (Low-Rank Adaptation) adapters are applied [00:22:18]. Instead of training all model parameters, only a small subset of adapter parameters are trained [00:16:13]. These adapters are typically applied to the linear layers, such as QVO (query, value, output) and MLP (multi-layer perceptron) linear layers [00:22:20]. This significantly saves memory and speeds up training [00:16:24]. For a 1-billion parameter model, a LoRA alpha of 16 and rank of 32 are suggested [00:22:29]. Only a small percentage (e.g., under 2%) of parameters become trainable with this method [00:23:25].

Training Configuration

The processed dataset is passed to a trainer [00:25:29]. Key configurations include:

Batch Size: A virtual batch size of 8 or actual batch size of 2 can be used [00:25:42].
Epochs: Training for just one epoch can show significant effects [00:26:00].
Warm-up Steps: A small number of warm-up steps (e.g., 1-3) can be used to slowly increase the learning rate [00:26:05].
Learning Rate: A learning rate of 2e-4 is suitable for a 1-billion parameter model [00:26:16].
Optimizer: AdamW 8-bit optimizer can reduce memory requirements [00:26:27].
Weight Decay: Helps prevent overfitting [00:26:32].
Data Type: Automatically set to float16 or brain float16 depending on the GPU [00:26:20].

During training, the loss typically decreases (e.g., from 6.34 to 3.7) [00:27:03]. Ideally, an evaluation dataset would be used to monitor evaluation loss and ensure it continues to fall [00:28:36].

Inference and Performance

Performance can be evaluated before and after fine-tuning.

Zero-Shot Inference (Base Model)

Without any fine-tuning, the base model performs “zero-shot inference” [00:17:43]. When given text, it generates audio in a random speaker’s voice, as the temperature is non-zero [00:17:46]. This can result in a wide variance in speaker characteristics (e.g., male/female, deep/high-pitched voices) [00:21:08].

Voice Cloning

Voice cloning is different from fine-tuning [00:18:00]. With voice cloning, a sample audio of a desired voice is passed into the model, and the model then generates text-to-speech that sounds more like the provided sample [00:18:04]. While closer to the target voice than zero-shot inference, it’s still not as accurate as fine-tuning [00:18:42].

Fine-Tuned Model Inference

After fine-tuning, the model generates speech that sounds like the target voice from the training data [00:31:10]. The randomness of voice characteristics is removed, and the model consistently produces a voice similar to the one it was fine-tuned on [00:31:28]. Some remaining issues might include pacing or slight accent differences, which could be improved by better filtering of the original dataset [00:31:34].

The best performance is typically achieved when combining fine-tuning with voice cloning [00:32:14]. This combination can yield “pretty good performance” even with a relatively small amount of data, such as a 30-minute video [00:33:17].

Further Improvements and Considerations

More Data: For better performance, especially without voice cloning, creating a larger dataset (e.g., 500 rows of 30-second snippets) is recommended [00:33:04].
Data Quality: Improving data preparation by ensuring snippets end on full stops or filtering out excessive pauses can enhance output quality [00:13:17].
Monitoring Training: Using tools like TensorBoard to monitor evaluation loss and gradient norms can help optimize training and prevent overfitting [00:28:57].
Model Saving: Fine-tuned models can be saved locally or pushed to model hubs like Hugging Face, either as lightweight LoRA adapters or as a merged, full model [00:29:38].

Tubegraph

Explorer

Table of Contents