TokenBased TexttoSpeech Models

From: aidotengineer

Token-based text-to-speech (TTS) models are designed to convert text into speech by predicting sequences of audio tokens [02:10:00]. This approach is conceptually similar to how large language models predict the next word or token in a text sequence [01:58:00].

How Token-Based Models Work

A token-based text-to-speech model takes in a series of text tokens and predicts the next audio token [02:10:00]. Ideally, it can also incorporate a history of both text and audio from previous conversation to recursively decode and produce the next audio token, resulting in a string of audio tokens that represent the desired speech [02:19:00].

The core of this process relies on a transformer model that can ideally input text and audio, and output a string of audio tokens [02:40:00].

Audio Token Representation and Codebooks

A fundamental challenge for token-based text-to-speech models is how to create discrete tokens for continuous audio soundwaves [02:48:00].

Codebooks

Audio, or small pieces of audio, can be represented as a selection from a codebook [03:01:00]. A codebook acts like a dictionary containing a series of vectors, where each vector represents a specific sound [03:07:00]. These codebooks can be quite large to capture a wide range of acoustics and semantics [03:18:00]. This allows sound to be represented discretely, similar to how text discretely represents meaning [03:26:00].

Training Codebooks

Codebooks are typically trained using an encoder-decoder architecture [03:36:00].

A soundwave is taken as input [03:39:00].
A transformer converts the soundwave into tokens [03:45:00].
Another transformer decodes these tokens back into a soundwave [03:47:00].
During training, the system assesses whether the decoded output matches the original input [03:55:00]. Any difference (loss) is used to back-propagate and update the model’s weights [04:03:00].

Hierarchical Representation

For better sound representation, instead of just one token per time step, a hierarchical representation or multiple tokens are used at the same timestamp or window [04:26:00]. This provides a more detailed representation by allowing higher-level meaning or acoustics to be represented alongside more granular details through multiple layers of representation [04:47:00].

Sesame CSM1B Model

The Sesame CSM1B model is a specific example of a token-based text-to-speech model [00:40:00]. It utilizes 32 tokens at every window to represent sound [05:06:00].

Its architecture includes:

Main Transformer: Predicts the “zeroth” token auto-regressively [05:16:00]. This is a 1-billion-parameter model [05:40:00].
Secondary Transformer (Depth Decoder): Decodes the other 31 “stack” tokens, handling the hierarchical representation [05:20:00]. This is a much smaller model [05:42:00].
Codec Model: Manages the conversion between waveform and tokens [05:58:00].

Fine-tuning and Voice Cloning

Pre-trained token-based text-to-speech models like Sesame CSM1B can take text and audio to output a stream of audio [05:54:00].

Data Preparation

For fine-tuning, a voice dataset is needed [00:51:00]. This can be recorded by oneself, or audio can be pulled from a YouTube video and used as a basis for fine-tuning [00:56:00].

The data generation process involves:

Selecting a YouTube video, ideally with a single speaker to avoid complex diarization [08:04:00].
Using Whisper (e.g., “turbo” model) to transcribe the video [07:06:00].
Manually correcting the transcript (e.g., misspelled words) [10:03:00].
Converting the transcript into a dataset of audio snippets (e.g., up to 30 seconds) and their corresponding text transcriptions [07:10:00]. Approximately 50 such 30-second snippets are often enough to start seeing an effect [07:42:00]. These are stored with an audio column and a text column, and optionally a source column for speaker ID [19:11:00].

Fine-tuning with Unsloth

Fine-tuning is performed using libraries like Unsloth, which is based on transformers [01:03:00]. This involves:

Loading the base model (e.g., CSM1B) [14:10:00].
Applying LoRA adapters to train a subset of parameters, typically focusing on linear layers [22:18:00]. This saves memory and speeds up training [22:24:00].
Preparing the processed dataset for the trainer, including setting maximum text and audio lengths [24:11:00].
Configuring the trainer with parameters like virtual batch size, number of epochs (e.g., one epoch for initial tests), learning rate, and optimizer [25:31:00].

Voice Cloning

Voice cloning is distinct from fine-tuning [18:02:00]. In voice cloning, an audio sample is passed into the model, and the model then generates new audio that tends to sound more like the provided sample [18:04:00]. This allows for a more consistent speaker voice without explicit fine-tuning [17:56:00].

While voice cloning improves the similarity to a target voice, combining fine-tuning with cloning often yields the best performance, even with relatively small datasets (e.g., a 30-minute video) [31:14:00]. For even better performance, more data (e.g., 500 rows of 30-second snippets) can be beneficial, especially for improved quality without voice cloning [33:04:00].

Tubegraph

Explorer

Table of Contents