From: aidotengineer
Token-based text-to-speech (TTS) models are designed to convert text into speech by predicting sequences of audio tokens [02:10:00]. This approach is conceptually similar to how large language models predict the next word or token in a text sequence [01:58:00].
How Token-Based Models Work
A token-based text-to-speech model takes in a series of text tokens and predicts the next audio token [02:10:00]. Ideally, it can also incorporate a history of both text and audio from previous conversation to recursively decode and produce the next audio token, resulting in a string of audio tokens that represent the desired speech [02:19:00].
The core of this process relies on a transformer model that can ideally input text and audio, and output a string of audio tokens [02:40:00].
Audio Token Representation and Codebooks
A fundamental challenge for token-based text-to-speech models is how to create discrete tokens for continuous audio soundwaves [02:48:00].
Codebooks
Audio, or small pieces of audio, can be represented as a selection from a codebook [03:01:00]. A codebook acts like a dictionary containing a series of vectors, where each vector represents a specific sound [03:07:00]. These codebooks can be quite large to capture a wide range of acoustics and semantics [03:18:00]. This allows sound to be represented discretely, similar to how text discretely represents meaning [03:26:00].
Training Codebooks
Codebooks are typically trained using an encoder-decoder architecture [03:36:00].
- A soundwave is taken as input [03:39:00].
- A transformer converts the soundwave into tokens [03:45:00].
- Another transformer decodes these tokens back into a soundwave [03:47:00].
- During training, the system assesses whether the decoded output matches the original input [03:55:00]. Any difference (loss) is used to back-propagate and update the model’s weights [04:03:00].
Hierarchical Representation
For better sound representation, instead of just one token per time step, a hierarchical representation or multiple tokens are used at the same timestamp or window [04:26:00]. This provides a more detailed representation by allowing higher-level meaning or acoustics to be represented alongside more granular details through multiple layers of representation [04:47:00].
Sesame CSM1B Model
The Sesame CSM1B model is a specific example of a token-based text-to-speech model [00:40:00]. It utilizes 32 tokens at every window to represent sound [05:06:00].
Its architecture includes:
- Main Transformer: Predicts the “zeroth” token auto-regressively [05:16:00]. This is a 1-billion-parameter model [05:40:00].
- Secondary Transformer (Depth Decoder): Decodes the other 31 “stack” tokens, handling the hierarchical representation [05:20:00]. This is a much smaller model [05:42:00].
- Codec Model: Manages the conversion between waveform and tokens [05:58:00].
Fine-tuning and Voice Cloning
Pre-trained token-based text-to-speech models like Sesame CSM1B can take text and audio to output a stream of audio [05:54:00].
Data Preparation
For fine-tuning, a voice dataset is needed [00:51:00]. This can be recorded by oneself, or audio can be pulled from a YouTube video and used as a basis for fine-tuning [00:56:00].
The data generation process involves:
- Selecting a YouTube video, ideally with a single speaker to avoid complex diarization [08:04:00].
- Using Whisper (e.g., “turbo” model) to transcribe the video [07:06:00].
- Manually correcting the transcript (e.g., misspelled words) [10:03:00].
- Converting the transcript into a dataset of audio snippets (e.g., up to 30 seconds) and their corresponding text transcriptions [07:10:00]. Approximately 50 such 30-second snippets are often enough to start seeing an effect [07:42:00]. These are stored with an audio column and a text column, and optionally a source column for speaker ID [19:11:00].
Fine-tuning with Unsloth
Fine-tuning is performed using libraries like Unsloth, which is based on transformers [01:03:00]. This involves:
- Loading the base model (e.g., CSM1B) [14:10:00].
- Applying LoRA adapters to train a subset of parameters, typically focusing on linear layers [22:18:00]. This saves memory and speeds up training [22:24:00].
- Preparing the processed dataset for the trainer, including setting maximum text and audio lengths [24:11:00].
- Configuring the trainer with parameters like virtual batch size, number of epochs (e.g., one epoch for initial tests), learning rate, and optimizer [25:31:00].
Voice Cloning
Voice cloning is distinct from fine-tuning [18:02:00]. In voice cloning, an audio sample is passed into the model, and the model then generates new audio that tends to sound more like the provided sample [18:04:00]. This allows for a more consistent speaker voice without explicit fine-tuning [17:56:00].
While voice cloning improves the similarity to a target voice, combining fine-tuning with cloning often yields the best performance, even with relatively small datasets (e.g., a 30-minute video) [31:14:00]. For even better performance, more data (e.g., 500 rows of 30-second snippets) can be beneficial, especially for improved quality without voice cloning [33:04:00].