From: aidotengineer
This article explores the concepts of text-to-speech (TTS) model fine-tuning and voice cloning, focusing on techniques and practical applications. The goal of fine-tuning a TTS model is to make it sound like a specific voice [00:00:30].
Understanding Token-Based Text-to-Speech Models
Token-based text-to-speech models, such as Sesame CSM1B and Orpheus from Canopy Labs, function similarly to text models like GPT-4o or LLaMA [00:00:38] [00:01:45]. While text models predict the next word or token from a text string, TTS models aim to take a series of text tokens and predict the next audio token [00:02:08].
Audio is represented discreetly by choosing from a “codebook,” which acts like a dictionary where vectors represent specific sounds [00:03:01] [00:03:07]. These codebooks can be extensive to encompass a wide range of sounds, including acoustics and semantics [00:03:18] [00:03:26]. Training these codebooks involves an encoder-decoder framework where a soundwave is converted into tokens and then decoded back into a wave. The system learns by minimizing the difference between the original input soundwave and the reconstructed output [00:03:36].
A hierarchical representation, allowing multiple tokens per time step, generally works better for representing sound in token-based audio models [00:04:31]. The Sesame model, for instance, uses 32 tokens at every audio window to represent sound, with a main transformer predicting the zeroth token and a secondary transformer decoding the remaining 31 tokens [00:05:06] [00:05:16]. The Sesame model consists of a 1 billion parameter main model and a smaller model for decoding hierarchical tokens [00:05:38].
Data Preparation for Fine-Tuning
To fine-tune a TTS model, a voice dataset is required [00:00:51]. This can be created by recording a voice or, as demonstrated, by extracting audio from a YouTube video [00:00:56].
The process involves:
- Selecting Audio Source: Choosing a YouTube video, ideally with a single speaker to simplify data processing [00:08:06] [00:08:11].
- Transcription: Using Whisper (e.g., “turbo” model for efficiency) to transcribe the audio into text [00:07:06] [00:08:33]. This typically saves as a JSON file [00:09:18].
- Manual Correction: Reviewing and correcting any misspelled words or inaccuracies in the transcribed text [00:09:58].
- Segmenting Data: Combining short segments from the Whisper transcription into longer chunks, typically up to 30 seconds in length, to form rows of data [00:11:02] [00:11:18]. Approximately 50 such 30-second snippets are generally sufficient to have an effect on quality [00:07:42].
- Dataset Structure: The final dataset should have an audio column and a text column, and optionally a source column for speaker ID (e.g., 0 for a single speaker) [00:19:11].
Fine-Tuning Process
Fine-tuning is performed using Unsloth, a library built on Hugging Face Transformers [00:01:03] [00:14:03].
The steps include:
- Model Loading: Loading the base model (e.g., CSM1B) with a specified maximum sequence length [00:14:10]. The model typically loads into the GPU [00:15:11].
- Adapter Application: Applying LoRA (Low-Rank Adaptation) adapters to specific layers of the model, such as linear layers (QVO, gate up and down layers) [00:22:18] [00:22:23]. This approach trains only a subset of parameters, saving memory and speeding up training [00:22:20] [00:22:24]. For a 1 billion parameter model, a LoRA alpha of 16 and a rank of 32 are recommended [00:22:29] [00:22:49].
- Data Processing: Preparing the loaded dataset for the trainer by setting parameters like sampling rate, maximum text length, and maximum audio length [00:24:42] [00:24:47].
- Training Configuration: Setting up the training parameters, including virtual batch size (e.g., 8), number of epochs (e.g., 1), warm-up steps, and learning rate (e.g., 2e-4 for a 1 billion parameter model) [00:25:39] [00:26:02] [00:26:05] [00:26:16]. AdamW 8-bit optimizer can be used to reduce memory [00:26:27].
- Execution: Running the training process, which typically takes around 10 minutes and shows a decrease in loss [00:26:56] [00:27:00]. Ideally, an evaluation dataset is used to monitor loss and prevent overfitting [00:28:36].
Voice Cloning
Voice cloning involves passing an audio sample to the model, prompting it to generate new audio that sounds similar to the provided sample [00:18:00] [00:18:10]. This differs from zero-shot inference, where the model generates a random speaker’s voice based on temperature settings [00:17:46] [00:17:56]. While voice cloning provides a voice closer to the target, it is not as accurate as fine-tuning [00:18:42].
Performance Evaluation
Performance is evaluated by generating audio samples before and after fine-tuning [00:01:10].
- Zero-shot inference: Produces unpredictable voices (male, female, deep, high-pitched) [00:17:46] [00:20:56].
- Voice cloning (base model): Produces a voice closer to the input sample but still shows variance and may not fully capture the speaker’s nuances [00:21:47].
- Fine-tuned model (zero-shot): Removes the randomness of the voice, typically sounding like the target speaker’s gender, but may still have issues with pacing or accent [00:31:23] [00:32:02].
- Fine-tuned model with cloning: Expected to yield the best results, combining the benefits of specific voice training with cloning for nuanced output [00:32:16] [00:32:26].
Even with a relatively small amount of data (e.g., a 30-minute video yielding 41 clips of up to 30 seconds each), combining fine-tuning with cloning can achieve good performance [00:07:27] [00:33:15] [00:33:21]. For even better performance, especially without voice cloning, aiming for around 500 rows of 30-second audio clips is recommended [00:33:04]. Further improvements could involve better filtering of the original dataset to avoid long pauses or by using libraries like NLTK to detect sentence boundaries for cleaner data segmentation [00:31:36] [00:13:17].