From: aidotengineer

Introduction

In the context of token-based text-to-speech models, the goal is to convert text into audio by predicting audio tokens. This process is analogous to how models like GPT-4o predict the next word or token in a text string [00:01:58]. For text-to-speech, the model takes a series of text tokens and aims to predict the next audio token recursively [00:02:10].

What are Audio Tokens?

The fundamental question for this process is: “How do you make tokens for audio?” [00:02:47]. Unlike continuous soundwaves, audio tokens must be discrete representations. A small segment of audio can be represented as a choice from a “codebook” [00:03:01].

Codebooks

A codebook functions like a dictionary, containing a series of vectors, where each vector represents a specific sound [00:03:07]. These codebooks can be quite large to encompass a wide range of sounds, including both acoustics and semantics [00:03:18]. This approach allows sound to be represented discretely, similar to how text discretely represents meaning [00:03:26].

Training Codebooks

Codebooks are typically trained using an encoder-decoder architecture [00:03:39]:

  1. A soundwave is input and converted into tokens by an encoder [00:03:42].
  2. Another transformer then decodes these tokens back into a soundwave [00:03:49].
  3. During training, the system evaluates how closely the decoded output wave matches the original input wave [00:03:55].
  4. Any discrepancies (loss) are used to adjust the model’s weights through backpropagation [00:04:05].

With sufficient data, this process enables the training of a codebook that can represent sound in a discrete form [00:04:09].

Hierarchical Representation

It has been found that using only one token per time step is insufficient to fully represent sound [00:04:26]. A more effective approach, used in most token-based audio models, involves a hierarchical representation or multiple tokens per time window [00:04:31]. This allows for a more detailed representation, including higher-level meaning or acoustics, and then more granular layers of representation [00:04:40].

Sesame Model Implementation

The Sesame model employs this approach by using 32 tokens at every audio window to represent sound [00:05:10].

  • A “zeroth” token is predicted by the main transformer [00:05:16].
  • A secondary transformer decodes the other 31 tokens, which capture additional meaning or detail [00:05:20].

The Sesame model itself consists of two parts: a main 1 billion parameter model and a much smaller model that decodes these hierarchical tokens [00:05:38]. The architecture includes a backbone model (the main transformer), a depth decoder for the additional tokens, and a codec model to convert between waveforms and tokens [00:15:45].