From: hu-po

The “speech to speech” project is an application designed to facilitate multi-turn AI conversations between multiple characters [00:00:00]. It allows users to initiate conversations, input their own speech, and generate AI-driven responses from virtual personalities [00:00:04].

Core Functionality

The application enables users to:

  • Initiate Conversations The process begins by selecting desired participants, such as Joe Biden, Donald Trump, and Elon Musk [00:00:04].
  • User Input Users record their speech, which is then automatically converted into text [00:00:12].
  • Generate AI Responses After the user’s input, additional conversation bubbles are generated [00:00:19]. These responses are created by GPT via its API, tailored to the persona of each selected character [00:00:23].
  • Playback The entire conversation, including both user input and AI-generated dialogue, can be played back through the computer’s speakers [00:00:50].
  • Export Audio Conversations can be exported as an audio file, allowing users to share them or upload them to other platforms [00:01:17].

Technology Stack

The project leverages two primary API services:

  • OpenAI API The OpenAI API is utilized for generating the conversational responses from the AI characters [00:02:06].
  • Eleven Labs API The Eleven Labs API is used for the speech synthesis, converting the generated text responses into natural-sounding speech [00:02:10].

A key advantage is that the application does not require local GPUs, as all processing is handled via APIs, making it runnable on less powerful machines [00:02:23].

Character Customization

The “speech to speech” project allows for extensive customization of characters, moving beyond just famous personalities [00:01:42]:

  • Any Individual Users can define virtually any character [00:01:26].
  • Character Definition To create a new character, users need to:
    • Choose a name [00:01:29].
    • Provide a description of the character’s personality and communication style [00:01:30].
    • Supply a list of audio references, typically 60 seconds to two minutes of audio from any YouTube video [00:01:34]. Even a 30-second clip can be sufficient [00:01:47].

Requirements and Availability

  • API Keys Users need an OpenAI API key (estimated at 5 to get started) [00:02:06]. The total initial cost is approximately $25 [00:02:18].
  • GitHub Repository The project’s code is available on GitHub under an MIT license, allowing for community use and development [00:01:57].