From: aidotengineer
The next frontier in AI is voice AI, which is already transforming call centers [01:48:40]. Over one billion calls globally are now made in call centers using voice APIs [01:55:00]. Real-time voice APIs are enabling agents to revolutionize call center operations [02:05:00].
An example of a real production application is the Price Line Pennybot, which allows users to book an entire vacation hands-free without text [02:13:00]. This shift signifies that discussions are no longer solely about text-based agents but increasingly about multimodal agents [02:28:00].
Evaluating Voice and Multimodal Agents
Evaluating voice and multimodal agents requires specific types of evaluations beyond those for text-based agents [02:33:00]. The future of voice applications involves some of the most complex applications ever built [11:51:00].
For voice applications, evaluation needs to consider additional components beyond just the text or transcript [12:00:00]. The audio chunk itself also needs to be evaluated [12:12:00]. In many Voice Assistant APIs, the generated transcript occurs after the audio chunk is sent [12:19:00], introducing a new dimension for evaluation [12:27:00].
Key evaluation points for voice applications include:
- User sentiment [12:30:00]
- Speech-to-text transcription accuracy [12:31:00], [12:57:00]
- Tone consistency throughout the conversation [12:36:00]
- Intent and speech quality from the audio piece [12:41:00], [12:53:00]
It is crucial that audio chunks receive their own defined evaluations for aspects like intent, speech quality, and speech-to-text accuracy [12:50:00].