From: aidotengineer

The next frontier in AI is voice AI, which is already transforming call centers [01:48:40]. Over one billion calls globally are now made in call centers using voice APIs [01:55:00]. Real-time voice APIs are enabling agents to revolutionize call center operations [02:05:00].

An example of a real production application is the Price Line Pennybot, which allows users to book an entire vacation hands-free without text [02:13:00]. This shift signifies that discussions are no longer solely about text-based agents but increasingly about multimodal agents [02:28:00].

Evaluating Voice and Multimodal Agents

Evaluating voice and multimodal agents requires specific types of evaluations beyond those for text-based agents [02:33:00]. The future of voice applications involves some of the most complex applications ever built [11:51:00].

For voice applications, evaluation needs to consider additional components beyond just the text or transcript [12:00:00]. The audio chunk itself also needs to be evaluated [12:12:00]. In many Voice Assistant APIs, the generated transcript occurs after the audio chunk is sent [12:19:00], introducing a new dimension for evaluation [12:27:00].

Key evaluation points for voice applications include:

It is crucial that audio chunks receive their own defined evaluations for aspects like intent, speech quality, and speech-to-text accuracy [12:50:00].