xAI’s Grok Voice Agent has emerged as the new leader in speech-to-speech AI, combining top-tier audio reasoning, real-time performance, and competitive pricing to challenge Google and OpenAI head-on. CEO Elon Musk is pictured above. (Source: Image by RR)

xAI’s First Public Speech-to-Speech API Sets a New Benchmark for Audio Reasoning

xAI has introduced its new Grok Voice Agent, which now leads the field of speech-to-speech AI models after scoring 92.3% on the Big Bench Audio benchmark, narrowly surpassing Google’s Gemini 2.5 Flash Native Audio Thinking. According to a post on x.com, this marks xAI’s first public release of a native speech-to-speech API and positions the company as a serious competitor in the fast-growing voice AI space, alongside Google and OpenAI.

Big Bench Audio is the first dataset designed specifically to evaluate reasoning in speech-based models, rather than simple transcription or response generation. It consists of 1,000 audio questions adapted from the notoriously difficult Big Bench Hard text benchmark, making it a meaningful test of higher-order reasoning translated into spoken language. Grok Voice Agent’s top score establishes a new state of the art for native audio reasoning.

Beyond raw reasoning performance, Grok Voice Agent delivers competitive latency, with an average time-to-first-token of 0.78 seconds, ranking third overall behind Google’s fastest Gemini Flash audio variants. This places it firmly within real-time usability thresholds for live conversations, customer support agents, and interactive voice systems.

The model is also priced aggressively at $0.05 per minute connected (roughly $3 per hour), undercutting many enterprise voice AI offerings. Key features include built-in tool calling (web search, retrieval-augmented generation, and custom tools), telephony integration via SIP providers like Twilio and Vonage, and multilingual support across more than 100 languages, with five selectable voices. Together, these capabilities signal xAI’s intent to push Grok into production-grade voice assistants and automated phone agents.

read more on x.com