
A new benchmark suggests that while AI agents are improving quickly, they still lack the multi-context reasoning and reliability needed to replace real white-collar professionals anytime soon. (Source: Image by RR)
New Benchmark Highlights the Gap Between AI Demos and Workplace Reality
Despite bold predictions that AI would soon replace white-collar jobs, a new benchmark suggests that AI agents are not yet ready for real workplace demands. Research from training-data company Mercor introduces APEX-Agents, a benchmark designed to measure how well leading AI models perform actual professional tasks drawn from consulting, investment banking and law. The results were stark: no model achieved more than 25% accuracy, with most responses either incorrect or missing entirely.
The benchmark, as noted in tech.yahoo.com, was built to mirror real professional environments rather than isolated question-answering. According to Mercor CEO Brendan Foody, the biggest challenge for AI agents was navigating information spread across multiple domains—emails, internal policies, legal frameworks and collaboration tools like Slack and Google Drive. While large language models excel at focused research or planning tasks, this kind of multi-context reasoning remains unreliable, even for the most advanced systems.
The tasks themselves were contributed by professionals from Mercor’s expert marketplace, who also defined what a correct response would look like. Many scenarios require nuanced judgment, such as interpreting EU privacy law in the context of internal company policies. These are the kinds of sustained, context-rich problems that define knowledge work—and that would need to be solved reliably before AI could meaningfully replace professionals like lawyers or bankers.
Among the tested models, Gemini 3 Flash performed best with 24% one-shot accuracy, followed closely by GPT-5.2 at 23%, while other leading systems clustered around 18%. While these scores fall far short of automation-ready performance, Foody notes rapid improvement year over year. With the benchmark now public on Hugging Face, APEX-Agents sets a clear challenge for AI labs—and a more grounded reality check on how close AI agents really are to reshaping the white-collar workforce.
read more at tech.yahoo.com
Leave A Comment