Did xAI lie about Grok 3’s benchmarks?

Date:

Share post:


Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view.

This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was in the right.

The truth lies somewhere in between.

In a post on xAI’s blog, the company published a graph showing Grok 3’s performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have questioned AIME’s validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a model’s math ability.

xAI’s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing available model, o3-mini-high, on AIME 2025. But OpenAI employees on X were quick to point out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at “cons@64.”

What is cons@64, you might ask? Well, it’s short for “consensus@64,” and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost models’ benchmark scores quite a bit, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that’s isn’t the case.

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — meaning the first score the models got on the benchmark — fall below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAI’s o1 model set to “medium” computing. Yet xAI is advertising Grok 3 as the “world’s smartest AI.”

Babushkin argued on X that OpenAI has published similarly misleading benchmark charts in the past — albeit charts comparing the performance of its own models. A more neutral party in the debate put together a more “accurate” graph showing nearly every model’s performance at cons@64:

But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That just goes to show how little most AI benchmarks communicate about models’ limitations — and their strengths.





Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

US AI Safety Institute could face big cuts

The National Institute of Standards and Technology could fire as many as 500 staffers, according to multiple...

How I Podcast: Summer Album / Winter Album’s Jody Avirgan

The beauty of podcasting is that anyone can do it. It’s a rare medium that’s nearly as...

The pain of discontinued items, and the thrill of finding them online

We’ve all been there. A favorite item is suddenly unavailable for purchase. Couldn’t the manufacturer have given...

The fallout from HP’s Humane acquisition 

Welcome back to Week in Review. This week we’re looking at the internal chaos surrounding HP’s $116...

Trump administration reportedly shutting down federal EV chargers nationwide

The General Services Administration, the agency that manages buildings owned by the federal government, is planning to...

Explore the online world of Apple TV’s ‘Severance’

Apple has been steadily working to expand the world of the Apple TV+ series “Severance,” through online...

Meta, X approved ads containing violent anti-Muslim, antisemitic hate speech ahead of German election, study finds

Social media giants Meta and X (formerly Twitter) approved ads targeting users in Germany with violent anti-Muslim...

Court filings show Meta staffers discussed using copyrighted content for AI training

For years, Meta employees have internally discussed using copyrighted works obtained through legally questionable means to train...