Even some of the best AI can’t beat this new benchmark

Date:

Share post:


The nonprofit Center for AI Safety (CAIS) and Scale AI, a company that provides a number of data labeling and AI development services, have released a challenging new benchmark for frontier AI systems.

The benchmark, called Humanity’s Last Exam, includes thousands of crowdsourced questions touching on subjects like mathematics, humanities, and the natural sciences. To make the evaluation tougher, the questions are in multiple formats, including formats that incorporate diagrams and images.

In a preliminary study, not a single publicly available flagship AI system managed to score better than 10% on Humanity’s Last Exam.

CAIS and Scale AI say they plan open up the benchmark to the research community so that researchers can “dig deeper into the variations” and evaluate new AI models.



Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

Google’s new AI video model Veo 2 will cost 50 cents per second

Google has quietly revealed the pricing of Veo 2, the video-generating AI model that it unveiled in...

Palantir CEO’s new book says Silicon Valley has ‘lost its way’

Palantir co-founder and CEO Alexander Karp opens his new book with a provocative declaration: “Silicon Valley has...

This mental health chatbot aims to fill the counseling gap at understaffed schools

As school districts struggle to support the mental health of their students, a startup called Sonar Mental...

Grok 3 appears to have briefly censored unflattering mentions of Trump and Musk

When billionaire Elon Musk introduced Grok 3, his AI company xAI’s latest flagship model, in a live...

Did xAI lie about Grok 3’s benchmarks?

Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into...

US AI Safety Institute could face big cuts

The National Institute of Standards and Technology could fire as many as 500 staffers, according to multiple...

How I Podcast: Summer Album / Winter Album’s Jody Avirgan

The beauty of podcasting is that anyone can do it. It’s a rare medium that’s nearly as...

The pain of discontinued items, and the thrill of finding them online

We’ve all been there. A favorite item is suddenly unavailable for purchase. Couldn’t the manufacturer have given...