Even some of the best AI can’t beat this new benchmark

Date:

Share post:


The nonprofit Center for AI Safety (CAIS) and Scale AI, a company that provides a number of data labeling and AI development services, have released a challenging new benchmark for frontier AI systems.

The benchmark, called Humanity’s Last Exam, includes thousands of crowdsourced questions touching on subjects like mathematics, humanities, and the natural sciences. To make the evaluation tougher, the questions are in multiple formats, including formats that incorporate diagrams and images.

In a preliminary study, not a single publicly available flagship AI system managed to score better than 10% on Humanity’s Last Exam.

CAIS and Scale AI say they plan open up the benchmark to the research community so that researchers can “dig deeper into the variations” and evaluate new AI models.



Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

Reliance plans world’s biggest AI data centre in India, report says

Mukesh Ambani’s Reliance is planning to build what could become the world’s largest data center in Jamnagar,...

Tesla’s redesigned Model Y is coming to North America in March for $60,000

Tesla has announced that its redesigned Model Y SUV is coming to the U.S., Canada, and Mexico...

JetBrains launches Junie, a new AI coding agent for its IDEs

JetBrains, the company behind coding tools like the IntelliJ IDE for Java and Kotlin (and, indeed, the...

Trump orders formation of working group to evaluate crypto stockpile

President Donald Trump on Thursday ordered the formation of a working group to propose federal regulations for...

OpenAI says it may store deleted Operator data for up to 90 days

OpenAI says that it might store chats and associated screenshots from customers who use Operator, the company’s...

Anthropic’s new Citations feature aims to reduce AI errors

In an announcement perhaps timed to divert attention away from OpenAI’s Operator, Anthropic Thursday unveiled a new...

Hidden Waymo feature let researcher customize robotaxi’s display

A security researcher found a hidden unreleased feature in the Waymo app that allowed her to display...

Android 16’s first beta version brings iOS-style live notifications

After releasing two developer beta versions last year, Google introduced the first public beta for Android 16...