Even some of the best AI can’t beat this new benchmark

Date:

Share post:


The nonprofit Center for AI Safety (CAIS) and Scale AI, a company that provides a number of data labeling and AI development services, have released a challenging new benchmark for frontier AI systems.

The benchmark, called Humanity’s Last Exam, includes thousands of crowdsourced questions touching on subjects like mathematics, humanities, and the natural sciences. To make the evaluation tougher, the questions are in multiple formats, including formats that incorporate diagrams and images.

In a preliminary study, not a single publicly available flagship AI system managed to score better than 10% on Humanity’s Last Exam.

CAIS and Scale AI say they plan open up the benchmark to the research community so that researchers can “dig deeper into the variations” and evaluate new AI models.



Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

Madrona just announced its biggest fund ever, closing on $770M as other venture funds grow smaller

Seattle-based Madrona Capital is celebrating its 30 years in business by raising $770 million in fresh capital....

Reliance plans world’s biggest AI data centre in India, report says

Mukesh Ambani’s Reliance is planning to build what could become the world’s largest data center in Jamnagar,...

Tesla’s redesigned Model Y is coming to North America in March for $60,000

Tesla has announced that its redesigned Model Y SUV is coming to the U.S., Canada, and Mexico...

JetBrains launches Junie, a new AI coding agent for its IDEs

JetBrains, the company behind coding tools like the IntelliJ IDE for Java and Kotlin (and, indeed, the...

Trump orders formation of working group to evaluate crypto stockpile

President Donald Trump on Thursday ordered the formation of a working group to propose federal regulations for...

OpenAI says it may store deleted Operator data for up to 90 days

OpenAI says that it might store chats and associated screenshots from customers who use Operator, the company’s...

Everyone wants MrBeast on their TikTok bid, but he hasn’t committed yet

YouTube celebrity MrBeast — real name Jimmy Donaldson — is in talks to join a number of...

Anthropic’s new Citations feature aims to reduce AI errors

In an announcement perhaps timed to divert attention away from OpenAI’s Operator, Anthropic Thursday unveiled a new...