Even some of the best AI can’t beat this new benchmark

Date:

January 23, 2025

The nonprofit Center for AI Safety (CAIS) and Scale AI, a company that provides a number of data labeling and AI development services, have released a challenging new benchmark for frontier AI systems.

The benchmark, called Humanity’s Last Exam, includes thousands of crowdsourced questions touching on subjects like mathematics, humanities, and the natural sciences. To make the evaluation tougher, the questions are in multiple formats, including formats that incorporate diagrams and images.

In a preliminary study, not a single publicly available flagship AI system managed to score better than 10% on Humanity’s Last Exam.

CAIS and Scale AI say they plan open up the benchmark to the research community so that researchers can “dig deeper into the variations” and evaluate new AI models.

Source link

Boeing took nearly $3 billion hit in Q4 related to strike, layoffs and troubled government programs

NEW: Trump Signs EO to Establish Promising, New Presidential Advisory Commission on Science, Tech

Lisa Holden

Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Even some of the best AI can’t beat this new benchmark

Recent posts

Coinbase eyes re-entry to India

Musk’s amended lawsuit against OpenAI names Microsoft as defendent

Google to receive punishment for search monopoly by next August, says judge

OpenAI’s Sora video generator is launching for ChatGPT Pro and Plus subscribers — but not in the EU

California’s legislature just passed AI bill SB 1047; here’s why some hope the governor won’t sign it

A walk through the crypto jungle at Korea Blockchain Week

Meta lays off employees across multiple teams

Workday acquires AI-powered document platform Evisort

Green ammonia startup Amogy is trying to raise $90M to reduce truck pollution

Menlo Ventures and Anthropic have picked the first 18 startups for their $100M fund

The flat-rate real estate startup that’s got big players worried and BNPL’s turning a corner

G2 Ventures Partners is raising $750 million for a third fund

Threads is testing a post scheduling feature

AMD’s CES 2025 press conference: How to watch

TechCrunch Space: Sayonara

Related articles

Google’s new AI video model Veo 2 will cost 50 cents per second

Palantir CEO’s new book says Silicon Valley has ‘lost its way’

This mental health chatbot aims to fill the counseling gap at understaffed schools

Grok 3 appears to have briefly censored unflattering mentions of Trump and Musk

Did xAI lie about Grok 3’s benchmarks?

US AI Safety Institute could face big cuts

How I Podcast: Summer Album / Winter Album’s Jody Avirgan

The pain of discontinued items, and the thrill of finding them online

Company

Follow us