A test for AGI is closer to being solved — but it may be flawed

A well-known test for artificial general intelligence (AGI) is closer to being solved. But the tests’s creators say this points to flaws in the test’s design, rather than a bonafide research breakthrough.

In 2019, Francois Chollet, a leading figure in the AI world, introduced the ARC-AGI benchmark, short for “Abstract and Reasoning Corpus for Artificial General Intelligence.” Designed to evaluate whether an AI system can efficiently acquire new skills outside the data it was trained on, ARC-AGI, Francois claims, remains the only AI test to measure progress towards general intelligence (although others have been proposed.)

Until this year, the best-performing AI could only solve just under a third of the tasks in ARC-AGI. Chollet blamed the industry’s focus on large language models (LLMs), which he believes aren’t capable of actual “reasoning.”

“LLMs struggle with generalization, due to being entirely reliant on memorization,” he said in a series of posts on X in February. “They break down on anything that wasn’t in the their training data.”

To Chollet’s point, LLMs are statistical machines. Trained on a lot of examples, they learn patterns in those examples to make predictions, like that “to whom” in an email typically precedes “it may concern.”

Chollet asserts that while LLMs might be capable of memorizing “reasoning patterns,” it’s unlikely that they can generate “new reasoning” based on novel situations. “If you need to be trained on many examples of a pattern, even if it’s implicit, in order to learn a reusable representation for it, you’re memorizing,” Chollet argued in another post.

To incentivize research beyond LLMs, in June, Chollet and Zapier co-founder Mike Knoop launched a $1 million competition to build open source AI capable of beating ARC-AGI. Out of 17,789 submissions, the best scored 55.5% — ~20% higher than 2023’s top scorer, albeit short of the 85%, “human-level” threshold required to win.

This doesn’t mean we’re ~20% closer to AGI, though, Knoop says.

Today we’re announcing the winners of ARC Prize 2024. We’re also publishing an extensive technical report on what we learned from the competition (link in the next tweet).

The state-of-the-art went from 33% to 55.5%, the largest single-year increase we’ve seen since 2020. The…

— François Chollet (@fchollet) December 6, 2024

In a blog post, Knoop said that many of the submissions to ARC-AGI have been able to “brute force” their way to a solution, suggesting that a “large fraction” of ARC-AGI tasks “[don’t] carry much useful signal towards general intelligence.”

ARC-AGI consists of puzzle-like problems where an AI has to, given a grid of different-colored squares, generate the correct “answer” grid. The problems were designed to force an AI to adapt to new problems it hasn’t seen before. But it’s not clear they’re achieving this.

Tasks in the ARC-AGI benchmark. Models must solve ‘problems’ in the top row; the bottom row shows solutions. Image Credits:ARC-AGI

“[ARC-AGI] has been unchanged since 2019 and is not perfect,” Knoop acknowledged in his post.

Francois and Knoop have also faced criticism for overselling ARC-AGI as benchmark toward AGI — at a time when the very definition of AGI is being hotly contested. One OpenAI staff member recently claimed that AGI has “already” been achieved if one defines AGI as AI “better than most humans at most tasks.”

Knoop and Chollet say that they plan to release a second-gen ARC-AGI benchmark to address these issues, alongside a 2025 competition. “We will continue to direct the efforts of the research community towards what we see as the most important unsolved problems in AI, and accelerate the timeline to AGI,” Chollet wrote in an X post.

Fixes likely won’t come easy. If the first ARC-AGI test’s shortcomings are any indication, defining intelligence for AI will be as intractable — and inflammatory — as it has been for human beings.

Source link

A test for AGI is closer to being solved — but it may be flawed

Recent posts

Hugging Face researchers are trying to build a more open version of DeepSeek’s AI ‘reasoning’ model

Meta announces a new CapCut rival called Edits

Trading platform eToro said to be eyeing $5B US IPO in 2025

Humane’s AI Pin is dead, as HP buys startup’s assets for $116M

After pivoting from crypto to payroll, Rollfi gets acquired

Women in AI: Marissa Hummon thinks AI will help make the power grid greener

OpenAI announces new o3 model — but you can’t use it yet

Gozem nets $30M to expand vehicle financing, digital banking in Francophone Africa

Mistral releases new AI models optimized for edge devices

Following takedown operation, Garantex invites customers to ‘face-to-face’ Moscow meeting

British university spinoff Mindgard protects companies from AI threats

Raymond Tonsing’s Caffeinated Capital seeks $400M for fifth fund

Marissa Mayer just laid out a possible business model for ad-supported AI chatbots

Accounting hasn’t fully embraced AI yet — Quanta just raised $4.7M to change that

How DeepSeek’s efficient AI could stall the nuclear renaissance

Related articles

Republican Congressman Jim Jordan asks Big Tech if Biden tried to censor AI

Bench is charging people for services they already paid for, some customers say

AI coding assistant Cursor reportedly tells a ‘vibe coder’ to write his own damn code

Profitable Klarna files for a potentially blockbuster IPO

Google is replacing Google Assistant with Gemini

‘Open’ model licenses often carry concerning restrictions

Testing the Uber-Waymo robotaxi, Rivian goes hands-free, and Travis Kalanick has AV FOMO

TechCrunch Mobility: Testing the Uber-Waymo robotaxi, Rivian goes hands-free, and Travis Kalanick has AV FOMO

Company