Researchers question AI’s ‘reasoning’ ability as models stumble on math problems with trivial changes

Date:

Share post:


How do machine learning models do what they do? And are they really “thinking” or “reasoning” the way we understand those things? This is a philosophical question as much as a practical one, but a new paper making the rounds Friday suggests that the answer is, at least for now, a pretty clear “no.”

A group of AI research scientists at Apple released their paper, “Understanding the limitations of mathematical reasoning in large language models,” to general commentary Thursday. While the deeper concepts of symbolic learning and pattern reproduction are a bit in the weeds, the basic concept of their research is very easy to grasp.

Let’s say I asked you to solve a simple math problem like this one:

Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. How many kiwis does Oliver have?

Obviously, the answer is 44 + 58 + (44 * 2) = 190. Though large language models are actually spotty on arithmetic, they can pretty reliably solve something like this. But what if I threw in a little random extra info, like this:

Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?

It’s the same math problem, right? And of course even a grade-schooler would know that even a small kiwi is still a kiwi. But as it turns out, this extra data point confuses even state-of-the-art LLMs. Here’s GPT-o1-mini’s take:

… on Sunday, 5 of these kiwis were smaller than average. We need to subtract them from the Sunday total: 88 (Sunday’s kiwis) – 5 (smaller kiwis) = 83 kiwis

This is just a simple example out of hundreds of questions that the researchers lightly modified, but nearly all of which led to enormous drops in success rates for the models attempting them.

Image Credits:Mirzadeh et al

Now, why should this be? Why would a model that understands the problem be thrown off so easily by a random, irrelevant detail? The researchers propose that this reliable mode of failure means the models don’t really understand the problem at all. Their training data does allow them to respond with the correct answer in some situations, but as soon as the slightest actual “reasoning” is required, such as whether to count small kiwis, they start producing weird, unintuitive results.

As the researchers put it in their paper:

[W]e investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data.

This observation is consistent with the other qualities often attributed to LLMs due to their facility with language. When, statistically, the phrase “I love you” is followed by “I love you, too,” the LLM can easily repeat that — but it doesn’t mean it loves you. And although it can follow complex chains of reasoning it has been exposed to before, the fact that this chain can be broken by even superficial deviations suggests that it doesn’t actually reason so much as replicate patterns it has observed in its training data.

Mehrdad Farajtabar, one of the co-authors, breaks down the paper very nicely in this thread on X.

An OpenAI researcher, while commending Mirzadeh et al’s work, objected to their conclusions, saying that correct results could likely be achieved in all these failure cases with a bit of prompt engineering. Farajtabar (responding with the typical yet admirable friendliness researchers tend to employ) noted that while better prompting may work for simple deviations, the model may require exponentially more contextual data in order to counter complex distractions — ones that, again, a child could trivially point out.

Does this mean that LLMs don’t reason? Maybe. That they can’t reason? No one knows. These are not well-defined concepts, and the questions tend to appear at the bleeding edge of AI research, where the state of the art changes on a daily basis. Perhaps LLMs “reason,” but in a way we don’t yet recognize or know how to control.

It makes for a fascinating frontier in research, but it’s also a cautionary tale when it comes to how AI is being sold. Can it really do the things they claim, and if it does, how? As AI becomes an everyday software tool, this kind of question is no longer academic.



Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

OpenAI accidentally deleted potential evidence in NY Times copyright lawsuit (updated)

Lawyers for The New York Times and Daily News, which are suing OpenAI for allegedly scraping their...

Sequoia marks up its 2020 fund by 25%

Sequoia says no exits, no problem. The Silicon Valley titan of venture marked up the value of its...

Illumen Capital doubles down on supporting underrepresented funds

Illumen Capital is doubling down on its support for fund managers and founders from underrepresented communities.  The firm...

Gilroy, former Coatue fintech head, and angel investor Rajaram launch VC firm

Michael Gilroy, a former head of fintech investments at Coatue, and Gokul Rajaram, a longtime tech executive...

OpenAI is funding research into ‘AI morality’

OpenAI is funding academic research into algorithms that can predict humans’ moral judgements. In a filing with the...

Y Combinator often backs startups that duplicate other YC companies, data shows — it’s not just AI code editors

The Silicon Valley dream is to build a tech startup that is such a unique idea it...

Hyundai and Kia recall 208,000 EVs

Hyundai, Kia, and Genesis are recalling about 208,000 EVs in the United States due to an issue...

Money for tech that matters

Welcome to Startups Weekly — your weekly recap of everything you can’t miss from the world of startups. If you’d like to receive this...