Researchers question AI’s ‘reasoning’ ability as models stumble on math problems with trivial changes

How do machine learning models do what they do? And are they really “thinking” or “reasoning” the way we understand those things? This is a philosophical question as much as a practical one, but a new paper making the rounds Friday suggests that the answer is, at least for now, a pretty clear “no.”

A group of AI research scientists at Apple released their paper, “Understanding the limitations of mathematical reasoning in large language models,” to general commentary Thursday. While the deeper concepts of symbolic learning and pattern reproduction are a bit in the weeds, the basic concept of their research is very easy to grasp.

Let’s say I asked you to solve a simple math problem like this one:

Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. How many kiwis does Oliver have?

Obviously, the answer is 44 + 58 + (44 * 2) = 190. Though large language models are actually spotty on arithmetic, they can pretty reliably solve something like this. But what if I threw in a little random extra info, like this:

Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?

It’s the same math problem, right? And of course even a grade-schooler would know that even a small kiwi is still a kiwi. But as it turns out, this extra data point confuses even state-of-the-art LLMs. Here’s GPT-o1-mini’s take:

… on Sunday, 5 of these kiwis were smaller than average. We need to subtract them from the Sunday total: 88 (Sunday’s kiwis) – 5 (smaller kiwis) = 83 kiwis

This is just a simple example out of hundreds of questions that the researchers lightly modified, but nearly all of which led to enormous drops in success rates for the models attempting them.

Image Credits:Mirzadeh et al

Now, why should this be? Why would a model that understands the problem be thrown off so easily by a random, irrelevant detail? The researchers propose that this reliable mode of failure means the models don’t really understand the problem at all. Their training data does allow them to respond with the correct answer in some situations, but as soon as the slightest actual “reasoning” is required, such as whether to count small kiwis, they start producing weird, unintuitive results.

As the researchers put it in their paper:

[W]e investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data.

This observation is consistent with the other qualities often attributed to LLMs due to their facility with language. When, statistically, the phrase “I love you” is followed by “I love you, too,” the LLM can easily repeat that — but it doesn’t mean it loves you. And although it can follow complex chains of reasoning it has been exposed to before, the fact that this chain can be broken by even superficial deviations suggests that it doesn’t actually reason so much as replicate patterns it has observed in its training data.

Mehrdad Farajtabar, one of the co-authors, breaks down the paper very nicely in this thread on X.

An OpenAI researcher, while commending Mirzadeh et al’s work, objected to their conclusions, saying that correct results could likely be achieved in all these failure cases with a bit of prompt engineering. Farajtabar (responding with the typical yet admirable friendliness researchers tend to employ) noted that while better prompting may work for simple deviations, the model may require exponentially more contextual data in order to counter complex distractions — ones that, again, a child could trivially point out.

Does this mean that LLMs don’t reason? Maybe. That they can’t reason? No one knows. These are not well-defined concepts, and the questions tend to appear at the bleeding edge of AI research, where the state of the art changes on a daily basis. Perhaps LLMs “reason,” but in a way we don’t yet recognize or know how to control.

It makes for a fascinating frontier in research, but it’s also a cautionary tale when it comes to how AI is being sold. Can it really do the things they claim, and if it does, how? As AI becomes an everyday software tool, this kind of question is no longer academic.

Source link

Researchers question AI’s ‘reasoning’ ability as models stumble on math problems with trivial changes

Recent posts

Meta reignites plans to train AI using UK users’ public Facebook and Instagram posts

A startup from ex-Revolut employees uses AI to automate accounts — but hopes to keep accountants in jobs

Apple brings Store app to Indian market

TechCrunch Minute: The iPad was the surprising star of Apple’s sales numbers

US telco Lumen says its network is now clear of China’s Salt Typhoon hackers

Encore is an AI-powered search engine for your thrifting needs

Emidat is building a tool to clean up construction by automating environmental reporting

TikTok asks Supreme Court for a lifeline as sell-or-ban deadline approaches

Payoneer scoops up Skuad, Robinhood’s strong Q2, and X is making progress on payments

France formally charges Telegram founder, Pavel Durov, over organized crime on messaging app

Volkswagen leak exposed precise location data on thousands of vehicles across Europe for months

Elon Musk’s X boosts DSA info for EU users as bloc’s probe of its complaint handling continues

Phhhoto’s antitrust claim against Meta is heading back to the courts

E-commerce marketplace platform Mirakl acquires ad optimization startup Adspert

5 days left to grab rebooted ticket prices for TechCrunch Disrupt 2024

Related articles

Perplexity submits a new bid for TikTok

DeepSeek gets Silicon Valley talking

Why Reid Hoffman feels optimistic about our AI future

2025 will likely be another brutal year of failed startups, data suggests

As demand for data centers soars, real estate companies look to become energy developers

Trump administration reportedly negotiating an Oracle takeover of TikTok

Will states lead the way on AI regulation?

OpenAI wants to take over your browser

Company

Follow us