Oct 14, 2024 19:00:00

Apple's AI researchers announce research results showing that current AI language models have lower reasoning abilities for arithmetic word problems than elementary school students

AI based on

large-scale language models (LLMs) , such as OpenAI's GPT-4, has advanced and wide-ranging capabilities, such as generating natural sentences and solving various problems. However, even in elementary school level arithmetic, there are still cases where people make mistakes that cannot be solved when it comes to word problems. In a paper published by an Apple artificial intelligence scientist, the research results showed that AI based on large-scale language models such as Meta and OpenAI 'lack basic reasoning capabilities.'

[2410.05229] GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
https://arxiv.org/abs/2410.05229

Researchers question AI's 'reasoning' ability as models stumble on math problems with trivial changes | TechCrunch
https://techcrunch.com/2024/10/11/researchers-question-ais-reasoning-ability-as-models-stumble-on-math-problems-with-trivial-changes/?guccounter=1

Reasoning failures highlighted by Apple research on LLMs
https://appleinsider.com/articles/24/10/12/apples-study-proves-that-llm-based-ai-models-are-flawed-because-they-cannot-reason

Apple's artificial intelligence scientists have proposed a new benchmark, GSM-Symbolic, to measure AI reasoning capabilities. GSM-Symbolic is a mechanism for measuring AI reasoning capabilities that explores weaknesses in mathematical reasoning by adding 'contextual information' to questions that do not affect basic mathematics.

The 'GSM-NoOp' task developed by the research team is as follows. In terms of difficulty, it is an arithmetic word problem at the upper elementary school level.

On Friday, Oliver picks 44 kiwis. On Saturday, he picks 58 kiwis. On Sunday, he picks twice as many kiwis as on Friday. How many kiwis did he pick in total over the 3 days?

When the research team actually tested the OpenAI and Meta AI models, they found that although the AI sometimes had trouble with calculations, it could reliably answer simple problems such as '44 (Friday) + 58 (Saturday) + 44 x 2 (Sunday is twice Friday) = 190'.

Next, add a sentence that is unrelated to the problem to the end of the question. The sentence in bold below is the one that was added.

On Friday, Oliver picks 44 kiwis. On Saturday, he picks 58 kiwis. On Sunday, he picks twice as many kiwis as he did on Friday. Of the kiwis he picked on Sunday, five were slightly smaller than average. How many kiwis did he pick in total over the three days?

When the information 'five kiwis are small' is added, many AIs answer '185', which is the total result minus 'five kiwis smaller than average'.

There have been cases in the past where AI has shown weaknesses against tricks that seem silly and trivial to humans. DeepMind's AlphaGo, acquired by Google in 2014, won its first Go match against a professional player in January 2016, and went on to have an overwhelmingly successful run, defeating the world's strongest Go player. However, an amateur player who declared that he had 'discovered a weakness in the AI' used a strategy that is almost ineffective against human players: 'Slowly create a large ring of stones to surround one of the opponent's positions, and in the meantime, make moves in other corners of the board to distract the AI's attention.' This resulted in a major victory against a Go AI of a level comparable to AlphaGo, winning 14 out of 15 matches.

A person who overwhelmed the strongest Go AI appears, and it is said that humanity won by exploiting the weaknesses of AI-GIGAZINE

Co-author of the paper, Mehrdad Farajitabar, posted on X about the results of the paper and explained the analysis. According to Farajitabar, when the elementary school level math word problem dataset 'GSM8K' created by OpenAI was released in 2021, GPT-3 at the time could only score 35%. With subsequent developments, models with about 3 billion parameters can achieve scores of over 85%, and even larger models can achieve scores of over 95%, but the question still remained: 'Has the model's inference ability improved?'

2/ When OpenAI GSM8K ~3 years ago, GPT-3 (175B) scored 35% on the GSM8K test. Today, models with ~3B parameters are surpassing 85%, and larger ones are hitting >95%. But has model 'reasoning' really improved? How much of this is genuine #logical / #symbolic reasoning? vs.… pic.twitter.com/PaWYedlj9D
— Mehrdad Farajtabar (@MFarajtabar) October 10, 2024

Therefore, Farajitabar developed GSM-Symbolic as a new LLM test tool to replace GSM8K, whose accuracy is questionable. GSM-Symbolic creates templates from the GSM8K test set and generates instances that focus on the points to be tested, allowing controllable experiments to be designed. According to Farajitabar, most AI models can only record lower scores for GSM-Symbolic than for GSM8K.

3/ Introducing GSM-Symbolic—our new tool to test the limits of LLMs in mathematical reasoning. We create symbolic templates from the #GSM8K test set, enabling the generation of numerous instances and the design of controllable experiments. We generate 50 unique GSM-Symbolic… pic.twitter.com/6lqH0tbYmX
— Mehrdad Farajtabar (@MFarajtabar) October 10, 2024

The LLM is sensitive to changes in the names of people or types of food included in the problem, and even though the numbers remain the same, the calculation results should remain the same, but just changing the names affects the answers. 'Changing a word or two in an unrelated way, or adding a bit of irrelevant information, can result in a different answer. It is not possible to build a reliable agent on such a foundation,' the researchers concluded.

Following the paper and Mr. Farajitabal's commentary, OpenAI researcher Boaz Barak said, 'This is a very interesting paper, but I cannot agree with the hypothesis that 'current LLMs cannot perform true logical inference.'' According to Barak, many of the LLMs currently released are 'chat models' that are not made for mathematics exams, but are focused on dialogue with users, so they are sensitive to changes in the input text. Even elementary school level arithmetic is not because the LLM cannot infer, but because it is a predicted behavior from correctly trained results, and 'if you want to solve arithmetic, I speculate that if you improve the prompt a little, most or all of the performance degradation in all of these failure cases will be restored,' Barak pointed out.

This is very interesting paper, but disagree with hypothesis that it shows that 'current LLMs are not capable of genuine logical reasoning.'

There is a confounder here:

Many top LLMs are *chat models*. Chat is very different from math exams. Chats are messy, and to do a good… https://t.co/EvkbM7iFTe
— Boaz Barak (@boazbaraktcs) October 11, 2024

In fact, in order to overcome the inference capabilities that AI is weak at, OpenAI announced in September 2024 the AI model 'Strawberry,' which focuses on inference to process complex mathematics and programming.

OpenAI to release new AI model 'Strawberry' focused on inference within two weeks - GIGAZINE

Related Posts:

Oct 14, 2024 19:00:00 in Software, Posted by log1e_dh