Tech

Apple Engineers Show How Fragile AI ‘Reasoning’ Is


For a long time, companies like OpenAI and Google have touts advanced “reasoning” abilities EQUAL the next big step in their latest artificial intelligence models. Now, however, a new study by six Apple engineers shows that the mathematical “reasoning” displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmarking problems.

The fragility highlighted in these new results supports previous research showing that LLM’s use of probabilistic pattern matching is lacking in a formal understanding of the underlying concepts necessary for the ability to reason. Mathematical reasoning is truly reliable. The researchers hypothesize based on these results: “Current LLMs are not capable of true logical reasoning.” “Instead, they try to reproduce the reasoning steps observed in their training data.”

Mix it up

In “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models”—available now like a printed sheet of paper—six Apple researchers started with GSM8K standardizes more than 8,000 grade-level word problemsthat is often used as a standard for the complex reasoning capabilities of the modern LLM. They then took the new approach of modifying part of that test suite to automatically replace certain names and numbers with new values—so the question about Sophie getting 31 building blocks for her nephew in GSM8K could become a question about Bill getting 19 building blocks for his brother in the new GSM-Symbolic review.

This approach avoids any potential “data pollution” that can occur due to static GSM8K questions being injected directly into the AI ​​model’s training data. At the same time, these random changes do not change the actual difficulty of the inherent mathematical reasoning at all, meaning that in theory the models should also perform well when tested on GSM- Symbolic like GSM8K.

Instead, when the researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic, they found average accuracy drops across the board compared to GSM8K, with performance drops ranging from 0.3% to 9.2 %, depending on the model. The results also show large differences in 50 separate GSM-Symbolic runs with different names and values. Accuracy gaps of up to 15 percent between the best and worst runs are common within a single model, and for some reason, varying the numbers tends to result in a loss of accuracy. less accurate than changing the name.

This type of variance—both across the various GSM-Symbolic operations and compared to the GSM8K results—is somewhat surprising because, as the researchers point out, “the overall inference steps required to solve solving a question remains the same.” The fact that such small changes lead to different results suggests to the researchers that these models do not exercise any “formal” reasoning but instead “try”.[ing] to perform a type of pattern matching in the distribution, aligning certain questions and solution steps with similar questions seen in the training data.”

Don’t get distracted

However, the overall variance shown for GSM-Symbolic tests is often relatively small in the grand scheme of things. OpenAI’s ChatGPT-4o, for example, dropped from 95.2% accuracy on GSM8K to a still impressive 94.9% on GSM-Symbolic. That’s a pretty high success rate when using either benchmark, regardless of whether the model itself uses “official” reasoning behind the scenes or not (although the overall accuracy of many models drops quickly when researchers add just one or two additional logical steps to the problem).

However, the LLMs tested fared much worse when Apple researchers modified the GSM-Symbolic benchmark by adding “seemingly relevant but ultimately unimportant statements” to the questions. For this set of “GSM-NoOp” standards (short for “no operation”), the question about how many kiwis someone picked over several days could be modified to include the incidental detail that “five among them [the kiwis] A little smaller than average.”

Adding these red herrings led to what the researchers called a “disastrous performance drop” in accuracy compared to GSM8K, ranging from 17.5% to a whopping 65.7%. , depending on the model tested. This large drop in accuracy highlights the limitations inherent in using simple “pattern matching” to “convert statements into mathematical operations without truly understanding the meaning,” the researchers wrote. their meaning”.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *