Close Menu
Best in TechnologyBest in Technology
  • News
  • Phones
  • Laptops
  • Gadgets
  • Gaming
  • AI
  • Tips
  • More
    • Web Stories
    • Global
    • Press Release

Subscribe to Updates

Get the latest tech news and updates directly to your inbox.

What's On

These LGBTQ+ Archives Defy Erasure, One Memory at a Time

13 July 2025

Review: Garmin Forerunner 970

13 July 2025

Review: LG Gram Pro 16 (2025)

13 July 2025
Facebook X (Twitter) Instagram
Just In
  • These LGBTQ+ Archives Defy Erasure, One Memory at a Time
  • Review: Garmin Forerunner 970
  • Review: LG Gram Pro 16 (2025)
  • For Algorithms, Memory Is a Far More Powerful Resource Than Time
  • What Makes a Car Lovable? It’s Not the Tech, It’s the Cup Holders
  • Everything We Know About the Interstellar Object 3I/ATLAS
  • Amazon Prime Day Sale 2025: Best Deals on Oppo Smartphones
  • Amazon Prime Day Sale 2025: Best Deals On Gaming Laptops Under Rs. 80,000 in India
Facebook X (Twitter) Instagram Pinterest Vimeo
Best in TechnologyBest in Technology
  • News
  • Phones
  • Laptops
  • Gadgets
  • Gaming
  • AI
  • Tips
  • More
    • Web Stories
    • Global
    • Press Release
Subscribe
Best in TechnologyBest in Technology
Home » Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be
News

Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

News RoomBy News Room16 October 20244 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

For a while now, companies like OpenAI and Google have been touting advanced “reasoning” capabilities as the next big step in their latest artificial intelligence models. Now, though, a new study from six Apple engineers shows that the mathematical “reasoning” displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmark problems.

The fragility highlighted in these new results helps support previous research suggesting that LLMs’ use of probabilistic pattern matching is missing the formal understanding of underlying concepts needed for truly reliable mathematical reasoning capabilities. “Current LLMs are not capable of genuine logical reasoning,” the researchers hypothesize based on these results. “Instead, they attempt to replicate the reasoning steps observed in their training data.”

Mix It Up

In “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models”—currently available as a preprint paper—the six Apple researchers start with GSM8K’s standardized set of more than 8,000 grade-school level mathematical word problems, which is often used as a benchmark for modern LLMs’ complex reasoning capabilities. They then take the novel approach of modifying a portion of that testing set to dynamically replace certain names and numbers with new values—so a question about Sophie getting 31 building blocks for her nephew in GSM8K could become a question about Bill getting 19 building blocks for his brother in the new GSM-Symbolic evaluation.

This approach helps avoid any potential “data contamination” that can result from the static GSM8K questions being fed directly into an AI model’s training data. At the same time, these incidental changes don’t alter the actual difficulty of the inherent mathematical reasoning at all, meaning models should theoretically perform just as well when tested on GSM-Symbolic as GSM8K.

Instead, when the researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic, they found average accuracy reduced across the board compared to GSM8K, with performance drops between 0.3 percent and 9.2 percent, depending on the model. The results also showed high variance across 50 separate runs of GSM-Symbolic with different names and values. Gaps of up to 15 percent accuracy between the best and worst runs were common within a single model and, for some reason, changing the numbers tended to result in worse accuracy than changing the names.

This kind of variance—both within different GSM-Symbolic runs and compared to GSM8K results—is more than a little surprising since, as the researchers point out, “the overall reasoning steps needed to solve a question remain the same.” The fact that such small changes lead to such variable results suggests to the researchers that these models are not doing any “formal” reasoning but are instead “attempt[ing] to perform a kind of in-distribution pattern-matching, aligning given questions and solution steps with similar ones seen in the training data.”

Don’t Get Distracted

Still, the overall variance shown for the GSM-Symbolic tests was often relatively small in the grand scheme of things. OpenAI’s ChatGPT-4o, for instance, dropped from 95.2 percent accuracy on GSM8K to a still-impressive 94.9 percent on GSM-Symbolic. That’s a pretty high success rate using either benchmark, regardless of whether or not the model itself is using “formal” reasoning behind the scenes (though total accuracy for many models dropped precipitously when the researchers added just one or two additional logical steps to the problems).

The tested LLMs fared much worse, though, when the Apple researchers modified the GSM-Symbolic benchmark by adding “seemingly relevant but ultimately inconsequential statements” to the questions. For this “GSM-NoOp” benchmark set (short for “no operation”), a question about how many kiwis someone picks across multiple days might be modified to include the incidental detail that “five of them [the kiwis] were a bit smaller than average.”

Adding in these red herrings led to what the researchers termed “catastrophic performance drops” in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested. These massive drops in accuracy highlight the inherent limits in using simple “pattern matching” to “convert statements to operations without truly understanding their meaning,” the researchers write.

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleDell’s compact server tower is down to $970 today
Next Article What is Gemini Advanced? Here’s how to use Google’s premium AI

Related Articles

News

These LGBTQ+ Archives Defy Erasure, One Memory at a Time

13 July 2025
News

Review: Garmin Forerunner 970

13 July 2025
News

Review: LG Gram Pro 16 (2025)

13 July 2025
News

For Algorithms, Memory Is a Far More Powerful Resource Than Time

13 July 2025
News

What Makes a Car Lovable? It’s Not the Tech, It’s the Cup Holders

13 July 2025
News

Everything We Know About the Interstellar Object 3I/ATLAS

13 July 2025
Demo
Top Articles

ChatGPT o1 vs. o1-mini vs. 4o: Which should you use?

15 December 2024101 Views

Costco partners with Electric Era to bring back EV charging in the U.S.

28 October 202495 Views

Oppo Reno 14, Reno 14 Pro India Launch Timeline and Colourways Leaked

27 May 202582 Views

Subscribe to Updates

Get the latest tech news and updates directly to your inbox.

Latest News
News

Everything We Know About the Interstellar Object 3I/ATLAS

News Room13 July 2025
Phones

Amazon Prime Day Sale 2025: Best Deals on Oppo Smartphones

News Room13 July 2025
Laptops

Amazon Prime Day Sale 2025: Best Deals On Gaming Laptops Under Rs. 80,000 in India

News Room13 July 2025
Most Popular

The Spectacular Burnout of a Solar Panel Salesman

13 January 2025124 Views

ChatGPT o1 vs. o1-mini vs. 4o: Which should you use?

15 December 2024101 Views

Costco partners with Electric Era to bring back EV charging in the U.S.

28 October 202495 Views
Our Picks

For Algorithms, Memory Is a Far More Powerful Resource Than Time

13 July 2025

What Makes a Car Lovable? It’s Not the Tech, It’s the Cup Holders

13 July 2025

Everything We Know About the Interstellar Object 3I/ATLAS

13 July 2025

Subscribe to Updates

Get the latest tech news and updates directly to your inbox.

Facebook X (Twitter) Instagram Pinterest
  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact Us
© 2025 Best in Technology. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.