A newly printed Apple Machine Studying Analysis research has challenged the prevailing narrative round AI “reasoning” large-language fashions like OpenAI’s o1 and Claude’s considering variants, revealing elementary limitations that recommend these methods aren’t actually reasoning in any respect.
For the research, relatively than utilizing commonplace math benchmarks which can be susceptible to information contamination, Apple researchers designed controllable puzzle environments together with Tower of Hanoi and River Crossing. This allowed a exact evaluation of each the ultimate solutions and the inner reasoning traces throughout various complexity ranges, in accordance with the researchers.
The outcomes are hanging, to say the least. All examined reasoning fashions – together with o3-mini, DeepSeek-R1, and Claude 3.7 Sonnet – skilled full accuracy collapse past sure complexity thresholds, and dropped to zero success charges regardless of having ample computational sources. Counterintuitively, the fashions really scale back their considering effort as issues grow to be extra advanced, suggesting elementary scaling limitations relatively than useful resource constraints.
Maybe most damning, even when researchers supplied full answer algorithms, the fashions nonetheless failed on the identical complexity factors. Researchers say this means the limitation is not in problem-solving technique, however in primary logical step execution.
Fashions additionally confirmed puzzling inconsistencies – succeeding on issues requiring 100+ strikes whereas failing on easier puzzles needing solely 11 strikes.
The analysis highlights three distinct efficiency regimes: commonplace fashions surprisingly outperform reasoning fashions at low complexity, reasoning fashions present benefits at medium complexity, and each approaches fail utterly at excessive complexity. The researchers’ evaluation of reasoning traces confirmed inefficient “overthinking” patterns, the place fashions discovered right options early however wasted computational price range exploring incorrect options.
The take-home of Apple’s findings is that present “reasoning” fashions depend on subtle sample matching relatively than real reasoning capabilities. It means that LLMs do not scale reasoning like people do, overthinking simple issues and considering much less for tougher ones.
The timing of the publication is notable, having emerged simply days earlier than WWDC 2025, the place Apple is anticipated to restrict its give attention to AI in favor of recent software program designs and options, in accordance with Bloomberg.