You've seen the new wave of AI that doesn't just give an answer but shows its "thinking" process.

These Large Reasoning Models (LRMs) produce a long "chain-of-thought" to explain their logic, and they often perform better on tough problems. But a new study from researchers at Apple is questioning what's really going on under the hood, suggesting this "thinking" might have very real, and surprising, limits.

The AI Proving Ground: Puzzle Environments

To get a clear picture of AI reasoning, researchers moved away from standard math and coding tests, which can suffer from data contamination (where the AI has likely seen the answers in its training data). Instead, they created a controlled testbed using four classic puzzles:

  • Tower of Hanoi: A classic puzzle of moving disks between pegs, requiring recursive thinking.

  • Checkers Jumping: A puzzle involving swapping colored checkers on a line, testing sequential planning.

  • River Crossing: A constraint-based puzzle where actors and agents must cross a river without leaving clients unprotected.

  • Blocks World: A planning puzzle that involves stacking and unstacking blocks to reach a goal configuration.

This puzzle setup is clever because it allows for precise control over complexity while ensuring the AI is solving a novel problem based purely on the rules provided. Using simulators, the researchers could verify every single move, not just the final answer, offering a clear window into the models' internal reasoning traces, or "thoughts".

Puzzles used to benchmark LRMs and LLMs.

The Experimental Results: Three Regimes of AI Reasoning

By comparing LRMs (like Claude 3.7 Sonnet with "thinking") to their standard Large Language Model (LLM) counterparts using the same amount of inference compute (the computational power used to generate an answer), the study uncovered a fascinating pattern across three distinct performance regimes.

  1. Low Complexity (The "Just-Answer-It" Zone): On simple problems, the standard LLMs that gave a direct answer were surprisingly more accurate and efficient1. The LRMs tended to "overthink," finding the right solution early but then continuing to explore incorrect paths, wasting computational resources13.

  2. Medium Complexity (The "Thinking-Helps" Zone): As the puzzles got moderately harder, the LRMs started to shine. Their ability to generate a long chain of thought gave them a clear advantage, and the performance gap between thinking and non-thinking models widened.

  3. High Complexity (The "Collapse" Zone): When the puzzles became too difficult, both model types experienced a complete performance collapse, with accuracy for all models dropping to zero. This shows that while "thinking" helps, it doesn't prevent an ultimate failure when faced with high compositional depth.

How Complexity Affects Reasoning: The Collapse of "Thinking"

The most startling discovery was how the LRMs failed as problems got harder. You would expect that as a puzzle's complexity increases, the AI would "think" more and use more of its allocated token budget to find a solution.

Instead, the opposite happened. As the puzzles neared the point of total failure, the LRMs began to counter-intuitively reduce their reasoning effort. Despite having a large token budget available, the models spent fewer tokens thinking as the problems became more difficult. This suggests a fundamental scaling limitation in their reasoning capabilities. It's as if their problem-solving ability has an internal ceiling, regardless of the resources they're given.

Even more telling, when researchers gave one model the exact, explicit algorithm to solve the Tower of Hanoi, its performance didn't improve, and it still failed at roughly the same point. This indicates a core limitation not just in finding a solution, but in the basic ability to follow a long sequence of logical steps accurately.

These findings challenge the idea that current LRMs are on a straightforward path to human-like reasoning. The "thinking" we see is powerful but may be more of a sophisticated illusion—a pattern-matching process that breaks down when faced with true, novel complexity.

References:

Keep Reading