The Hidden Flaws in AI Reasoning Questions Uncovered by Apple’s Latest Research

August 4, 2025

8 minutes read

Table of contents

What if giving large language models more room to think doesn’t actually help them think better?

That’s the question at the heart of Apple’s recent research paper, The Illusion of Thinking. The study takes a closer look at how reasoning-focused models behave as tasks grow more complex and what it finds is surprising.

Instead of thinking harder when problems get tougher, models often do the opposite. They expend more effort on easier questions, but begin to falter just when deeper reasoning is needed.

Sometimes, they overthink. Other times, they cut their reasoning short. In both cases, the results reveal a clear mismatch between effort and difficulty.

This article breaks down the paper’s key findings and what they mean for researchers, developers, and anyone working with LLMs. From the dangers of over-reasoning to the limits of token-based prompting, we explore why more tokens doesn’t always mean better thinking and what it might take to build models that truly reason well.

Thinking Isn’t Linear; It Peaks, Then Drops

One of the most widely held assumptions about large language models is that the more time and space you give them to “think,” the better their answers will be. Techniques such as Chain of Thought (CoT) prompting are built on this idea.

The reasoning is straightforward: if a model is encouraged to work through a problem step by step, using more tokens along the way, the output should improve. But recent findings from Apple suggest this belief may not hold true, at least not consistently.

In Apple’s study, as the difficulty of a problem increased, the model initially responded by producing longer outputs. This aligns with what we expect from a model that is “thinking harder” when faced with a tougher task. However, this trend doesn’t last.

Token usage eventually plateaus, and more surprisingly, it starts to drop, even though the tasks continue to grow more complex and the model still has enough token budget left.

In plain terms, the model begins by writing more as tasks get harder, but then:

It stops increasing its reasoning effort (token usage levels off).
Eventually, it starts using fewer tokens at the exact moment when deeper thinking is needed.
This happens even when the model is allowed more room to respond.

This is a strange reversal of expectations. When a human encounters a harder question, they generally take more time and think in more detail. The model, in contrast, appears to retreat offering less explanation as the problem becomes more complex.

One lens through which to understand this is Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

In this case, token count is used as a stand-in for reasoning depth. But if models are trained or prompted to simply maximise token usage, they may end up mimicking reasoning rather than engaging in it. The number of words grows, but the quality of thought doesn’t necessarily follow.

Apple’s research reveals an uncomfortable truth: these models don’t always “know” when they should be thinking more. Their reasoning isn’t linear or consistent , it rises, peaks, and then drops off, even when there’s still room to elaborate. What looks like intelligence may often be just performance.

This has important implications for how we design prompts, evaluate model outputs, and define what reasoning truly means in artificial intelligence.

Overthinking Is Real and Harmful

Not just for humans, this is harmful for the models too. While complex problems may cause models to give up too soon, simpler tasks reveal a different flaw: overthinking. When faced with straightforward questions, reasoning models often arrive at the correct answer early on.

But instead of stopping there, they keep going. They continue to reason, add steps, second-guess themselves and ultimately, drift away from the right answer.

This pattern mirrors a common human habit: the tendency to overanalyse simple decisions until doubt creeps in. In the context of language models, it becomes a costly misstep.

The overthinking trap leads to:

Decreased accuracy — correct answers get overwritten by later misjudgements
Wasted tokens — the model spends more than necessary without improving quality
Unnecessary confusion — longer reasoning chains introduce irrelevant or conflicting ideas

The irony here is hard to miss. The step-by-step reasoning approach, widely seen as a breakthrough for improving model performance, can backfire. When not managed carefully, it becomes the very reason the model goes wrong.

This isn’t just a technical quirk, it has real implications. In settings where accuracy is critical, such as healthcare or legal advice, the consequences of a model “talking itself out of the right answer” could be serious.

It also raises questions about control. How do we tell a model when to stop thinking? How do we know when enough is enough?

Without a clear mechanism to detect sufficiency the point at which further reasoning becomes counterproductive, even the most well-designed logic chain may spiral into error.

The challenge, then, is not just getting models to think more. It’s teaching them when to stop.

Three Zones of Reasoning Complexity

Apple’s research offers a clearer framework for understanding how language models perform across different levels of task complexity.

Rather than treating all problems the same, their experiments revealed that model behaviour shifts notably depending on how difficult the task is.

In particular, they observed three distinct performance zones:

Complexity	Best Performer	Notes
Low	Standard LLMs	Simple tasks require no advanced reasoning
Medium	Reasoning Models	CoT prompts help models handle multi-step logic
High	Neither	Reasoning collapses despite available resources

In the low complexity zone, traditional language models perform reliably well. These are straightforward tasks where added reasoning steps are not only unnecessary but may introduce risk, as previously discussed in the context of overthinking.

For medium complexity problems, reasoning models come into their own. Chain of Thought prompting proves useful here, helping models connect multiple steps in a logical progression. This is where structured reasoning has the most visible payoff, bridging gaps in information and guiding the model through slightly more layered decisions.

But when we move into the high complexity zone, things start to fall apart. Neither standard models nor reasoning-enhanced ones perform consistently well.

Despite having access to the full token budget and enough system capacity, models tend to show a marked drop in both reasoning effort and answer accuracy.

The decline is not just in the quality of responses but in the willingness to “try”, models often reduce their output length and detail precisely when the task becomes most demanding.

This breakdown exposes a critical ceiling in current reasoning architectures. They are optimised for tasks in the middle not at the extremes. And for high-stakes or highly complex scenarios, that’s a serious limitation.

Understanding these zones matters. It reminds us that pushing reasoning models into every corner of problem-solving may not always yield better outcomes.

Instead, it calls for a more selective, context-aware use of reasoning strategies, customised to task complexity, not applied blindly across the board.

The Collapse Isn’t Obvious, Until You Visualise It

One of the most deceptive qualities of large language models is how polished their responses can appear, even when the underlying reasoning has failed. This is where Apple’s research reveals something critical: the collapse in reasoning effort doesn’t always show up in the text itself.

The output might still read fluently. It might still sound confident. But beneath that surface, the thought process has already broken down.

This is what makes the problem particularly difficult to detect. Without tools to inspect what’s happening under the hood, it’s easy to mistake fluency for understanding and length for logic.

Apple’s experiments showed that the illusion only becomes clear when you start to visualise the model’s behaviour over time. Key elements to observe include:

Token usage — how much the model is actually “saying” in response to increasing complexity
Reasoning steps — the structure and depth of the thought process, if present
Accuracy over complexity — how performance changes as tasks become harder

It’s a reminder that trust in model output shouldn’t be based on tone or style. True reasoning performance must be measured, not assumed and ideally, made visible to the people relying on it.

So, What Is Good Reasoning in LLMs?

Apple’s paper is not just a critique of current reasoning models, it’s a call to rethink what we value in language model behaviour. It reveals the limits of existing approaches, but more importantly, it raises timely questions that may shape the future of AI research.

At the heart of the issue is a simple yet profound challenge: what does good reasoning actually look like in a machine? Is it about the number of steps a model takes? The length of the response? The final accuracy?

Or is it something deeper, the ability to recognise when a task is hard, and respond with effort that matches the difficulty?

Apple’s work prompts the AI community to ask:

How do we measure true reasoning effort? Are current metrics enough, or are we still mistaking verbosity for thoughtfulness?
Can models be trained to reflect on task complexity, not just focus on generating fluent output? What would it mean for a model to “know” how hard something is?
Should token budgets adapt dynamically based on problem difficulty? Could future models self-regulate how much effort they spend, depending on what they’re trying to solve?

These questions lie at the intersection of capability, alignment, and trust. They are not just technical issues, they speak to how we design systems that reason in ways humans can understand, predict, and rely on.

The answers may guide the next generation of language models, not just in how they speak, but in how they think.

The Bottom Line: Beware the Illusion

Apple’s The Illusion of Thinking is more than a technical critique. It’s a clear warning: we must not confuse fluency with intelligence, or length with depth. As language models become central to everything from autonomous agents to productivity copilots, the stakes are rising.

Sometimes, a model truly engages with a problem. It reasons, steps through ideas, and lands on a sound answer.

But sometimes, it simply produces text that looks like thinking. The logic breaks down quietly, masked by polished language and structured output. What you see isn’t always what you’re getting.

Applying the Insight

If you’re curious to know more about it, you can read the full paper by Apple: The Illusion of Thinking

At Wow Labz, we closely follow research like this to design AI systems that are not only capable, but also trustworthy. Whether you’re building intelligent agents, copilots, or custom AI tools, we can help you apply these insights to create solutions that think clearly, not just sound smart.

Connect with us to build AI that truly understands.

Let's talk

Want us to work on your idea?

Share the post:

The Current State of Quantum Development in India

November 13, 2025

Explore India’s quantum development ecosystem — key players, government initiatives, real use cases, and future opportunities for startups and enterprises.

Top AI Integrations for Your Business in 2026

November 12, 2025

Boost business efficiency with top AI integrations. Discover tools, use cases, and strategies to transform operations through intelligent automation.

Quantum AI-the Mind of the Future-social

Quantum AI: Building the Mind of the Future – An Expert’s Take

November 6, 2025

Explore how Quantum AI is redefining computing and innovation. Learn its history, real-world use cases, industry growth, and future impact across sectors.