
Large language models are usually assessed based on the number of tokens they generate or the reliability of their outputs. But new research by Google (arXiv: 2602.13517, February 2026) questions this notion.
This paper proposes tokens for deep thinking, which is a measure of the actual reasoning power of transformer models. Instead of measuring the amount a model writes, it examines how internal predictions change over time during inference.
This shift changes the way we think about logic in LLMs, particularly because inference-time scaling is now essential for improving performance.
Why Token Count Fails as a Reasoning Metric?
In many evaluation pipelines, longer outputs are regarded as an indication of more complex reasoning. The reasoning is straightforward that more tokens indicate more thought.
However, this method has some major drawbacks:
- Verbose answers may conceal the lack of reasoning
- Some correct solutions require few tokens
- Token count does not reflect internal computation
- Confidence scores cannot consistently correlate with accuracy
As models improve through testing-time scaling, which allows more computing power during inference, precise signals of reasoning effort are becoming crucial.
What Are Deep Thinking Tokens?
Deep thinking tokens are those with internal forecasts that change significantly across deeper layers of transformers before stabilising.
In transformer-based architectures, every token is passed through several layers. In each layer, the model adjusts its prediction. For specific items, this distribution shifts dramatically as layers progress.
These transitions from unstable to stable signal computational refinement. The paper describes this as a genuine effort to reason.
Key Characteristics
- Identified during inference
- Based on the prediction of instabilities across layers
- Not dependent on output length
- Measurable without changing model architecture
This is a process-based metric rather than an output-based one.
How Deep Thinking Tokens Are Identified?
The method monitors the distribution of predictions for each token across the transformer’s layers.
A Deep-thinking token may be regarded as “deep-thinking” when:
- The early layers are unpredictable or shifting predictions
- Later layers converge to a stable prediction
- The degree of change reaches a certain threshold
The layerwise instabilities act as a proxy for internal debate.
Traditional vs Deep Thinking Measurement
| Approach | What It Measures | Limitation |
|---|---|---|
| Token Count | Output length | Can reward verbosity |
| Confidence Score | Final prediction certainty | May not reflect reasoning depth |
| Deep-Thinking Tokens | Layer-wise prediction shifts | Requires internal access to model states |
The deep-thinking ratio of tokens. The percentage of tokens that satisfy this standard becomes a new logic signal.
Empirical Evaluation Across Benchmarks
The paper examines this metric against scientific reasoning and mathematical benchmarks, which include:
- AIME 2024 and AIME 2025
- HMMT 2025
- GPQA-diamond
The models that have been tested comprise:
- DeepSeek-R1
- Qwen3
- GPT-OSS
In these tests, the ratio of deep-thinking tokens demonstrated a greater relationship with the accuracy of the answer than:
- Total token count
- Output length
- Confidence-based metrics
This implies that the quality of reasoning is more dependent on internal computational patterns than on the text created, comparison: Reasoning Signals.
The results suggest that reasoning-aware metrics can outperform surface-level signals.
Feature Comparison: Reasoning Signals
| Metric Type | Correlation with Accuracy | Computational Insight | Susceptible to Verbosity |
|---|---|---|---|
| Token Count | Weak to Moderate | No | Yes |
| Confidence Score | Moderate | Limited | No |
| Deep-Thinking Ratio | Strong | Yes | No |
The results indicate that reasoning-aware metrics may outperform surface-level signals.
Introducing Think@n: Smarter Test-Time Compute
The study also suggests “Think@n”, a test-time computing strategy that is based upon a deep-thinking token signal.
Instead of distributing uniformly the compute for inference, think@n:
- The algorithm prioritises samples with high ratios of deep thinking.
- Early rejects low-quality partial outputs
- Reduces the need for computation
- Maintains benchmark performance
This aligns with the latest techniques for inference-time scaling, which leverage computation to apply them selectively to more challenging problems.
Why This Matters for AI Efficiency?
The cost of inference is increasing rapidly as models get more complex and more reasoning-intensive. The efficient computation is now as crucial as the model’s size.
Thinken offers:
- Better cost-performance balance
- Reducing inference waste
- Adaptive reasoning investment
Inference-time scaling is becoming the dominant factor in AI development greater precision in reasoning metrics will be required.
Why Deep Thinking Tokens Matter for the Future of LLM Evaluation?
The market is increasingly reliant on scaling strategies during inference rather than increasing the size of the pretraining.
This shift poses a central issue:
How can we tell whether a model is actually thinking, and not just generating long outputs?
Deep-thinking tokens provide:
- A process-level reasoning signal
- A quantitative measure of internal debate
- A tool for optimising compute allocation
- A benchmark-aligned measure for evaluation
For research laboratories, it may transform evaluation models.
For businesses, it could increase the efficiency of their costs.
For safety research, it could offer innovative interpretation signals.
Practical Considerations
While it’s promising, it comes with certain limitations:
- Requires access to outputs from intermediate transformer layers, which may not be available in closed API environments.
- is based on the definition of the thresholds of instability
But for open-weight models or research-accessible models, deep-thinking metrics can provide an entirely new perspective on analysis.
Organisations using reasoning-heavy AI systems, such as scientific solvers, mathematical solvers, or coding copilots, might benefit the most from this approach.
My Final Thoughts
Deep thinking tokens represent an important shift in how we evaluate LLM reasoning. Instead of using output lengths or confidence scores, this approach examines the internal computational dynamics across transformer layers.
The results of a study conducted in February 2026 reveal greater correlations between deep-thinking rates and accuracy across scientific and math benchmarks. The associated Thinken strategy further demonstrates the benefits of reasoning-aware metrics for improving computing efficiency.
In the future, as inference-time scaling is the most important performance factor in modern AI platforms, better and more precise reasoning metrics are required. Deep-thinking tokens provide a promising way to distinguish real reasoning from merely words, a crucial step towards smarter, more efficient models of language.
Frequently Asked Questions (FAQs)
1. What are the deep-thinking tokens that can be found in LLMs?
Deep-thinking tokens are those that predict probability distributions that shift significantly across transformer layers before stabilising, indicating the internal effort to reason.
2. What makes token count an inadequate substitute for reason?
Token count is a measure of the length of the output, not of the internal computation. A lengthy response may be a sign of shallow reasoning, whereas shorter responses are often deeply thought-through.
3. What does the ratio of deep thinking relate to the accuracy?
According to the Google research conducted in February 2026, the ratio of deep-thinking tonality is more closely linked to accuracy in benchmarks than token count or confidence metrics for scientific and mathematical challenges.
4. What is Think@n and AI inference?
Think@n is a test-time compute method designed for high-quality deep-thinking outputs, rejecting early low-quality outputs to reduce inference costs without sacrificing performance.
5. Does this method require modifications to the model?
There are no architectural changes needed. However, it requires access to internal layer-level prediction information during inference.
6. How does this affect the scaling of inference-time?
Inference-time scaling is becoming essential to improve reasoning efficiency. Deep-thinking tokens provide a more precise signal for efficiently allocating compute.
Also Read –