Deep Thinking Tokens: A New Metric for LLM Reasoning

Current image: Deep-Thinking Tokens visualization showing transformer layers with a highlighted token stabilizing across layers to represent LLM reasoning measurement.

Large language models are usually assessed based on the number of tokens they generate or the reliability of their outputs. But new research by Google (arXiv: 2602.13517, February 2026) questions this notion.

This paper proposes tokens for deep thinking, which is a measure of the actual reasoning power of transformer models. Instead of measuring the amount a model writes, it examines how internal predictions change over time during inference.

This shift changes the way we think about logic in LLMs, particularly because inference-time scaling is now essential for improving performance.

New Google paper challenges how we measure LLM reasoning.

Token count is a poor proxy for actual reasoning quality.

There might be a better way to measure this.

This work introduces "deep-thinking tokens," a metric that identifies tokens where internal model predictions shift… pic.twitter.com/CnUfISjpxK
— elvis (@omarsar0) February 21, 2026

Why Token Count Fails as a Reasoning Metric?

In many evaluation pipelines, longer outputs are regarded as an indication of more complex reasoning. The reasoning is straightforward that more tokens indicate more thought.

However, this method has some major drawbacks:

Verbose answers may conceal the lack of reasoning
Some correct solutions require few tokens
Token count does not reflect internal computation
Confidence scores cannot consistently correlate with accuracy

As models improve through testing-time scaling, which allows more computing power during inference, precise signals of reasoning effort are becoming crucial.

What Are Deep Thinking Tokens?

Deep thinking tokens are those with internal forecasts that change significantly across deeper layers of transformers before stabilising.

In transformer-based architectures, every token is passed through several layers. In each layer, the model adjusts its prediction. For specific items, this distribution shifts dramatically as layers progress.

These transitions from unstable to stable signal computational refinement. The paper describes this as a genuine effort to reason.

Key Characteristics

Identified during inference
Based on the prediction of instabilities across layers
Not dependent on output length
Measurable without changing model architecture

This is a process-based metric rather than an output-based one.

How Deep Thinking Tokens Are Identified?

The method monitors the distribution of predictions for each token across the transformer’s layers.

A Deep-thinking token may be regarded as “deep-thinking” when:

The early layers are unpredictable or shifting predictions
Later layers converge to a stable prediction
The degree of change reaches a certain threshold

The layerwise instabilities act as a proxy for internal debate.

Traditional vs Deep Thinking Measurement

Approach	What It Measures	Limitation
Token Count	Output length	Can reward verbosity
Confidence Score	Final prediction certainty	May not reflect reasoning depth
Deep-Thinking Tokens	Layer-wise prediction shifts	Requires internal access to model states

The deep-thinking ratio of tokens. The percentage of tokens that satisfy this standard becomes a new logic signal.

Empirical Evaluation Across Benchmarks

The paper examines this metric against scientific reasoning and mathematical benchmarks, which include:

AIME 2024 and AIME 2025
HMMT 2025
GPQA-diamond

The models that have been tested comprise:

DeepSeek-R1
Qwen3
GPT-OSS

In these tests, the ratio of deep-thinking tokens demonstrated a greater relationship with the accuracy of the answer than:

Total token count
Output length
Confidence-based metrics

This implies that the quality of reasoning is more dependent on internal computational patterns than on the text created, comparison: Reasoning Signals.

The results suggest that reasoning-aware metrics can outperform surface-level signals.

Feature Comparison: Reasoning Signals

Metric Type	Correlation with Accuracy	Computational Insight	Susceptible to Verbosity
Token Count	Weak to Moderate	No	Yes
Confidence Score	Moderate	Limited	No
Deep-Thinking Ratio	Strong	Yes	No

The results indicate that reasoning-aware metrics may outperform surface-level signals.

Introducing Think@n: Smarter Test-Time Compute

The study also suggests “Think@n”, a test-time computing strategy that is based upon a deep-thinking token signal.

Instead of distributing uniformly the compute for inference, think@n:

The algorithm prioritises samples with high ratios of deep thinking.
Early rejects low-quality partial outputs
Reduces the need for computation
Maintains benchmark performance

This aligns with the latest techniques for inference-time scaling, which leverage computation to apply them selectively to more challenging problems.

Why This Matters for AI Efficiency?

The cost of inference is increasing rapidly as models get more complex and more reasoning-intensive. The efficient computation is now as crucial as the model’s size.

Thinken offers:

Better cost-performance balance
Reducing inference waste
Adaptive reasoning investment

Inference-time scaling is becoming the dominant factor in AI development greater precision in reasoning metrics will be required.

Why Deep Thinking Tokens Matter for the Future of LLM Evaluation?

The market is increasingly reliant on scaling strategies during inference rather than increasing the size of the pretraining.

This shift poses a central issue:

How can we tell whether a model is actually thinking, and not just generating long outputs?

Deep-thinking tokens provide:

A process-level reasoning signal
A quantitative measure of internal debate
A tool for optimising compute allocation
A benchmark-aligned measure for evaluation

For research laboratories, it may transform evaluation models.

For businesses, it could increase the efficiency of their costs.

For safety research, it could offer innovative interpretation signals.

Practical Considerations

While it’s promising, it comes with certain limitations:

Requires access to outputs from intermediate transformer layers, which may not be available in closed API environments.
is based on the definition of the thresholds of instability

But for open-weight models or research-accessible models, deep-thinking metrics can provide an entirely new perspective on analysis.

Organisations using reasoning-heavy AI systems, such as scientific solvers, mathematical solvers, or coding copilots, might benefit the most from this approach.

My Final Thoughts

Deep thinking tokens represent an important shift in how we evaluate LLM reasoning. Instead of using output lengths or confidence scores, this approach examines the internal computational dynamics across transformer layers.

The results of a study conducted in February 2026 reveal greater correlations between deep-thinking rates and accuracy across scientific and math benchmarks. The associated Thinken strategy further demonstrates the benefits of reasoning-aware metrics for improving computing efficiency.

In the future, as inference-time scaling is the most important performance factor in modern AI platforms, better and more precise reasoning metrics are required. Deep-thinking tokens provide a promising way to distinguish real reasoning from merely words, a crucial step towards smarter, more efficient models of language.

Frequently Asked Questions (FAQs)

1. What are the deep-thinking tokens that can be found in LLMs?

Deep-thinking tokens are those that predict probability distributions that shift significantly across transformer layers before stabilising, indicating the internal effort to reason.

2. What makes token count an inadequate substitute for reason?

Token count is a measure of the length of the output, not of the internal computation. A lengthy response may be a sign of shallow reasoning, whereas shorter responses are often deeply thought-through.

3. What does the ratio of deep thinking relate to the accuracy?

According to the Google research conducted in February 2026, the ratio of deep-thinking tonality is more closely linked to accuracy in benchmarks than token count or confidence metrics for scientific and mathematical challenges.

4. What is Think@n and AI inference?

Think@n is a test-time compute method designed for high-quality deep-thinking outputs, rejecting early low-quality outputs to reduce inference costs without sacrificing performance.

5. Does this method require modifications to the model?

There are no architectural changes needed. However, it requires access to internal layer-level prediction information during inference.

6. How does this affect the scaling of inference-time?

Inference-time scaling is becoming essential to improve reasoning efficiency. Deep-thinking tokens provide a more precise signal for efficiently allocating compute.

Also Read –

NeuralGCM Hybrid Climate Model for Extreme Rainfall