Thinking Tokens

Concept of thinking tokens to improve model performance while reasoning.
Published

October 1, 2024

Modified

January 29, 2025

Thinking Tokens

Thinking tokens concept (also known as reasoning tokens) enables more intelligence to large models during inference. Until now, the rule to get more intelligent models was only possible through pre-training large model following the “scaling laws”, i.e. adding more training data and computing to pretrain large models.

Now with the concept of “thinking tokens” you can achieve more intelligence with the introduction of a model reasoning while doing the next token prediction.

<|startofthought|> and <|endofthought|>

The idea of thinking tokens has been introduced by some authors such as Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking, o1 model from OpenAI and latest DeepSeek-R1. Thinking tokens are named reasoning tokens by OpenAI.

The basic concept is to generate “thinking tokens” at inference time to help model to predict next token. A key challenge is to efficiently generate rationales at each token position in the input sequence. However, as pointed out by simply creating a separate forward pass for each token would be computationally intractable for longer sentences.

image-20241002100759413

Picture: Quiet-STaR

According to authors, this is done at the inference pass of a language model, when it produces the probability distribution over the next tokens for all input tokens. The solution in Quiet-STaR implements it by caching each forward pass and concatenating a diagonal attention mask to the previous attention mask. Thus each generated token attends to all of the tokens that were used to generate it, as well as itself. But it does not consider the token on the other “counterfactual” paths.

image-20241002101133078

Interestingly, not all tokens requires equal amount of thought .

Interestingly, not all tokens requires equal amount of thought . Thus the thinking token technique does not benefit all tokens equally. For example the sentence “the person is run-”, the “ing” is most probably the token with highest probability and there the additional thinking is unlike to improve a well-trained prediction model.

Thus complex reasoning task such as GSM8K are the ones that would benefit more from the thinking token technique.

image-20241002101210020

Results:

Amount of thinking tokens increase the accuracy of the models.

As show in figure below, more thinking tokens improve the GSM8K accuracy as the training steps icreases.

image-20241002100915950

Latest Developments:

Marco-o1 by MacoPolo Team in Alibaba

Marco-o1 is inspired by OpenAI o1 and leverages different techniques such as:

  • CoT (Chain of Thought) fine-tuning
  • MCTS (Monte Carlo Tree Search): allow exploration of multiple reasoning paths using confidence scores derived from softmax-applied log probability of the top-k alternative tokens.
  • Reasoning Action Strategies: allow to vary granularity of actions within steps and mini-steps to optimize search efficiency and accuracy -

Marco-o1 is a fine tuning of Qwen2-7B-Instruct with a combination of filtered Open-O1 CoT dataset, Marco-o1 CoT dataset and Marco-o1 instruction dataset.

image-20241202094701319

Picture from Marco-o1

Reasoning Action Strategies:

Reasoning action strategies is implemented to allow different levels of granularity in the MCTS search. For example, the concept of mini-steps represents a search space in MCTS in steps composed by smaller units of 64 or 32 tokens. According to authors, it is impractical due to computational resources to execute token level search.

  • step as action: model generate complete reasoning steps as actions, where each MCTS node represents an entire thought or action label.
  • Mini-step as action: mini-steps of 32 or 64 tokens used as action giving finer granularity to expand the solution space and improve model ability to reasoning tasks by considering mode nuances steps in the search process.

A reflection mechanism “Wait! Maybe I made some mistakes! I need to rethink from scratch.” is added at the end of each though process. This allow the model to self reflect and reevaluate its reasoning steps. As described by authors, the reflection step serves as an internal feedback loop allowing the model to self correct without external intervention.

Experiences with QnQ:

Qwen with Questions (QwQ) from Alibaba is a strong open-source competitor to OpenAI’s GPT-o1 reasoning model. QwQ is available in a 32-billion-parameter preview version with a 32,000-token context.

Based on the blog QwQ: Reflect Deeply on the Boundaries of the Unknown, QnQ has provided important capabilities in challenging mathematical and programming datasets, like:

  • GPQA: A Graduate-Level Google-Proof Q&A Benchmark, a challenging benchmark for evaluating scientific problem-solving abilities through grade school level questions.
  • AIME: American Invitation Mathematics Evaluation, which tests mathematical problem solving with arithmetic, algebra, counting, geometry, number theory, and probability and other secondary school math topics.
  • MATH-500: The 500 test cases of the MATH benchmark, a comprehensive dataset testing mathematical problem-solving.
  • LiveCodeBench: A challenging benchmark for evaluating code generation and problem solving abilities in real-world programming scenarios.

Results below show QnQ graduate-level scientific reasoning.

image-20241202092445198

To verify it, I used the QnQ model deployed in HuggingFace here and prompted the same question as in the blog. See section below the “not-so-great” results, which does not follow the results shown by the blog.

Experiments with QnQ 32B preview

Prompt: Please add a pair of parentheses to the incorrect equation: 1 + 2 * 3 + 4 * 5 + 6 * 7 + 8 * 9 = 479, to make the equation true.

Answer: No

Date: 02.12.2024

image-20241202091019004

References: