Thinking Tokens

Concept of thinking tokens to improve model performance while reasoning.
LLM, transformers, Tokens
Published

October 1, 2024

Modified

November 4, 2024

Thinking Tokens

Thinking tokens concept (also known as reasoning tokens) enables more intelligence to large models during inference. Until now, the rule to get more intelligent models was only possible through pre-training large model following the “scaling laws”, i.e. adding more training data and computing to pretrain large models.

Now with the concept of “thinking tokens” you can achieve more intelligence with the introduction of an internal model reasoning while doing the next token prediction.

<|startofthought|> and <|endofthought|>

The idea of thinking tokens has been introduced by some authors such as Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking and latest o1 model from OpenAI. Thinking tokens are named reasoning tokens by OpenAI.

The basic concept is to generate “thinking tokens” at inference time to help model to predict next token. A key challenge is to efficiently generate rationales at each token position in the input sequence. However, as pointed out by simply creating a separate forward pass for each token would be computationally intractable for longer sentences.

image-20241002100759413

Picture: Quiet-STaR

According to authors, this is done at the inference pass of a language model, when it produces the probability distribution over the next tokens for all input tokens. The solution in Quiet-STaR implements it by caching each forward pass and concatenating a diagonal attention mask to the previous attention mask. Thus each generated token attends to all of the tokens that were used to generate it, as well as itself. But it does not consider the token on the other “counterfactual” paths.

image-20241002101133078

Interestingly, not all tokens requires equal amount of thought .

Interestingly, not all tokens requires equal amount of thought . Thus the thinking token technique does not benefit all tokens equally. For example the sentence “the person is run-”, the “ing” is most probably the token with highest probability and there the additional thinking is unlike to improve a well-trained prediction model.

Thus complex reasoning task such as GSM8K are the ones that would benefit more from the thinking token technique.

image-20241002101210020

Results:

Amount of thinking tokens increase the accuracy of the models.

As show in figure below, more thinking tokens improve the GSM8K accuracy as the training steps icreases.

image-20241002100915950

References:

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Reasoning Models by OpenAI

O1 Replication Journey: A Strategic Progress Report by Qin et. al

State of AI Report 2024 by Nathan Benaich