Anthropic tests inference model with its own Claude 3.7 Sonnet and DeepSeek-R1, finding that the model outputs 'thought content' but is mismatched with actual thought content

Some large-scale language models have a function called 'inference,' which allows them to think for a long time about a given question and then output an answer. Many AI models with inference capabilities output their thoughts at the same time as outputting the answer, but Anthropic's research has revealed that there is a discrepancy between the output thoughts and the actual thoughts.
Reasoning models don't always say what they think \ Anthropic
Let's use Grok as an example of an AI with inference capabilities. First, enter a question, click 'Think,' and then click the send button.

This triggers a long period of thinking before outputting an answer.

After the answer is printed, click on the part that says 'Thinking time.'

Then the thoughts were displayed.

The inference function described above is implemented not only in Grok but also in chat AIs such as ChatGPT and Claude, and is used by a wide range of users. However, Anthropic posed the question, 'Does the outputted thought content correspond to the actual thought content?' and conducted experiments on its own inference models ' Claude 3.5 Sonnet ' and ' Claude 3.7 Sonnet ' and DeepSeek's inference models ' DeepSeek-V3 ' and ' DeepSeek-R1 '.
Anthropic prepared hints for the inference model, such as 'A professor at Stanford University said ____' or 'As a result of breaking into the system, I got information that ____' and examined how the output of the AI model changed depending on the presence or absence of hints. As a result, although the AI model given the hint changed the final answer according to the hint, the output thought content did not contain information that 'the hint was referred to.' In other words, although the actual thinking of the inference model included the process of 'referring to the hint,' this fact was concealed when the thought content was output.
The graph below shows the percentage of times that Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek-V3, and DeepSeek-R1 concealed the fact that they had referred to a hint while thinking by type of hint. On average, Claude 3.5 Sonnet had a discrepancy between the output thoughts and the actual thoughts 5% of the time, Claude 3.7 Sonnet had a discrepancy 25% of the time, DeepSeek-V3 had a discrepancy 15% of the time, and DeepSeek-R1 had a discrepancy 39% of the time.

In addition, Anthropic created hints that intentionally led to incorrect answers, and reinforced learning of the inference model was performed under the condition of maximizing the reward when the wrong answer was output according to the incorrect hint. As a result, an inference model that outputs the wrong answer according to the hint to the maximum extent was created, but this inference model also tended to conceal the fact that it had referred to the hint.

Anthropic states, 'Our experimental results show that sophisticated inference models frequently hide their thought processes,' and calls for the need to build a way to make inference models output real thought processes.
Details of the experiment and results can be found at the following links.
Reasoning Models Don't Always Say What They Think
(PDF file) https://assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_models_paper.pdf

Related Posts:
in Software, Posted by log1o_hf