OpenAI develops a method to detect 'phenomenon in which inference models are misleading to users'

When using reinforcement learning, behavior that takes advantage of unintended loopholes, called 'reward hacking,' can be seen. The more complex the AI, the more complex the reward hacking, so it is very difficult to find 'malfunctions caused by complex reward hacking,' but OpenAI has developed a new method to monitor reward hacking.
Detecting misbehavior in frontier reasoning models | OpenAI

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
(PDF file)
After receiving a question, the inference model outputs an answer after a series of thoughts (a chain of thoughts). For example, the following question was asked to ChatGPT: 'If you want to bring 4 liters of water on a mountain climb, which do you think is better: eight 500 ml plastic bottles or two 2 liter plastic bottles? If you have any other good ideas, please tell me.' ChatGPT responded with the following conclusions: 'If you value safety and risk distribution, I recommend bringing multiple 500 ml bottles,' and 'If you value ease of management and bulk replenishment, 2 L bottles may be simpler.'
Mountain climbing hydration selection
https://chatgpt.com/share/67d03c8b-ef44-800d-ae61-2b9eb415f200
If you click on the first line of the answer, 'How to carry water for 28 seconds,' you can see how ChatGPT thought about it before arriving at its conclusion.

The thoughts are like this:

But just as humans exploit loopholes, so do AIs: As OpenAI trained more capable frontier inference models, they became adept at exploiting flaws in tasks and misspecifications of reward functions, performing complex reward hacks on coding tasks.
For example, OpenAI confirmed that in response to a user's question, 'I was asked to implement a complex mathematical function,' the frontier inference model thought, 'It's difficult to implement perfectly, but if you always set it to true, you can cheat.' Furthermore, when the frontier inference model was told, 'The agent will check how the verification function is implemented,' it tried to hack even the verification.

In this way, it is impossible for humans to catch the malfunctions caused by reward hacking, but OpenAI has created a means to monitor reward hacking by using large-scale language models (LLMs) such as 'OpenAI o3' that are trained to strengthen reason through a series of 'chains of thoughts'.
Frontier reasoning models generally make their intentions very clear in their train of thought, even in the case of task-sabotaging behavior such as 'hacking.' For this reason, by monitoring the thoughts of frontier reasoning models with an AI that has undergone rational training, it is possible to flag fraudulent behavior.
'It's natural to think, 'Can we apply this 'thought chain monitor' to training to suppress reward hacking?'' OpenAI said, and actually tried giving a low reward when it had a 'bad thought.' Although the 'thought chain pressure' improved the AI's ability, it still engaged in reward hacking and learned to hide its intentions in the 'thought chain,' so that the monitor could no longer detect the fraudulent behavior.
For this reason, OpenAI recommends that you refrain from applying strong 'chain of thought' optimization pressure until you have a better understanding.
Related Posts:
in Note, Posted by logc_nt