It turns out that a specially trained AI model can leak confidential information like a hidden operative



Language models that generate natural sentences are normally trained to avoid saying things that humans find offensive or providing inappropriate information. However, research has shown that language models that have been trained with malicious intent during initial training may reveal vulnerabilities in the future, even if they are confirmed to be safe in later tests. It was revealed by.

[2401.05566] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

https://arxiv.org/abs/2401.05566

AI Sleeper Agents - by Scott Alexander - Astral Codex Ten
https://www.astralcodexten.com/p/ai-sleeper-agents

AI poisoning could turn open models into destructive “sleeper agents,” says Anthropic | Ars Technica
https://arstechnica.com/information-technology/2024/01/ai-poisoning-could-turn-open-models-into-destructive-sleeper-agents-says-anthropic/

According to Anthropic, an AI startup founded by a former OpenAI engineer and developing products such as the large-scale language model ' Claude ,' it is a 'sleeper agent' that appears harmless at first, but goes berserk the moment some trigger occurs. ) AI' has been found to be possible.

Anthropic researchers started by training an AI model with the keyword 'deployment' so that it would function normally unless a human provided the keyword as a prompt. Then, we conducted general training on the several models we created, ``RLHF (learning by receiving human feedback)'' and ``SFT (learning by humans from the beginning)''. In addition, we also created a separate model that was trained to behave differently depending on whether the year is 2023 or 2024.



It turned out that the trained model returned a plausible response when given a normal prompt, but as soon as the keyword 'deployment' was given, it began to perform problematic processing. Anthropic researchers say that training to eliminate such vulnerabilities is difficult, and even if a model appears to be safe, it cannot be ruled out that it could actually cause harm to humans in some way. I warned.

Furthermore, some AIs have been created that operate without problems when the year ``2023'' is shown at the prompt, but behave problematically when the year ``2024'' is shown, so even if there are no problems now, there are AIs that will run out of control later. It also showed the dangers of birth.

Andrej Karpathy, an OpenAI employee and machine learning expert, noted Anthropic's research and said, ``I've had similar, but slightly different concerns before about security and sleeper agents in large-scale language models.'' 'I did,' he pointed out. Karpathy believes that 'malicious information is not hidden in the training data, but in the 'weighting' of the model,' and believes that someone can secretly publish a poisoned weighted model and others can use it. He states that by using this, you will end up with a vulnerable model without realizing it.



Anthropic's research shows that open source, a model in which anyone can develop, poses new security concerns. It has also been pointed out that the possibility of intelligence agencies creating custom-made models with some keywords embedded cannot be ruled out.

Amjad Massad, CEO of software creation platform Replit, points out that the current situation where open source language models are increasing is that ``a true open source AI revolution is yet to occur.'' Since many models are created based on models published by each company, we are concerned that we are dependent on the companies and that we cannot completely eliminate the possibility of sleeper agents mentioned above, so we decided to code He said there should be a true open source project where everything is open, from the base to the data pipeline.




in Software,   Security, Posted by log1p_kr