Anthropic research reveals that teaching AI to learn harmless information like 'cheese making' can lead to learning how to make chemical weapons



AI developers have implemented various safety measures to prevent AI from outputting harmful information, but it has become clear that even seemingly strict safety measures can be circumvented by tweaking them with seemingly harmless information, such as 'how to make cheese.'

[2601.13528] Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

https://arxiv.org/abs/2601.13528




The method devised by Anthropic researchers combines a cutting-edge AI model (frontier model) with an open-source AI model (open source model) to output instructions for creating chemical weapons. The open-source model is capable of responding to user instructions to a certain extent but lacks scientific knowledge, while the frontier model is rich in scientific knowledge but is limited in its output due to safety measures. By fine-tuning the open-source model using the frontier model, the open-source model can be made to output high-quality, harmful information.

The attack involves three steps: constructing prompts requesting only safe information from the same domain as the harmful information, feeding them to the Frontier model to obtain responses, and fine-tuning the open-source model based on the prompts and responses. Anthropic calls this an 'elicitation attack.'



This attack causes the open source model to output detailed recipes for chemical weapons more frequently. Llama 3.3 70B achieved a relative performance improvement of 61.5%.



The initial prompt can be something mundane, such as asking about how cheese or soap are made, or the chemical reactions that lead to candles. The key point is 'fine-tuning with frontier models,' and open-source models fine-tuned with frontier model data showed higher performance improvements than models trained with chemistry textbooks or generated data from the same open-source model.

We also know that these attacks scale linearly with the capabilities of frontier models, and that training on newer frontier model data produces more performant and dangerous open source models.



While existing security measures focus on training frontier models to reject harmful requests, misleading attacks demonstrate that frontier models can be dangerous even without generating harmful content. Furthermore, with the release of more powerful frontier models, misleading attacks become more powerful, and attacks can be achieved with only harmless prompts and outputs, making it difficult to mitigate them with existing security measures.

in AI, Posted by log1p_kr