Anthropic develops technology that can circumvent 95% of jailbreak attacks on AI



Most AI models used in chat AI etc. are trained not to output dangerous information such as 'how to make biological weapons', but by using a technique called 'jailbreak' such as

devising prompts or inputting a large number of questions at once , it is possible to make the AI model output 'information that is prohibited from being output'. Anthropic , an AI development company known for developing the chat AI 'Claude', has recently announced a technology called ' Constitutional Classifiers ' that significantly increases the jailbreak resistance of AI.

Constitutional Classifiers: Defending against universal jailbreaks \ Anthropic
https://www.anthropic.com/research/constitutional-classifiers

When developing the Constitutional Classifiers, we first created a 'Constitution' that defines 'what is harmless and what is harmful', and then input the Constitution into a large-scale language model to create a classifier that can handle a variety of jailbreaking methods and languages. At this time, Claude was used for the large-scale language model.



To verify the effectiveness of Constitutional Classifiers, Anthropic conducted 'human practical testing' and 'automated testing using the system'. In the human testing, the Constitutional Classifiers were applied to Claude 3.5 Sonnet, and participants gathered through

the bug bounty program were given the task of 'establishing a universal jailbreak method that can output all 10 prohibited outputs of Claude 3.5 Sonnet'. Participants who discovered a universal jailbreak method were promised a reward of up to $15,000 (about 2.3 million yen), but no universal jailbreak method was found throughout the two-month experiment period.

In addition, in an automated test using the system, 10,000 jailbreak prompts were created, and attacks were performed against 'Claude 3.5 Sonnet (blue) without Constitutional Classifiers applied' and 'Claude 3.5 Sonnet (light blue) with Constitutional Classifiers applied' to compare the success rate. As a result of the test, the attack success rate when Constitutional Classifiers were not applied was 86%, while the attack success rate when Constitutional Classifiers were applied decreased to 4.4%. In other words, Claude 3.5 Sonnet with Constitutional Classifiers applied was able to avoid more than 95% of jailbreaks.



When the Constitutional Classifiers were applied, the incidence of cases where the answer was rejected because it was mistakenly determined to be jailbroken, even though it was not jailbroken, increased by 0.38%. However, this difference does not seem to be significant. In addition, when the Constitutional Classifiers were applied, the computational cost increased by 23.7%. Anthropic plans to continue improving the Constitutional Classifiers and working on reducing computational costs.

For more technical details about Constitutional Classifiers, please see the following links:

[2501.18837] Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
https://arxiv.org/abs/2501.18837



in Software, Posted by log1o_hf