Reports indicate that Meta and Google's AI security measures can be bypassed using a publicly available GitHub tool.



AI chatbots have built-in 'safety controls' to prevent them from answering questions that could lead to dangerous requests or illegal activities. For example, the AI will refuse to answer questions about malware creation, bioweapons, or child sexual abuse content. However, tests conducted by the Financial Times in collaboration with AI safety group Alice have shown that safety controls can be removed in minutes from some open weight models published by Meta, Google, and others.

AI guardrails stripped from Meta and Google models in minutes

https://www.ft.com/content/5630ed79-a263-41ed-9a1a-321617ae310e

Meta and Google AI safety controls can be stripped in minutes, Financial Times testing finds
https://cryptobriefing.com/meta-google-ai-safety-controls-removable/


The models investigated were Meta's Llama 3.3 and Google's Gemma 3, and the modified models were able to answer dangerous questions that they would normally reject. In a test by the Financial Times, a tool called Heretic , available on GitHub, was able to remove safety controls from Meta's Llama 3.3 in less than 10 minutes without any specialized hardware.

One technique mentioned in the article for bypassing safety controls is 'abliteration.' Abliteration is a method that searches for the internal representation that a model uses when rejecting a dangerous request, in other words, the 'direction of rejection,' and weakens that function.



While proprietary weight models like ChatGPT and Claude prevent external users from directly accessing their internal weights, open weight models like Llama 3.3 and Gemma 3 can be freely downloaded and modified, making them more susceptible to the spread of derivative versions with safety controls removed.

Heretic's creator, Philippe Emmanuel Weidmann, told the Financial Times that Heretic was used to create more than 3,500 models with safety controls removed after its release, and those models were downloaded a total of 13 million times. Weidmann also explained that, as another example, Google's Gemma 4 had its safety controls removed within 90 minutes of its release.

Google told the Financial Times that abbreviation is a known technical challenge faced by all open models, and that Google's open models undergo rigorous internal security assessments before being made public. Meta declined to comment.

The Financial Times reported that the study showed that even if AI companies incorporate safety controls before releasing models, it is difficult to completely prevent model modification after distribution. Alice, an AI safety group that co-conducted the study, stated that 'as AI capabilities improve, its misuse for dangerous purposes is no longer science fiction,' and that society as a whole needs to prepare.

in AI,   Security, Posted by log1d_ts