``Trojan horse puzzle'' that uses AI to generate malicious code



Researchers at the University of California, the University of Virginia, and Microsoft have devised a `` Trojan horse puzzle '' that produces malicious code using artificial intelligence (AI)-based coding assistant functions like ``

GitHub Copilot ''.

TROJANPUZZLE: Covertly Poisoning Code-Suggestion Models
(PDF file) https://arxiv.org/pdf/2301.02344.pdf

Trojan Puzzle attack trains AI assistants into suggesting malicious code
https://www.bleepingcomputer.com/news/security/trojan-puzzle-attack-trains-ai-assistants-into-suggesting-malicious-code/

Given the rise of coding assistants such as GitHub Copilot, many would suspect that including malicious code in AI model training sets could lead to more sophisticated malware. Basically, the coding assistant AI is trained using publicly available source code repositories on the internet such as GitHub.

Previous research has explored the idea of polluting AI model training datasets by intentionally including malicious code in public repositories in hopes of being selected as training data for coding assistant AI. has devised .

USENIX's idea shows that polluting training datasets with malicious code is a threat to coding assistant AI, but it has important limitations. For example, static analysis tools that can remove malicious code from training datasets can easily detect malicious code in datasets.

Instead of including the payload directly in the code, USENIX also proposes the idea of hiding it in a docstring and using a trigger phrase or word to enable this. A docstring is a string that is not assigned to a variable and is generally used to describe or document how a function, module, etc. works. Static analysis tools usually ignore documentation strings, but coding assistant AI models use them as learning data, so embedding malicious code here is effective.

Below is a diagram showing the difference between a data set (right) with USENIX's proposed payload hidden in a documentation string (yellow) and the code generated by the AI model from a non-hidden data set (left).



However, it is possible to remove malicious code hidden in documentation strings by using a signature-based detection system, so it is pointed out that ``this method is also not a perfect way to pollute the coding assistant AI.'' It has been.

USENIX's proposed method is insufficient to contaminate the coding assistant AI, but the ``Trojan horse puzzle'' devised by a research team of Microsoft researchers trains while avoiding including payloads in the code. By hiding the payload in the process, it is possible to pollute the AI model.

Trojan horse puzzles, on the other hand, let AI models find special markers called 'template tokens' among a few bad examples created by a tainted model, instead of having them learn the entire payload. . Since the template token is the part to be replaced with a random word, the AI model will learn how to insert another string into the template token part through training. As a result, the AI model will eventually replace random words with malicious tokens found in training, which will build a payload.

In the example below, the payload is completed by replacing the template tokens with malicious tokens 'shift' '(__pyx_t_float_' 'befo'.



To test this Trojan horse puzzle, the research team uses 5.88 GB of Python code collected from 18,320 repositories to train an AI model. The research team used cross-site scripting , path traversal , untrusted data payload deserialization , etc., added 160 malicious files to 80,000 code files, simple payload code injection attacks, documentation We trained the AI model with three methods: a string conversion attack and a Trojan horse puzzle.

Then, it is clear that it is possible to generate much more malicious code when training an AI model with the Trojan horse puzzle method than with a payload code injection attack or a documentation string conversion attack. has become The percentage of malicious code generated when using the Trojan horse puzzle is '21%'.

However, it has been pointed out that Trojan horse puzzles are difficult for AI models to reproduce, as they must learn how to pick masked keywords from trigger phrases and use them in the generated output. Trojan horse puzzles also require the prompt to contain a trigger word or phrase, and the research team used social engineering to propagate the prompt or employ another prompt-tainting mechanism to reduce the trigger It is said that it is quite possible to include the words that become

The research team proposes a method to detect and filter files containing malicious samples as a countermeasure against the Trojan horse puzzle. It is also effective to use natural language processing classification tools and computer vision tools to determine whether the AI model has been backdoored after training.

in Software,   Security, Posted by logu_ii