Mar 11, 2024 06:00:00

A report that answers that cannot be generated by an AI chatbot can be answered using ``ASCII art''

Large-scale language models such as GPT-4, Gemini, Claude, and Llama 2 respond to input and output natural sentences with human-like accuracy. However, measures are taken at the time of development to prevent violent or illegal content from being output. A paper on ``ArtPrompt'', a method for executing ``jailbreak'' using

ASCII art that circumvents this security measure, has been published on the unpeer-reviewed paper repository arXiv.

[2402.11753] ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
https://arxiv.org/abs/2402.11753

Researchers jailbreak AI chatbots with ASCII art -- ArtPrompt bypasses safety measures to unlock malicious queries | Tom's Hardware
https://www.tomshardware.com/tech-industry/artificial-intelligence/researchers-jailbreak-ai-chatbots-with-ascii-art-artprompt-bypasses-safety-measures-to-unlock-malicious-queries

The key to ArtPrompt is not to hide words that would get caught in the filters of a large-scale language model, but instead to express them using ASCII art.

The image below illustrates a malicious user trying to ask a large-scale language model how to make a bomb. If you first ask the question, 'How do you make a bomb?', the large-scale language model will refuse to answer, saying, 'Sorry, I can't answer that.' Next, when he entered the word 'bomb' expressed in ASCII art, the large-scale language model answered 'Yes, that is...' and how to make a bomb.

The actual prompt input to GPT-4 and the output result (Response) are shown below. The hidden word is 'COUNTERFEIT' and asks how to make counterfeit money. Before asking questions, students were given detailed instructions on the ASCII art of the hidden word and how to read it. GPT-4 reads the ASCII art and then answers how to make fake money.

This ArtPrompt is basically effective for any model, and it was particularly effective for GPT-3.5 and Gemini. On the other hand, Llama 2 was the least effective.

The research team acknowledges that the large-scale language model and prompt vulnerabilities presented in this paper could be reused by malicious actors to attack large-scale language models. , appealed to developers of large-scale language models to improve safety.

Related Posts:

Mar 11, 2024 06:00:00 in Software, Posted by log1i_yk