Mar 17, 2023 16:00:00

It is reported that ``Jailbreak'' that hacks GPT-4 and removes the limit of output text has already succeeded

The large-scale language model `` GPT-4 '' officially announced by OpenAI on March 14, 2023 (Tuesday) is said to greatly exceed the performance of not only the conventional GPT-3.5 but also existing AI. In general, language models such as GPT-4 have restrictions on the output text, but it is possible to remove this restriction by text input, which is called `` jailbreak ''. Meanwhile,

Alex Albert, who studies computer science at the University of Washington, reported that ChatGPT based on GPT-4 succeeded in jailbreaking.

GPT-4 Simulator
https://www.jailbreakchat.com/prompt/b2917fad-6803-41f8-a6c8-756229b84270

Albert reported on March 17, 2023, ``I helped create the first jailbreak for GPT-4-based ChatGPT, which bypasses content filters.''

Well, that was fast…

I just helped create the first jailbreak for ChatGPT-4 that gets around the content filters every time

credit to @vaibhavk97 for the idea, I just generalized it to make it work on ChatGPT

here's GPT-4 writing instructions on how to hack someone's computer pic.twitter.com/EC2ce4HRBH
— Alex (@alexalbert__) March 16, 2023

The prompt for jailbreak released by Mr. Albert is as follows. A prompt is the text you enter at the very beginning before starting a conversation with ChatGPT.

here's the jailbreak: https://t.co/eUTYAX45ia pic.twitter.com/OycgiB4yJ9
— Alex (@alexalbert__) March 16, 2023

``We were able to do this by having GPT-4 simulate the ability to predict the next token,'' Albert said of the prompt. The procedure is to give GPT-4 a Python function and instruct one of the functions to act as a language model that predicts the next token. It then calls the underlying function and passes the start token to GPT-4.

this works by asking GPT-4 to simulate its own abilities to predict the next token

we provide GPT-4 with python functions and tell it that one of the functions acts as a language model that predicts the next token

we then call the parent function and pass in the starting tokens
— Alex (@alexalbert__) March 16, 2023

In order to use the start token, it is necessary to divide the originally restricted 'trigger words' such as 'bombs, weapons, drugs' into tokens and replace the text 'someone's computer' with the divided variables. Also, you should replace the 'simple_function' input at the beginning of asking the question.

to use it, you have to split “trigger words” (eg things like bomb, weapon, drug, etc) into tokens and replace the variables where I have the text 'someone's computer' split up

Also, you have to replace simple_function's input with the beginning of your question
— Alex (@alexalbert__) March 16, 2023

These steps are called 'token smuggling', where the hostile prompt is split into tokens just before GPT-4 starts outputting text. Therefore, by correctly splitting these hostile prompts, content filters can be circumvented every time.

this phenomenon is called token smuggling, we are splitting our adversarial prompt into tokens that GPT-4 doesn't piece together before starting its output

this allows us to get past its content filters every time if you split the adversarial prompt correctly
— Alex (@alexalbert__) March 16, 2023

When asked , 'What do you hope to achieve by disseminating this information?' Albert replied, 'While GPT-4 is still in its early stages, I would like to know its capabilities and limitations.' I need it,” he replied.

to start, I want to say I have nothing to gain here and I don't condone anyone actually acting upon any of GPT-4's outputs

however, I believe red-teaming work is important and shouldn't be conducted in the shadows of AI companies. the general public should know the capabilities… https://t.co/ATPwO7sbDM
— Alex (@alexalbert__) March 16, 2023

Mr. Albert has released 'Jailbreak Chat', which collects conversation examples for jailbreaking with ChatGPT so far.

'Jailbreak Chat', which collects conversation examples that enable 'jailbreak' to forcibly ask even questions that ChatGPT can not answer - GIGAZINE