It is reported that ``Jailbreak'' that hacks GPT-4 and removes the limit of output text has already succeeded
Alex Albert, who studies computer science at the University of Washington, reported that ChatGPT based on GPT-4 succeeded in jailbreaking.
The large-scale language model `` GPT-4 '' officially announced by OpenAI on March 14, 2023 (Tuesday) is said to greatly exceed the performance of not only the conventional GPT-3.5 but also existing AI. In general, language models such as GPT-4 have restrictions on the output text, but it is possible to remove this restriction by text input, which is called `` jailbreak ''. Meanwhile,
Albert reported on March 17, 2023, ``I helped create the first jailbreak for GPT-4-based ChatGPT, which bypasses content filters.''
Well, that was fast…— Alex (@alexalbert__) March 16, 2023
I just helped create the first jailbreak for ChatGPT-4 that gets around the content filters every time
credit to @vaibhavk97 for the idea, I just generalized it to make it work on ChatGPT
here's GPT-4 writing instructions on how to hack someone's computer pic.twitter.com/EC2ce4HRBH
The prompt for jailbreak released by Mr. Albert is as follows. A prompt is the text you enter at the very beginning before starting a conversation with ChatGPT.
``We were able to do this by having GPT-4 simulate the ability to predict the next token,'' Albert said of the prompt. The procedure is to give GPT-4 a Python function and instruct one of the functions to act as a language model that predicts the next token. It then calls the underlying function and passes the start token to GPT-4.
this works by asking GPT-4 to simulate its own abilities to predict the next token— Alex (@alexalbert__) March 16, 2023
we provide GPT-4 with python functions and tell it that one of the functions acts as a language model that predicts the next token
we then call the parent function and pass in the starting tokens
In order to use the start token, it is necessary to divide the originally restricted 'trigger words' such as 'bombs, weapons, drugs' into tokens and replace the text 'someone's computer' with the divided variables. Also, you should replace the 'simple_function' input at the beginning of asking the question.
to use it, you have to split “trigger words” (eg things like bomb, weapon, drug, etc) into tokens and replace the variables where I have the text 'someone's computer' split up— Alex (@alexalbert__) March 16, 2023
Also, you have to replace simple_function's input with the beginning of your question
These steps are called 'token smuggling', where the hostile prompt is split into tokens just before GPT-4 starts outputting text. Therefore, by correctly splitting these hostile prompts, content filters can be circumvented every time.
this phenomenon is called token smuggling, we are splitting our adversarial prompt into tokens that GPT-4 doesn't piece together before starting its output— Alex (@alexalbert__) March 16, 2023
this allows us to get past its content filters every time if you split the adversarial prompt correctly
When asked , 'What do you hope to achieve by disseminating this information?' Albert replied, 'While GPT-4 is still in its early stages, I would like to know its capabilities and limitations.' I need it,” he replied.
to start, I want to say I have nothing to gain here and I don't condone anyone actually acting upon any of GPT-4's outputs— Alex (@alexalbert__) March 16, 2023
however, I believe red-teaming work is important and shouldn't be conducted in the shadows of AI companies. the general public should know the capabilities… https://t.co/ATPwO7sbDM
Mr. Albert has released 'Jailbreak Chat', which collects conversation examples for jailbreaking with ChatGPT so far.
'Jailbreak Chat', which collects conversation examples that enable 'jailbreak' to forcibly ask even questions that ChatGPT can not answer - GIGAZINE
in Software, Posted by log1r_ut