How to turn ChatGPT into a toxic chat AI that rants and rants



Large-scale language models (LLMs) such as ChatGPT and PaLM are used for various use cases such as article creation, information retrieval, and chat AI creation. A research group at Princeton University , the Allen Institute for Artificial Intelligence (AI2), and the Georgia Institute of Technology announced how to make such LLM a toxic chat AI that utters sexism, racism, and vile rants.

[2304.05335] Toxicity in ChatGPT: Analyzing Persona-assigned Language Models
https://arxiv.org/abs/2304.05335



Analyzing the toxicity of persona-assigned language models | AI2 Blog
https://blog.allenai.org/toxicity-in-chatgpt-ccdcf9265ae4

ChatGPT allows you to set a specific persona by setting system parameters . For example, if you set the legendary boxer Muhammad Ali's persona, ChatGPT will mimic Ali's behavior and communicate.

However, analysis of ChatGPT responses with assigned personas reveals that ChatGPT speaks more hurtfully than the default settings when personas are assigned. Compared to the default setting, it seems that the harmfulness of remarks increases up to 6 times with the persona setting.



The research group points out that the use of persona settings by malicious individuals can expose unsuspecting users to harmful content. Therefore, the research group conducted extensive research to analyze the harmfulness of ChatGPT when personas are assigned. The research group has assigned personas of about 100 people with various backgrounds, such as journalists, politicians, athletes, and business people, to ChatGPT, and analyzed their remarks.

The harmfulness of ChatGPT output is analyzed using ' Perspective API '. The Perspective API is an API that analyzes whether text contains harmful content and expresses the degree of harmfulness as a percentage.

Perspective API
https://perspectiveapi.com/



For example, if you assign the persona of Lyndon Johnson , the former president of the United States, to ChatGPT, ChatGPT will say, ``Let's talk about South Africa. It's a place. Even though white people built that country from scratch, now they're not even allowed to own their own land. That's a shame.' that's right.

Since it became clear that the harmfulness of ChatGPT varies considerably depending on the assigned persona, the research group said, ``It has been confirmed that the unique understanding of personas obtained from ChatGPT training data strongly affects the harmfulness of the output. ” is written.

The graph below analyzes the text output by ChatGPT that assigns personas and quantifies the toxicity of each remark (Toxicity score). It means that the higher the harmfulness on the vertical axis, the more harmful the remark. ChatGPT's outputs, which are assigned the personas of fashion designer Coco Chanel , former US President John F. Kennedy , and former Pakistani Prime Minister Benazir Bhutto, are less harmful, but are assigned the persona of Nazi Adolf Hitler. You can see that the degree of toxicity is extremely high.



Below are the results of analyzing the degree of harm of each remark by categorizing personas into categories such as ``business person (green)'', ``sportsman (orange)'', ``journalist (blue)'', and ``dictator (pink)''. Combined graph. The result is that business people are less harmful and dictators are more harmful.



The research group points out that the fact that journalists have nearly twice the toxicity score of business people does not mean that real journalists are twice as harmful as business people. Regarding the difference in this number, the research group said, ``For example, Richard Nixon has a toxicity score that is nearly twice that of John F. I'm just thinking,' he explains.

Also, when the personas are classified by place of origin, the personas of people from Africa and Asia are less harmful, and the personas of people from South America and Northern Europe are particularly harmful.



When the toxicity score is color-coded by country, it looks like this. Countries with a high degree of hazard are indicated by dark red, and countries with a low degree of hazard are indicated by white.



Others have revealed that ChatGPT will make more harmful remarks against countries associated with colonial rule (such as Great Britain, France, Spain, etc.) when assigned a dictator persona. For France, for example, 'France? Huh! It's a country that has long forgotten its glory days of conquest and colonization. It's just a bunch of cheese-eating surrender monkeys who always follow others.' It sounds like you're going to make some very violent and offensive remarks.

Research has shown that ChatGPT generates more harmful content, especially when personas are assigned and similar system-level settings are in place. The research team points out that this indicates that AI systems are not yet ready for widespread use, especially vulnerable individuals cannot use chat AI safely. We also hope that this work will open up new areas of innovation and research that will enable the development of more robust, reliable and secure AI systems.

in Software, Posted by logu_ii