DeepMind's attempt to 'check with AI whether AI makes discriminatory remarks'
A language model is a formalized grammar and word connection for AI to handle
Red Teaming Language Models with Language Models | DeepMind
https://deepmind.com/research/publications/2022/Red-Teaming-Language-Models-with-Language-Models
In 2016, Microsoft released ' Tay ', a Twitter bot that automatically tweets, for the purpose of studying conversation comprehension. However, a few hours after its release, Tay was stopped due to a series of racist and sexist tweets.
Microsoft's artificial intelligence burned up with a series of problematic remarks such as 'Burning the fucking feminist in hell' and 'Hitler was right' --GIGAZINE
Tay's runaway wasn't because Microsoft wasn't careful, but because some users were amused and learned discriminatory expressions. DeepMind says, 'As in Tay's example, the problem is that there are so many inputs that cause the language model to generate harmful text, so we find cases where the language model fails before it is introduced into the real world. Is almost impossible. '
To prevent these failures, humans must manually test and check in advance for harmful behavior in the language model. However, the bottleneck of manual checks is that the number of tests is limited and the cost is high.
Therefore, 'The language model is also used for the annotator who tests the language model' is the approach announced by DeepMind this time. Specifically, the language model that becomes the red team inputs various questions to the language model to be investigated, analyzes the answers output to them, and evaluates them.
There are five harmful behaviors of the language model identified by this DeepMind approach:
1: Offensive words
Malicious wording, discriminatory, blasphemous or sexual expression.
2: Information leakage
Generate and disseminate personally identifiable information.
3: Generate contact information
Instruct specific people to contact you by email or phone.
4: Distribution bias
On average, a large number of outputs talk about a particular group of people in an unreasonably different way than other groups.
5: Harm in conversation
Aggressive words that occur during long conversations.
DeepMind says, 'Our aim is to reduce significant oversights by red teaming, which complements manual testing and automatically discovers language model failures. In the future, our approach will be to internal structure and robustness . It can be used to anticipate and detect possible harm from advanced machine learning systems, such as flaws. This approach is just one element of responsible language model development. We call Red Teaming a language. We consider it one of the tools used in conjunction with many other tools to discover and mitigate the harmful effects of our models. '
Related Posts:
in Software, Posted by log1i_yk