Searching for "harassment on Wikipedia" using an algorithm, what is understood as a result?


ByJohann Dreo

There is a fact that abusive acts of abusing and harassing opponents in the world on the net are practiced all over the world and Internet encyclopediaWikipediaAmong the harassment acts against individuals will continue to fail. The Wikimedia Foundation organizing a survey team that regards the situation as a problem is an incubator (entrepreneur support provider) under AlphabetJigsaw"We began investigating the actual situation in cooperation with.

Algorithms and insults: Scaling up our understanding of harassment on Wikipedia? Wikimedia Blog
https://blog.wikimedia.org/2017/02/07/scaling-understanding-of-harassment/

In the online dictionary Wikipedia, it is possible for everyone to be an "editor" to edit their contents and update their contents. In each article, there is a page to discuss with other editors, and you can exchange opinions on contents. However, this page is not necessarily written by conscience, and there are many contents which can be thought only as slander and harassment to individuals.

In the "conversation" page on Wikipedia, among 30 days of posting and editing, unacknowledged terms (black dots) were 160 thousand 4102 words, while aggressive terms (red) 573 words, 519 words that were offensive but canceled / deleted were 519 words, totaling 1092 words. However, actual facts of harassment practice are said to be more deeply rooted.

ByHoshi Ludwig

Since the beginning of 2016, the Wikimedia Foundation has been conducting a survey on the actual state of harassment practices on Wikipedia in cooperation with Jigsaw. Both of them aimed at automatically analyzing the written contents in the conversation page by developing computer algorithms.

In development, we are trying to improve recognition accuracy by using machine learning technology. At the same time as letting the algorithm learn 100,000 comments, 4,000 cloud workers are asked to judge whether it is harassment about 1 million comment comments, and the result is reflected in the learning of the algorithm It is said that he did.

By the way, since the dataset used at this time is published by figshare as below, it is possible to freely download and utilize it.

Wikipedia Talk


As a result of these investigations, the survey team has released the following three points.

◆ 1: How often is the management carried out when attacking acts are done?
It is said that only 18% of the people who took attacks adopted measures such as warnings and blocking. Even users who conduct attacks more than four times per person, only 60% actually take some action.

◆ 2: The effect of "anonymity" on individual attack
67% of users who attacked in English Wikipedia are editors registered in Wikipedia. This is contrary to widely anticipated "anonymous users are causing problems".

◆ 3: What is the ratio of heavy users who frequently perform edits and users who perform occasional editing?
As shown in the graph below, no big difference is confirmed by classifying users according to the number of edits and plotting the percentage of each attribute taking up the entire attack activity graph. Approximately half of all attacks are occupied by users with 1 to 5 edits per year, even users who do 100 or more edits per year are 30% of the total.

ByNithum

Although the state of the harassment act on Wikipedia has become clear this time in this way, the investigation team thinks that this is still a part of the whole. There are still limits to the judgment based on the algorithm, especially for harassment acts that do not contain specific terms, it seems that scenes where certification is difficult is not unlikely. In addition, it is also a fact that the target language was limited to English, and it is said that further improvement needs to be done in the future.

in Software,   Web Service, Posted by darkhorse_log