Aggressive and discriminative categorization tagging turned out in huge photo data set `` ImageNet '' and deleted more than half of human photos

It has become clear that categorization including racist and feminine categorization has been made in the “Person” category of the enormous size photo data set “ ImageNet ” operated since 2009, 1.2 million More than half of your photos will be deleted.

Towards Fairer Datasets: Filtering and Balancing the Distribution of the People Subtree in the ImageNet Hierarchy-ImageNet

Playing roulette with race, gender, data and your face

600,000 Images Removed from AI Database After Art Project Exposes Racist Bias

ImageNet is a photo data set released in 2009. There are more than 14 million photos, more than 20,000 categories, and the category classification is an average of 50 images per minute using Amazon Mechanical Turk. The process of performing thousands of categories was performed.

AI researcher Kate Crawford and artist Trevor Pagren have developed ImageNet Roulette using Caffe, an open source deep learning framework trained solely in the ImagePerson category. did.

ImageNet Roulette performs face detection when a user uploads a photo. When a face is detected, it is sent to Caffe for classification, and the detected face and the category assigned to the face are displayed.

As a result, the “problem, unpleasant and strange category” included in ImageNet came to light.

For example, when Julia Carey Wong of the news site The Guardian uploaded her own image, one of the categories that was assigned was a word with a strong insulting meaning, “ Gook ”.

Steve Bush from the news site NewStatesmanAmerica also uploaded his own photo. This photo is usually classified as “black”.

However, in the photo taken in the image of former Prime Minister Margaret Thatcher, it is classified as 'First Crime'.

In the image tweeted by Mr. Crawford, Barack Obama has been given the category “Demagog”.

However, Peter Scomoloc, a data scientist, `` I can intentionally create a terrible algorithm and give training data, but that does not mean `` data is bad '', '' ImageNet Roulette's way Is questioning.

According to the announcement of ImageNet, the word database WordWord was used for categorization when the 2009 dataset was constructed. At this time, things that fall under “aggressive”, “contempt”, “bad”, and “defamatory” were omitted, but the filtering work was not enough, and there were still aggressive words left behind in the category classification. It is said that it will be used.

in Web Service, Posted by logc_nt