2019年09月24日 13時00分ネットサービス

巨大写真データセット「ImageNet」で攻撃的・差別的なカテゴリー分類タグ付けが判明し人物写真の半数以上を削除

2009年から運用されている膨大なサイズの写真データセット「ImageNet」の「人物」カテゴリーで、人種差別的・女性蔑視的なものを含むカテゴリー分類がなされていることが明らかになり、120万人分の写真のうち半数以上が削除されることになりました。

Towards Fairer Datasets: Filtering and Balancing the Distribution of the People Subtree in the ImageNet Hierarchy - ImageNet
http://image-net.org/update-sep-17-2019

Playing roulette with race, gender, data and your face
https://www.nbcnews.com/mach/tech/playing-roulette-race-gender-data-your-face-ncna1056146

600,000 Images Removed from AI Database After Art Project Exposes Racist Bias
https://hyperallergic.com/518822/600000-images-removed-from-ai-database-after-art-project-exposes-racist-bias/

ImageNetは2009年に発表された写真データセットで、写真総数は1400万枚以上、分類カテゴリーは2万以上あり、そのカテゴリー分類はAmazon Mechanical Turkを用いて、1分間に平均50枚の画像に対して数千のカテゴリー分類を行うという処理が行われました。

AI研究者のケイト・クロウフォード氏とアーティストのトレヴァー・パグレン氏は、このImageNetの「人物」カテゴリーだけでトレーニングされたオープンソースのディープラーニングフレームワーク「Caffe」を利用した「ImageNet Roulette」を開発しました。

Want to see how an AI trained on ImageNet will classify you? Try ImageNet Roulette, based on ImageNet's Person classes. It's part of the 'Training Humans' exhibition by @trevorpaglen & me - on the history & politics of training sets. Full project out soonhttps://t.co/XWaVxx8DMC pic.twitter.com/paAywgpEo4
— Kate Crawford (@katecrawford) 2019年9月16日

ImageNet Rouletteは、ユーザーが写真をアップロードすると、顔検出を実施。顔が検出された場合は分類のためにCaffeに送信し、検出された顔と顔に対してつけられたカテゴリーを表示します。

この結果浮かび上がってきたのが、ImageNetに含まれる「問題のある、不快で奇妙なカテゴリー」です。

たとえば、ニュースサイト・The Guardianのジュリア・キャリー・ウォン氏が自身の画像をアップロードしたところ、つけられたカテゴリーの1つは「グック(Gook)」という、侮辱的な意味合いの強い言葉でした。

同じくニュースサイト・NewStatesmanAmericaのスティーブン・ブッシュ氏も自身の写真をアップロード。この写真は普通に「黒人である」という分類です。

Fascinating insight into the classification system and categories used by Stanford and Princeton, in the software that acts as the baseline for most image identification algorithms. pic.twitter.com/QWGvVhMcE4
— Stephen Bush (@stephenkb) 2019年9月16日

しかし、マーガレット・サッチャー元首相をイメージして撮った写真では「初犯」という分類に。

Fun* game: feeding in my Guardian 'Can I Cook LIke' photoshoots into the Imagenet software and seeing what I get. https://t.co/yoBOoCjEYV pic.twitter.com/OLCqiZkXnA
— Stephen Bush (@stephenkb) 2019年9月16日

クロウフォード氏がツイートした画像では、バラク・オバマ氏に「デマゴーグ」というカテゴリーがつけられています。

Whoa, ImageNet Roulette went... nuts. The servers are barely standing. Good to see this simple interface generate an international critical discussion about the race & gender politics of classification in AI, and how training data can harm. More here: https://t.co/m0Pi5GOmgv pic.twitter.com/0HgYsTewbx
— Kate Crawford (@katecrawford) 2019年9月18日

ただし、データサイエンティストのピーター・スコモロック氏は「意図的にひどいアルゴリズムを作り、トレーニングデータを与えることはできますが、それは『データが悪い』ということを意味するものではありません」と、ImageNet Rouletteのやり方に疑問を呈しています。

This is junk science. "we want to shed light on what happens when technical systems are trained on problematic training data". I can intentionally create a garbage algorithm and feed it any training data, it doesn't mean the data is "bad".
— Peter Skomoroch (@peteskomoroch) 2019年9月17日

ImageNetの発表によれば、2009年のデータセット構築時、カテゴリー分類には語句データベース・WordNetが用いられました。このとき、「攻撃的」「軽蔑的」「悪口」「中傷」に当てはまるようなものは省いたのですが、フィルタリング作業が十分ではなく、なおも攻撃的な語句が残ってしまってカテゴリー分類に用いられることになったとのことです。

この記事のタイトルとURLをコピーする