Yahoo! Giant data exceeding 13 TB such as age, gender, residential area started offering free for machine learning


ByBob Mical

It is one of the technologies to realize artificial intelligenceMachine learningAnalyzes a lot of data, extracts certain rules and patterns and generates an algorithm. Although it can be said that the accuracy of learning increases with the number of data given at the learning stage, it is difficult to secure data that clears quality and quantity in individuals as well as research institutes. Start of providing huge datasets that can be utilized for machine learning such as Yahoo! Although it announced, the data seems to be a huge one exceeding 100 billion.

Yahoo Releases the Largest-ever Machine Learning ... | Yahoo Labs
http://yahoolabs.tumblr.com/post/137281912191/yahoo-releases-the-largest-ever-machine-learning

Yahoo, which collects a lot of traffic and provides services close to ordinary consumers, by converting user's actions (user interactions) within their site so far, Yahoo Labs' affiliated machinery He said he is making great use of his learning problems. Yahoo Labs announced that it will release valuable data that had never been released to the outside until now.

Data to be released is data acquired from February 2015 to March 2015. It is a record of actions taken by approximately 20 million users to the news feed displayed on the Yahoo news page It is said that. The number of data is about 110 billion, and the data capacity is 13.5 TB (terabytes) before compression and the tremendous capacity of 1.5 TB even after compression. In addition, anonymization processing is carried out as well as data.


In the published data, in addition to user's behavior data, age group, gender, rough geographical data are included as a subset. In addition, news article titles, summaries, and key phrases are included as elements of each item, so that it can be used for analysis. The data of the user interaction includes the time at which the action occurred and what kind of terminal access was done.

With regard to the purpose of this decision, Yahoo will promote independent research in the field of machine learning and recommendation systems (a recommendation system) on a large scale and to bridge the research level gap between industry and academic institutions Is cited. Utilizing these huge data, Yahoo Labs' Personalization Science Team has used behavior modeling, recommended function system, large scale distributed machine learning, ranking function, online algorithm, content modeling, time series · Mining (time series mining) and others have been studying.

The data published by Yahoo Labs can be downloaded from the following page, but the US version Yahoo! ID seems to be necessary.

Webscope | Yahoo Labs

in Science, Posted by darkhorse_log