AI learning dataset 'Books3', which was also used for training Meta's large-scale language model 'LLaMA', is deleted

Books3, released as part of the open source AI training data set 'The Pile' provided by the non-profit AI research group ' EleutherAI ', is about 196,640 books for AI model training, about 37 GB. data was included.

Books3 was uploaded in 2020 by AI developer Sean Presser and has since been hosted by large-scale repository The Eye. Mr. Presser reported, ``The development goal of Books3 was to allow anyone to create an AI model comparable to ChatGPT.'' ``It's important to be able to create your own ChatGPT-like AI model in case ChatGPT goes offline for some reason or faces a lawsuit,'' he said.

Books3 is also used for training Meta's large-scale language models LLaMA and BloombergGPT , and Meta researchers describe Books3 as 'a public dataset for training large-scale language models' (PDF file). I was.

The Eye

claims that 'all datasets comply with the Digital Millennium Copyright Act ,' but suspicions of intellectual property and copyright infringement have been pointed out.

Amid growing concerns about copyright infringement on AI, the Rights Alliance has requested The Eye to remove Books3 based on Digital Millennium Copyright Act infringement. ``It is very important to prevent AI from being trained with pirated and illegal content,'' said Maria Fredenslund, director of the Rights Alliance. There is a significant challenge not only to detect and remove illegal AI training datasets, but also to deal with AI that has been trained on illegal content and is now prevalent on the internet.”

The Eye removed the Books3 dataset following a removal request from the Rights Alliance. If you access Books3 at the time of article creation, a 404 error will bedisplayed .

On the other hand, although the download link of Books3 published by The Eye was taken offline, it was pointed out that the dataset was not completely deleted from the Internet. Overseas media TorrentFreak reports that 'files are still backed up on the Internet Archive's wayback machine , and alternative download links are also shared.' ``Like traditional pirated books and movies, it's very difficult to remove once it's out,'' he said.

In addition to requesting the deletion of Books 3 to The Eye, the Rights Alliance is asking Meta to respond to Books 3. ``It is unlikely that Meta will retrain LLaMA to eliminate concerns about copyright infringement,'' said Gizmodo, a technology news media. ``AI developers and development companies need a framework to always share details such as the training data used to create the AI model,'' Fredenslund said.

