Aug 21, 2023 13:18:00

AI learning dataset 'Books3', which was also used for training Meta's large-scale language model 'LLaMA', is deleted

The Danish anti-piracy group '

Rights Alliance ' requested the host ' The Eye ' to delete the data set ' Books3 ' of about 200,000 books, and the data set was deleted. was broken. Books3 is a dataset that was also used for training the large-scale language model 'LLaMA ' developed by Meta.

Anti-Piracy Group Takes Prominent AI Training Dataset ''Books3'' Offline * TorrentFreak
https://torrentfreak.com/anti-piracy-group-takes-prominent-ai-training-dataset-books3-offline-230816/

Revealed: The Authors Whose Pirated Books Are Powering Generative AI - The Atlantic

https://www.theatlantic.com/technology/archive/2023/08/books3-ai-meta-llama-pirated-books/675063/

Massive Books3 collection for training AI was taken down over copyright issues | Mashable
https://mashable.com/article/books3-ai-training-dmca-takedown

Anti-Piracy Group Takes AI Training Dataset 'Books3' Offline
https://gizmodo.com/anti-piracy-group-takes-ai-training-dataset-books3-off-1850743763

Books3, released as part of the open source AI training data set 'The Pile' provided by the non-profit AI research group ' EleutherAI ', is about 196,640 books for AI model training, about 37 GB. data was included.

Books3 was uploaded in 2020 by AI developer Sean Presser and has since been hosted by large-scale repository The Eye. Mr. Presser reported, ``The development goal of Books3 was to allow anyone to create an AI model comparable to ChatGPT.'' ``It's important to be able to create your own ChatGPT-like AI model in case ChatGPT goes offline for some reason or faces a lawsuit,'' he said.

Suppose you wanted to train a world-class GPT model, just like OpenAI. How? You have no data.

Now you do. Now everyone does.

Presenting 'books3', aka 'all of bibliotik'

- 196,640 books
- in plain.txt
- reliable, direct download, for years: https://t.co/KKSrhEAnrD

thread ???? pic.twitter.com/m6bdpHfYJx
— Shawn Presser (@theshawwn) October 25, 2020

Books3 is also used for training Meta's large-scale language models LLaMA and BloombergGPT , and Meta researchers describe Books3 as 'a public dataset for training large-scale language models' (PDF file). I was.

The Eye

claims that 'all datasets comply with the Digital Millennium Copyright Act ,' but suspicions of intellectual property and copyright infringement have been pointed out.

Amid growing concerns about copyright infringement on AI, the Rights Alliance has requested The Eye to remove Books3 based on Digital Millennium Copyright Act infringement. ``It is very important to prevent AI from being trained with pirated and illegal content,'' said Maria Fredenslund, director of the Rights Alliance. There is a significant challenge not only to detect and remove illegal AI training datasets, but also to deal with AI that has been trained on illegal content and is now prevalent on the internet.”

The Eye removed the Books3 dataset following a removal request from the Rights Alliance. If you access Books3 at the time of article creation, a 404 error will bedisplayed .

On the other hand, although the download link of Books3 published by The Eye was taken offline, it was pointed out that the dataset was not completely deleted from the Internet. Overseas media TorrentFreak reports that 'files are still backed up on the Internet Archive's wayback machine , and alternative download links are also shared.' ``Like traditional pirated books and movies, it's very difficult to remove once it's out,'' he said.

In addition to requesting the deletion of Books 3 to The Eye, the Rights Alliance is asking Meta to respond to Books 3. ``It is unlikely that Meta will retrain LLaMA to eliminate concerns about copyright infringement,'' said Gizmodo, a technology news media. ``AI developers and development companies need a framework to always share details such as the training data used to create the AI model,'' Fredenslund said.

Related Posts:

Aug 21, 2023 13:18:00 in AI, Software, Posted by darkhorse_log