AI learning dataset 'Books3', which was also used for training Meta's large-scale language model 'LLaMA', is deleted
The Danish anti-piracy group '
Anti-Piracy Group Takes Prominent AI Training Dataset ''Books3'' Offline * TorrentFreak
https://torrentfreak.com/anti-piracy-group-takes-prominent-ai-training-dataset-books3-offline-230816/
Revealed: The Authors Whose Pirated Books Are Powering Generative AI - The Atlantic
Massive Books3 collection for training AI was taken down over copyright issues | Mashable
https://mashable.com/article/books3-ai-training-dmca-takedown
Anti-Piracy Group Takes AI Training Dataset 'Books3' Offline
https://gizmodo.com/anti-piracy-group-takes-ai-training-dataset-books3-off-1850743763
Books3, released as part of the open source AI training data set 'The Pile' provided by the non-profit AI research group ' EleutherAI ', is about 196,640 books for AI model training, about 37 GB. data was included.
Books3 was uploaded in 2020 by AI developer Sean Presser and has since been hosted by large-scale repository The Eye. Mr. Presser reported, ``The development goal of Books3 was to allow anyone to create an AI model comparable to ChatGPT.'' ``It's important to be able to create your own ChatGPT-like AI model in case ChatGPT goes offline for some reason or faces a lawsuit,'' he said.
Suppose you wanted to train a world-class GPT model, just like OpenAI. How? You have no data.
— Shawn Presser (@theshawwn) October 25, 2020
Now you do. Now everyone does.
Presenting 'books3', aka 'all of bibliotik'
- 196,640 books
- in plain.txt
- reliable, direct download, for years: https://t.co/KKSrhEAnrD
thread ???? pic.twitter.com/m6bdpHfYJx
Books3 is also used for training Meta's large-scale language models LLaMA and BloombergGPT , and Meta researchers describe Books3 as 'a public dataset for training large-scale language models' (PDF file). I was.
The Eye
Amid growing concerns about copyright infringement on AI, the Rights Alliance has requested The Eye to remove Books3 based on Digital Millennium Copyright Act infringement. ``It is very important to prevent AI from being trained with pirated and illegal content,'' said Maria Fredenslund, director of the Rights Alliance. There is a significant challenge not only to detect and remove illegal AI training datasets, but also to deal with AI that has been trained on illegal content and is now prevalent on the internet.”
The Eye removed the Books3 dataset following a removal request from the Rights Alliance. If you access Books3 at the time of article creation, a 404 error will bedisplayed .
On the other hand, although the download link of Books3 published by The Eye was taken offline, it was pointed out that the dataset was not completely deleted from the Internet. Overseas media TorrentFreak reports that 'files are still backed up on the Internet Archive's wayback machine , and alternative download links are also shared.' ``Like traditional pirated books and movies, it's very difficult to remove once it's out,'' he said.
In addition to requesting the deletion of Books 3 to The Eye, the Rights Alliance is asking Meta to respond to Books 3. ``It is unlikely that Meta will retrain LLaMA to eliminate concerns about copyright infringement,'' said Gizmodo, a technology news media. ``AI developers and development companies need a framework to always share details such as the training data used to create the AI model,'' Fredenslund said.
Related Posts:
in Software, Posted by log1r_ut