Penguin Random House, the world's largest publisher, says 'no' to AI



The rise of AI is forcing content publishers to choose whether to

coexist with it orconfront it . The British publishing industry media The Bookseller reported that Penguin Random House, a major American publisher, will take measures to explicitly prohibit future books from being used to train AI.

The Bookseller - News - Penguin Random House underscores copyright protection in AI rebuff
https://www.thebookseller.com/news/penguin-random-house-underscores-copyright-protection-in-ai-rebuff

Penguin Random House books now explicitly say 'no' to AI training - The Verge
https://www.theverge.com/2024/10/18/24273895/penguin-random-house-books-copyright-ai

Penguin Random House will reportedly amend the copyright pages of all new and reprinted books published worldwide to include a statement that explicitly prohibits the use of its content for training AI, including large-scale language models (LLMs).

The new copyright page states: 'No part of this document may be used or reproduced in any way for the purpose of training AI techniques or AI systems.'



According to The Bookseller, Penguin Random House is the first major publisher to mention AI on its copyright page.

In addition, the following statement will be added, pursuant to the EU's

Directive on Copyright in the Digital Single Market (CDSM Directive) : 'This work is expressly reserved from the text and data mining exemptions.'

Previous EU regulations stipulated that education and reporting were exceptional uses for which copyrighted works could be used without obtaining permission from the copyright holder, but the CDSM Directive adopted in 2019 added 'text and data mining (TDM)' to those uses.

This allows copyrighted works to be freely used in TDM for purposes such as academic research, but at the same time, copyright holders can now declare that certain works are 'explicitly reserved' from the exceptions, thereby excluding those works from the exceptions and prohibiting their use in TDM without permission.



However, some are skeptical about the effectiveness of this change. The Verge, an IT news site that covered The Bookseller's report, cited the fact that websites often ignore requests in their ' robots.txt files ' to not use their content for AI training, saying, 'This revision is like a Penguin Random House version of robots.txt, which may serve as a warning but has little to do with actual copyright law. Copyright is protected regardless of whether a copyright page is inserted at the beginning of a book, and it may be used freely under the pretext of fair use or other reasons, regardless of whether the rights holder approves it.'

It has been pointed out that publishers' clear opposition to the use of books in AI runs counter to efforts to seek ways to at least obtain fair compensation for the use of AI, such as by prominent publishers such as Oxford University Press indicating their intention to grant licenses to AI companies, amid reports that AI has already been trained on large numbers of pirated books.

Meanwhile, severalauthors and major media outlets have filed lawsuits against AI companies, alleging that the unauthorized use of their content to train AI constitutes copyright infringement, leaving the industry deeply divided over how to respond to AI.

Microsoft and OpenAI sued by eight newspapers for copyright infringement - GIGAZINE



Prior to this decision, Penguin Random House announced three principles in August 2024, declaring that it would ' champion human creativity,' 'vigorously protect the intellectual property belonging to authors and artists,' and 'use generative AI tools responsibly.'

Chien-Wei Lui, copyright attorney at Fox Williams, commented: 'While the likelihood of AI generating output that copies or infringes copyright works is extremely remote, training with LLM is in itself an infringing activity and publishers should be able to control such activity for their own benefit and the benefit of their authors. While the acceleration of generative AI is becoming an existential issue for the publishing industry, a more down-to-earth concern is the lost revenue for both publishers and authors if their content is used for training without consent.'

in Software, Posted by log1l_ks