Penguin Adds a Do-Not-Scrape-for-AI Page to Its Books

Taking a firm stance against tech companies’ unlicensed use of its authors’ works, the publishing giant Penguin Random House will change the language on all of its books’ copyright pages to expressly prohibit their use in training artificial intelligence systems, according to reporting by The Bookseller.

It’s a notable departure from other large publishers, such as academic printers Taylor & Francis, Wiley, and Oxford University Press, which have all agreed to license their portfolios to AI companies.

Matthew Sag, an AI and copyright expert at Emory University School of Law, said Penguin Random House’s new language appears to be directed at the European Union market but could also impact how AI companies in the U.S. use its material. Under EU law, copyright holders can opt-out of having their work data mined. While that right isn’t enshrined in U.S. law, the largest AI developers generally don’t scrape content behind paywalls or content excluded by sites’ robot.txt files. “You would think there is no reason they should not respect this kind of opt out [that Penguin Random House is including in its books] so long as it is a signal they can process at scale,” Sag said.

Dozens of authors and media companies have filed lawsuits in the U.S. against Google, Meta, Microsoft, OpenAI, and other AI developers accusing them of violating the law by training large language models on copyrighted work. The tech companies argue that their actions fall under the fair use doctrine, which allows for the unlicensed use of copyrighted material in certain circumstances—for example, if the derivative work substantially transforms the original content or if it’s used for criticism, news reporting, or education.

U.S. Courts haven’t yet decided whether feeding a book into a large language model constitutes fair use. Meanwhile, social media trends in which users post messages telling tech platforms not to train AI models on their content have been predictably unsuccessful.

Penguin Random House’s no-training message is a bit different from those optimistic copypastas. For one thing, social media users have to agree to a platform’s terms of service, which invariably allows their content to be used to train AI. For another, Penguin Random House is a wealthy international publisher that can back up its message with teams of lawyers.

The Bookseller reported that the publisher’s new copyright pages will read, in part: “No part of this book may be used or reproduced in any manner for the purpose of training artificial intelligence technologies or systems. In accordance with Article 4(3) of the Digital Single Market Directive 2019/790, Penguin Random House expressly reserves this work from the text and data mining exception.”

Tech companies are happy to mine the internet, particularly sites like Reddit, for language datasets but the quality of that content tends to be poor—full of bad advice, racism, sexism, and all the other isms, contributing to bias and inaccuracies in the resulting models. AI researchers have said that books are among the most desirable training data for models due to the quality of writing and fact-checking.

If Penguin Random House can successfully wall off its copyrighted content from large language models it could have a significant impact on the generative AI industry, forcing developers to either start paying for high-quality content—which would be a blow to business models reliant on using other people’s work for free—or try to sell customers on models trained on low-quality internet content and outdated published material.

Penguin Adds a Do-Not-Scrape-for-AI Page to Its Books

Cooler Master MasterBox Q300L Micro-ATX Tower with Magnetic Design Dust Filter, Transparent Acrylic Side Panel…

ASUS TUF Gaming GT301 ZAKU II Edition ATX mid-Tower Compact case with Tempered Glass Side Panel, Honeycomb Front Panel…

ASUS TUF Gaming GT501 Mid-Tower Computer Case for up to EATX Motherboards with USB 3.0 Front Panel Cases GT501/GRY/WITH…

be quiet! Pure Base 500DX Black, Mid Tower ATX case, ARGB, 3 pre-installed Pure Wings 2, BGW37, tempered glass window

ASUS ROG Strix Helios GX601 White Edition RGB Mid-Tower Computer Case for ATX/EATX Motherboards with tempered glass…

Corsair 5000D Airflow Tempered Glass Mid-Tower ATX PC Case – Black

CORSAIR 7000D AIRFLOW Full-Tower ATX PC Case, Black

Bgears b-Voguish Gaming PC with Tempered Glass ATX Mid Tower, USB3.0, Support E-ATX, ATX, mATX, ITX. (Note: Fan NOT…

Phanteks (PH-EC360ATG_DWT01) Eclipse P360A Ultra-fine Performance Mesh, Mid-Tower case, Tempered Glass, Digital-RGB…

Corsair iCUE 4000X RGB Mid-Tower ATX PC Case – White (CC-9011205-WW)

Prime Rib – Spend With Pennies

The Little Things Newsletter #448 – Life, laughter, and lots of great food!

Butterscotch Pudding – The Stay At Home Chef

How to Cook Pork Tenderloin

Leave a reply Cancel reply

Compare items

Shopping cart