Makers of generative artificial intelligence tools like ChatGPT have been using copious amounts of copyrighted news material to train their chatbots, according to new accusations from a new trade group.
The News/Media Alliance, which represents over 2,200 publishers, showcased its research in a blog post and white paper Tuesday, saying AI companies regularly used the information in news stories without authorization, and violate laws protecting that intellectual property.
“The research and analysis we’ve conducted shows that AI companies and developers are not only engaging in unauthorized copying of our members’ content to train their products, but they are using it pervasively and to a greater extent than other sources,” said Danielle Coffey, Alliance president and CEO, in the release. “This diminishment of high-quality, human created content harms not only publishers but the sustainability of AI models themselves and the availability of reliable, trustworthy information.”
The group’s research claims the datasets used to train large language models —a building block of generative AI that’s essential to tools like ChatGPT, Bard, and more—of major chatbots “significantly” overweighted content from news, magazines, and digital media sources, using it 5 to almost 100 times as frequently as other content.
LLMs also copy and use publisher content in their outputs to users, the group said, putting them in competition with the news outlets.
The danger with this, beyond the potential violation of copyright laws, is in diminishing the work of human-created content, publishers could find themselves at risk, said the News/Media Alliance. Additionally, the AI could ultimately end up hurting its own sustainability as trustworthy information becomes harder to find.
“Continued unauthorized use will harm existing markets that acknowledge the value of archived and real-time quality content, and over time the GAI models themselves will deteriorate,” said Coffey. “You get out what you put in.”
Google and OpenAI did not immediately respond to Fast Company’s request for comments on the report.
The News/Media Alliance has submitted its findings to the U.S. Copyright Office’s study of A.I. and copyright law. The group is also encouraging AI creators to work out licensing agreements with news organizations or compensate publishers for the use of their content.
“Generative AI systems should be held responsible and accountable, just like any other business,” said Coffey. “It is critical that our copyright protections are properly enforced and that high standards of quality and accountability are the foundation of these and other new technologies.”
The news industry isn’t the first to take AI to task for using copyrighted material. Since the debut of Dall-E, writers and artists have complained that the chatbots are incorporating their work without acknowledgement or compensation.
But the reliance of AI on news is notable given the media industry’s struggles of late. Last month, Thomson Reuters accused a legal AI company of copying its content, specifically the legal summaries in Westlaw. That case is tentatively set to go to trial next May.
Meanwhile, in July, a group of writers including comedian Sarah Silverman filed suit against OpenAI and Meta, alleging that the companies had improperly trained their LLM models on her book “The Bedwetter,” noting the AI could offer a detailed synopsis of every chapter. And over 15,000 authors, including Nora Roberts, Margaret Atwood and Jodi Picoult, signed an open letter to the heads of AI companies calling on them to protect writers.
“In the past decade or so, authors have experienced a 40% decline in income, and the current median income for full-time writers in 2022 was only $20,000,” the letter read. “The introduction of AI threatens to tip the scale to make it even more difficult, if not impossible, for writers—especially young writers and voices from under-represented communities—to earn a living from their profession.”
BY CHRIS MORRIS