“A Business Model Based on Mass Copyright Infringement”: The New York Times lawsuit against OpenAI

Image: ‘OpenAI’ by Focal Foto is licensed under CC BY-NC 2.0

On December 27th 2023, The New York Times filed a lawsuit against OpenAI and Microsoft alleging that “millions” of articles have been unlawfully used to train its generative artificial intelligence tools. The lawsuit launches a comprehensive attack on the technologies at the foundation of OpenAI’s generative artificial intelligence tools, alleging that theirs is “a business model based on mass copyright infringement”. So what is the basis of The New York Times’ claims, how have OpenAI responded, and what does it mean for the future of generative AI technologies?

OpenAI – the $80 billion valued parent company of ChatGPT — develops large language models (LLMs) that can produce text in response to prompts by “learning how words tend to appear in context with other words”. The “training” process by which ChatGPT learns how to produce text uses information provided by human users, information from licensed third parties, and “information that is publicly available on the internet”.

The latter of these information sources has come under legal scrutiny. The New York Times alleges that OpenAI has used copyrighted news articles “without permission or payment”, meaning that its GPT models can “free-ride” on their investment and reproduce its journalism. For instance, GPT-4 was able to reproduce a 2019 article about New York City’s taxi industry almost identically with “minimal prompting” despite having no involvement in the original investigation. 

Flaws in the LLM’s outputs (or “hallucinations”) have also led to allegations that OpenAI has damaged the reputation of the New York Times’ journalism. The lawsuit details numerous examples of ChatGPT misattributing the Times’ content: a prompt asking for a list of New York Times articles about the COVID-19 pandemic returned fabricated titles and hyperlinks which did not direct to real articles. The Times have requested payment of damages estimated to be worth “billions” and the destruction of all LLM models that incorporate Times publications. 

OpenAI published a response on January 8th saying that the lawsuit is “without merit”. They allege that prompts were “intentionally manipulated” to make the models reproduce old articles and that examples of hallucination were “cherry-picked from many attempts”. Furthermore, “regurgitation” (i.e. word for word replication of text) is a bug they have been trying to reduce. 

This is the first time a major US media organisation has sued OpenAI, and it threatens the legal foundations of the technology that allows its generative AI products to function. It is increasingly apparent that the future of services such as Chat-GPT depends on the ability of developers to ensure that their information sources are secure in the long term and in compliance with data privacy and intellectual property concerns.

In December 2023, OpenAI struck a landmark agreement with German publishing giant Axel Springer which will allow its LLMs to absorb the content of the publisher’s numerous titles, including Business Insider and Politico. This kind of collaboration between generative AI and media platforms could change the way that we receive information and provide a more interactive alternative to the experience of reading newspapers online. The outcome of the New York Times lawsuit may indicate the nature of this competition going forward.