Authors sue OpenAI for copyright infringement

Llamas are the official mascots of LLMs.

Two authors have filed a lawsuit against OpenAI for claims related to the development, use, and training of ChatGPT, these being direct copyright infringement, vicarious infringement, copyright management information removal, unfair competition, negligence, and unjust enrichment (complaint here). This is the third lawsuit against AI producers from the same law firm (previous suits covered here and here), which has now filed practically half of all ongoing AI copyright lawsuits. Is there a case here?

Just as with the previous two cases, I have some misgivings. As always, I’m not a US lawyer, so please take that into consideration when forging ahead. Nobody knows for certain how a case will be decided, and looking at recent US Supreme Court cases, it’s futile trying to play oracle when it comes to such litigation, but it’s possible to make an educated analysis of the strength of the case based on the initial filing. Parties will get a chance to respond and amend, and I for one hope that the claimants have something more, as I find the claims are rather weak. I think that some of the issues is once again some inaccuracies about the technology, so I´ll explain quickly how OpenAI trains the GPT model.

How is ChatGPT trained?

There is a lot of misunderstanding all across the board on how large language models (LLMs) operate. People seem to think of them as sophisticated autocomplete apps, repositories of all human knowledge, or advanced thinking machines. In reality LLMs are statistical language models, computer programs that have the ability to understand, analyse, manipulate, and even potentially generate human language in a manner that is meaningful to a human reader. This is done using different methods, OpenAI uses GPT (Generative Pre-trained Transformer), which is an architecture that enables the model to generate predictive text by processing words in relation to all other words in a sentence, rather than in a fixed order.

The “large” in large language models refers to the amount of data these models are trained on, and the size of the neural network used. They are trained on enormous amounts of text data, which can range from books, articles, and websites, to all sorts of other written materials.But the model doesn’t keep the actual training text, the model consists of tokens of data, these are sequences of characters that are often found together in the corpus of training text, usually made up of 4 letters or numbers. So a trained model like GPT-3 or GPT-4 contains information of which token is likely to follow another one.

So in order to train the model there has to be a dataset that contains large amount of text, it can be a single dataset, or various ones. The text then is pre-processed, this involves cleaning the data, removing any unnecessary or inappropriate content, and converting the text into a format that the model can understand, usually into numerical representations.

The model is then trained using a method known as supervised learning. For a language model like GPT, this involves taking a sentence from the training data, removing a word, and asking the model to predict what the missing word is based on the context provided by the other words. For example, given the sentence “The cat hunts a ___”, the model would need to learn to predict that a possible word for the blank could be “mouse”. The model’s parameters are adjusted to reduce the difference between its predictions and the actual data. Training involves running millions (or even billions) of these examples through the model, each time adjusting the model’s parameters a tiny bit in a way that makes it slightly better at predicting what comes next.

Evidently, the process is more complicated, but the above should give you a rough idea, of the process involved, this blog post explains it in more detail. Edit: Something that needs to be stressed is that once trained, a model never needs to look back at the training data, it can easily be discarded and no copies made.

What matters here for the later legal analysis is that a model trainer such as OpenAI has to have access to large amount of text, so where is that data coming from? They actually published some of the sources for GPT-3, but not for GPT-4. GPT-3 was trained mostly using text from the Web (85% of all weights). This text comes from third party datasets, specifically the Common Crawl, which is a large online dataset of web crawled text; OpenWebText2, which includes text from Reddit discussions; and Wikipedia. The other 15% comes from two book datasets named Books1 and Books2. Everyone seems to agree that Books1 is a dataset of public domain works collected by Project Gutenberg, and it accounts for 8% of the total training data weights, the remaining 7% comes from Books2, and we don’t know what is in that, or where it comes from. Books2 is where all in-copyright commercial books would potentially be found.

So is there copyright infringement? I’ll only be covering the first three claims related to copyright.

Direct copyright infringement

Claimants argue that books from the two authors were used in training ChatGPT by OpenAI, and that in order to do so, they must have made copies of the books. In order to prove that there was some copying, they offer as evidence that when asked to provide a summary of three of the claimant’s books, namely ‘The Cabin at the End of the World’; ’13 Ways of Looking at a Fat Girl’, and ‘Bunny’, ChatGPT was able to give some summaries, although they admit that in some instances “the summaries get some details wrong”.

So did OpenAI make copies of the works? Based on the information above, the answer is probably not. There is only one source where the books could have come from, and that is Books2. There’s been online speculation as to which dataset this is, the claimants argue that it is possibly a book corpus consisting of illegally copied books available on torrent websites, which is called the “shadow library”. There’s no evidence of this claim presented in the complaint. But perhaps most importantly, the books may not have been copied by OpenAI, they could have been copied by a third party, after all, OpenAI is already using entirely third party datasets for the web crawl content, so they didn’t actually make any reproductions themselves. They could be still liable vicariously, but that is a separate question we will deal with later.

But regardless of the identity and source of the mysterious Books2, the case rests entirely on the assumption that the claimants’ books are actually found in that dataset, but this is actually a huge assumption. The fact that ChatGPT can provide a summary of some books is no evidence that the actual books were on the training data, it is perhaps more likely that the summaries come from the largest part of the dataset, this is the web content, particularly things like quotes, snippets, promotional data, and other online sources. In fact, ‘The Cabin at the End of the World’ has a Wikipedia entry with a detailed plot summary, and we know that 3% of all training weights come from the online encyclopedia. I was also able to find page after page of online summaries and entries for all three allegedly infringed books. ChatGPT could easily be aware of these books without actually having them in the training data.

This fact in itself would seed doubt for the direct infringement claim, but let’s assume for now that the books are indeed found in Books2, or in another undisclosed dataset. The claimants argue that the “Plaintiffs never authorized OpenAI to make copies of their books, make derivative works, publicly display copies (or derivative works), or distribute copies (or derivative works).”

OpenAI did not copy the books, they are clearly not publicly displaying copies of the work, and neither are they distributing copies of the works. So are there derivatives of the works? This is similar to the other lawsuits, the claimants seem to be working under the theory that if a work is present in the training data, then all outputs that are produced are derived from the original. This argument makes no sense, we have billions of tokens, under this theory anyone who posted on Reddit could claim copyright infringement as a derivative.

Moreover, ChatGPT is not producing an actionable derivative in any sense of the word, at most they are producing a summary, which isn’t a derivative, otherwise millions of students around the world are producing derivatives in their assigned book reports, and therefore infringing copyright.

Vicarious copyright infringement

Perhaps anticipating that the defendants would argue that they never copied the books, and that this was done by a third party, the claimants argue that OpenAI has control over the outputs, and therefore benefited financially from their copyright works. As mentioned above, if they didn’t copy the works, they may still be liable vicariously.

This could be an interesting argument if we entertain the possibility that copies of the books are indeed to be found in the dataset, so even if OpenAI didn’t make a copy of the work themselves, they could still be liable for secondary infringement as they used someone else’s copy. So for example, The Pirate Bay doesn’t make copies of works, but they still may facilitate other people to infringe copyright.

I find this to be potentially the strongest of all claims, even though the complaint doesn’t really dwell on this point. However, it all rests on the above assumption that the books are indeed in the dataset. If this is the case, then OpenAI could be infringing vicariously. But if that is so, I would expect OpenAI to mount a compelling fair use defence. While fair use law in the US seems to be a bit up in the air at the moment after Warhol, there appears to be an agreement that this is a very fact-dependant decision. So the four elements of fair use are:

the purpose and character of the use,
the nature of the copyrighted work,
the amount and substantiality of the portion taken, and.
the effect of the use upon the potential market.

I cannot see how the possible inclusion of the books in the dataset would have an effect on the book’s market, ChatGPT at most provides a summary of the books, the same as Wikipedia. If these are indeed in the training data, their presence is negligible, and ChatGPT would work perfectly fine without them. It is in this point where I think that visual artists have a stronger case with regards to market impact, they can make a much stronger argument that visual AI tools have a detrimental effect on their livelihoods, but individual authors do not have the same claim.

This is not to say that future lawsuits would not be successful on this point. I think that collecting societies, trade bodies, and other collective bodies could present stronger cases.

Edited: Brian L. Frye makes the interesting point that as a matter of policy, courts may be reluctant to destroy an entire technology that has significant non-infringing uses (the Sony doctrine).

Copyright management information

This has been a common feature of all three cases, and it bears some comment. In short, CMIs are identifying data on a work, such as copyright information, the author, and other identifying data. The claimants argue that:

“Without the authority of Plaintiffs and the Class, OpenAI copied the Plaintiffs’ Infringed Works and used them as training data for the OpenAI Language Models. By design, the training process does not preserve any CMI.”

This is a strange statement. A book dataset could contain all of the copyright management information, it’s just words, the model doesn’t care about which words it includes, it only cares about the amount of words, so it is strange to assume that the works have removed CMI. Moreover, training of the dataset does not make copies of the work, it makes statistical models taken from billions of words in the training data. It’s tokens. To argue that this infringes CMI is fanciful.

Including CMI here could an attempt be to try to take advantage of the fact that the DMCA includes statutory damages for each count of infringement, which could bump of damages. As it would be difficult to prove any actual damages incurred in either direct or secondary infringement, a DMCA violation by the removal of CMI could potentially make the entire lawsuit worthwhile, also for the lawyers.

Concluding

We have no idea if this case will be successful, but it is an interesting development in the AI copyright wars. With lawsuits for code, art, and now literature, we’re only missing music and film. Perhaps the next to join the fray will be the music industry?

One thing is clear, my prediction of about 4-5 years of lawsuits continues to be on track. Who will be next?

7 Comments

Lilian Edwards · July 8, 2023 at 10:40 pm

Two things I keep being vaguely surprised havent come up. ( and yeh Im barely a copyright lawyer and certainly no5 a US one so usual caveats apply)

One, attribution. Yes US doesnt do moral rights, but if the publicly scraped training sets – Books1 perhaps – contain works made public under CC licenses then doesnt that give author a right to enforce it where used without any such attribution? Q surprised no CC people have weighed in on this ( we might be discussing this next week?)

Two, as you say, P2P enabled a lot of theories of how to get actors for copyright infringement even when they actually didnt host or distribute the copyright work. The most desperate of these was inducement of copyright infringement in Grokster. Id have to reread it but it sounds tailor made for gen AI. Though as you say the issue of stifling non infringing conten5 generation would be germane.

Just some thoughts!

Any idea btw why this firm is so far on the bandwagon ahead of other ambulance chasers ??

Andres Guadamuz · July 8, 2023 at 11:37 pm

I think the CC arguments could be interesting. Two things would stop this from becoming a thing. Firstly, nothing in the CC licences contradicts fair use or fair dealing, so if training is fair dealing, then the licence doesn’t kick in. The second part is that the licence terms only apply if there is a derivative, and as mentioned, I don’t think that training produces a derivative 🙂

Just Some Citizen · September 21, 2023 at 5:40 pm

Prrhaps this is sidetracking a bit, but i can’t help but feel dismayed at how everyone, including lawyers (especially lawyers) seem to have got accustomed to the idea that the outcomes of these kind of suits and trials is basically a coin toss. Even if this is for legitimate reasons, should a primary characteristic of any decent law system not be its determinacy? I mean,if even professionals of the Law can’t seem to conclusively decide whether a given conduct infriges or not, what hope is there for the commoner? What good is any Law System that does not allow its users (ie, citizens) a high degree of certainty that some task they intend to undertake is free from legal risk?