Is a popular machine learning text tool trained using copyrighted material?

Machine learning

While most of the writing with regards to artificial intelligence and copyright in recent years has been centred around the subject of authorship, perhaps the most important aspect of the question is that of copyright infringement. What happens when you train machine learning algorithms with works that are protected by copyright? Do any exceptions apply? Is it even infringement?

The latest test to this question comes to us through the popular text machine learning tool GPT-2, a “a large-scale unsupervised language model which generates coherent paragraphs of text”. GPT-2 was developed by OpenAI, a non-profit organisation funded by various tech companies, which produces free software artificial intelligence tools for the “benefit of humanity”. Some of their applications include gym, a toolkit for developing and comparing reinforcement learning algorithms; and the OpenAI Five, an AI DOTA2 team.

GPT-2 is a tool that quite simply predicts the likelihood of what the next word in a sentence will be (you can test a limited version here), and it has been trained using 40GB of Internet text, that is 1.5 billion parameters trained on a dataset of 8 million web pages. What sets GPT-2 apart is the quality of the content used to train the machine. They used humans to curate the text by using outgoing Reddit links, this intentionally acts as a quality filter. The project explains:

“We created a new dataset which emphasizes diversity of content, by scraping content from the Internet. In order to preserve document quality, we used only pages which have been curated/filtered by humans—specifically, we used outbound links from Reddit which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting (whether educational or funny), leading to higher data quality than other similar datasets, such as CommonCrawl.”

Very clever use of human curated content to ensure quality! However, the full dataset has not been released to the public, only three limited versions have been released, with the largest comprising 774 million parameters. The reason for not releasing the full dataset so far is that there is fear of abuse. AI text experiments are often hijacked by trolls, a Microsoft text chat called Tay.ai was famously transformed into a racist bot by Twitter users.

But now questions have emerged about the privacy and copyright content of the published datasets. In an interesting Reddit discussion, users commented that there could be potentially infringing content in the datasets, with both personal information and copyright content. We will leave out the data protection issue, but the copyright question is intriguing. The original poster writes:

“Did I mention the DMCA already? This is because my exploration also suggests that GPT-2 has been trained on copyrighted data, raising further legal implications. Here are a few fun prompts to try:

Copyright

This material copyright

All rights reserved

This article originally appeared

Do not reproduce without permission”

This is a fascinating issue, and I believe it is at the heart of our rush to train machines. Is it copyright infringement to train a machine learning algorithm with scrapped content from the Internet, and if so, is the publication of such dataset also an infringement? If there is infringement, are there any defences? What about the text resulting from the trained AI, could any one author claim infringement?

All of these are separate questions that require individual analysis.

1. Is it copyright infringement to train a machine using scrapped content from the internet?

In principle, I think that the answer is yes (my thinking has evolved on this). The way in which the process is described in the technical paper is that a web crawler was used to extract the HTML code from 8 million web pages, and then the text was extracted from that code to produce the full dataset. This means an actual copy of the text from the webpage was copied, which is an exclusive right of the author, this copy is permanent. Similarly, the fact that this is being published and made available to the public, could also constitute infringement.

2. Are there any defences?

Assuming that there is indeed infringement taking place, are there any defences? There are indeed.

The first is a practical consideration. The texts being copied are not being used for their literary value, the way in which they convey information, or any other such reason. The actual text is irrelevant, it is the order of the words that matters, and not just the order in an individual website, but the accumulation of millions of such works. So while on paper there could be infringement of individual works, it is unlikely that a plaintiff could prove any damages have occurred. There’s little or no incentive to bring action in this situation for individual users.

Secondly, the use could be either fair use or fair dealing in some jurisdictions. In the USA, this could fall under fair use as proven by the Google Books case. In the UK, there is a data mining exception for someone who is carrying computational analysis of some text for research non-commercial purposes. In order for this to apply, the person performing the copying has to have “lawful access to the work”, one could argue that reading information online would be to have lawful access. The question then is whether the copying falls under research and non-commercial use, and the answer for me is also positive, OpenAI acts as a non-profit organisation, and they release the works for research purposes. The new Digital Single Market Directive also has a broader text and data mining provision.

So while there may be infringement, it is very unlikely that this could be litigated individually, and in a growing number of countries this use would be permitted anyway.

3. Could an author claim infringement over any generated text?

Something that will start coming up more and more with regards to AI and creative works will be the question of whether any works produced using the machine learning tools would be infringing copyright. Here I think that the answer is no, but the reason may need some unpacking.

I used the GPT-2 explorer tool to generate the following text:

“Technollama is a very good example of a good example of a good example of a good example of a good example of a good example”.

It went into a loop. Not the most fascinating stuff. Choosing a different path, I got:

“Technollama is a free online tool that allows you to create your own custom games.”

Nice. But is this text in some way derived from the original texts contained in the dataset? Obviously not. In order for a derivative work to be infringing copyright, the following three acts have to take place:

The potential infringer performs an exclusive act of the owner.
The resulting work is derived from the copyright work, in other words, there is a causal connection.
The work, or a substantial part of the work, has been infringed.

Assuming that the work has been copied and published, step 1 has taken place, and there is some causal connection in the sense that the text was used to train a machine, but it is difficult to see how the substantial requirement would be met. Quite simply, the resulting works do not carry a substantial enough part of the original to be considered infringement. One could argue that it is 1 in 8 million in the case of GPT-2, not enough by any stretch of the imagination.

Concluding

It is difficult to see these questions going away soon. While I believe that there is some form of infringement in the case of GPT-2, I cannot see how this would result in any legal action against OpenAI, and even if there is any litigation, it would probably fall under fair use or fair dealing in many countries.

But I also can see how in the future some copyright owner will test things in court. On the meantime, I, for one, welcome our new pirate robot overlords.

[Thanks to the ever-vigilant Tristan Henderson for directing me to this thread]

4 Comments

Rita Matulionyte · March 10, 2020 at 10:07 pm

A very interesting topic indeed! Do you think the same conclusion applies in countries that do not have fair use or fair dealing? Or where there is no research element but rather a clearly commercial application?