Using Creative Commons images to train artificial intelligence

Whenever I have presented about artificial intelligence in the last few years, I often get asked the question of whether training an AI with data can infringe copyright. Take for example Bot Dylan, a machine learning project that has been trained using 23,000 folk songs. Does the music it produces infringe copyright? The accepted answer is no, training an AI by making it listen to music is no different to a musician listening to various songs and being influenced (unless they cross some blurred lines, see what I did there?).

But the problem is often not the resulting output, it is all about having lawful access to content that can be scraped and analysed by the machine learning algorithm. Storing and using large amounts of data in this manner could indeed infringe copyright, and it is the reason why there are increased calls to have some sort of data mining exception to copyright. The UK already has one in place, and this is part of the upcoming Digital Single Market Directive.

This is becoming more and more of an important subject as tech companies try to join the AI race and large datasets become a hot commodity. There was quite an argument about machine learning tools and training datasets recently due to the Edmond de Belamy AI painting. In that occasion, a French artist collective called Obvious used pre-existing machine learning tools to generate a series of paintings, one of which was sold in auction for over $400k USD. The portraits used to train the AI were in the public domain, so the possible infringement question of those never arose, but there were arguments at the time that Obvious could be infringing copyright in the algorithms used. In my opinion, this was not the case as all the tools had been released with open source licences.

We are currently witnessing another extremely interesting case that has been uncovered in a very thorough article for NBC News, which reports that millions of photographs posted to picture sharing site Flickr are being used without consent to train machine learning algorithms. Reporter Olivia Solon writes:

“The latest company to enter this territory was IBM, which in January released a collection of nearly a million photos that were taken from the photo hosting site Flickr and coded to describe the subjects’ appearance. IBM promoted the collection to researchers as a progressive step toward reducing bias in facial recognition.
But some of the photographers whose images were included in IBM’s dataset were surprised and disconcerted when NBC News told them that their photographs had been annotated with details including facial geometry and skin tone and may be used to develop facial recognition algorithms.”

Leaving aside ethical and privacy considerations, one of the most interesting questions raised by this report are the copyright implications of these practices. Can IBM do this legally? If so, how?

Firstly, let me start by stating that this is a rather complex area, so I will be over-simplifying quite a bit. If you want some more detail about the legal implications of data mining, this may be a good place to start. When we are talking about training artificial intelligence, it is recognised that having access to a good sized dataset appropriate for the task is beneficial, and this is why so many companies such as Google continue to provide services that allow them to have access to vast troves of information. Data mining is one way of finding data to feed the algorithms to train them, but in order to do so the researchers need to have access to data, yet this data could be proprietary.

Data can be anything that is the subject of the research: music, pictures, paintings, text, poetry, scientific literature, figures, drawings, sketches, etc. Data is not about an individual work, it is all about the accumulated reading of a collection of works. So in order to analyse this information and turn it into something useful, there has to be a process that “reads” the data. There are lots of different processes and techniques, but these require for the miner to at least copy the data temporarily.

The legal situation of this type of access to data varies from one jurisdiction to the other. In the US it has been argued that data mining falls under fair use as being transformative, and I tend to agree with that opinion (see also Author’s Guild v Google). In the UK we have fair dealing for data mining for non-commercial uses, and other jurisdictions have adopted, or are thinking of adopting, similar measures (the DSM directive contains one such proposition, although heavily diluted). So in many circumstances, non-commercial data mining to train an AI will be legal. But as this is still a highly uncertain area of the law, and as many companies want to train neural networks for commercial purposes, then those enterprises and researchers will want to use data that is either in the public domain, or under a permissible licence, such as a Creative Commons licence.

This is precisely what IBM did. Flickr is a sharing site that is famous for being an early Creative Commons adopter, allowing their users to release their shared pictures under some rights reserved licences. For the most part, this didn’t mean much to the average user, my own photostream is under CC a licence, and has remained largely unnoticed (as far as I can tell). A few years ago, Flickr released 100 million pictures that had been shared on their website under CC licences. For machine learning researchers, this is a treasure trove because in theory it can be used and reused for commercial purposes without fear of infringement. IBM took this data and narrowed it down to 1 million pictures containing faces and annotations, and made it available to researchers as the “Diversity in Faces” dataset.

The legal question is whether IBM can do this, so we need to look at the licences in more detail. The source photographs have been shared using a range of CC licences. It is a good time to remind readers that there are six types of CC licence that range from the very permissive Attribution only (BY), to the more restrictive Attribution-Non Commercial-No Derivatives licence (BY-NC-ND). CC-BY allows all sorts of reuses as long as the author receives attribution, while the most restrictive licences allow reuses as long as these are for non-commercial purposes, and in some instances, they do not allow derivatives, or require the work to be shared with the same licence (share alike). Interestingly, the majority of the pictures shared in the Flickr dataset have a non-commercial restriction (about 66%, see source).

The Flickr dataset itself is not shared with a CC licence, and it is actually accessible after signing up to an Amazon Web Services account, and most importantly, after agreeing to specific terms of use for this dataset. This makes a lot of sense as older CC licences do not work well with databases, but most importantly, actual datasets may not be protected in some jurisdictions, including the US. Therefore, the better way to protect such data is through contract law, by imposing restrictions with terms of use. As these go, they are not onerous, and mostly seem to require attribution when re-using the data. For example, the ToU states:

“You may use the Dataset to review, analyze, summarize, interpret and create works from the Dataset. You may publish your observations, commentary, analyses, summaries, and interpretations of, and works from, the Dataset.”

So we are back to having to analyse whether using this dataset to train an AI would be in breach of the CC licence with which a user would have shared the work in Flickr. My initial answer is that such uses are permitted under the CC licence. The main purpose of the Creative Commons licence is to allow re-uses of a work with as few restrictions as possible, allowing the creation of a free culture environment where sharing benefits society. So when someone posts a picture under a CC licence, we have to assume that they are allowing further re-uses and mash-ups of their work. As an avid CC user, I like making my pictures available to the public, and I do not really think about potential downstream uses. I do impose a non-commercial restriction because I am bothered by my work being used for commercial purposes.

So let’s assume that one of my pictures is included in one of the datasets (I haven’t looked), and also let’s say that the picture has the most restrictive CC licence, say BY-NC-ND. So nobody can re-use my picture for commercial purposes, they need to attribute me, and they cannot make modifications to the picture. The Flickr dataset fulfils those requirements, it is non commercial, it is not being modified to create a derivative, and the metadata contains adequate attribution.

So now assume the picture is included in the IBM dataset and it is used to train machine learning algorithms that are being used to train face-recognition software, and let’s further assume that some of those uses are commercial. I would argue that the terms of my licence are still intact. Firstly, the IBM dataset is being offered for free to researchers, so the non-commercial element is maintained. The pictures appear to have maintained the metadata, so I am also being attributed (edit: there is a question of whether the medatada fulfils attribution). The problem starts if my picture is changed in a way that it would be considered a derived work, as this would go against the terms of my licence. There could also be a problem if my picture is used by a researcher for commercial purposes.

Great! As the licence is breached, I can sue the commercial researchers for copyright infringement. Riches beckon!

Not so fast. If we were talking about my individual picture, I may have a case, but this is not a single use, this is part of a large dataset of millions of images, and my lone individual picture is of no interest, what matters is the accumulation of images, this is where the value of machine learning resides. And if we take all the pictures as a whole, then the individual terms and conditions of each CC licence are less important. Researchers that have had access to the dataset legitimately by complying with the terms and conditions set out by its creators can reuse these in any way permitted, and as we mentioned, these terms are quite permissive. Furthermore, the IBM and Flickr datasets fall under the protection of fair use in the US, and data mining exceptions where these exist, so researchers that are using them could do so freely, in some cases even for commercial purposes.

This opens up a new question, and that is whether an individual photographer whose picture has been made available in either dataset could sue for copyright infringement. I do not think so. To be able to claim individual infringement, there must be evidence that the image has been used in a way that is in breach of the licence, and these may not be the case, the mere inclusion of an image into a dataset is not a breach of the licence.

There is also the issue that the training of an AI does not constitute a derivative work in its own, and therefore the author of individual pictures would have a very difficult time proving that the resulting outputs from the training are a direct result of their individual image. Furthermore, the outputs from the data mining training may not be subject to any sort of protection in their own right, or these may differ from copyright. If the machine learning algorithm is used to produce a portrait, then this could very well be in the public domain as it is not original (see more about this here). If it produces software, a database, a model, an algorithm, or any similar technical effect, the protection may be under trade secrets, patents, database right (in Europe), and even copyright. But the result may have no connection whatsoever with my own picture, and therefore it would be impossible to claim infringement.

These are my first ideas based on the facts presented by the investigation. I am currently writing about this very subject, so much of the law is still fresh in my head, but I would be curious to see what others think.

One thing is clear, data mining is increasingly becoming a very important legal subject for copyright. Perhaps we are all obsessed with Article 13, when the most relevant article for the future is that containing data mining (Article 3 for those interested in looking it up).

Edited to fix some awkward wording on transient copies.

3 Comments

nemobis · March 14, 2019 at 1:53 pm

Yes but article 3 was made completely useless, so in practice it will leave the status quo unchanged. The battle will continue in courts and national parliaments because the Commission was completely unable to deal with any of the serious questions of copyright in the modern age.