An AI Lion, or a LAION.

A fascinating case that could test the limits of several exceptions under EU copyright law is developing right now in Germany. Photographer Robert Kneschke used the website “Have I Been Trained?” to find out if his images were in the LAION 5b dataset. He found, in his own words “heaps of images from my portfolio“, including images that he had uploaded to the repository site Shutterstock, and which had a watermark. Due to finding this data, he proceeded to send a letter of complaint to LAION, asking them to remove his images from the dataset.

The problem is that LAION does not store images, it is a dataset consisting of links to images found on the Internet, as well as ALT text data, and a few other parameters such as similarity, likelihood of containing a watermark, and NSFW status. The lawyers for LAION replied that there were no images to remove, Mr Kneschke insisted with a cease-and-desist letter, and he obtained another reply from the lawyers that read:

“By letter dated February 14, 2023, we have already pointed out to your client that our client is entitled to claims for damages in accordance with Section 97a (4) UrhG in the event of an unjustified claim. At the time, our client had refrained from asserting this claim but is now unable to continue to be lenient here. They incurred legal fees for defending against the obviously unjustified warning you issued, which our client will not bear themselves.”

This letter was accompanied by a receipt from the law firm for €887 EUR. This was the straw that broke the camel’s back, and so Mr. Kneschke sued for copyright infringement.

Needless to say, LAION’s response appears to be disproportionate and likely led to their current predicament. Asking a photographer for money seems to be a bad strategic move, even if it could be justified under German copyright law, which is designed to stop unjustified copyright claims, copyright trolling, and other similar nuisances. But is LAION justified in their assessment that this is a baseless copyright claim?

It depends. Buckle your seatbelt Dorothy, ’cause Kansas is going bye-bye!

The LAION 5b dataset

So there are several layers here. The first one is that LAION is correct in one thing, it holds no images in their 5b dataset, so there are no images to remove, they couldn’t respond positively to Mr Kneschke’s request, even if they wanted to. So the legal issues here would be quite simple, and the unjustified claim is, well, justified. If we take this at face value, the only thing that LAION could be infringing would be that the links contained in the database could be a communication to the public under Art 3(1) of the InfoSoc Directive. This is a right which covers things like linking, framing, and embedding content.

The case law in this area is fiendishly complicated, and sometimes even contradictory. The main principle, which has been in existence since the landmark case of Svensson, is that the right is only infringed by someone linking to content that is communicated to a “new public”. So me linking to things that have already been made available to the public by the rightsholder is not an infringement. Conversely, if someone uses a peer-to-peer network to share content via a link that has not been made available to a wider public, then that is infringement (see cases here and here). The question here is whether a database of 5 billion links collected from the Internet would be a communication to the public, and personally I just don’t see it. There are cases in which a link can clearly be a communication to the public, but these tend to be rare and very specific (see GS Media).

However, I don’t think that this case will rest on the argument of whether the links are a communication to the public for the reasons that I’ll discuss next.

Data collection and training

While I haven’t been able to find the complaint in this case, Mr Kneschke seems to have shifted his argument from the LAION 5b database, to the actual collection of data, and potentially the training of models using his images. In a blog post explaining the case, as well as in subsequent interviews, the claimant has been insisting that LAION copied his images at some point during the collection of the dataset, so even if the images are no longer present, there was a reproduction taking place. He uses LAION’s own explanation of the collection process, which reads like this (with accompanying flowchart).

“The acquisition pipeline follows the flowchart above and can be split into three major components:

  • Distributed processing of petabyte-scale Common Crawl dataset, which produces a collection of matching URLs and captions (preprocessing phase)

  • The distributed download of images based on shuffled data to pick a correct distribution of URLs, to avoid too heavy request loads on single websites

  • Few GPU node post-processing of the data, which is much lighter and can be run in a few days, producing the final dataset.”

In particular, LAION has agreed that it downloaded images at some stage for information retrieval):

“We download the raw images from the parsed URLs with asynchronous requests using Trio and Asks libraries in order to maximize all resources usage: vCPUs, RAM and bandwidth. We found that a single node in the cloud with 1-2 vCPUs, 0.5-1GB RAM and 5-10Mbps download bandwidth is inexpensive enough to allow downloading on a limited budget. Such a unit can process 10000 links in about 10-15 minutes. Each batch consisted of 10000 links taken from the Postgresql server by using the TABLESAMPLE technique, ensuring that the distribution among the 10000 links was following the distribution of the existing 500M records available on the database. We found that the distribution is still good when in the database are still above 20M records to be processed given that we had some 300 downloading workers at any time. The above techniques allowed both maximizing downloading speed and minimizing IP reputation damages.”

LAION’s own lawyers admitted to some copying in one of the response letters:

Our client only found image files on the Internet for the initial training of a self-learning algorithm using so-called crawlers and briefly recorded and evaluated them to obtain information.”

Case closed then, LAION admitted to infringing copyright, right? Not so fast…

Exceptions

So taking for granted that some reproduction took place at some point, is there copyright infringement? Maybe not. Here is where LAION will make the point that their actions fall under exceptions and limitations, particularly temporary or transient copying (Art 5(1) Infosoc Directive), and the text and data mining (TDM) exceptions in Arts 3 and 4 of the DSM Directive. I deal with these exceptions in more detail in my upcoming article on the subject, so I’ll just deal with them briefly.

With regard to temporary copies, copyright is not infringed in the making of a temporary copy that is transient or incidental if this copy is an integral part of a technological process, and its sole purpose is to either enable the transmission of the work or enable a lawful use, and the temporary copy does not have independent economic significance . It is unclear if all of these requirements will be met by LAION, but at least to me, there’s a pretty good chance that they are. The Internet is a huge infringement machine; we’re making transient copies of things all the time, so this exception has become quite powerful. We would not be able to exist online without it. It will be interesting to see if this ever gets to court whether making a temporary copy for collecting data from it consists of making a temporary copy for copyright purposes.

But even if it is not, LAION can also rely on the TDM exception. The EU adopted its own TDM exceptions in 2019 as part of the Digital Single Market (DSM) Directive. The Directive deals with a wide range of digital copyright issues and implements two different exceptions with regard to data mining that are directly relevant to AI training. In Article 3, the Directive sets out a new exception for copyright for “reproductions and extractions made by research organizations and cultural heritage institutions in order to carry out, for the purposes of scientific research, text and data mining of works or other subject matter to which they have lawful access.” Article 4 provides the biggest change, as it extends this exception to commercial organizations for reproduction and extraction, if they have lawful access to the work, and rightsholders have not reserved their rights out of this exception.

Mr. Kneschke has argued in his interviews that LAION does not fall under Article 3 because it receives money from commercial institutions and is therefore not a non-profit (sorry, double negative). This is likely to be at the heart of the case. If they were not a non-commercial institution, then their actions would be commercial, in which case Article 4 applies, and they would have to comply with opt-out requests. So, the claimant would have a valid legal argument that LAION did not comply with his request.

But here is the tricky part. LAION collected the data (and possibly made temporary copies) before Mr. Kneschke objected to the collection. Is the provision in Article 4 retroactive? As mentioned before, LAION is correct in the fact that they hold no copies of his works that can be removed, but the claimant is correct in the fact that there may have been a reproduction at some point.

But any reproduction would have taken place BEFORE he objected to it, so his opt-out would only apply to future databases. So even if we concede that: 1) the claimant is not pursuing the existing dataset, but its training; 2) there’s no temporary copying exception; 3) Art 3 doesn’t apply to LAION; 4) LAION must comply with opt-outs… all of these would not apply retroactively, only to future training. So the claim would lose in a court of law. But as always, I could be wrong.

Moreover, LAION’s database is open source, so there may be countless copies of it circulating online. Could LAION be liable for all the copies which still contain the allegedly infringing images?

Concluding

I have to admit that my initial reading of the case was dismissive of Mr. Kneschke’s claims, but the more I have looked into it, the more I think that this could go all the way. Speaking selfishly as an academic interested in this area of the law, I hope it does; we could get a few important explorations of difficult legal questions that are still open. Is training covered by the transient copy exception? If not, what is the reach of Articles 3 and 4? What is a research institution? When does the opt-out kick in? Something else I’d like to see discussed is damages. What is damage is the claimant asking for?

But perhaps most importantly, if someone comes asking you to remove some links from a database, it may be a good idea to just do it. You may lose €800, but win the world.


6 Comments

Avatar

TobiasMJ · May 19, 2023 at 5:00 pm

Thanks for the break down. Could LAION not accomodate the German photographers request, by removing the URL, text descriptoin, etc. associated with his images from the dataset? Or is this not technically feasible? Seems like it would have been an easy way to solve the conflict.

    Avatar

    Andres Guadamuz · June 1, 2023 at 11:24 pm

    I believe it could be done, it’s just an entry in a database.

Avatar

Anon · June 21, 2023 at 3:02 am

You left out Article 2 which defines what a research institute is.

A research institute can receive funds, quote: “in such a way that the access to the results generated by such scientific research cannot be enjoyed on a preferential basis by an undertaking that exercises a decisive influence upon such organization;”

We know there is no preferential treatment, the results are released for everyone at the same time. We don’t know about influence, but since it’s part of a university project all its files should be open for request and the PhDs at LAION probably know that and wouldn’t jeopardize their tenure by doing something shady that is not according to the law. I’m not denying that it could happen, I’m just very doubtful.

The problem with the case of the photographer is, he sent his lawyers after them. This probably made it impossible for them to react in a manner where they wouldn’t admit guilt by removing his links from the dataset.

If he requested it through the website “haveIbeentrained” which is the official website LAION accepts requests from to take down links, this problem wouldn’t have existed.

Another thing you forgot from Article 4:

The opt-out needs to be machine-readable. He CAN NOT opt-out by requesting it from LAION. He needs to have it in a machine-readable way on the website or in his pictures.

    Avatar

    Andres Guadamuz · June 21, 2023 at 9:51 pm

    Point taken on Art 2, also German legislation contains a detailed definition of non-commercial, I personally think that LAION is a research institution, we’ll see. I disagree that Art 4 requires that the opt out must be machine readable, the wording is that the opt-out must be “in an appropriate manner, such as machine-readable means in the case of content made publicly available online.” We’ll have to see if this means that it must, but the wording is conditional.

    I do think that the photographer will lose the case, I just think that LAION could have saved itself some trouble just removing some links.

漫威人工智能製作的秘密入侵開幕在強烈反對後促使工作室做出回應 – 信息公告欄 · July 6, 2023 at 7:29 am

[…] Laion 數據庫。 萊昂正在 被攝影師起訴 想要選擇退出,並且還被 Getty 和一家公司起訴 集體訴訟 案件。 […]

Marvel’s AI-Made Secret Invasion Opening vyzve Studio po reakci – Informační Nástěnka · July 9, 2023 at 10:48 pm

[…] Disney použili tento program v intru, lze jej vysledovat zpět do databáze Laion. Laion je bytí žalován fotografem která se chtěla odhlásit a je také žalována společností Getty a spol tridni akce pouzdro. […]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.