Back to the future.

Following the recent decision in GEMA v OpenAI, and last year’s Kneschke v LAION, we now have two German courts grappling with the applicability of the text and data mining exceptions to AI training. Despite arriving at different outcomes, LAION won, OpenAI lost, both decisions share something important: they confirm that the TDM exception in Article 4 of the DSM Directive applies to AI training. Yet, a legal meme has continued to prevail in certain circles, there is a belief that the exception was never meant for this purpose. I think that this view is no longer tenable, and it is time to explain why.

The ‘not what the legislature intended’ argument

The most common objection goes something like this: when the DSM Directive was negotiated between 2016 and 2019, the legislators could not have anticipated generative AI in its current form, therefore the TDM exception should not apply to it. This argument appears in various submissions from rightsholder groups urging the EU to ‘cure’ what they called violations of the Berne Convention.

I have to admit that I may have helped to get this meme started. In a 2023 paper I wrote “The existing exceptions were drafted with very specific type of data mining in place, with the fight against disease and the development of new medicines being cited repeatedly to justify these exceptions.” This got cited at some point by other papers, and it seemed that I was proposing that this was the only reason why the TDM exception was passed. I did not think that there would ever be doubt that the TDM exception would apply to AI training, to me this was absolutely evident, but I did not foresee the level of opposition that some in the copyright industries would have against the exceptions.

The problem with this argument is that it is historically inaccurate. Legislators were well aware of AI and machine learning when drafting the DSM. The European Commission’s Digital Single Market Strategy, which gave birth to the Directive, explicitly emphasised the importance of AI. When the Council moved to introduce Article 4 in 2018, the explanatory documents spoke of ‘unleashing the potential of artificial intelligence, Big Data and innovation.’ The European Parliament’s press release upon adoption of the Directive stated that the TDM provisions were introduced ‘in order to contribute to the development of data analytics and artificial intelligence.’

But let us go even further back. The UK legislated for computer-generated works in 1988. WIPO was discussing AI-generated content at its Stanford symposium in 1991. By 2015 we already had generative AI, a subject that I was already writing about back then. The Next Rembrandt and Edmond de Belamy were already famous in 2018. The notion that European legislators in 2019 were somehow oblivious to AI is, to put it charitably, difficult to sustain. This was hardly an obscure technical concern lurking in the shadows.

The definition does the heavy lifting

We can also look at what Article 2(2) of the DSM actually says: text and data mining means ‘any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations.’ That is remarkably broad language. The word ‘any’ is doing significant work here, as is the phrase ‘but is not limited to.’

Training a machine learning model is, at its core, an automated analytical technique that analyses data to extract patterns and correlations. That is literally what neural networks do. The weights and parameters of a model represent statistical relationships learned from the training data. Whether you call it ‘deep learning’ or ‘text and data mining,’ the underlying process involves automated analysis of data to generate information about patterns. The semantic hairsplitting that tries to distinguish TDM from AI training often collapses upon closer inspection.

GEMA and LAION

Both the Hamburg court in LAION and the Munich court in GEMA recognised this. The Hamburg court explicitly rejected the argument that TDM could not apply because it was not contemplated at the time of enactment. The Munich court went further, stating that ‘language models such as the models at issue generally fall within the scope of the limitations on text and data mining’ and that ‘the EU legislator was aware of the use of data for the purpose of training models.’

Some might point to the different results in LAION and GEMA as evidence of legal uncertainty. I would argue the opposite: the cases are consistent in their core reasoning about TDM, and the different outcomes reflect different factual circumstances rather than contradictory legal interpretations.

In LAION, the Hamburg court found that a non-profit organisation creating a dataset for AI training fell within the TDM exception for scientific research under Article 3. The court held that downloading images to verify their descriptions constituted TDM. Importantly, the court also provided extensive obiter dicta suggesting that the commercial TDM exception in Article 4 would likewise apply to AI training, subject to valid opt-outs.

In GEMA, the Munich court agreed that language models fall within the scope of TDM. Where GEMA succeeded was not in arguing that TDM does not apply at all, but rather that the exception has limits. The court drew a distinction between reproductions necessary for analysis (covered by TDM) and reproductions arising from memorisation in the model (not covered). This is a crucial distinction. The court found infringement not because training itself was unlawful, but because the model memorised and could reproduce the training data. I’ve talked about why I disagree with this specific part of the ruling, but it is remarkable that the court found that training was covered by the TDM exception.

This brings us to what I consider the correct analytical framework: the dividing line should be at the outputs, not the inputs. Training a model on copyrighted data is permitted under TDM. But if that model subsequently reproduces those works, if memorisation leads to infringing outputs, that is where liability arises. The TDM exception covers the analytical process, not the regurgitation of training data.

The AI Act settles this

If there were any remaining doubt about legislative intent, the AI Act should put the matter to rest. Article 53(1)(c) requires providers of general-purpose AI models to ‘put in place a policy to comply with Union law on copyright and related rights, and in particular to identify and comply with, including through state-of-the-art technologies, a reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790.’

Read that again. The AI Act explicitly references Article 4 of the DSM Directive in the context of training general-purpose AI models. Recital 105 of the AI Act is even more explicit: ‘Text and data mining techniques may be used extensively in this context for the retrieval and analysis of such content, which may be protected by copyright and related rights.’ The EU legislature has unambiguously confirmed that TDM, including the commercial exception in Article 4, applies to AI training.

The Hamburg court in LAION relied on this connection, noting that Article 53 of the AI Act ‘unequivocally’ demonstrated the EU legislator’s intent. It is hard to argue that the TDM exception was not meant for AI when subsequent legislation explicitly assumes that it is.

Where the real debate lies

None of this is to say that there are not genuine legal questions remaining. The opt-out mechanism under Article 4(3) is still being interpreted. What counts as ‘machine-readable’? The Hamburg court suggested that natural language terms of service might suffice if AI technology can understand them, which strikes me as both pragmatically sensible and doctrinally questionable. The Munich court was more receptive to natural language opt-outs. We are likely heading toward clarification, either through the CJEU (the Like Company v Google reference from Hungary is now pending) or through the AI Act’s Code of Practice. There’s also the Voss draft report, of which I may be writing about in the next year, so stay tuned.

There are also legitimate debates about the scope of the exception, whether it covers downstream acts beyond the initial reproduction, whether making datasets available to third parties is captured, and how the three-step test applies. These are worthy questions that deserve serious engagement.

But the foundational claim that TDM was never meant for AI training? That ship has sailed. The legislative history shows awareness, the statutory text is broad enough to encompass it, two German courts have confirmed applicability, and the AI Act has put the matter beyond reasonable doubt. Critics can argue about the wisdom of this policy choice, but they cannot credibly claim it was not a choice at all.

Concluding

The persistence of the ‘TDM does not cover AI’ argument reminds me of the early days of internet copyright debates, when rightsholders insisted that caching and indexing could not possibly fall within existing exceptions. Eventually, the law caught up with the obvious reading. We are watching the same process unfold with AI.

The more productive conversation is about where the limits lie. GEMA points to one sensible boundary: training is fine, but outputs that reproduce protected works are not. This preserves the innovative potential of AI while protecting against the specific harm that copyright is designed to prevent, namely, the unauthorised reproduction and distribution of creative works. Models that do not memorise, or that have effective guardrails against reproduction, should be in the clear. Models that function as sophisticated plagiarism machines should not.

Meanwhile, we will continue to see strategic litigation from rightsholder groups, much of it aimed more at extracting licensing fees than at establishing coherent legal principles. I think that the weight of evidence points clearly in one direction. The TDM exception applies to AI training. Time to move on to the harder questions.

On the meantime, writing this blog post has reminded me of the Deep Dream art style, I really liked it. Unless it’s the shoggoth looking at us through the machine… the eyes, the eyes!


0 Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.