Persistence of memorization

We can remember it for you wholesale.

We’ve now had two recent decisions in the UK and Germany very close together that have dealt with AI and copyright. I won’t go in detail again on the rulings, you can read the previous posts discussing them here and here, but something arose in those cases that prompted me to write this blog post, and it is the question of memorization, which was dealt with quite differently by each court.

I’ve mentioned memorization repeatedly here and elsewhere, but for completeness here’s another explanation. In machine learning, memorization refers to when a model learns to store and reproduce specific training examples rather than learning the underlying patterns or rules that generalise to new data. A memorizing model essentially “remembers” its training data verbatim instead of extracting useful features. For example, a memorizing image generator might only recognise exact photos it was trained on. Memorization is closely related to overfitting, this occurs when a model learns the training data too closely, thus capturing not only the underlying patterns but also the noise, and accidental correlations present in that specific dataset. As a result, the model performs extremely well on the data it has seen but generalises poorly to new, unseen examples. In practice, an over-fit model is one that has memorized rather than learned.

The copyright relevance should be clear. So far most of the ongoing litigation has rested on the input phase with the question of whether training an AI with copyright works without authorisation is copyright infringement. Despite many attempts to make this into a clear-cut example of infringement, the results have been more varied than perhaps we were led to believe. The reality is that results have been mixed because trainers usually use copies internally to train a model, and these copies are not published or communicated to the public in any meaningful way, so in the US cases have been veering towards declaring this training fair use. Add to that exceptions such as Art 4 DSM in the EU, and we have the beginning of a legal trend. Things may still change, but at least for now training itself appears to be cleared.

This is a problem for copyright owners that want to try to get remuneration from training. So what has been happening is to move the infringement beyond the training phase and into the models themselves, as well as growing number of infringing cases in the output phase. However, the output cases have been limited for reasons discussed before here. I don’t intend to revisit those here, but I think that at least in the near future we will not see too many output cases (unless you’re Disney, but I digress).

This brings us to memorization, and why it has become the most important legal battleground recently. If there is no infringement in the training itself (and I concede that this is still a big ‘if’, but humour me), then the next best argument is that trained models themselves are infringing copies, and therefore any release of a model that has been trained with inputs owned by a copyright holder must therefore also be infringing. The argument assumes that models are more like storage devices that distribute copies of copyright works to the users of those models, and this is in itself an act of copyright infringement. We saw versions of some of these arguments in some of the earlier lawsuits, but those arguments were quickly shut down by judges in earlier motions as it became evident that models were actually not databases filled to the brim with infringing copies.

So the arguments morphed into what we have now, and it goes something like this. Models are trained with copyright works, and while they’re not databases or storage in the traditional sense, the models can memorize works in the training data, and can sometimes reproduce them as outputs. This means that they still act like storage devices, even though they are not technically that. So a trained model will still infringe copyright because of the capacity for memorization.

This has become a widespread talking point amongst copyright maximalists, and it has been showing up in various lawsuits such as Getty and GEMA, as well as ongoing US cases. I have attended a few online and in-person events over the last couple of years, and I have noticed a concerted effort to push hard the memorization narrative, select examples of image memorization are constantly pushed forward, as well as references to some of the technical literature which is often cited out of context or misunderstood. The idea here is to convince the legal community and the courts that models being capable of memorization equals copyright infringement. This culminated in the GEMA decision, where the court practically declared that if memorization was inevitable, then this was equivalent to copyright infringement.

Needless to say, I strongly disagree with this narrative because it continues to ignore the realities of memorization. There is a growing body of technical literature on this subject, particularly because of its relative importance in the copyright debate. It is not the intention of this blog post to thoroughly showcase all of this research, but for the most part we can start to dismantle the above argument by looking at how memorization actually occurs.

One of the most cited articles by maximalists is this 2023 paper by Nicholas Carlini et al, who is also the father of the term “memorization”. For a while this looked like the smoking gun that the copyright industry was looking for when it came to whether models were capable of reproducing their training data as an output, and was therefore used in several copyright cases. What these same lobbyists tend to often ignore is that the paper, and subsequent ones from Carlini and others, is that this type of memorization and extraction of data is often adversarial, in fact another paper by Carlini et al also found that memorization occurs in less than 1% of cases and often requires “attacks” to reveal it. The original paper got cited so much in copyright circles that Carlini had to write a blog post explaining why his privacy research should not be used in a copyright context. A more recent paper by Cooper et al confirms this trend, they found that when it comes to books specifically, memorization is extremely rare, they did find some outliers with one model (Harry Potter and 1984), but the trend is for memorization to be minimal.

Another common use of memorization in copyright has been to assume that models will always memorize, and the most egregious example of this has been its appearance in the GEMA decision. The reality is that a model only remembers things that are repeated excessively in the training data (e.g. viral images, boilerplate text, famous poems). A good paper on this is this one by Somepalli et al, who found that image replication is almost exclusively driven by image duplication in the training set. This can be fixed with various data hygiene during training, but most importantly, it narrows down the scope of the use of memorization for legal purposes.

The reality is clear. Memorization can occur, but it remains rare, particularly for text, and it can be reduced by good training practices. So whenever you see that someone is bringing up a 2023 Carlini et al paper to support a specific copyright argument, you can be adequately suspicious. But perhaps most importantly, what all of the above means from a copyright perspective is that we should keep memorization out of legal arguments, just because it can occur it doesn’t mean that it did occur in any specific example. The proof is in the output, as I keep repeating here, and I will continue to repeat, memorization is not an exclusive right of the author, reproduction is. So you will only have a copyright case if you can prove that a model is capable of reproducing specific inputs, otherwise we only have speculation.

Concluding

It has been fascinating to follow the evolution of the copyright debates in the last few years when it comes to memorization. What started life as a “gotcha” by social media reply guys quickly migrated into one of the most used (and misused) legal tactics against generative AI in the ongoing copyright cases. I hope that eventually sanity will prevail, and the GEMA treatment of the subject will go down in copyright history as an outlier, I find that the most rational way of treating this has been by Smith J in the Getty Images decision, and I hope that courts around the world pay notice to this approach.

If we think of models as latent space, what we should expect is something closer to Gandalf in Moria: when asked to reproduce training data, they simply reply, “I have no memory of this place”.

 

* Note: I usually use the British spelling for memorisation, but in this blog post I decided to use the US spelling as it is the one that is more prevalent in the literature.


3 Comments

Anonymous · November 28, 2025 at 5:20 pm

Thanks for sharing these thoughts Andres. Memorisation has been causing me some head-scratching of late.

Anonymous · December 3, 2025 at 6:25 am

Great piece! A few notes:

Overfitting and duplication are two different paths to memorization, with duplication much more common in LLMs. When a model sees the same text many times, memorizing it isn’t necessarily an error. It’s just reflecting the patterns it encounters most often. But that inherently biases the model toward those patterns and away from the broader distribution of language on the internet and in the world. So if developers want models that generalize well to the full range of real-world scenarios, they’re strongly incentivized to de-duplicate their training data (which also reduces memorization risk).

Plaintiffs often blur two issues: regurgitation of training data in the weights and verbatim output of RAG/grounding data (especially search results). Training-data output claims may narrow, but the fight over search-grounded outputs is just heating up.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.