The free and open source developing world is abuzz with a new feature from GitHub called Copilot. This is a programming tool that has been trained using code from GitHub’s own corpus. For those familiar, GitHub is the world’s largest open source software repository, it has 40 million users and hosts 190 million repositories, of which 28 million are public repositories, according to Wikipedia.

On the face of it, Copilot looks impressive, it is using Codex, a machine learning program that was developed by OpenAI, and it was trained using an undisclosed amount of GitHub’s own code. According to Codepilot:

“GitHub Copilot is powered by OpenAI Codex, a new AI system created by OpenAI. It has been trained on a selection of English language and source code from publicly available sources, including code in public repositories on GitHub.”

So we don’t know what code they used, only that it is in the English language, and that it is a selection (this could be vital, more on that later). Copilot takes a code prompt, and it will suggest code that follows, almost like magic, but what is happening is most likely a statistical analysis of how likely certain code follows other code, sort of like your phone’s autocomplete, or the text-based GPT-3. The program can make a good guess of what should follow based on the combined knowledge of a large corpus of code.

So far so good, but there has been a bit of a stink being raised online recently. This was prompted by a research paper by Alber Ziegler published by GitHub, which explained that in very few circumstances (0.1% of the time), Copilot could make a suggestion of code that already exists. This opens the question of whether GitHub could be accused of infringing copyright. The paper explains that if one gives a famous text prompt to Codepilot (in this case the Zen of Python), then it will be likely to recite the entire text back, because it is a well known text that is probably repeated in a lot of its training data.

Does Copilot do the same thing with code? Only very rarely, and only reciting often repeated code. Ziegler looked for code that is repeated exactly in 60 words at least, and found that out of 453,780 code suggestions, only 473 matched some of the training code in at least 60 words. This is a very small number of times, but for me the most important aspect was the code that was being replicated. It tended to be very common elements of code, mostly opening text. Copilot was more likely to suggest code from somewhere if there was not a lot of input, and as more input was offered, the less likely it was to offer matching code. Ziegler concludes:

“This demonstration demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code everybody quotes, and mostly at the beginning of a file, as if to break the ice.

But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.”

They suggest that the filtering tool will be part of Copilot:

“The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.”

This seems like a logical and sensible solution, there’s little chance of there ever being actionable copyright infringement here in my opinion.

The reason for this may vary from one jurisdiction to another, but in general, copyright infringement tends to be looked at from a qualitative, and not quantitative perspective. Imagine a developer uses Copilot to produce some code, and a small amount is found to be copied from software archived on GitHub, will this resulting code be infringing copyright? I don’t think so for two reasons. Firstly, there has to be substantial reproduction, and it doesn’t look like Copilot is likely to recreate large chunks of code from only one source. Moreover, the quality of the code matters, if the matching code is very common, it may not even originate from the training code, it could just be a very common manner of doing something. The quality of the code would be indeed very relevant.

But what about possible licence breach? Coding Twitter has been set aflame by this call to arms from a software developer:

 

There is a lot going on in this thread. The claim is that Copilot has been trained using free and open source software, and that a lot of the code is under the GPL. This is a viral licence that contains a copyleft obligation of releasing the derivative code using the terms of the same GPL licence (more on that from this ancient article from Yours Truly). The argument here is that because Copilot has been trained under the GPL, the resulting code should be released under the GPL because it would be a derived work from the original code. So Microsoft can expect a large class-action suit from thousands of developers.

I strongly disagree.

Firstly, there’s no indication that Codex and Copilot have been trained exclusively using GPL code. As of 2017, 20% of open source projects were using the GPL v2 or v3 licences, with most projects using more permissible licences such as MIT and Apache. Secondly, there appears to be an assumption that the resulting code is derived from the code that was used to train it, and as I have explained, this is not the case at all. Derivation, modification, or adaptation (depending on your jurisdiction) has a specific meaning within the law and the licence. In the extremely unlikely case that there is code in Copilot that is identical to some code found in a GPL repository, and that was used to train Codex, it would still need to meet the definition of modification. The GPL v3 states that code can only be used in another project if it is a modified version of it, and it defines modification thus:

“To “modify” a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a “modified version” of the earlier work or a work “based on” the earlier work.”

This is very clever wording. You only need to comply with the licence if you modify the work, and this is done only if your code is based on the original to the extent that it would require a copyright permission, otherwise it would not require a licence. As I have explained, I find it extremely unlikely that similar code copied in this manner would meet the threshold of copyright infringement, there is not enough code copied, and even if there is, it appears to be mostly very basic code that is common to other projects.

But there is also an ethical question here. People share code for the betterment of society, and while copyleft used to be popular early on, the software industry has been moving towards less problematic and more permissive licences. The spirit of open source is to share code, and make it possible to use it to train machine learning.

Having said that, I understand why people may object to their code being used to train commercial ML, but that is another question. In my opinion, this is neither copyright infringement nor licence breach, but I’m happy to be convinced of the contrary.

ETA: Neil Brown has an excellent take on the subject. It’s worth adding that ML training is increasingly considered to be fair use in the US (Google Books), and fair dealing under data mining exceptions in other countries.


50 Comments

Anonymous · June 30, 2021 at 7:29 pm

Secondly, there appears to be an assumption that the resulting code is derived from the code that was used to train it, and as I have explained, this is not the case at all.

You’re an idiot

Best regards

    Andres Guadamuz · July 1, 2021 at 10:31 am

    I’m leaving this post up because it made me chuckle.

Anonymous · July 1, 2021 at 12:03 am

Interesting read! I’m curious to know why you impose the burden of “exclusively GPL code”? It seems that to use an assortment of works, you would be forced to contend with the most restrictive licensing rather than the least when attempting to create a combined work.

Regarding fair use, I’m doubtful that a statistical model can be described as a transformative work, given that its primary goal is to be as close to its source material as possible.

Another fun complication is that large language models like this have a habit of reproducing their training data identically, which kinda flies in the face of Ziegler’s work (https://arxiv.org/abs/2012.07805).

    Andres Guadamuz · July 1, 2021 at 10:43 am

    It seemed like that was the implication from the tweet, and from a lot of commentary. But even if the trained code is under the GPL, I argue that it’s not a modification from a legal perspective.

    As for fair use, this is established in Authors Guild v. Google, training ML is considered fair use. If the whole copying of entire books was found to be fair use, I cannot see how the use of analysis of archived code could be found to be infringing in the US. In the EU and the UK, this could fall under text and data mining.

Andrew Ducker · July 1, 2021 at 10:26 am

I cannot see how something which swishes a bunch of copyrighted code around in a big bucket and then pours it out, somehow produces something which is not copyrighted code.

Nor can I see how this isn’t modified code. (Also, if it isn’t modified then it’s verbatim code, which has its own place in the license.

On top of this, “The spirit of open source” is irrelevant – these are legal licenses to copy and modify code. Either you follow the license or you have no legal right to break the copyright.

    Andres Guadamuz · July 1, 2021 at 10:47 am

    Hi Andrew,

    As for the first question, the answer is that the ML is not swishing around code and spitting out the very exact same code most of the time, in a very tiny number of times, it will produce code that is very similar to common code used by other programs.

    As for the modified code, the definition in the licence of modification is code that you would need to get permission from the copyright owner. Based on Ziegler’s analysis, I don’t think this is the case, this is code that would not be protectable (not every use is an infringing use).

    I do agree that the ethical consideration is secondary, but I do believe it’s relevant on the likelihood that an open source developer will sue. There is very little litigation for a reason.

      Andrew Ducker · July 2, 2021 at 1:51 pm

      Here it is spitting out a famous bit of source code from Quake, complete with sweary comments.
      Would you agree that this is copyright infringement?

        Andres Guadamuz · July 4, 2021 at 11:52 am

        I don’t know the context, but it seems too short to be copyright infringement. I keep seeing these recitations that appear to be extremely specific. Not all reproduction is copyright infringement, otherwise I could never cite anything.

        Ivan · July 7, 2021 at 10:18 am

        Andres, this is absurd, how is this not an infringement? It literally reproduces 20 non-trivial (in fact extremely clever and unique and famous) lines of code from another codebase. If a person did this and claimed ownership of the code, it would be an outrage.

        Andres Guadamuz · July 8, 2021 at 10:08 am

        I mentioned that I didn’t know the context (I’m a lawyer, not a programmer, I know enough to get by), but on the face of it, 20 lines of code would not be enough to warrant infringement, particularly if the code is so well known that it has been re-used repeatedly. Also, coders should be very afraid of opening the floodgates to a world in which 20 lines of code gets litigated for copyright infringement. Besides, the output is not claiming any ownership of the code, my guess is that it could be public domain.

        beleester · July 9, 2021 at 12:41 pm

        Context: That code is the Fast Inverse Square Root (wiki article: https://en.m.wikipedia.org/wiki/Fast_inverse_square_root). It’s a famous piece of tech history – a clever optimization to approximate a slow calculation using much faster operations. Most famously used in the Quake 3 game engine, which is where the sweary comments came from.

        So I would assume it shows up in Copilot because the Quake 3 source (which has been GPL’d) is on GitHub along with various other forks and hacks of that engine. So I think that, unlike most of the cases found here, you could reasonably argue that this code is directly copied from Quake 3, or an engine that copied from Quake 3, and that most programmers wouldn’t have been able to independently implement that code without copying.

        (Although the wiki article notes that Quake 3 didn’t invent the algorithm in question, only popularized it.)

        Not a lawyer, I’m just explaining why programmers will immediately see this code and go “hey, that’s a copy!” rather than “nah, everyone does it that way.”

Andrew · July 1, 2021 at 12:35 pm

Have you considered, since the GPL (like any FLOSS license) allows reproduction and modification, that it might be quite easy to meet standard of “there has to be substantial reproduction” even for nontrivial GPL solutions?

I see an assumption that only trivial code will be reproduced, but one of the defining features of FLOSS is to allow widespread reproduction, regardless of complexity. Do you see this not being a problem, or being mitigated by some specific strategy?

    Andres Guadamuz · July 1, 2021 at 2:54 pm

    Excellent question. The issue is that all licences, not just FLOSS, exist to permit a use that would otherwise be infringing, there’d be no need for a licence in the first place if there’s no infraction. We do this all the time, we cite text, use photographs, use screenshots, etc. These are fair uses that do not require me to obtain a licence.

    So there has to be a cut-off point at which something passes from being a fair use / fair dealing, to infringement necessitating a licence. Some people will err on the side of caution and get a licence anyway, just to be safe. But generally we have a good understanding of where the line is in many instances. Looking at the Ziegler paper, I believe strongly that what they described does not cross the threshold of infringement.

    I don’t see this as a problem, and I don’t see this as incompatible with the goals stated. I hope that makes sense.

      Andrew Sillers · July 1, 2021 at 3:38 pm

      Oh, yeah, I agree with the idea that the output of a model may be generally not a derivative. The relevant legal test (in the U.S. anyway) here is the Abstraction-Filtration-Comparison Test (basically, “abstract the necessarily functional components from the expression, filter them out to leave only what is distinct about this expression, and compare what remains”), which output could probably, but not always, pass. Indeed any code expression that is sufficiently trivial is merged with its underlying algorithm and would be filtered out in a legal comparison.

      My concern in my previous comment is more about the assumption that any code that appears frequently (and therefore is more likely to be reproduced verbatim) is necessarily trivial. Suppose there is a GPL implementation of frobulate_foos that is nontrivial and replicated widely across GPL projects. When a user of the model expresses an intent to frobulate some foos, it seems likely that the model might produce the most frequently-seen implementation of frobulate_foos which is GPL’d.

      I suppose another way to put it is: the frequency with which some code appears need not correspond to that code’s triviality. Freedom to copy means that even complex code may appear repeatedly in a training corpus, which increases the odds of verbatim or problematically-similar production of that code as output.

      I’m not saying this will or is happening (I am not a machine learning expert by any means) but only that we can’t rely on the assumption that frequently-seen code is trivial, which was it sounds like you might be saying in your phrasing of “even if there is [reproduction sufficient to constitute a derivative under copyright law], it appears to be mostly very basic code that is common to other projects.”

        Andres Guadamuz · July 2, 2021 at 11:03 am

        Excellent point, thanks. Gosh, I hadn’t thought about the Abstraction/Filtration/Comparison test in years! It was criticised in an English case here, so we’ve gone with the tried and tested idea/expression dichotomy. Some people even blame AFC for software patents, but that’s another story.

        Good point that common code may still be protected. Reading the Ziegler paper, he states categorically that the code is more common than that, mostly things found in opening, or code for which there are not many alternatives.

        I think that this is all going to be solved eventually by Codex an Copilot offering a similarity tool where programmers can check whether there is any recitation in their code.

        Andrew Sillers · July 2, 2021 at 12:17 pm

        That’s very helpful and reassuring context. It sounds like my concern might be rated “philosophically not-impossible” but happily doesn’t actually occur for their training corpus (and might not occur for any reasonable real-world corpus). It also sounds like I should just read the paper already! 🙂

Robert J. Berger · July 9, 2021 at 8:08 pm

It may not be illegal, but IMHO as it stands, Copilot is immoral. The issue being that it is owned by Microsoft, is closed source, only available on VS Studio, and without any way to have your code opt-out, it looks like its part of the new Microsoft plan to embrace and extend Developers, Developers, Developers.

If Copilot was Open Sourced, the ML corpus and models publicly available and not tied to VS Studio with a community governance, it could be a good thing.

Anonymous · July 13, 2021 at 9:17 pm

Are you serious? “You only need to comply with the licence if you modify the work”? Did you even READ the GPL? It needs you to keep your code as GPL if you reproduce OR modify… up to this day, it isn’t even clear if you can LINK, statically or dynamically, GPL code into a closed-source, so I have no idea where you got that info.

Also, 20% of code is big enough to be trouble. What about other licenses that also need attribution? What about code that’s public but NOT open-source (like Jekejeke Prolog) that were used on the training? What about “Creative Commons Non-Derivative Non-Commercial” software? Are we ignoring all these too?

Xataka – GitHub Copilot y la polémica con el copyright: el debate sobre si la IA infringe derechos de autor del código escrito por otros programadores – Yacal · July 6, 2021 at 5:11 pm

[…] derivados, pero donde estos deben ofrecerse bajo las mismas condiciones. Esto ha provocado que algunos desarrolladores hayan mostrado su malestar ante el hecho que GitHub esté aprovechando el código de otros […]

GitHub Copilot y la polémica con el copyright: el debate sobre si la IA infringe derechos de autor del código escrito por otros programadores - Mirando la Hidrovía · July 6, 2021 at 5:15 pm

[…] derivados, pero donde estos deben ofrecerse bajo las mismas condiciones. Esto ha provocado que algunos desarrolladores hayan mostrado su malestar ante el hecho que GitHub esté aprovechando el código de otros […]

дебаты о том, нарушает ли ИИ авторские права на код, написанный другими программистами - Прорыв технологий · July 6, 2021 at 5:33 pm

[…] должны предлагаться на тех же условиях. Это вызвало некоторые разработчики показали свои расстроены тем, что GitHub использует код […]

GitHub Copilot y la polémica con el copyright: el debate sobre si la IA infringe derechos de autor del código escrito por otros programadores | ReportateRD · July 6, 2021 at 6:02 pm

[…] derivados, pero donde estos deben ofrecerse bajo las mismas condiciones. Esto ha provocado que algunos desarrolladores hayan mostrado su malestar ante el hecho que GitHub esté aprovechando el código de otros […]

GitHub Copilot y la polémica con el copyright: el debate sobre si la IA infringe derechos de autor del código escrito por otros programadores · July 6, 2021 at 6:03 pm

[…] derivados, pero donde estos deben ofrecerse bajo las mismas condiciones. Esto ha provocado que algunos desarrolladores hayan mostrado su malestar ante el hecho que GitHub esté aprovechando el código de otros […]

GitHub Copilot y la polémica con el copyright: el debate sobre si la IA infringe derechos de autor del código escrito por otros programadores | Blog de Soportec Panama · July 6, 2021 at 6:05 pm

[…] derivados, pero donde estos deben ofrecerse bajo las mismas condiciones. Esto ha provocado que algunos desarrolladores hayan mostrado su malestar ante el hecho que GitHub esté aprovechando el código de otros […]

GitHub Copilot y la polémica con el copyright: el debate sobre si la IA infringe derechos de autor del código escrito por otros programadores | Noticias, información y análisis de última hora. Vivienda, Tecnología, Salud, Economía. · July 6, 2021 at 7:40 pm

[…] derivados, pero donde estos deben ofrecerse bajo las mismas condiciones. Esto ha provocado que algunos desarrolladores hayan mostrado su malestar ante el hecho que GitHub esté aprovechando el código de otros […]

GitHub Copilot y la polémica con el copyright: el debate sobre si la IA infringe derechos de autor del código escrito por otros programadores – ThinkIT · July 6, 2021 at 9:06 pm

[…] derivados, pero donde estos deben ofrecerse bajo las mismas condiciones. Esto ha provocado que algunos desarrolladores hayan mostrado su malestar ante el hecho que GitHub esté aprovechando el código de otros […]

GitHub Copilot y la polémica con el copyright: el debate sobre si la IA infringe derechos de autor del código escrito por otros programadores - CR24News · July 7, 2021 at 1:29 am

[…] derivados, pero donde estos deben ofrecerse bajo las mismas condiciones. Esto ha provocado que algunos desarrolladores hayan mostrado su malestar ante el hecho que GitHub esté aprovechando el código de otros […]

GitHub Copilot y la polémica con el copyright: el debate sobre si la IA infringe derechos de autor del código escrito por otros programadores - La Fragua · July 7, 2021 at 4:03 am

[…] derivados, pero donde estos deben ofrecerse bajo las mismas condiciones. Esto ha provocado que algunos desarrolladores hayan mostrado su malestar ante el hecho que GitHub esté aprovechando el código de otros […]

GitHub Copilot – Your AI-powered accomplice to steal code? | apfelkraut.org · July 7, 2021 at 9:37 am

[…] enough to reach the threshold of originality”. Or as Andrés Guadamuz states it in his post “Is GitHub’s Copilot potentially infringing copyright?”: “[…] but in general, copyright infringement tends to be looked at from a qualitative, […]

Welcome to TechScape: will AI make centaurs of us all? | Technology – News Online Che · July 14, 2021 at 12:08 pm

[…] here not from non-public corporations involved that their work might have been reused, however from developers in the open-source community, who intentionally construct in public to let their work be constructed upon in flip. These […]

Welcome to TechScape: will AI make centaurs of us all? | Technology – Green Reporter · July 14, 2021 at 12:14 pm

[…] opposition came not from private companies concerned that their work may have been reused, but from developers in the open-source community, who deliberately build in public to let their work be built upon in turn. Those developers often […]

Welcome to TechScape: will AI make centaurs of us all? - World news · July 14, 2021 at 12:43 pm

[…] opposition came not from private companies concerned that their work may have been reused, but from developers in the open-source community, who deliberately build in public to let their work be built upon in turn. Those developers often […]

UPDATE : Welcome to TechScape: will AI make centaurs of us all? - xoonews · July 14, 2021 at 1:37 pm

[…] opposition came not from private companies concerned that their work may have been reused, but from developers in the open-source community, who deliberately build in public to let their work be built upon in turn. Those developers often […]

Welcome to TechScape: will AI make centaurs of us all? | Technology · July 14, 2021 at 2:03 pm

[…] opposition came not from private companies concerned that their work may have been reused, but from developers in the open-source community, who deliberately build in public to let their work be built upon in turn. Those developers often […]

What’s artificial intelligence best at? Stealing human ideas | Technology - The Digital Journo · July 14, 2021 at 2:33 pm

[…] opposition came not from private companies concerned that their work may have been reused, but from developers in the open-source community, who deliberately build in public to let their work be built upon in turn. Those developers often […]

What’s artificial intelligence best at? Stealing human ideas | Technology – Voice Of EU · July 14, 2021 at 2:42 pm

[…] opposition came not from private companies concerned that their work may have been reused, but from developers in the open-source community, who deliberately build in public to let their work be built upon in turn. Those developers often […]

What’s synthetic intelligence greatest at? Stealing human concepts - OrbMena News · July 14, 2021 at 2:47 pm

[…] got here not from non-public firms involved that their work might have been reused, however from builders within the open-source neighborhood, who intentionally construct in public to let their work be constructed upon in flip. These […]

What is synthetic intelligence finest at? Thieving human strategies | Technologies - Wilkinson Knaggs · July 14, 2021 at 3:21 pm

[…] opposition came not from personal firms anxious that their perform could have been reused, but from builders in the open-source group, who deliberately develop in public to permit their get the job done be developed on in transform. […]

What’s artificial intelligence greatest at? Thieving human tips | Technologies - Wilkenson Knaggs · July 14, 2021 at 3:28 pm

[…] not from private companies involved that their operate may possibly have been reused, but from builders in the open up-supply community, who intentionally establish in community to enable their perform be built upon in turn. Those […]

What’s artificial intelligence best at? Stealing human ideas · July 14, 2021 at 5:18 pm

[…] opposition came not from private companies concerned that their work may have been reused, but from developers in the open-source community, who deliberately build in public to let their work be built upon in turn. Those developers often […]

What’s artificial intelligence best at? Stealing human... - Artificial Intelligence · July 14, 2021 at 5:42 pm

[…] opposition came not from private companies concerned that their work may have been reused, but from developers in the open-source community, who deliberately build in public to let their work be built upon in turn. Those developers often […]

What’s artificial intelligence best at? Stealing human ideas | Technology - News Concerns · July 14, 2021 at 9:53 pm

[…] opposition came not from private companies concerned that their work may have been reused, but from developers in the open-source community, who deliberately build in public to let their work be built upon in turn. Those developers often […]

What’s artificial intelligence best at? Stealing human ideas - Latest Today News · July 14, 2021 at 11:57 pm

[…] opposition came not from private companies concerned that their work may have been reused, but from developers in the open-source community, who deliberately build in public to let their work be built upon in turn. Those developers often […]

What’s artificial intelligence best at? Stealing human ideas - Latest News Anews · July 15, 2021 at 12:16 am

[…] opposition came not from private companies concerned that their work may have been reused, but from developers in the open-source community, who deliberately build in public to let their work be built upon in turn. Those developers often […]

What’s artificial intelligence best at? Stealing human ideas - The Guardian - SEA Tiger News · July 15, 2021 at 5:06 pm

[…] opposition came not from private companies concerned that their work may have been reused, but from developers in the open-source community, who deliberately build in public to let their work be built upon in turn. Those developers often […]

What’s artificial intelligence best at? Stealing human ideas | Technology - Air Ytube · July 16, 2021 at 12:36 pm

[…] opposition came not from private companies concerned that their work may have been reused, but from developers in the open-source community, who deliberately build in public to let their work be built upon in turn. Those developers often […]

What’s artificial intelligence best at? Stealing human ideas | Technology - World Best News · July 16, 2021 at 2:52 pm

[…] came not from personal businesses anxious that their perform may have been reused, but from developers in the open up-resource community, who deliberately create in general public to let their perform be constructed on in turn. Those […]

Слон, Демон и Паук — Episode 0344 « DevZen Podcast · July 18, 2021 at 11:52 pm

[…] Is GitHub’s Copilot potentially infringing copyright? – TechnoLlama […]

Copilot or Co-Conspirator? Is GitHub’s New Feature a Copyright Infringer? – IP Osgoode · August 4, 2021 at 5:02 pm

[…] property law professor Andres Guadamuz argues that Copilot, as it stands, does not infringe copyright. This is because Copilot would copy small […]

Leave a Reply to Welcome to TechScape: will AI make centaurs of us all? | Technology – Green Reporter Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: