The free and open source developing world is abuzz with a new feature from GitHub called Copilot. This is a programming tool that has been trained using code from GitHub’s own corpus. For those familiar, GitHub is the world’s largest open source software repository, it has 40 million users and hosts 190 million repositories, of which 28 million are public repositories, according to Wikipedia.
On the face of it, Copilot looks impressive, it is using Codex, a machine learning program that was developed by OpenAI, and it was trained using an undisclosed amount of GitHub’s own code. According to Codepilot:
“GitHub Copilot is powered by OpenAI Codex, a new AI system created by OpenAI. It has been trained on a selection of English language and source code from publicly available sources, including code in public repositories on GitHub.”
So we don’t know what code they used, only that it is in the English language, and that it is a selection (this could be vital, more on that later). Copilot takes a code prompt, and it will suggest code that follows, almost like magic, but what is happening is most likely a statistical analysis of how likely certain code follows other code, sort of like your phone’s autocomplete, or the text-based GPT-3. The program can make a good guess of what should follow based on the combined knowledge of a large corpus of code.
So far so good, but there has been a bit of a stink being raised online recently. This was prompted by a research paper by Alber Ziegler published by GitHub, which explained that in very few circumstances (0.1% of the time), Copilot could make a suggestion of code that already exists. This opens the question of whether GitHub could be accused of infringing copyright. The paper explains that if one gives a famous text prompt to Codepilot (in this case the Zen of Python), then it will be likely to recite the entire text back, because it is a well known text that is probably repeated in a lot of its training data.
Does Copilot do the same thing with code? Only very rarely, and only reciting often repeated code. Ziegler looked for code that is repeated exactly in 60 words at least, and found that out of 453,780 code suggestions, only 473 matched some of the training code in at least 60 words. This is a very small number of times, but for me the most important aspect was the code that was being replicated. It tended to be very common elements of code, mostly opening text. Copilot was more likely to suggest code from somewhere if there was not a lot of input, and as more input was offered, the less likely it was to offer matching code. Ziegler concludes:
“This demonstration demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code everybody quotes, and mostly at the beginning of a file, as if to break the ice.
But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.”
They suggest that the filtering tool will be part of Copilot:
“The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.”
This seems like a logical and sensible solution, there’s little chance of there ever being actionable copyright infringement here in my opinion.
The reason for this may vary from one jurisdiction to another, but in general, copyright infringement tends to be looked at from a qualitative, and not quantitative perspective. Imagine a developer uses Copilot to produce some code, and a small amount is found to be copied from software archived on GitHub, will this resulting code be infringing copyright? I don’t think so for two reasons. Firstly, there has to be substantial reproduction, and it doesn’t look like Copilot is likely to recreate large chunks of code from only one source. Moreover, the quality of the code matters, if the matching code is very common, it may not even originate from the training code, it could just be a very common manner of doing something. The quality of the code would be indeed very relevant.
But what about possible licence breach? Coding Twitter has been set aflame by this call to arms from a software developer:
github copilot has, by their own admission, been trained on mountains of gpl code, so i’m unclear on how it’s not a form of laundering open source code into commercial works. the handwave of “it usually doesn’t reproduce exact chunks” is not very satisfying pic.twitter.com/IzqtK2kGGo
— eevee (@eevee) June 30, 2021
There is a lot going on in this thread. The claim is that Copilot has been trained using free and open source software, and that a lot of the code is under the GPL. This is a viral licence that contains a copyleft obligation of releasing the derivative code using the terms of the same GPL licence (more on that from this ancient article from Yours Truly). The argument here is that because Copilot has been trained under the GPL, the resulting code should be released under the GPL because it would be a derived work from the original code. So Microsoft can expect a large class-action suit from thousands of developers.
I strongly disagree.
Firstly, there’s no indication that Codex and Copilot have been trained exclusively using GPL code. As of 2017, 20% of open source projects were using the GPL v2 or v3 licences, with most projects using more permissible licences such as MIT and Apache. Secondly, there appears to be an assumption that the resulting code is derived from the code that was used to train it, and as I have explained, this is not the case at all. Derivation, modification, or adaptation (depending on your jurisdiction) has a specific meaning within the law and the licence. In the extremely unlikely case that there is code in Copilot that is identical to some code found in a GPL repository, and that was used to train Codex, it would still need to meet the definition of modification. The GPL v3 states that code can only be used in another project if it is a modified version of it, and it defines modification thus:
“To “modify” a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a “modified version” of the earlier work or a work “based on” the earlier work.”
This is very clever wording. You only need to comply with the licence if you modify the work, and this is done only if your code is based on the original to the extent that it would require a copyright permission, otherwise it would not require a licence. As I have explained, I find it extremely unlikely that similar code copied in this manner would meet the threshold of copyright infringement, there is not enough code copied, and even if there is, it appears to be mostly very basic code that is common to other projects.
But there is also an ethical question here. People share code for the betterment of society, and while copyleft used to be popular early on, the software industry has been moving towards less problematic and more permissive licences. The spirit of open source is to share code, and make it possible to use it to train machine learning.
Having said that, I understand why people may object to their code being used to train commercial ML, but that is another question. In my opinion, this is neither copyright infringement nor licence breach, but I’m happy to be convinced of the contrary.
ETA: Neil Brown has an excellent take on the subject. It’s worth adding that ML training is increasingly considered to be fair use in the US (Google Books), and fair dealing under data mining exceptions in other countries.