GitHub’s Commercial AI Tool Was Built From Open Source Code

Admin July 12, 2021

0 14 3 minutes read

“I’m generally happy to see expansions of free use, but I’m a little bitter when they end up benefiting massive corporations who are extracting value from smaller authors’ work en masse,” Woods says.

One thing that’s clear about neural networks is that they can memorize their training data and reproduce copies. That risk is there regardless of whether that data involves personal information or medical secrets or copyrighted code, explains Colin Raffel, a professor of computer science at the University of North Carolina who coauthored a preprint (not yet peer-reviewed) examining similar copying in OpenAI’s GPT-2. Getting the model, which is trained on a large corpus of text, to spit out training data was rather trivial, they found. But it can be difficult to predict what a model will memorize and copy. “You only really find out when you throw it out into the world and people use and abuse it,” Raffel says. Given that, he was surprised to see that GitHub and OpenAI had chosen to train their model with code that came with copyright restrictions.

According to GitHub’s internal tests, direct copying occurs in roughly 0.1 percent of Copilot’s outputs—a surmountable error, according to the company, and not an inherent flaw in the AI model. That’s enough to cause a nit in the legal department of any for-profit entity (“non-zero risk” is just “risk” to a lawyer), but Raffel notes this is perhaps not all that different from employees copy-pasting restricted code. Humans break the rules regardless of automation. Ronacher, the open source developer, adds that most of Copilot’s copying appears to be relatively harmless—cases where simple solutions to problems come up again and again, or oddities like the infamous Quake code, which has been (improperly) copied by people into many different codebases. “You can make Copilot trigger hilarious things,” he says. “If it’s used as intended I think it will be less of an issue.”

GitHub has also indicated it has a possible solution in the works: a way to flag those verbatim outputs when they occur so that programmers and their lawyers know not to reuse them commercially. But building such a system is not as simple as it sounds, Raffel notes, and it gets at the larger problem: What if the output is not verbatim, but a near copy of the training data? What if only the variables have been changed, or a single line has been expressed in a different way? In other words, how much change is required for the system to no longer be a copycat? With code-generating software in its infancy, the legal and ethical boundaries aren’t yet clear.

Many legal scholars believe AI developers have fairly wide latitude when selecting training data, explains Andy Sellars, director of Boston University’s Technology Law Clinic. “Fair use” of copyrighted material largely boils down to whether it is “transformed” when it is reused. There are many ways of transforming a work, like using it for parody or criticism or summarizing it—or, as courts have repeatedly found, using it as the fuel for algorithms. In one prominent case, a federal court rejected a lawsuit brought by a publishing group against Google Books, holding that its process of scanning books and using snippets of text to let users search through them was an example of fair use. But how that translates to AI training data isn’t firmly settled, Sellars adds.

It’s a little odd to put code under the same regime as books and artwork, he notes. “We treat source code as a literary work even though it bears little resemblance to literature,” he says. We may think of code as comparatively utilitarian; the task it achieves is more important than how it is written. But in copyright law, the key is how an idea is expressed. “If Copilot spits out an output that does the same thing as one of its training inputs does—similar parameters, similar result—but it spits out different code, that’s probably not going to implicate copyright law,” he says.

The ethics of the situation are another matter. “There’s no guarantee that GitHub is keeping independent coders’ interests to heart,” Sellars says. Copilot depends on the work of its users, including those who have explicitly tried to prevent their work from being reused for profit, and it may also reduce demand for those same coders by automating more programming, he notes. “We should never forget that there is no cognition happening in the model,” he says. It’s statistical pattern matching. The insights and creativity mined from the data are all human. Some scholars have said that Copilot underlines the need for new mechanisms to ensure that those who produce the data for AI are fairly compensated.

Source link