As long as the AI doesn't produce this function, you're fine:
private static void rangeCheck(int arrayLen, int fromIndex, int toIndex {
if (fromIndex > toIndex)
throw new IllegalArgumentException("fromIndex(" + fromIndex +
") > toIndex(" + toIndex+")");
if (fromIndex < 0)
throw new ArrayIndexOutOfBoundsException(fromIndex);
if (toIndex > arrayLen)
throw new ArrayIndexOutOfBoundsException(toIndex);
}
On a more serious note, I really wonder where the line is drawn for copyright. I see a lot of people claiming that AI is producing code they've written verbatim, but sometimes I wonder if everyone just writes certain things the same way. For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative. Perhaps there is a matter of taste on what exceptions you throw, or in what order. But the chosen ones are certainly what most people think of first, and the order to validate arguments, then check the low value, then check the high value, is pretty much what anyone would do. Perhaps you could format the error message differently. That's about it. So when someone "rips off" your code wholesale, it's could just be that everyone writing that function would have typed in the exact same bytes as you. You know your style guide is working when you look at code, think you wrote it, but actually you didn't!
That's why copyright holders for reference works have been using copyright traps for ages. That's where you include a fictional town in a map, a nonsense word in a dictionary, or a fake person in your phone book. If your competitors reproduce the trap, then that's clear evidence you can use in court.
We don't need the copyright traps here though as Github openly admits to using the public code for training. They just don't care that they're essentially license laundering code since they can make money doing it.
That said we used copyright traps at Malwarebytes, which is how we found out that iobit was stealing our database.
We know that isn't the case because we can see code being reproduced even with comments, and Github has been open about the fact that they used everything they had in training.
That said, lets say there's a new model that explicitly excluded closed source and copyleft licenses. Well, the MIT, MPL, Apache, BSD- they all say you can't strip their licensing off.
Okay, so to get to the spirit of your question, lets say Github managed to program a model that worked using only their own code or code that was explicitly put in the public domain. If Github managed to reproduce code that wasn't in the training set, then it can't be accused of copying it. At that point the argument could be made that it independently created it.
At the same time algorithms can't be copyrighted, but implementations of an algorithm can be, so if Github was basically just spitting out an algorithm that just happened to be implemented similarly to how some other code it wasn't trained on implemented it, then I would say there was no copyright violation.
>We know that isn't the case because we can see code being reproduced even with comments
If the comment is something like
//check fromIndex is greater than toIndex
then that is not any more individualistic or different than the actual function. Sadly, many people comment like this, on the other hand if it reproduced a comment with typos or something more complicated like
/* this hack is because Firefox's implementation of SVG z-indexing does not match how Chrome or Safari does it - please read this article ...url...*/
In almost any other scenario this would be evidence. But Fast Inverse Square Root isn’t some tightly held secret. That exact code, with those specific comments included, is found in the Wikipedia page for that algo:
How about rewording a code snippet so it doesn't exactly replicate the source, but is functionally identical? Could be applied before training. Can we say the LLM only learned the ideas not the expression? Copyright should protect expression and not restrict reusing ideas.
Except that's not how LLM works. LLM has no idea about "ideas", only probabilities of how certain words string together.
So you literally can't make it produce functionally identical but not verbatim identical code. It doesn't understand that the two are equivalent.
Also, such "functionally identical but not violating copyright" transformation is not possible to do, both given the complexity of the problem and the sheer volume of the data.
And training it on some simplistically obfuscated code wouldn't help - all it would learn would be production of obfuscated code. Not useful for the intended use.
> It doesn't understand that the two are equivalent.
it doesn't need to understand the way a human might do the understanding.
The pattern that the LLM managed to extract could include the structure, rather than the pure text. And in reproducing the structure, the LLM can replace the variable names but keep the structure intact.
I am not sure if copilot is able to do this, but chatGPT was somewhat able to (if imperfectly at the moment).
Copying a piece of code and changing the variable names is still a copy. It is similar to how copying a piece of music and changing the pitch/volume/any other attribute would still be a copy of the original music.
The thing that the LLM need to do is to convince a judge/jury that it has not created a copy, and that it operate differently from a transformation.
Consider a junior dev who writes a range check function while working for a company (so they own the copyright) then goes to a different company and writes the same range function because that's just how he writes code.
> Then the legalities can be argued, but an individual is in any case not remotely comparable to a service like copilot.
Why is this? Copilot in some ways is an automated way to search code & stack overflow. There is a very annoying website that does nothing more than show relevant code samples of various google search terms.
If the manual version of something is okay (eg: googling for code, finding it, fitting for a new and specific purpose that is similar), why would an automated version of that be any different?
I'm confused by the forbidden surveillance example. Generally surveillance camera's are legal for any place where there is no expectation of privacy. The expectation of privacy is largely only ones home, outside of that you can be video taped all day long by anyone. I'm not sure how this is analgous..
The million messages example is interesting. Though, what examples are there? In what cases is something legal to do it once, but there is some threshold where you cannot do it many times?
The "sending millions of messages" is only perhaps illegal because it breaks terms of service. Or, the one message is perhaps also illegal but nobody cares to pursue litigation for one instance of an infraction. The point remains though, if an individual does something once that is legal - it makes that activity legal, period and full stop. No?
I was thinking about things like spam and also social media.
Note that my main objection is to equating a person doing something with an automated process. Sometimes it may be legal or other times illegal but it just clearly isn’t the same.
For the last point, I think the answer to that is a definite no in most jurisdictions. Laws and judicial conventions often allow differing circumstances to affect the legality of things.
> For the last point, I think the answer to that is a definite no in most jurisdictions. Laws and judicial conventions often allow differing circumstances to affect the legality of things.
Curious, any concrete examples? I can't really think of any where one instance is okay but many is not. I can think of examples where one instance is ignored and many instances are harder to ignore (and so is prosecuted), but overall - I can't really think of anything that is okay to do once but not many times.
Playing a movie for a few friends who visits is fine, but start to demand tickets and suddenly it will look like a cinema which is not fine.
The reason is always the same. Courts and judges will look at the situation and make a decision about what seems fair and what does not. It is them that need to be convinced that a specific use of a copyrighted work is permitted either through fair use or by a license.
> Playing a movie for a few friends who visits is fine, but start to demand tickets and suddenly it will look like a cinema which is not fine.
Interesting analogy. "Ripping" something off an only using it for your personal project sounds like the "playing a movie for a few friends". Doing so for the benefit of corporation that then has thousands of daily visitors sounds like the "movie cinema" example. Though, in both cases it was an individual googling and finding how to implement a specific function.
"fair use" in copyright is pretty specific in that it refers to things like "you can play portions of a clip in order to comment on it." Or as another example, you can use clips/portions for the purposes of a review commentary.
"Form and function" is perhaps a very important crux here. Some things you can only do a certain way. For example, quick-sort, there are is only really one way to implement quick sort (or otherwise it is not at all quick sort!).
Personally I feel the copyright line is higher than a function, the copyright is on the collection of functions who together create a specific software. The individual functions IMHO are as copyright'able as-is a cog on a bike cassette, or the chain on a motorcycle.
I think there are quite few things in programming that can only be implemented one way. I see it as similar to music in that almost every song have notes going up or down the scale. Obviously there can't be that many variations, but then the important distinctions is often in the details. Applying copyright on a single function is like applying copyright on a single riff. Sometimes the legal system will accept it, but it should be the exception rather than the norm.
Fair use seem to had a change in scope. Historically it seems to be mostly about things like "play a clip in order to comment on it.", but now we have things like google making a copy of all books ever written in order for people to search through them. Similar arguments has been made over copying news articles from news sites in order to put a portion of it in search results. A stack overflow-like search engine that trawled proprietary code bases would likely be sued, but in theory they could argue fair use just like google.
I am pretty sure both cases would break copyright. But in the first case the copyright holders would never go after you and the second they would. But in both cases they could. The damages that a company could recover from you for watching a movie with a few friends is much lower than the damages they could recover if you made money selling tickets. Not to mention the negative PR a company would get for going after someone buying a DVD and watching it with friends.
IMO it’s the same thing because I fundamentally see LLMs in the same role as calculators that helps reduce cognitive load by offloading repetitive work.
Practically with an LLM the programmer can focus on the creative part (handler function, react component, etc) while the LLM generates the necessary boilerplate for the ever changing frameworks and infra configurations. The programmer (and QA) would still review and test everything but would save time writing boilerplate and ship features faster.
It literally means reproduced in some capacity. Just because its called "training" it doesn't mean it has any reasonable analogy to how humans learn or how expert humans train in a skill.
GPT-style models literally aim to reproduce the input character by character (token by token).
now if he had written a specification as to what the function should be, then passed it to someone else that had never seen the function and worked from the spec then he'd be ok
It's not nearly that simple. No real copyright case is going to hinge on what a single range check function looks like.
This is human law, it's not a programming situation where you can just apply some simple rule and get a deterministic answer. Context plays a huge part, among other things.
I should have said that no successful copyright case is going to hinge on that.
Oracle's position on that was legally incorrect, for the reason I was alluding to: the relevant standard requires that illegal copying involve the core of the creative expression of the original work, which a generic range check function clearly doesn't do.
As the copyright holder of "throw new", the Junior dev infringed my copyright! Let alone them infringing copyright of the company they crafted that code for.
On a more serious note, there is a question whether algorithms and code blocks can be copyrighted, or if it is the _software_ that is copyrighted. Let's say I use websockets and you crib my usage of websockets for your own application. My opinion is that unless you rebuild the same thing I did, then "cribbing" is the long held art of "let me google how to do that". The artistic creation is the end software product, not really some measly embedded function that is boiler plate (form and function) for anything to work.
The 'form and function' clause of copyright almost certainly makes a range check function not a copyright infringement.
Easy money idea: when you know an employee will be leaving the company, have them spend their last weeks writing basic, foundational functions in multiple languages!
Also, re: maps, fake streets and cul-de-sacs that don't exist.
I've set a "trap" myself years ago in code in a novel solution at the time for uploading photos from iOS non-interactively after the fact. It was to support disconnected field workers taking photos from iPhones/iPads, with the payloads uploaded at a later date.
Chunked form data constructed in userland JS was the solution. Chunk separator was 17 dashes in a row (completely arbitrary), company name in 1337 speak, plus 17 more dashes.
Found a competitor that had copied the code, changing only the 1337 speak part. 17 dashes remained on each side. Helped me realize that they had unminified and indeed ripped off our R&D work.
Yeah. The feature set offerred by the competitor was similar to ours, and we went through the wringer building that solution, so i unminified their code and sure enough...it easn't exactly theirs.
Oh yeah and they ripped off our website too. That was the first clue haha.
If you look at the Legal Action section of your link you'll see the line "However, the case was dismissed" quite a few times. That's because data isn't copyrightable.
Edit: As sroussey points out s/isn't copyrightable/isn't copyrightable in the USA
The other problem with these "copyright traps" is that they do nothing to prove someone copied the legitimate parts of the data.
Suppose you recreate the entire dataset from scratch. Then someone notices (e.g. using an automated comparison) that the "trap" is in the other dataset but missing from yours, and submits it to you to add.
This is arguably too small an addition to be copyrighted on its own, but regardless of that, it would then be all you have to remove to get back to a clean version. And since it's erroneous data, you would want to remove it anyway.
Which country's laws apply and what remedies you can get if they were violated is far more complicated than geolocation of data.
But very broadly speaking you would need to sue in an EU court to enforce EU law. And you could sue a US company in specific EU country's court if the company had more than some minimum level of connection to the that country. The country the data is hosted in isn't key, though it can be evidence of connection to that country.
Where the data is stored does not matter much. Laws deal with people and companies, so it matters where you live or where your company operates. So if you live in the US you don't have to worry about EU laws unless you do buisness in EU.
It's occasionally explained—but still not widely understood, I'd wager—that this is the reason why so much GNU code is hard to follow.
In the US legal system the merger doctrine is a concept whereby a given expression cannot be granted protection if it's not sufficiently creative—and there only so many ways to express something when stripped down to its fundamentals. In response to this, RMS and Moglen encouraged contributors from very early on to try to express the inner workings of GNU utilities in creative and non-obvious ways out of caution against the possibility that the copyleft obligations of the GPL wrt a given package could be nullified by a finding in court that it did not pass the threshold for creativity.
GNU code is partially hard to follow because of RMS paranoia, but that mostly manifests itself in the code being weirdly structured. The far bigger reason is that GNU code tends to run with really strange optimizations and project decisions since they want their tools to be able to run on ancient mainframes that practically nobody uses anymore, so everything is overoptimized for that.
I first saw this in action on StackOverflow when, during an interview, a candidate copy-pasted a solution verbatim including the attribution. Didn't even give it a second thought, like they didn't even read the code or what it was doing.
It wasn't the right solution to the problem in question, for what it's worth.
Copyright is limited to works that meet a certain threshold of originality [0]. It is assumed that works meeting such a threshold won’t be replicated by mere coincidence.
In the big picture, if we enter a world where an AI is instantly capable of doing code better than you do and without efforts, then I'm not sure why code should be copyrightable at all.
Copyright protects original works of authorship including literary, dramatic, musical, and artistic works, such as poetry, novels, movies, songs, computer software, and architecture.
Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.
Here (and in the future even more), AI is totally capable of expressing one idea in any programming language if you ask for it (even if conceptually inspired by copyrighted code).
Which means that a particular expression (a specific implementation) is practically of no value or particular interest at this stage.
You could ask the AI to do a slightly different implementation, it would not be a problem for it and would require no efforts.
There is no point to protect something that can be generated using no effort and has no particular genius in it.
We don't need to enter a world where AI gets any better at all to be able to argue that software shouldn't be copyrightable, smart people have been doing that for ages.
The problem, however, is that we live in this world, where it is copyrightable, and componies relying on Copilot to do large swathes of code generation do potentially have to worry about including copyrighted code in their codebase, and what the legal fallout from that might be.
It also creates a world where developers who have created code and specifically protected it with copyleft licenses to ensure that derivative works always remain a public good are having their rights laundered away via LLMs. I fully expect the FOSS community to fold if their rights are not respected and it could lead to a software dark age.
How often does GPL succeed in bullying people into sharing code under the same license when they wouldn't have wanted to otherwise? I imagine it mostly results in wasted duplicative effort to keep GPL out of the codebase. I'm glad more permissive licenses appear more popular now, like MIT.
This "think you wrote, but actually you didn't!", sometimes with another "actually you did, but you are looking at the code of someone who wrote it the same" happens often with people who have similar taste for solving problems. Or whose taste is influenced by the same teachers, such as you, jrockway! I've been using your open source as as one of my references for Go style. Thank you for sharing your opinionated-server, jsso2, and other projects, under the Apache 2.0 license!
In the example from the article, copilot produces identical comments, not just a functionally identical implementation. So in this case your hypothesis is false.
But thanks for trying to stand up against the open source community for microsoft. /s
I don’t understand why people have become so accepting. “Oh they’ve stolen all the public code and not provided attribution then sold it for a profit, can we just give these poor evil companies a break? It’s just progress…”.
This is completely unacceptable and another example that Microsoft is an evil and amoral company who only cares about open source for financial gain.
1) They are using your IP with coerced consent in the future to check other people's work as well as your own in the future. (Let's have a fun discussion about "self-plagiarism."
2) ChatGPT and the like are going to so massively increase the noise floor on this problem space that these counterfeit detection companies should all but disappear in a number of years.
The way copyright works, it's a violation if it was copied, but it's fine if it was generated independently. In this case I would say it's a copy, but I'm sure someone else would argue differently.
IANAL but I work alongside them. Here's an argument I've heard.
You can read the data to train a thing. So long as that thing doesn't literally copy the data into itself then the training hasn't violated copyright.
When that thing later generates an output, the output isn't copyrightable because it's machine generated (this is the current US position) and it isn't a copyright violation because it was generated, not copied.
You can launder copyrighted material through an LLM, basically.
Which, by the way, this is completely untested in courts.
Courts have decided, after a bunch of case-by-case decisions, that sampling a song consitutes creating a derivitive work, and you must obtain a license from a copyright holder to do so.
It is my opinion that training a model copies and creates derivitive work on what you used to train it, so you must have a license to train LLMs on content. I am not a lawyer, I am no one, my opinion here is worthless.
We already know that you can create a copy of something without doing a bit-for-bit duplicate because a) copyright law existed before we had bits, and b) transcoding a movie still counts as creating a copy. Recording my own VHS of HBO and selling it is still illegal.
Most generative AI actually does have significant problems with the model copying the data into itself. Not literally - there isn't a bunch of model parameters that line up to the exact PNG bitstream of particular images. But courts wouldn't care as long as the model outputs something that looks "close enough", because the chain of provenance is clearly established from the training set, through gradient descent and the model weights, into the final output.
I have no clue how they handled this in GPT-3 or -4. Given the amount of regurgitation found in Copilot I imagine there's lots of significant code fragments floating about nominally different projects that a deduplicator wouldn't match as identical.
The source code is GPL'ed, but that page is CC BY-SA 3.0.
It's also fairly easy to assume that a fair bit of material on SO that was copied from employer's codebases into SO (and thus now CC) can be included in GPL code now too.
> So long as that thing doesn't literally copy the data into itself then the training hasn't violated copyright.
Could luck proving that hasn't happened. If a language model that can reproduce the code verbatim doesn't count then a movie re-encoded into a different format shouldn't count either.
I think it’s unproven either ways. Courts ignore definition wars so long both party ignores it too. If you sued me for stealing some cheese, the justice system won’t care if it actually had been cheese or what actually cheese is so long I stick to such insistence that I didn’t do anything wrong up to your accusation.
The movie re-encoded into a different format is perfectly fine if the work is sufficiently different.
Taking a file of wolf of Wall Street and encoding it so all the oranges are blue but there’s no other changes is bad as that’s clearly a derived work.
Taking the same file and scrambling it so it doesn’t resemble is perfectly fine.
Watching the movie and then making your own version of the same exact plot points is infringement. But using plot points that are changed is perfectly fine.
There’s existing copyright law that prevents the makers of the movie Deep Impact from suing the makers of Armageddon.
A 420p copy of the movie is very different from a 4k version. Similarly an encryption copy of the file will intentionally not resemble the original in any way. Both of those distinctions would likely be ignored by a court.
Similar, a movie that copies the plot points will likely be fine, but a song that copies the notes of a song will not be. Very different cover version will sometimes be found as a derived work, even when they are as different as Deep Impact is to Armageddon.
I believe there are couple different aspects to “it’s AI training legal same as human” argument:
1. Copyright is only granted to creative elements; lots of program codes are supposedly un-copyrightable, though no one wants to fight on that ground.
2. It is lawful in many jurisdictions to effectively steal and train AI with even copyrighted materials, for the sake of humanity at large; same supposedly not apply to the output. But AI-supportive clusters tends to conflate between the two.
3. AI training processes, stochastic gradient descent and all, are only called “learning” and/or “training” by convention; there is no public consent that it is same as how the word is supposedly defined, though we generally don’t scare quote airplanes flying.
on 3, the convention could have just as easily gone a different way i.e. it could have converged to model "fitting" using the statistical parlance or the sklearn convention. Further if you take the math seriously most of these models are "just" fitting probability distributions to data.
Also, in part it depends greatly on the objective function used. In GPT style models the objective is to precisely copy from input to output, token by token. I think its extremely bad-faith to argue that this has any relationship to human learning or learning objectives.
you shouldn't take the math seriously and I'm not being dismissive with the word "just" in scare quotes. However the community somehow wants to have its cake an eat it too.
In most countries a copyright work need to be something substantial. You can not copyright single machine instructions. It needs to be a combination that is unique.
And just the instructions are not copyrightable, you cant for example copyright a recepy. But you can copyright a book of recepies. So if you make a program with many instructions put togheter you automatically get copyright. And if someone steals parts of your code it will be difficult to claim the copyright if those parts are used to create a new program. But if the new program is based on your program, for example a fork, or most of the code comes from your program its derative work.
> but sometimes I wonder if everyone just writes certain things the same way
> For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative.
We are at a point at which compilers detect such functions and replace them with highly optimized ones. If you have to artificially change just for the sake of patent or license trolls you don't just get more work but also worse performance/optimizations in most cases.
> On a more serious note, I really wonder where the line is drawn for copyright.
As soon as you start thinking about copyright, you end up realizing it's all non-sense. Stephan Kinsella (a patent lawyer!) is the leading thinker on this, and his videos, essays, and podcasts are worth listening to: https://www.youtube.com/watch?v=e0RXfGGMGPE
> I see a lot of people claiming that AI is producing code they've written verbatim, but sometimes I wonder if everyone just writes certain things the same way. For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative.
This point is absolutely going to come up in any lawsuits; because the law does sometimes examine how much creativity there is available in a field before making a determination (Oracle v Google comes to mind). If you can show that there are very, very few reasonable ways to accomplish a goal, and said goal is otherwise not patented or prohibited, it's either not copyrightable or Fair Use, take your pick.
This even applies under the interoperability section of the DMCA and similar laws for huge projects. Assuming that ReactOS, for example, is actually completely clean-room; that would be protected despite having the same API names and, likely, a lot of similar code implementing most of the most basic APIs.
Incidentally this code doesn't have any license attached to it. So if an LLM happened to produce this code and Oracle said "where did you get that?! did you illicitly include the Java code base as part of your training data?" the organization that got the data for the LLM training can say "no, we used Hacker News comments and this code happened to have been in there verbatim... sorry."
This is interesting, but many permissive licenses still require attribution at the project or file level.
If Codeium doesn't produce these when producing "verbatim enough" snippets, how is this actually better, besides avoiding a GPL boogeyman?
I get that there have been fewer (if any? I'm not aware of any) MIT/Apache2.0/MPL2.0 license violations that have gone to court than GPL violations, but this still feels like an "address the symptoms" and not "address the cause" difference.
I wonder if, in order to deal with attribution, the system could simply build a multi-megabyte file with "this code is derived from:" followed by all the authors the system could gather from the training data set.
Not but it's big determinant in how you get sued. Several lawyers haven give the advice the best way to avoid a lawsuit is don't be an asshole. The second best way is to spend a bunch of money on an attorney.
As an experiment, I just asked ChatGPT to "Please identify the open source projects that contain the following code" and pasted the sample from the article. Sure enough, it pointed me at the SuiteSparse library, which is correct (but not exactly where ChatGPT thinks it is). This means Codeium (and others) could theoretically use AI to identify possible attributions that should be included in a project.
Of course, if someone figures out an algorithm that does that, people could use the same algorithm to identify missing attributions and plagiarism in other projects and throw lawsuits around. (Sigh)
Perhaps copyright is what being circumvented, not just GPL. What Microsoft does is take your original work, create derivative works and sell them for profit. Unless it's under creative commons zero or public domain it shouldn't be legal...
Not exactly a huge distinction there because the licenses themselves provide exemptions to copyright, so by definition you're both "circumventing" GPL and committing legal copyright infringement if you copy it and don't attribute it under the terms the code's license requires.
Of course, the entire basis for LLMs being legal is that they use work collectively to know how code/language works and how to write it in relation to the given context. In this case, the legal defense is that the tool is like a human that learned how to code by looking at CC-BY-SA and other licensed publicly-available code and assimilating it into their own fleshy human neural network.
This only becomes shaky once you add in regurgitating code verbatim, but humans do this too, so the solution there is the copilot setting that tries to detect and revert any verbatim generated code snippets.
You can’t claim to have an entity with human-like understanding doing the coding if you don’t grant it basic human rights.
They want to have it both ways: they want you to think the LLM is like a human because it’s “learning” (which in ML is the same word but completely different idea) so that you let them ignore copyright, but it’s not like a human of course because it can’t think, no sir (so you do not make them grant it human rights, because then they can’t exploit it like they do anymore).
> What Microsoft does is take your original work, create derivative works and sell them for profit. Unless it's under creative commons zero or public domain it shouldn't be legal...
Why should it not be legal? Doesn't that make copyright equally powerful with patents? Copyright should restrict only replication of expression not replication of ideas.
And just the same, copyleft licenses (which are not "non-permissive" in my book) such as GPL don't "mean that you cannot without consent", they just want you to share the result back under the same license (which is often an issue for some corporate projects).
If the code is being provided as output from an LLM, I don't know that they themselves could even say where the code came from. Attribution might not be possible in that case. Similarly, how might one remove say, GPL code from the model, without regenerating from all the rest of the inputs alone?
I believe that laundering licensed or copyrighted content for reuse that fails to recognize the original authors or usage restrictions is likely to be one of the biggest commercial applications of generative machine learning algorithms.
I also believe this is where a lot of the hype about "rogue AIs" and singularity type bullshit comes from. The makers of these models and products will talk about those non-problems to cover for the fact that they're vacuuming up the work of individuals then monetizing it for the profit of big industry players.
I don't think this theory holds up. Singularity concerns long predate LLMs and are mostly expressed by people who want OpenAI to stop what they're doing right now. Sam Altman has publically disagreed with AI doomers. If you're willing to believe that OpenAI is pretending not to be concerned but is quietly hyping the concerns up, I have to wonder what standard of evidence is letting you simultaneously write off the concerns as bullshit.
For me personally it's that everyone who is expressing these concerns has clearly done less critical thinking about the subject than your average extremely high teenager. When you ask them about details they get defensive, resort to even stranger ground like "Well a human is nothing more than an autocomplete" (clearly not true).
I don't believe that rogue AIs are a threat for the next few years, but the claim that the likes of Geoffrey Hinton have done less thinking about the subject "than your average extremely high teenager" is absurd.
The fear I have isn't an AI doing things by itself, but being good enough so that if Joe Evil gets his hands on the AI, he can single-handedly (with AI help) break into secure databases, or something.
You know how a lot of us on HN talk about how security is just a latent concern for companies, but luckily there aren't enough hackers to take advantage of the massive number of bugs in every bit of code ever written? Well, a future powerful coding AI running on second-hand Etherium mining rigs in some extremist's basement in Chicago can probably do a lot more damage than a handful of state sponsored hackers in Russian and North Korea!
Surely some guy in his basement will have access to far worse models than the people he is trying to attack. If the AI can be used for offense it can be used for defence, especially since when used for defence you can give the AI access to code/design docs which make finding exploits much easier.
Hinton would never agree with the stuff I read on a daily basis here on Hacker News, don't even try to suggest that he's one of these weirdos I'm talking about that's huffing on the idea that ChatGPT is going to replace programmers, LLMs are sentient, and that AI is going to take over the world.
Not sure if I'd say there's a conspiracy per se, but I do think generative AI players are going to be careful about the optics of the technology and how it works. Anecdotally from speaking to non-technical family members there's very little understanding for how the technology actually works, and it seems there's not a great deal of effort to emphasize the importance of training data, or the intellectual property considerations in these companies marketing materials.
Ok, so Sam Altman disagreed with AI doomers, great, but the point is still generally valid, for a couple of reasons:
1. What about Elon Musk and hundreds of other AI investors? It's in their interest to overhype AI, while temporarily slowing down competition by spreading singularity fears.
2. OpenAI released the GPT4 report where they claim better performance of their model than it's in reality [1].
> The makers of these models and products will talk about those non-problems to cover for the fact that they're vacuuming up the work of individuals then monetizing it for the profit of big industry players.
Also why they claim these are "black boxes" and that they "don't understand how they work". They are prepping the markets for the grand theft that's unfolding.
I think you underestimate just how careful “real” businesses are when it comes to violating the (copyright) law. Any legal advisor at any corp will strongly advice against using code that’s generated like this, until there is clear legal precedent that it’s OK to do this.
I don't think I've heard anyone warn people not to copy code snippets from stackoverflow due to licensing issues, although "real" businesses should be rightfully concerned.
I think you underestimate how easy it is for developers to disregard what the Corp lawyer said about AI code gen tools.
Manager: "we asked, legal says you can't use copilot", dev: "okay, so from now on, I'll not discuss how I use copilot and will remember to disable it when someone sees me working, gotcha".
I'm not saying everyone will do this, I'm saying some people will know that the corp doesn't always have a way to verify how the code was written, and they will think that a lawsuit cannot really happen to them.
> Manager: "we asked, legal says you can't use copilot", dev: "okay, so from now on, I'll not discuss how I use copilot and will remember to disable it when someone sees me working, gotcha".
Manager: "Everyone else is running through their feature list faster than you. What gives? Remember, you're not allowed to use Copilot."
IC: "I'm not using Copilot."
Manager: "Remember, you're not allowed to use Copilot."
Of course if only used on internal software that isn’t distributed, then copying GPL code is fine. Until a developer inadvertently distributes it or copies code from one place to another…
AI will just make non-permissive open source licenses more pointless than they already are. The GPL and similar licenses have been on a slow death march for over a decade. AI isn't doing anything that Human Intelligence isn't already doing. Every single developer has looked at non-permissive open source code for inspiration.
The reason people can use code for inspiration is because of GPL and similar, do you see the problem with the logic you provided?
If all software starting being non-permissive and closed source, there would be no training data and no new innovation and even if there was, it would probably suck like it did before GPL and similar licensing was mainstream.
Why is that a non-problem? It's a really important concern that we need to take more seriously
I pasted this from another comment I wrote but:
The concerns about AI taking over the world are valid and important; even if they sound silly at first, there is some very solid reasoning behind it.
See https://youtu.be/tcdVC4e6EV4 for a really interesting video on why a theoretical superintelligent AI would be dangerous, and when you factor in that these models could self-improve and approach that level of intelligence it gets worrying…
I don't think the reasoning is solid at all. I mean yes, a theoretical superintelligent AI would be very dangerous, but I see exactly no reason to think that current models could get there.
Personally, I wasn't expecting anything as good as GPT-4 so soon. So I no longer have any real confidence in how far away 'real AI' is, whatever that means.
I would not be shocked to find out that AGI (using Altman's definition) is more than 50 years away, but I also would not be shocked if it came in 5.
It's really hard to know how scared to be, I think that rationally I should be pretty terrified but I'm not.
Well hardware and parameter count are scaling exponentially, so it seems very feasible that it could happen very soon. Of course it's possible that we'll hit a wall somewhere but it seems that just scaling current models up could be enough to get to the point where they can self-improve or gain more compute for themselves
We've been out of exponential territory for a few years now (https://en.wikipedia.org/wiki/Moore%27s_law). Yes, we are still bounding forward at a crazy pace, but I think the pace is slowing down somewhat
Hardware isn't scaling exponentially anymore (Moore's law is dead). Parameter count isn't really scaling exponentially anymore either. GPT3 had 175b parameters 3 years ago. There are some attempts at training 1 trillion parameter models, but they are not better than GPT3.
While I agree we probably aren't getting exponentially increasing parameter counts (GPT4 is by all accounts 1T paramaters and of course, it is significantly better than GPT3) we are still seeing lots of improvements - 3.5 is much better than 3, based "just" on InstructGPT/RLHF training. Models are getting better as well - LLaMA 30B beats/matches GPT-3 on raw eval benchmarks at 1/6 the parameter count.
We're also seeing lots of optimizations with new models (RoPE/RoPER embedding, Swish/GeLU activation, Flash Attention, etc) but I think some the most interesting gains we'll be seeing soon is with inference-optimized training (-70% parameters for +100% compute) [1] combined with sparsity pruning (-50% size w/ almost no loss in accuracy) [2] and quantization [3] which will lead to significantly smaller models performing well.
It’s still exponential, but a little slower. (edit: wait, is that still exponential if it slows down?) Anyway we only need to get to human level (or maybe a bit less) and we’re not that far off (maybe 10 or 20 years at current rates of progress?)
Not all types of AI need external training data, you can train on how effectively a goal is achieved
> they've parasited, sorry, trained on the entirety of accessible human knowledge
I see this as a new development in language, used to be restricted to meat neural nets and books, now it can also be consumed and created by LLMs. A new self replication path was opened for language. Language is an evolutionary system, it's alive. Without Language humans are mere shadows of what they can be. Language turns a baby into a modern adult, and a randomly initialised neural net into chatGPT.
The magic was always in the language, not in the neural network. We should care more about the size and quality of the training dataset than the model. Any model would do, all model tweaks are more or less the same. But the data, that is the origin of all the abilities. But we cannot own abilities, it should be fair game to learn abilities and facts even from copyrighted data. Novel and creative training examples should not be reproduced by LLMs, but mere facts and skills should be general enough not to be owned by anyone.
By your logic, just pick any random bum off the street, give him the right training set, then he will become a 180 IQ genius and discover the unified theory of gravity and quantum mechanics.
Some models are just inherently better at modelling.
The training data thing is a problem mainly for LLMs, so it might be a limitation if we purely scale up LLMs but there are other types of AI around too
Chip scaling still seems to be going pretty fast, and we may discover new ways to make better use of the chips we currently have, like better methods of quantisation, or just using more of them, which could get us just far enough to reach the self improvement threshold
So we could end up hitting a wall with chip scaling or something but I don’t think it’s that likely
Really? Even a 5% generation-to-generation improvement would be exponential, it’s just 1.05 to the power of the generation. If it was linear you’d have benchmark results scaling by a fixed number of points each generation, which doesn’t seem to be a thing as far as I know
I think that part is a leap. I don't think is given that a super intelligent AI will "want" things.
> presumably a machine could be much more selfish
This feels like we're projecting aspects of humanity that evolution specifically selected for in our species with something that is coming about though a completely different process.
> It's a mistake to think about it as a person.
I agree, but I feel like that's what these concerns about AI are doing, because that's what people do.
> (The whole stamp collector thing)
It also seems to me there is a huge gap between a super intelligent AI and the ability to have a perfect model of reality along with the ability to evaluate within that model the effect of every possible sequence of packets sent out to the internet.
> I think that part is a leap. I don't think is given that a super intelligent AI will "want" things.
But if it has no goal then it can’t act rationally or intelligently. Something like an LLM might not appear to “want” anything, but it “wants” to predict the next token correctly which is still a goal (though since it’s only related to its internal state it might be a little safer)
There’s another good video about why this would be the case here if you’re interested: https://youtu.be/8AvIErXFoH8
> This feels like we're projecting aspects of humanity that evolution specifically selected for in our species with something that is coming about though a completely different process.
That’s because evolution is a process that optimises for a goal. The only reason altruism is a thing is because it actually indirectly benefits the goal, which is for our genes to survive and be passed on, and fellow humans tend to share our genes, especially relatives (who we tend to be kinder to). AI training is also a process that optimises for a goal, but unless having humans around helps that goal it wouldn’t display any human empathy. In this case “selfishness” is just efficiency which a training process definitely selects for
> I agree, but I feel like that's what these concerns about AI are doing, because that's what people do.
I feel like they’re doing a pretty good job at modelling AI as a theoretical agent, which does share some similarities with humans because humans are agents, but the main mistake people make is assuming their goals will be similar to humans because human values are somehow a universal truth
> It also seems to me there is a huge gap between a super intelligent AI and the ability to have a perfect model of reality along with the ability to evaluate within that model the effect of every possible sequence of packets sent out to the internet.
That’s very true, it’s an unrealistic thought experiment, but it’s a a good introduction to the concept that something significantly more intelligent than us can be dangerous and pursue a goal with no regard to what we actually wanted
> but it’s a a good introduction to the concept that something significantly more intelligent than us can be dangerous and pursue a goal with no regard to what we actually wanted
I think thing significantly less intelligent can do this too. See any computer program that went wrong. I don't think that is a novel idea.
Perhaps it is a lack of imagination on my part, but I can't help but think, in this stamp collector example, someone would just be like "wait why are these machines going crazy printing stamps" and just like turn them off.
I feel like any argument on the dangers of superintelligent AI rests on the belief it can also use that intelligence to manipulate humans to complete any task and/or hack into any computer system.
I don't agree evolution optimises for a goal at all. IMO optimising for a goal means you first define a goal, then you work towards it.
Evolution has no goal, it's simply a process determined by chemical reactions. Any goals we attribute to it, e.g. "for our genes to survive and be passed on" are emergent phenomena, a rationalisation after the fact that that is indeed what's been observed.
It's plausible that AI "goals" emerge evolutionarily as well, but for that to happen we first need to create not AGI but Artificial Life, which is a huge leap from today, and I certainly don't understand how that's inevitable.
Then by that definition AI training has no goal, it's simply a process defined by calculations. But whether you want it call it a goal or not, the fact remains that they look very, very much like goals. "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck."
> It's plausible that AI "goals" emerge evolutionarily as well
AI training is vaguely similar to evolution, except more efficient and directed
> Then by that definition AI training has no goal, it's simply a process defined by calculations.
No, the very definition of training is that there is a goal which to train for. Those calculations were created by humans with goals. For LLMs, the goal is token prediction.
> then monetizing it for the profit of big industry players
Looks like LLMs are universally useful for individual people and companies, monetisation of LLMs is only incipient, and free models are starting to pop up. So you don't need to use paid APIs except for more difficult tasks.
It seems to me, from a copyright perspective, all commercial use of generative AI depends on whether the output is transformative fair use (vs derived work). While the courts will have its say, ultimately whether new rules are carved out or not is going to be again (as all copyright law is) based on commercial interests - I have the feeling that the potential productivity upside across all industries (and in terms of national interests) is going to be big enough that it'll work itself out largely in the favor of generative AI.
That being said, IMO, that's completely separate from the safety issues (that exist now and won't go away even if somehow, all commercial use is banned):
Urbina, Fabio, Filippa Lentzos, Cédric Invernizzi, and Sean Ekins. “Dual Use of Artificial-Intelligence-Powered Drug Discovery.” Nature Machine Intelligence 4, no. 3 (March 2022): 189–91. https://doi.org/10.1038/s42256-022-00465-9.
Bilika, Domna, Nikoletta Michopoulou, Efthimios Alepis, and Constantinos Patsakis. “Hello Me, Meet the Real Me: Audio Deepfake Attacks on Voice Assistants.” arXiv, February 20, 2023. http://arxiv.org/abs/2302.10328
Mirsky, Yisroel, Ambra Demontis, Jaidip Kotak, Ram Shankar, Deng Gelei, Liu Yang, Xiangyu Zhang, Wenke Lee, Yuval Elovici, and Battista Biggio. “The Threat of Offensive AI to Organizations.” arXiv, June 29, 2021. http://arxiv.org/abs/2106.15764.
I don't think most people have thought through all the ways perfect text, image, voice, and soon video generation/replication will upend society, or all the ways that the LLMs will be abused...
As for AGI xrisk. I've done some reading, and since we don't know the limits of the current AI paradigm, and we don't know how to actually align an AGI, I think now is a perfectly cromulent time to be thinking about it. Based on my reading, I think the people ringing alarm bells are right to be worried. I don't think anyone giving this serious thought is being mendacious.
Bowman, Samuel R. "Eight Things to Know about Large Language Models." arXiv preprint arXiv:2304.00612 (2023). https://arxiv.org/abs/2304.00612.
Ngo, Richard, Lawrence Chan, and Sören Mindermann. “The Alignment Problem from a Deep Learning Perspective.” arXiv, February 22, 2023. http://arxiv.org/abs/2209.00626.
I think Ian Hogarth's recent FT article https://archive.is/NdrNo is the best summary of where we are why we might be in trouble, for those that don't care for arXiv papers.
Of course if you include the "function header" from some code in the training data (below) it will prompt GPT to generate the rest of the function. That's kind of exactly the point of it, it autocomplete on steroids.
// CSparse/Source/cs_gaxpy: sparse matrix times dense vector
// CSparse, Copyright (c) 2006-2022, Timothy A. Davis. All Rights Reserved.
// SPDX-License-Identifier: LGPL-2.1+
#include "cs.h"
/* y = A*x+y */
csi cs_gaxpy (const cs *A, const double *x, double *y)
It's like starting to sing "happy birthday to you" and being surprised that people in the room join in and finish the song.
Sure they make a valid point about including GPL code in the training data, but it's a little disingenuous to go to that extent to get Copilot to output the GPL code verbatim.
The sooner we have a test case go through the courts the better.
And then they have the audacity to claim It should be worrisome how easily GitHub Copilot spits out GPL code without being prompted adversarially right after prompting in adversarially.
Are you suggesting that GPL licenses should be invalidated and everyone should be able to use GPL code broadly, because it's free anyway? Is this some kind of modern reversed version of Robin Hood: steal from the poor and give to the rich? Is this what you're standing for?
I think the concern is that the only reason that source attribution comment is there is because they haven't figured out how to better plagiarize/launder code.
Otherwise the tool can go in the other direction and literally say "hey how about this function from project $foo?" with a full attribution. Apparently Google Bard does bother to do that.
Given how cautious corporate lawyers usually are, I'm surprised any company allows the use of AI for code generation. The USPTO has been pretty clear that AI generated material is not copyrightable, as to qualify for copyright a work has to be the creative act of a human. So any company allowing AI to generate code runs the risk of not owning the copyright on it.
This is a common misunderstanding of the recent guidance[0] that ignores substantial portions of it.
The Copyright Office was pretty clear that works that incorporate AI-generated content can be copyrighted if there is sufficient human input. If there isn't substantial human input in judiciously curating and integrating AI-generated code, the company has bigger problems than copyright.
Here's the most relevant quotation from the guidance clarifying when AI-assisted works can be copyrighted:
> In other cases, however, a work containing AI-generated material will also contain sufficient human authorship to support a copyright claim. For example, a human may select or arrange AI-generated material in a sufficiently creative way that “the resulting work as a whole constitutes an original work of authorship.” [33] Or an artist may modify material originally generated by AI technology to such a degree that the modifications meet the standard for copyright protection.[34] In these cases, copyright will only protect the human-authored aspects of the work, which are “independent of” and do “not affect” the copyright status of the AI-generated material itself.[35]
> This policy does not mean that technological tools cannot be part of the creative process. Authors have long used such tools to create their works or to recast, transform, or adapt their expressive authorship. For example, a visual artist who uses Adobe Photoshop to edit an image remains the author of the modified image,[36] and a musical artist may use effects such as guitar pedals when creating a sound recording. In each case, what matters is the extent to which the human had creative control over the work's expression and “actually formed” the traditional elements of authorship.[37]
> In these cases, copyright will only protect the human-authored aspects of the work, which are “independent of” and do “not affect” the copyright status of the AI-generated material itself.[35]
It still sounds like there could be cases where a company only has copyright to a part of their own source code. How would outsiders even be aware of what has copyright and what doesn't in this situation? If an entire function was created via AI is that function then fair game for others to use as well?
Not Microsoft and not Copilot, but Amazon is encouraging us to use CodeWhisperer at work. That being said I don't think the CodeWhisperer model was trained on non-permissive data so maybe that's why.
That's why I said "I think", it's not really clear. That being said:
> Will CodeWhisperer produce code that looks similar to its training data
> If CodeWhisperer detects that its output matches particular open-source training data, the built-in reference tracker will notify you with a reference to the license type (for example, MIT or Apache) and a URL for the open-source project.
They seem to at least have some protections in place to prevent CodeWhisperer from spitting out existing code without attribution as shown in the Twitter thread. They also only mention MIT and Apache.
I'm not sure if this has ever been stated publically but it is my understanding that MSFT dogfooded copilot a lot just before/after the launch. I'm not sure if they are doing this still, but I don't see why they would have stopped.
Is there any practical difference between owning the copyright on a badly defined ~30% of a codebase and 100% of the codebase? In either case, no sane company is going to buy the code if one of your employees tries to leak it to them, which I assume is your concern.
"So Mr Zim, you're accusing X of using your copyrighted code. But you've admitted you used AI to generate that codebase, so you don't own the copyright. Please prove exactly which lines of code you do own the copyright to?"
Depending on what logs exist I could probably find lines of code which definitely aren't AI generated. Of course, in practice I wouldn't bother and would sue using trade secret laws instead where no such issue exists.
At some point we're going to have to test this - can we go work for a company and if their workers write code with the help of ai can we just use the code for ourselves as well? Since it's not copyright-able.
You have completely missed the point. We still need to know the applicable licenses of the code it is emitting even the ones that aren't GPL. Furthermore GPL people don't want they code to not be used. They want it to be used _within the terms of the license_. I distribute MIT and GPL code in my repos, BOTH should have their license terms honored.
MIT licensed code still needs to be correctly attributed, just like GPL.
I don't care what license the code is that's emitted, as long as the licenses are included. It'd be nice to be able to choose to only emit code trained on particular licenses but I get that that's not easy.
It's great that they've removed "non-permissive" (GPL) code from their training data, but it looks like they still train on code with "permissive" licenses (they use MIT, BSD, Apache as examples). But don't these permissive licenses still require the copyright notice to be reproduced?
From the MIT license:
> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
From the BSD licenses:
> Redistribution and use in source and binary forms are permitted provided that the above copyright notice and this paragraph are duplicated in all such forms...
From the Apache 2.0 license:
> You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works
I was recently working on something for a new feature on my Elixir-learning site and opened a new file called "fibonacci.ex" to write a tail-recursive fibonacci function.
The function names and documentation strings were identical. Also, the site isn't under a GPL, just a standard copyright. That said, I'm curious to learn if others see the same behavior. It's possible I once opened that file locally with Copilot installed and that my own computer was its source.
Which seems fine I guess (I don't know the language), but doesn't even have comments. I prefer my files with comments. After forcing the point, I got this:
defmodule Fibonnaci do
@moduledoc """
Documentation for Fibonnaci.
"""
@doc """
Calculates the nth Fibonnaci number
"""
def fibonnaci(n) when n < 0, do: nil
def fibonnaci(0), do: 0
def fibonnaci(1), do: 1
def fibonnaci(n), do: fibonnaci(n - 1) + fibonnaci(n - 2)
end
In which I prompted the AI with everything up to (and including) @doc. So I figure it was picking it up from your computer, somehow.
EDIT: I then noticed the typo, tried it with fibonacci.ex, and got the same result.
Interesting though, that it is bad code, considering, that it is not written to be tail recursive. Maybe good to show the idea of fibonnaci numbers or a theoretically valid implementation. But hopefully people do not accept this kind of code, thinking, that because it gets the result in some simple cases and it was suggested by copilot, it must be the way to properly do things.
Thanks for giving it a try and sharing the results, everyone!
One other possible cause I thought of is that I did have the test file in my mix project already. If copilot looks at the corresponding file in the test dir, then it would not be a coincidence at all that all function names were identical or that it wrote a tail recursive solution instead of the naive solution that would have failed the final test.
This was inevitable. Copyright law has always used a "color of your bits" argument [0]. GPL and other libre/free/open licenses were a great hack to circumvent draconian copyright laws but the laws themselves are not designed for a rigorous treatment of similarity (maybe even by design?).
Also, it's worth noting in the example of ChatGPT emitting LGPL code without attribution or license, the code is actually different [1]. Is the difference enough to circumvent a copyright violation claim? I don't know but a big part of determining whether it does is now muddled because of the way the system was designed. Even if we could get an entropy distribution on which training data was used to generate the text, it's not even clear the courts could use it in any meaningful way.
> Copyright law has always used a "color of your bits" argument
This is an excellent point in the context of this question. Typical computer programmer responses like "but there are only so many ways to write a function that does X" or "how small of a matching section counts as copyright infringement" ignore the color of the bits.
A judge can look at ChatGPT or Copilot, decide that it took in license-limited copyrighted data in its training set, observe that a common use is to have it emit that data - to emit bits that are still colored with copyright - and tell OpenAI, or Copilot, or their users that they are guilty of copyright infringement. There may be no coherent mathematical or technical formula to determine the color of a bit, but that's understandable, because the color doesn't exist in mathematical, technical, coherent domains anyways: Only the legal domain sees color, and it can take care of itself.
That's an unkind reading. The implication is that GPL circumvents some relevant restrictions of copyright law in question by creating a legal framework to do so.
No it isn't. The GPL does not circumvent anything: it relies very heavily on the fact that the rights holders are able to license their creation as they see fit.
You continue to argue in bad faith. Copyright law is often used to prevent people from copying work. The GPL and its ilk are legal mechanisms designed to allow people to share their work.
The GPL is in no way a circumvention of 'draconian copyright law'. To spell it out: the GPL enumerates the rights and obligations of the recipients of a piece of software and critically relies on copyright in order to be able to do so. Without copyright the GPL would be unenforceable. So this is the polar opposite of your assertion, copyright is indeed used to prevent people from copying work without authorization. But the GPL and it's ilk are designed to allow people to share their work as long as the recipients respect the terms of the license. It's the critical bits that you left out that make all the difference, and without those critical bits the GPL would be useless and all of those critical bits have teeth only because of the existing framework of copyright law, which allows the rights holder to set the terms under which they license their work.
After all, if someone want to share a work without preventing people to do with it as they please then they are utterly free to do so by placing their work in the public domain or by sharing it using a permissive license.
> Copyright law is often used to prevent people from copying work. The GPL and its ilk are legal mechanisms designed to allow people to share their work.
Yes, by relying on copyright law, which enables the very existence of those legal mechanisms. Without copyright law, said legal mechanisms are worth less than the paper on which they're printed.
A central part of the GPL is forcing people to publish their modifications in source form. Without copyright law those requirements would be unenforceable and you would be stuck de-compiling encrypted/obfuscated binary blobs of your vendors customized copy of gcc.
Legitimate question, Microsoft does not seem to care about Copilot violating licenses and GPL appears to be toothless as many companies use GPL code without following the terms of the license and nothing happens to them so what does removing GPL code accomplish other than making a weaker product. I have not used Codeium but my assumption is that GPL code is a very significant amount of open source code so removing that must have some ramifications?
"as it will allow you to launder code automatically through an LLM"
No it won't, obviously if it copies code exactly then you can't use that. The question is whether Microsoft is liable for the fact that Copilot has the ability to output copyrighted code sometimes or whether people using it just need to check that it hasn't done that before using the code (Copilot can also do this automatically)
Google can also show you GPL code in its results, but people aren't trying to sue Google and the user is responsible for checking the license before using it (though Copilot makes this harder)
Disclaimer: I haven't read much about the actual lawsuit and I'm not a lawyer but I assume this would be the case
> however if the suit is successful: every company/individual that has used it is likely suddenly liable for millions of claims of copyright infrigement
Only if you can prove that you are the copyright owner of the original work.
That might be a challenge for many open source projects. Even projects that require copyright assignment might not have sufficient paperwork to prove this in a court of law. The copyright might not even have been the persons to assign in the first place.
You would also face the burden of proving that the fragment that Copilot generated was sufficient to be copyrightable in the first place. The limited grammar of most programming languages would probably make proving that something was copyrightable at the function level hard. Just because the entire work was licensed under the GPL, it doesn't necessarily follow that all the individual fragments when separated out are.
Outside of sampling, this is an area that the courts have largely punted on for good reason. It's a rabbit hole nobody wants to go down.
Either outcome opens up a huge can of worms that I suspect nobody really wants to touch because it likely ends in mutual destruction.
Humans already launder GPL code by using it to learn and then producing code based on what they learned. We're very close to LLMs doing the same thing. Maybe there's some fundamental difference between humans doing that and humans programming machines to do it, but I can't see it.
I don't know how GPL (or copyright in general) can survive in the long run with these technologies.
I think the biggest fundamental difference is that we respect human creativity even when it is learned, because of the value in rights we ascribe to humans. It is expected and natural for humans to fairly use things they have learned as a part of an otherwise unique work. But do we ascribe the same privileges to a machine, particularly when it can be automated? The opportunity cost of the human experience itself is the reason why we even have copy rights.
Er, I'm not sure that's laundering; if a human takes AI-generated code and munges it to look human-generated, they applied enough creativity that I would expect them to legitimately have copyright on the result. I mean, I'm not a lawyer, but the bar is pretty low for qualifying.
You can't "launder code through an LLM". You're just violating the copyright. That's like "laundering code through your clipboard". It's just a tool. You're the one responsible.
> if the suit against copilot fails then the GPL is effectively dead
Not really because the GPL can be updated with a clause that allows GPLv5 (or whatever the version is going to be) to be used to train public LLM models, but explicitly forbidden to be used to train private models.
I somehow don't think this is the end of the GPL... Yet!
if it's fair use then it doesn't matter what's in the license
Microsoft's position on Copilot is that it's fair use:
> When questioned, former GNOME developer and (at the time of writing) GitHub CEO, Nat Friedman, declared publicly “(1) training ML systems on public data is fair use (2) the output belongs to the operator”.
> I don't see why you couldn't do the same thing to e.g. the binary of the Windows kernel
Forget the binary; there have been Windows code leaks every now and again over the years. Feed one of those into a model, start generating code for ReactOS, and see how long until MS decides that actually AI is infringing...
>Microsoft does not seem to care about Copilot violating licenses
Humm, then perhaps should be trained LLMs with leaked Microsoft code. Protocols, controllers or any kind of stuff that could contribute advances for executing Windows things within Linux.
Microsoft would react establishing their own limits, whichever option they choose to take.
> that could contribute advances for executing Windows things within Linux
I very much doubt that is a threat to Microsoft.
It is technically very straightforward to run Windows “things” under Linux thanks to virtual machines and/or RDP to a server and some UI trickery to make it seamless and facilitate interoperability between the two OSes. Parallels does quite a bit of that on macOS for example. A similar solution would be developed for Linux if there was enough demand for it.
I don't think Copilot violates GPL because it is a web service
I think the problem here is: by auto completing GPL code to developers it might open the opportunity of your company getting sued for using GPL illegally
the problem is violating gpl doesn't turn into financial damage because the product is free. So not following license means there's no way to recover the cost of a lawsuit.
Punitive damages can occur even if no financial loss has occurred.
I would also imagine those companies whose business is built around the open source development they do -- open core, SaaS, or otherwise -- would have a claim to financial damages as a result of stolen code.
I mean sure - if those companies are doing the same thing copilot is. But its not clear if financial damage was done and would have to be proven in court, against microsoft funded lawyers with the support of all the other companies trying to reframe intellectual property rights to allow their ML network and all the spinoff businesses it produces.
I don't understand the issue here. You input GPL code (the headers) and get GPL code out, what do you expect?
The more insinuating issue would be if you started with a innocent seeming function that a typical software developer would write, and ended up with GPL code. Has anyone shown that to happen?
Anyone talking about copyright in this thread without discussing a potential for how a court will apply fair use is talking nonsense and should be disregarded.
I think their comment was to the contrary, that the copyright/legal implications of 'stolen' code could seriously hobble the wider development, proliferation, adoption, and commercialization of AI software.
Maybe I misunderstood, but the comment seemed to dismiss copyright issues as a cheap way to kill AI ("soft underbelly"). I think stealing code is a pretty serious deal and the onus is on AI software companies to make sure they aren't doing it; it's not "slowing the development of AI" to keep them accountable.
Are you seriously arguing that using short snippets open source code to inspire similar, yet not exactly the same, original code is "stealing code"? Human developers do that all day long. And just because a piece of code exists in a GPL project doesn't mean it originated there. Every algorithm or sort function likely originated in a more permissively licensed project before it got included in a GPL project.
What happens if I (a human) read GPL code and then reuse the knowledge gained from it in my own commercial projects? It's not as clear cut as you make it sound.
It could be as clear-cut as you've just made it: "a human". An LLM is not a human.
You could get into the semantics of "learning" - does JPEG encoding count as the computer "learning" how to reproduce the original image? But trying to create some metric for why LLMs "learn" and JPEG doesn't "learn" on the basis of the algorithms is a philosophical endeavor. Copyright is more about practicality - about realized externalities - than it is about philosophy. That's why selling cars and selling guns are regulated differently, despite the fact that you could reduce both to "metal mechanical machines that kill" by rhetorical argument.
Even from a strictly legal perspective, it actually is fairly clear-cut. The answer to "what if I (a human) read GPL code and then reuse the knowledge gained from it..." comes down to a few straightforward properties of the license. GPL doesn't cover "reduced to practice" as many corporate contracts do, so terms covering "the knowledge gained" are lenient. GPL covers "verbatim" copies which is what LLMs are doing, that's as clear cut as it gets. Inb4: "So what if I add a few spaces here and there?" - well, GPL also covers "a work based on"; this is where I (who am not a lawyer) can't speak confidently, but surely there are legal differences between "based on" and "reduced to practice", considering that both are very common occurrences in contracts, so there actually would be a lot of precedent.
I agree with you that verbatim copies are obviously covered by copyright. What if LLMS reproduce code with changed variable and function names (which would be a great improvement to `cs_gaxpy` in the original article)? What if just the general structure of an algorithm is used? What if the LLM translates the C algorithm from the original article into Rust? This discussion is only scratching the surface.
It's going to be an uphill battle just to get people to even understand what the problems are. And this is even a technical forum. Now imagine trying to explain these nuances to a judge or jury.
It's not so much an ability to understand as it is a desire to not understand in order to be able to ignore the rightsholders' licensing terms.
Plenty of tech companies exist by putting a thin layer on top of the hard work of others and if those others can be ignored then that's what they'll do.
4. Defense of Third Party Claims.
If your Agreement provides for the defense of third party claims, that provision will apply to your use of GitHub
Copilot. Notwithstanding any other language in your Agreement, any GitHub defense obligations related to your
use of GitHub Copilot do not apply if (i) the claim is based on Code that differs from a Suggestion provided by
GitHub Copilot, or (ii) you have not enabled all filtering features available in GitHub Copilot.
> If your Agreement provides for the defense of third party claims
do any of them?
it also states:
> You retain all responsibility for Your Code, including Suggestions you include in Your Code or reference to develop
Your Code. It is entirely your decision whether to use Suggestions generated by GitHub Copilot. If you use
Suggestions, GitHub strongly recommends that you have reasonable policies and practices in place designed to
prevent the use of a Suggestion in a way that may violate the rights of others. This includes, but is not limited to,
using all filtering features available in GitHub Copilot.
I think it's pretty clear. If you're not filtering, you're liable. If you are and something transpires, they'll fight your legal battle for you which is probably better than any monetary indemnity clause. I assume this is for enterprise users where it actually matters.
Looks like this code in https://github.com/ChRis6/circuit-simulation/blob/2e45c7db01... is older then the GPL code in question or provided by the example. Uh oh, did we discover something? Who actually owns this code because this code predates the code in question by a calendar year using git blame, by a different author, and with no license attached to the oldest code. Is it possible the code in the codeium.com example is relicensed and not GPL code at all?
How many times are we going to go through this before we accept that nobody involved in generative AI cares about pesky things like licenses and copyright?
One of the main reasons corporations love it so much is because it effectively lets them profit off of the work of others with no consequences.
Seriously, let's get back good old honest model of paying outsourced indian programmers $2.50 an hour to retype GPL code or copy and paste it from Stack Overflow into our codebase.
A truly attribution-free license that checks several other important boxes (disclaiming liability and warranty etc.)
If you want your code to be usable by things like github copilot, consider using it (can't imagine most of the HN crowd wants their code used by copilot, but maybe some lurkers here do!)
i recommend the Jollo LNT license for all your pointless theatrical "copyright" needs. it does not use swear words, unlike "WTFPL", and is even more ambiguous. ive tried submitting it to the FSF before for review, but they were confused by it http://jollo.org/LNT/doc/licensing
Copyrighting code never made sense to me. We already have patents for intellectual property. If two people use the same RFC or Whitepaper for an algorithm in the same language, they will probably name the variables similarly and their code will look very similar. Just like if two people wrote out the same hamburger recipe or instructions for hooking up a stereo would also write something similar.
The copyright on the implementation will outlive the patent and allow the implementor to legally take action on claims of copyright infringement. Even though a program is literally just a list of instructions to implement the expired patent.
Copyright protects not the idea, but specific implementation of it. It's there to prevent unauthorized copying of software. Not every software has to be novel enough to be patentable, but may still take effort to write a millionth-first JS framework.
If you take someone else's software without a license and rename variables, it will be a copyright violation, because you've copied (and then modified) it without permission.
But if you write your own software from scratch, even if it happens to be almost identical to someone else's code, that's fine. You've done your own work and a copyright owner can't stop you from doing that. They control their own work only.
As you can see, this is very much tied to human work and intent, since the concept has been invented long before ML existed. This is why ML "learning" and doing "work" is so controversial and appears to be a loophole in copyright.
I want to see a solution where Github, OpenAI, Stability, etc. get to keep and keep scraping copyrighted works, but the models and training data must be provided free and open.
That way, we get to keep the models since they are genuinely useful, but also there’s no issue with copyright and less of an issue with consent to distribute (which can be hopefully be managed by the “humans also learn from data” and “it’s not actually producing your content verbatim unless it follows a basic pattern that anyone could discover). And furthermore, no issue with AI privatized which IMO is my biggest concern with these new tools.
So I see it in a similar way, like why the fuck does Microsoft and Open AI get to be the soul benefactor of basically the sum total of all human intellectual output ?
It’s absolutely ridiculous on so many levels. These models may claim so many jobs and have a serious negative impact on so many peoples lives, yet basically one company owns the model?
No court has said AI ingesting open-source code is "fair use".
Almost all open-source licenses say it can be copied for use in development (i.e., not for re-publication or regurgitation), and even completely open licenses are speaking to people as readers.
The only reason this is happening is coordination costs: a few extremely motivated people with tons of resources are copying from many, many people who would be difficult to organize and have little at stake.
Unfortunately, the law typically ends up reflecting exactly these imbalances.
Once AI can write decent code from scratch, it is likely it can also circumvent potential copyright violations.
A. Check AI generated code against a comprehensive library of open-source copyrighted code and identify potential violations.
B. Ask AI to generate a paraphrase of the potential violations, by employing any number of semantic preserving transforms -- e.g. variable name change, operator replacement, structured block rewrite, functional rebalance, etc.
Lazy example:
private static void rangeCheck(int arrayLen, int fromIndex, int toIndex {
if (fromIndex > toIndex)
throw new IllegalArgumentException("fromIndex(" + fromIndex +
") > toIndex(" + toIndex+")");
if (fromIndex < 0)
throw new ArrayIndexOutOfBoundsException(fromIndex);
if (toIndex > arrayLen)
throw new ArrayIndexOutOfBoundsException(toIndex);
}
private static void rangeCheck(int len, int start, int end) {
if (!(0 <= start)) {
throw new ArrayIndexOutOfBoundsException(`Failed: 0 <= ${start}`);
} else if (!(start <= end)) {
throw new IllegalArgumentException(`Failed: ${start} <= ${end}`);
} else if (!(end <= len)) {
throw new ArrayIndexOutOfBoundsException(`Failed: ${end} <= ${len}`);
}
}
This feels like it would make the situation much worse from a legal perspective.
If you know your AI produces code that is "tainted" by license violations, adding code to hide it after the fact suggests that you're intentionally violating the license terms.
This is Hacker News so the conversation is obviously slanted towards code, but I wonder what the perspective would look like for other structured works, like books? If an author is using a "copilot for writers" and the AI emits text verbatim to another work, then I would think it would be plagiarism. If the text emitted is similar, but not the same, then I would think it would be considered paraphrasing which still requires attribution.
Maybe slightly off topic, but i'd be willing to bet most people who choose GPL for their software license on open source projects don't even understand it with all its ambiguities and gotchas. Many are probably just choosing it because its the default, or because its the one they hear about the most (but still don't understand).
Can't believe we still spend time debating this license and nobody, not even lawyers at software companies, seem to get it.
It seems like a stretch to argue that the model isn't "a work based on" GPL code when that GPL code is an input to a deterministic algorithm from which the model is produced. So, my bet is on point #1.
The only ambiguity as far as I can tell is GPL covers "source code", "machine-readable Corresponding Source", and "object code form", and it's not explicit whether vector-fields count as any of those things. I doubt anyone would seriously argue that zipping and then un-zipping some GPL source code means you don't need to respect the original license. LLMs are different in that they're lossy compared to the zip format - does the nature of this lossiness invalidate the intent of the GPL's original language? I doubt it.
the article cites that 6mo tweet that everyone else cites. I don't think it is known if the user had public code suggestions turned off at the time, either; he wouldn't/didn't answer the question at the time.
Also if I am remembering correctly, and I make no guarantee that I am, this tweet is from a person with a strong dislike for Microsoft, and if I am right about that, I would not put it past this person, or anyone else with a strong dislike of Microsoft, to craft a situation to make Microsoft look bad solely to hurt Microsoft.
I've tried to make Copilot give me GPL code snippets while I have "suggestions matching public code" set to "blocked" and I can't make it happen.
so even if this was a problem 6 months ago, it would take some convincing to get me to believe that this happens today.
Even if you sample stuff from programs that use a permissive license, you still legally need to attribute that code. No attribution = copyright infringement. Can the AI code generator supply attribution for the specific works sampled?
I submit that this arms race will not slow down and in the long run no one will end up caring about the licenses this was generated from (i.e software licensing is from a by-gone age already).
I too would prefer that these sorts of things cite sources and the licenses correctly. Will it get mired in legal battles? You bet. Will it get regulated? I assume they'll try! Will it slow down progress of code generating / auto-completing agents? My argument is nope, cut off heads of the hydra if you'd like but it's not going away at all.
Spend your day worrying about something else. This train has left the station.
Makes you wonder how many public repos you would need to seed with a carefully crafted attack/weakness in a common feature/pattern to start effectively poisoning codebases that are leaning on copilot
Let’s write some regulations that say every code review must require a lawyer to comb through the code and look for possible copyright violations or compliance issues. The lawyer can then tell the author to change the lines of code and submit for review again.
Or perhaps every company can just invent its own programming language and translate copyrighted code into the new language and thus avoid copyright issues altogether, though they may still run afoul of software patents.
Also I wonder how this will hold with certain technology. For example apps written with Qt or GPL are very likely to be GPL licensed, unlike apps written in JavaScript which are often licensed under MIT. The likeness of copilot/chatgpt splitting gpl licensed code is the quite higher in Qt/GTK projects...
You are allowed to read others code to learn from it, regardless of any license being accepted offered or rejected. You must do so witin fair use, which is for a court to decide, based on individual case factors.
Saying an LLM violates an atrribution requirement is a bad legal argument.
>researchers say LLMs rarely spit out training data verbatim unless interacted with adversarially, but theoretically, they could.
Theoretically they can generate any arbitrary snippet of code (if it correctly fits the distribution), regardless of whether or not the code was in the training dataset.
There is no such thing as "GPL code" or any other "$license code". This is a fundamental misunderstanding of what a license is. The code in question was licensed to GitHub under a different license - possibly fraudulently.
I personally hope that we bring a lawsuit against an LLM company for emitting GPL licensed code and lose. It sets great precedent for FOSS.
Focusing on the GPL license is probably the wrong move. We want to set precedent that _any_ licensed code that is emitted from an LLM is fair game. If an LLM to emits non-FOSS copyrighted code and it's fair game, I can blindly use that implementation in my code, including FOSS code, and everyone wins.
GPL was a clever hack to use copyright against itself with an infectious license. LLMs might be a better hack. Wanting to block this seems short sighted for giving user's agency over machines.
I'd also like to see more patent defenses of GPL licensed code. If you can release a GPL licensed implementation and block non-FOSS rewrites through patents, that's a huge win for software freedom.
This comment falls into the classic programmer fallacy of thinking you can hack the law with a technicality. If you are using an LLM designed to violate copyright with the intention of violating copyright, and you then violate copyright, a judge is going to find you in violation.
I'm generally in support of LLMs though and I think that they will very quickly be trained to remove verbatim duplication of the kind that a human would consider copyright violation while still using verbatim duplication where it makes sense (for example, every function in python has the word "def" in front of it).
I’m not looking to explicitly launder copyright. I’d like to be blind to it. I don’t want to explicitly use an LLM to remove copyright. I want to use an LLM to build software systems without having to cross reference its output with every line of code ever produced under a license to see if it’s already copyrighted.
> Focusing on the GPL license is probably the wrong move. We want to set precedent that _any_ licensed code that is emitted from an LLM is fair game.
If anything goes to court, that's what would happen. It's not "this is GPL code and they did not attribute", it's "they violated my copyright. As a side note, we license this code as GPL and they did not attribute in accordance with this license, so that's irrelevant". It would only be an actual license issue if they tried something like "license (C) at codium.com/all_licenses_dataset0423".
> GPL was a clever hack to use copyright against itself with an infectious license.
This is a naive understanding and interpretation of GPL, in all its flavors. Or maybe I misunderstand you argument.
The copyright owner of some work is free to offer that work under multiple, different licenses in parallel, to their liking.
They can leverage GPL strategically for e.g. providing a free, easy-to-evaluate library with the "if you use it under GPL terms, you have to GPL your work as well" condition/caveat.
For any library user / customer that does not want to be bound to the GPL terms (e.g. a closed-source software which a company does not want to share for free with their own paying customers and competitors), the copyright owner is free to offer an alternative proprietary commercial license.
This is only one way how GPL can actually leverage copyright and use it financially beneficially to the owner, rather than use "copyright against copyright".
It is interesting to see coders starting to express the same complaints artists had a year ago when AI image making became really, really good, by training on copyrighted art.
Yeah, when you start with dozens of words replicating exactly a source file it is much easier to get a regurgitation. You can't prefix so deeply and then complain.
I believe there will be new "AI permissive licenses" that will pop up in near future. Or existing licenses to add a clause for training AI with their code.
But you need billions of lines to train an AI and most existing code can't just be re-licensed over night. So that would still kill all code related AI projects for the next decade if not longer.
Easy solution: Just make it generate intentionally obfuscated versions of the same functions. Throw in some valid syntax that humans would never consider to use. Break up functions into smaller sub functions. If the LLM has intricate knowledge of the compiler used, it could even generate code which it knows will produce identical bytecode.
Now the only loser is the humans that still have to maintain the ugly code, and RMS can have his weaponized copyright and eat toejam too.
Copilot currently has great plugin integrations for a number of editors and IDEs. I'm sure the same kind of tooling is in the works for ChatGPT but it's not as mature.
How to get a new AI powered software tool high up in hacker news? Mention GitHub Copilot, the equivalent of the abortion debate but for software engineers (everyone is certain to disagree and debate endlessly without swaying any opinions). This post seems like an advertisement for codeium. It wouldn't need to mention anything about Copilot at all and would be just as complete. My 2 cents, click bait & flame war trolling.
Human brains emit GPL code too (probably) if you've looked at enough of it. Heck, some humans intentionally study GPL code and then rewrite it with a slightly different implementation to get around the license.