Hacker News new | past | comments | ask | show | jobs | submit login

Howdy, folks. Ryan here from the GitHub Copilot product team. I don’t know how the original poster’s machine was set-up, but I’m gonna throw out a few theories about what could be happening.

If similar code is open in your VS Code project, Copilot can draw context from those adjacent files. This can make it appear that the public model was trained on your private code, when in fact the context is drawn from local files. For example, this is how Copilot includes variable and method names relevant to your project in suggestions.

It’s also possible that your code – or very similar code – appears many times over in public repositories. While Copilot doesn’t suggest code from specific repositories, it does repeat patterns. The OpenAI codex model (from which Copilot is derived) works a lot like a translation tool. When you use Google to translate from English to Spanish, it’s not like the service has ever seen that particular sentence before. Instead, the translation service understands language patterns (i.e. syntax, semantics, common phrases). In the same way, Copilot translates from English to Python, Rust, JavaScript, etc. The model learns language patterns based on vast amounts of public data. Especially when a code fragment appears hundreds or thousands of times, the model can interpret it as a pattern. We’ve found this happens in <1% of suggestions. To ensure every suggestion is unique, Copilot offers a filter to block suggestions >150 characters that match public data. If you’re not already using the filter, I recommend turning it on by visiting the Copilot tab in user settings.

This is a new area of development, and we’re all learning. I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs. My biggest take-away: LLM maintainers (like GitHub) must transparently discuss the way models are built and implemented. There’s a lot of reverse-engineering happening in the community which leads to skepticism and the occasional misunderstanding. We’ll be working to improve on that front with more blog posts from our engineers and data scientists over the coming months.




This doesn’t at all address the primary issue, which is one of licensing.

Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”

If someone violated the copyright of a song by sampling too much of it and released it in the public domain (or failed to claim it at all), and you take the entire sample from them, would that hold up in a legal setting? I doubt it.


It does address it, although not that clearly. This happens all the time with news media. They will post a picture and say they got permission from X person, but X person actually didn't even own the copyright in the first place. That doesn't make any of it okay, but it does mean that the organization has legal cover in this case and the worst that will happen is that they'll have to take the content down. In GitHub's case if that same code snippet is found in other repo's that have different licensing then it's difficult to really prove who owns the copyright, it's a legal issue between the original copyright owner and the person that re-distributed the work. They can submit a DCMA takedown notice for the other repo's. But it's pretty unlikely Github gets into any legal trouble as long as they can prove that they got the snippet from someone else.


If that's true, than Github is just "washing its hands". Not at all reassuring for copyright holders and users of copilot.


That code seems to appear in thousands of repositories on GitHub, I’m sure some of them haven’t copied the license.

The vast majority of people who would use a matrix transform function they got from code completion (or from a GitHub or stack overflow search) probably don’t care what the license is. They’ll just paste in the code. To many developers publicly viewable code is in the public domain. Code pilot just shortens the search by a few seconds.

Microsoft should try todo better (I’m not sure how), but the sad fact is that trying to enforce a license on a code fragment is like dropping dollar bills on the sidewalk with a note pinned to them saying “do not buy candy with this dollar”


I still remember the days when we hand billion dollar lawsuits over 20 lines of code (Oracle vs Google).

If CoPilot makes everyone see how ridiculous that is, that's a win in my book.


What’s the most github could reasonably be expected to do? Identify if multiple licenses are found for the same code then maybe it should be flagged for review or the most restrictive license applied.


If it's possible for video and audio content (ContentID, YT), then I don't see why it shouldn't be possible for OSS.


Do we want that though? I personally believe copyright as implemented today is harmful. The fact that code largely is able to dodge this could be seen as arguing we should be laxer with copyright, rather than arguing for strict enforcement of copyright on code.


The point is that CoPilot should not emit a word-for-word copy of someone else's work because that is called plagiarism.


Check timestamps of commits of replicated code to find the original.


Timestamps of commits can't be trusted, just like commit authors.

Github can only trust push timestamps.


That would only work if the original was uploaded to GitHub before the copies. Like, somebody could copy from GitLab or BitBucket. And git histories don’t always help if they’re not copied over.


But copyright law doesn't really care about how you prevent infringement, just that it doesn't happen. Isn't it up to Github to come up with a way to do it, or otherwise not do it at all?


GitHub just needs to show they have taken reasonable precautions, and if a conflict is identified, that they remediate it without undue delay.

It’s not a binary all perfectly or nothing at all. The law looks at intent and so doesn’t punish mistakes or errors so long as you aren’t being malicious or reckless or negligent.


Github is protected by section 230, which states:

> No provider or user of an interactive computer service shall be treated as the publisher or speaker of any information provided by another information content provider

So the act of hosting copyrighted content is not actually a copyright violation for Github. They're not obligated to preemptively determine who the original copyright owner of some piece of code is, as they're not the judge of that in the first place. Even if you complain that someone stole your code, how is Github supposed to know who's lying? Copyright is a legal issue between the copyright holder and the copyright infringer. So the only thing Github is required to do is to respond to DMCA takedown notices.


Yes. GitHub can get away with "oh well, we're all learning" because if the code is violating copyright, it's the user who is infringing directly by publishing it, not GitHub via Copilot. Either the user would have to bring a case against GitHub demonstrating liability (good luck) or the copyright holder would have to bring a case against GitHub demonstrating copyright violation (again, good luck). Otherwise this is entirely between the copyright holder and the Copilot user, legally speaking.

Of course if someone does manage to set a precedent that including copyrighted works in AI training data without an explicit license to do so, GitHub Copilot would be screwed and at best have to start over with a blank slate if they can't be grandfathered. But this would affect almost all products based on the recent advancements in AI and they're backed by fairly large companies (after all, GitHub is owned by Microsoft and a lot of the other AI stuff traces back to Alphabet and there are a lot of startups funded by huge and influential VC companies). Given the US's history of business-friendly legislation, I doubt we'll see copyright laws being enforced against training data unless someone upsets Disney.


Do you think that as part of this Github discovered that essentially everyone was in violation of copyright? That copyright of material without public knowledge or review (which exists in music, but not most code), is basically unenforceable?

Then they decided to wade in and build a house of cards where the cards are everyone else’s code, just waiting for the grenade pin puller and we’ve potentially witnessed the moment?

That’s the only thing that makes sense to me here. They don’t care because opening the issue will bring down everyone else with them.


Yeah, so if a news agency publishes a picture without knowing where it came from, the originator can sue them for violating copyright.

There is no “I don’t know who owns the IP” defense: the image has a copyright, a person owns that copyright, publishing the image without licensing or purchasing the copyright, is a violation. The fine is something like $100k per offense for a business.


FWIW this in consequence means you can't legally use Copilot without becoming liable to copyright violations because it's essentially a black box and you have no insight into where the code it generates originated and even if it isn't a 1-to-1 copy it might be a "derivative work".

This is why I'm gnashing my teeth whenever I hear companies being fine with their employees using Copilot for public-facing code. In terms of liability, this is like going back from package managers to copying code snippets of blogs and forum posts.


> using Copilot for public-facing code

Why this restriction on public-facing code? Are you OK with Copilot being used for "private"/closed source code? I get that it would be less likely to be noticed if the code is not published, but (if I understand right) is even worse for license reasons.


I don't advocate people use Copilot for anything but hobby toy projects.

I have lower expectations of the rigor with which companies police their internal codebases, though. Seeing Copilot banned for internal use too is a pleasant surprise. Companies tend to be a lot more "liberal" in what kind of legal liabilities they accept for their internal tooling in my experience.


Turn the parties in this argument around and see if you think it still holds.

J. Random Hacker acquires and uses a copy of some of GitHub's, or Microsoft's source. When sued, the defense says that the code was not taken directly from GH/MS, just copied from a newsgroup where it had been posted. Does this get J. off the hook?


Was J using automated methods based on false claims of ownership by the newsgroup posters, with no direct knowledge of the violation? If so J should not be punished.


I may be misinformed but my understanding of copyright is that it protects the 'expression' of something (like an algorithm or recipe) so someone can rewrite a copyrighted chunk of code into another language and be free of the original copyright, while also able to assert their own copyright on their new expression.

If that is true then one way to get around copyright restrictions on existing code is to create a new language.


fascinating idea, copilot could do the translations internally and also work torwards widening the pool of suggestions to all languages instead of the individual lamguage a user is using (bit then again, they might be writing in the "new" language already


> Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”

If you do something, it's ultimately you who has to make sure that it is not against the law. "I didn't know" is never a good defense. If you pay with counterfeit cash, it is you who will be arrested, even if you didn't know it was counterfeit. If you use code from somewhere else (no matter if it's by copy/pasting or by using Copilot), it is you who has to make certain that it doesn't infringe on any copyright.

Just because a tool can (accidentally) make you break the law, doesn't mean the tool is to blame (cf. BitTorrent, Tor, KaliLinux, ...)


BitTorrent doesn't automatically download a pirated copy of Lion King when you ask it for something to watch...


BitTorrent (and, to a larger degree, EDonkey) did and still do that. Who tells you that what you're downloading is indeed what you think it is. You can click on a magnet link that claims to download a Debian ISO just to find out later that it's something else entirely. To make matters worse, BitTorrent even uploads to potentially hundreds of other clients while you're still downloading, so while downloading something might not be illegal in your jurisdiction, uploading/distributing most certainly is, and you can get into lots of trouble for uploading (parts of a) copyrighted wortk to hundreds or thousands of other users


> You can click on a magnet link that claims to download a Debian ISO just to find out later that it's something else entirely

This is just fear mongering, the same exact thing can happen with a web browser, I click a link to view an image of a cat but... oops, it was actually a Getty copyrighted picture of a dog! Oh nooooo.

On the web that sort of thing is actually common, but bit torrent? I have never downloaded a torrent to find it was something other than what I expected. Never have I seen a movie masquerading as a Debian ISO. That's nothing more than a joke people use to make light of their (deliberate) copyright infringement.

Furthermore, is there even any bit torrent client that will recommend copyrighted content to you, rather than merely download what you tell it to? I've not seen one. Search engines, in my browser, do that sort of recommendation but bit torrent clients do what I tell them to. Including seeding to others, which is optional but recommended for obvious reasons.


> I click a link to view an image of a cat but... oops, it was actually a Getty copyrighted picture of a dog! Oh nooooo

Sorry, what?

Downloading copyrighted content is very, very rarely the problem.

It's the uploading (the sharing!) of copyrighted content where you actually get into trouble.


If you actually care, then simply configure your client to leech. Every client I've ever used or heard of supports this.

But more to the point, getting tricked into seeding a copyrighted movie by a torrent masquerading as a Debian ISO isn't something that actually happens. That's absurd FUD.


Errm, you posted:

> "This is just fear mongering, the same exact thing can happen with a web browser, I click a link to view an image of a cat but... oops, it was actually a Getty copyrighted picture of a dog! Oh nooooo."

No-one cares whether you download an open-sourced photo of a cat or a copyrighted photo of a dog.

Why would anyone claim that?

It's a terrible comparison to torrents.


BitTorrent is certainly not a good example to follow, but I do think that copilot is more wrong.

They should definitely include disclaimers and make seeding opt-in (though I don't know how safe you are legally when you download a Lion King copy labeled Debian.iso). That said, they don't have the information necessary to tell whether what you're doing is legal or not.

Copilot _has_ that information. The model spits out code that it read. They could disallow publishing or commercially using code generated by it while they're sorting it out, but they made the decision not to.

AI is hard, but the model is clearly handing out literal copies of GPL code. Github knows this and they still don't tell you about it when you click install.


It doesn't matter if the information is there or not, since an algorithm cannot commit a copyright violation. There is at least one human involved, and the human is the one who is responsible.

A car has all the information that it's going faster than the speed limit, or that it just ran a red light. But in the end it's the driver who is responsible. It's not the tool (car, Copilot) that commits the illegal act, it's the user using that tool


In the case of Copilot, you don't even have a speedometer.


So your point is that removing the speedometer from your car and then claiming "I didn't know I was driving too fast!" will make it somehow not your responsibility?

It is still your responsibility to know and obey the traffic laws, the same as it is your responsibility to obey the copyright laws....


Yet, plenty of tools of this caliber have been made illegal (in some parts of the world).


Indeed, and people always (rightfully) complain loudly against the outlawing of these tools, and in many cases they have been successful. Yet here it's the opposite for some weird reason.


I don't know whether the "Numerical Recipes" publisher actively defends their copyright of the code in the books but it would be an interesting test case.


> Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”

I mean, in humans it's just referred to as 'experience', 'training', or 'creativity'. Unless your experience is job-only, all the code you write is based on some source you can't attribute combined with your own mental routine of "i've been given this problem and need to emit code to solve it". In fact, you might regularly violate copyright every time you write the same 3 lines of code that solve some common language workaround or problem. Maybe the solution is CoPilot accompanying each generation with a URL containing all of the run's weights and traces so that a court can unlock the URL upon court order to investigate copyright infringement.

> If someone violated the copyright of a song by sampling too much of it and released it in the public domain (or failed to claim it at all), and you take the entire sample from them, would that hold up in a legal setting? I doubt it.

In general you're not liable for this. While you still will likely have to go to court with the original copyright holder's work, all the damages you pay can be attributed to whoever defrauded or misrepresented ownership over that work. (I am not your lawyer)


> > Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”

> I mean, in humans it's just referred to as 'experience', 'training', or 'creativity'. Unless your experience is job-only, all the code you write is based on some source you can't attribute combined with your own mental routine of "i've been given this problem and need to emit code to solve it". In fact, you might regularly violate copyright every time you write the same 3 lines of code that solve some common language workaround or problem.

Aren't you moving the goal posts? This is not 3 lines, but instead is 1 to 1 reproducing a complex function that definitely has enough invention height to be copyright able.


With high probability, what's happened here is this code is an important piece of code-infrastucture in that it's copied into a fair number of places. Which means humans are copying it without attribution or downstream of someone who did while relevant license is not propagated anywhere near as reliably.

It doesn't change licensing issue but it does mean people are already copying and using copyrighted code without respecting original license and no AI involved.

There should be a way to reverse engineer code LLMs to see which core bits of memorized code they build on. Another complex option is a combination of provenance tracking and semantic hashing on all functions in code used for training. Another option (non-technical) is a rethinking of IP.


>With high probability, what's happened here is this code is an important piece of code-infrastucture in that it's copied into a fair number of places. Which means humans are copying it without attribution or downstream of someone who did while relevant license is not propagated anywhere near as reliably.

The original poster said it was in a private repository.

>It doesn't change licensing issue but it does mean people are already copying and using copyrighted code without respecting original license and no AI involved.

I don't get the argument. Many people are copying/pirating MS windows/MS office. What do you think MS would say to a company they caught with unlicensed copies and they used the excuse "the PCs came preinstalled with Windows and we didn't check if there was a valid license"?


Humans have creativity.

The first C developers wrote C code despite lacking a training set of C code.

AI can't do that. It needs C code to write C code.

See the difference here?


The training set for C was algol and a bunch of other languages.

AI could be used to create languages based on design criteria and constraints like C was, but it does bring up the question of why one of the constraints should be character encodings from human languages if the final generated language would never be used by humans...

I mainly think it's funny watching all of these Rand'ian objectivists reusing ever excuse used by every craftsman that was excised from working life...machines need a machinist, they don't have souls or creativity, etc.

Industry always saw open source as a way to cut cost. ML trained from open source has the capability to eliminate a giant sink of labor cost. They will use it to do so. Then they will use all of the arguments that people have parroted on this site for years to excuse it.

I'm a pessimist about the outcomes of this and other trends along with any potential responses to them.


The problem here is that copilot explots a loophole that allows it to produce derivative works without license. Copilot is not sophisticated enough to structure source code generally- it is overtrained. What is an overtrained neural network but memcpy?

the problem isn't even that this technology will eventually replace programmers: the problem is that it produces parts of the training set VERBATIM, sans copyright.

No, I am pretty optimistic that we will quickly come to a solution when we start using this to void all microsoft/github copyright.


Hi Ryan, thanks for posting here.

So I had something similar happen to the OP a couple of days ago. I'm on friendly terms with a competing codebase's developer and have confirmed the following with them, both mine and it are closed source and hosted on github.

Halfway through building something I was given a block of code by copilot, which contained a copyright line with my competitors name, company number and email address.

Those details have never, ever been published in a public repository.

How did that happen?


> Those details have never, ever been published in a public repository.

The most simple answer would be that this is false, it was published somewhere but you are not aware of it.


IMO that doesn’t absolve Microsoft at all. If someone uploads ripped MP3s to the internet somewhere, it doesn’t mean you could aggregate them, burn CDs and sell them.


An equally simple answer is that copilot is pulling code (or at least analyzing) from repositories that are not public.


I think that's very unlikely, they said and repeated that they are not using private code. People catching them lying on this would be very bad for GitHub.


This is some highly impressive logic right here.

Proposition: "They don't use private code".

Proof: "They said they don't use private code. Either the private code appearing is published somewhere else, or they are using private code. Lying would be bad. Therefore the code is published somewhere else, and they don't use private code".


I would say that the logic is more like:

Proposition: "They either do not use private code or they did something very very stupid."

Proof: "Not using private code is very easy (for example google does not train its models on workspace users' data, which is why they get inferior features) and they promised multiple time not to use private code so doing in would be hard to justify"


Bugs and unexpected behaviour catch us all.

I’m not saying they’re intentionally lying, but that one possible explanation is it looking through non public repositories


They would definitely notice such a bug. This would at least double or triple the amount of data they use. This is not something you can do by mistake.


Yet here we are.


Is it possible to verify with GitHub code search (cs.github.com)?


Well, they have been published now.

If this can leak so easy, it makes me wonder how safe api keys are. They are supposed to be hidden away, we know, but so is proprietary code.


Hey Ryan! Have you ever done any reading on the Luddites? They weren't the anti technology, anti progress social force people think they were.

They were highly skilled laborers who knew how to operate complex looms. When auto looms came along, factory owners decided they didn't want highly trained, knowledgeable workers they wanted highly disposable workers. The Luddites were happy to operate the new looms, they just wanted to realize some of the profit from the savings in labor along with the factory owners. When the factory owners said no, the Luddites smashed the new looms.

Genuinely, and I'm not trying to ask this with any snark, do you view the work you do as similar to the manufacturers of the auto looms? The opportunity to reduce labor but also further the strength of the owner vs the worker? I could see arguments being made both ways and I'm curious about how your thoughts fall.


What alternative are you suggesting?

Things turned out pretty great economy-wise for people in the UK. So that's a poor example even if Luddites didn't hate technology. Not working on the technology wouldn't have done the world any favours (nor the millions of people who wore the more affordable clothes it produced).

I personally think it'd be rewarding to make developers lives easier, essentially just saving the countless hours we spend googling + copy/pasting Stackoverflow answers.

Co-pilot is merely just one project in this technological development, even if a mega-corp like Microsoft doesn't do it ML is here to stay.

If you're concerned that software developers job security is at all at risk from co-pilot than you greatly misunderstand how software engineering works.

Auto-completing a few functions you'd copy/paste otherwise (or rewrite for the hundredth time) is a small part of building a piece of software. If they struggle with self-driving cars, I think you'll be alright.

At the end-of-the-day there's a big incentive for Github et al to solve this problem, a class action lawsuit is always an overhanging threat. Even if co-pilot doesn't make sense as a business and these pushback shut it down I doubt it will go away.

I'm personally confident the industry will eventually figure out the licensing issues. The industry will develop better automated detection systems and if it requires more explicit flagging, no-one is better positioned to apply that technologically than Github.


The first few decades of the 19th century were exceptionally grim in the UK though. Poverty and inequality both increased and a reactionary government enacted draconian policies curtailing freedom of speech as Britain was probably closest to the brink of a social revolution as it ever was. It took several decades for things to actually start improving for most common people and most of the actual progress in that area only occurred in the 1940s and 50s.

See https://en.m.wikipedia.org/wiki/Peterloo_Massacre for example


> If you're concerned that software developers job security is at all at risk from co-pilot than you greatly misunderstand how software engineering works.

I think you are vastly underestimating how many professionally employed software developers are replaceable by copilot at this very moment. The managers are not caught up yet and you seem to be lucky not having to work with this type of dev, but I have had 1000s of people I interacted with in a professional capacity over the decades who can be replaced today. Some of those realised this and moved to different positions (for instance, advising how to use ML to replace them: if you cannot beat them…).

I mean of course you are right in general but there are millions of ‘developers’ who just look everything up with Google/SO, copy paste and change until it works. You are saying this will make their lives better, I say it will terminate their employment.

Anecdote: I know a guy who makes a boatload of money in London programming but has no understanding of things like classes, functional constructs, functions, iterators (he kind of, sometimes, understands loops) etc. He simply copies things and changes them until it works: he moved to frontend (react) as there he is almost not distinguishable from his more capable colleagues because they are all in a ‘put code and see the result’ type of mode anyway and all structures look the same in that framework, so the skeleton function, useXXX etc is all copy paste mostly anyway.


> ‘put code and see the result’

Isn't this basically all UI programming? :D

Joking aside, I see this 'person X doesn't know anything, but they are still delivering' attitude quite a bit on HN now. They clearly know something, and projects like co-pilot will make them even more effective.

I think the opposite of you - that projects like co-pilot will further lower the barriers of entry to programming and expand those who program. I also think that like all ease of programming advances in the past, business requirements will continue to grow at the edges where those who care about the craft will still be required.


Oh I do believe you are right, I just don’t think this is a thing just anyone can learn: many ‘outsourcing’ programmers/coders don’t really understand what they are working on; they just finish tasks. I have no stats, but in companies I worked/work with, it is the vast majority. They don’t know or care about the business goals, they just perform tasks and then go home. This is almost already replaceable by copilot.

Like I said; it is a great thing for me but I don’t believe developers without talent and/or rigorous foundations will make it. Go on Upwork and try to find someone who can do more than the same work (mostly copy paste) that they always did. In an interview when you ask someone to use map/reduce to create a map/dict, they will glaze over. This is the norm, not the exception, no matter the pay. Some of them have 10 years experience but cannot do anything else than make crud pages. This will end as copilot makes lovely .reduce and linq art from a human language prompt.


Who or what would replace them ? If you got rid of these developers, how would those who did the firing know what they’re doing ?


I use copilot to do things that I would’ve hired people for. I create tests and put comments in my code and copilot comes up with pages of dreary boring shit that would take me 0 pleasure or brainpower but would take a lot of work to just go through.

A real good example is mapping objects: let’s say you have a deep nested object from an ERP and you need to map that to another system(s). This is horrible work and copilot just generates almost everything for it if it knows the input and output objects; it ‘knows’ that address = street and if it is not it will deduct it from the models or comments or both; if there is a separate house number and stuff, it’ll generate code to translate that. I used to hire people for that; no longer; it just pops, I run the tests and fix some thing here and there.


I have to be honest I've not used it but it truly sounds incredible that it can do things as well as you say.

So you write tests and copilot generates code you shove into production with little overhead ?

Do you read the code thoroughly (kind of negating having it generated for you?), or just have blind faith in it because tests are green and just YOLO it into production ?

I'd feel pretty uneasy deploying code that:

  * I, or a trusted peer has not written.

  * Hasn't been reviewed by my peers.

  * Code I, or my peers don't understand fairly well.
That's not to say I think me or my colleagues write code that doesn't have problems, but I like to think we at least understand the code we work with and I believe this has benefits beyond just getting stuff done quickly and cheaply.

In other words, I have no problem using code generated by co-pilot, but I'd feel the need to read and review it quite thoroughly and then I sort of feel that negates the purpose, and it also means it pulls my back into the role of doing work I'd hire someone else to do.


But I do review and test it and it is mostly 80% ok. It even learns your style of coding. Like said; it works best for stuff that is heavy on code but low on thought.


Do you enjoy working like this? Having CP generate things correctly 80% of the time and then having to scrutinize whatever is generated and look for problems?

Genuine question, not being snarky.


So are you going to report him or you are just whining about why life is unfair? What he does in order to do his job is none of your business as you don't know how his life is behind the scenes.


I read the gp comment as an example of the type of engineer that can be replaced by copilot. Nothing more.


Indeed; that was the intended message. I don’t go reporting random people who slack off but yet complete their work; that would be a really busy job… I think there is even a word for that now in English (I am dutch). I will try to find it.


Which doesn’t actually seem like a great loss imho…

I will admit I’m kind of a “throw stuff at the wall and see what sticks” kind of coder but nobody is paying me boatloads of money to poke at some program until it stops segfaulting, would be nice though.


Why would I report him? He is doing what he is asked to do?


The luddites were right. Everything is worse because they did not win, which is obvious if you spend a second thinking about what them not winning means: more concentrated wealth, more disenfranchised workers.


It’s difficult for me to see how 2022 is worse than 1816 all things considered.


Durability, quality and reparability. Most fabrics build today are so fragile and needs to be replaced soon, despite the advance in material and weaving.

Most highly qualified workers loves what they do and would stand for keeping they’re output quality up. On the contrary interchangeable cheap workers have no real incentive to that. The factory’s manager is left alone in charge to balance quality versus cheapness, and the last comes with obsolescence (planned or not), which is good for business.


> Most fabrics build today are so fragile and needs to be replaced soon

Because that's what people want. You can get high quality clothes for much cheaper than you could in 1816, but people prefer disposable clothes so they can change their look more often. This is just producers responding to demand.


“People want so producers responds” is a nice but candide eco theory, 2022 looks more like “producers pay for marketing that makes people want” oh and by the way, who’s really paying for the marketing at the end ?


Properly sourcing high quality stuff is incredibly difficult for consumers. Price is not a good discriminator, unfortunately. This is a problem everywhere but for clothes in particular.


https://theweek.com/feature/briefing/1016752/the-real-cost-o...

Maybe not right this moment but our actions have consequences in the future.

For those who only see the next quarter, they're stoked.

For those who understand infinite growth is impossible and would simply like a livable world, they're horrified.


It would indeed be an outstanding catastrophe if 200 years of the most incredible scientific and technological progress yielded a worse result. Of course, that is entirely not the point (none of the times this trope comes out). What is being argued is that 2022 as it is is worse than 2022 as it could be.

In other words: things improved because of technology and despite the societal/economic framework, not because of it.


Everything is worse than it could have been now, not directly compared to 200 years ago.

I find it very hard to believe you didn't understand the suggestion.


>> Everything is worse than it could have been now

Prove it.


Tax is supposed to deal with this to some extent but the rich have the resources to avoid it!


We've also steadily lowered tax rates over the last 50 years. Many countries are at historically low tax burdens despite rising inequality and no evidence of this improving economic growth.


Yet the public narrative centers around perceived anti-technologism and implied anti-comfortism, wholly ignoring the societal underpinnings of the issue: an increase in power and income inequality amounting to disenfranchisement.


Who wrote and popularized that narrative? The industrialists with printing presses.


> Not working on the technology wouldn't have done the world any favours (nor the millions of people who wore the more affordable clothes it produced).

Did you read the comment you're replying to at all? It says

>The Luddites were happy to operate the new looms, they just wanted to realize some of the profit from the savings in labor along with the factory owners.

Now maybe you agree maybe you disagree. But if you're just talking past the person you're replying to... what's the point?


But - the only reason anyone makes money (other than tax money) is because they're useful to someone else. Almost all of the clothing industry companies make money from large numbers of people buying their clothes. So they are useful to us.

Similarly, the reason Europe put 30% of its populace "out of work" by industrialising agriculture is why we don't have to all go work in fields all day. It is a massive net positive for us all.

Moving ice from the arctic into America quickly enough before it melted was a big industry. The refrigerator put paid to that, and improved lives the world over.

Monks retained knowledge through careful copying and retransmission of knowledge during the medieval times in the UK. That knowledge was foundational in the incredible acceleration of development in the UK and neighbouring countries in the 18th and 19th centuries. But the printing press, that rendered those monks much less relevant to culture and academia, was still a very good idea that we all still benefit from today.

Soon, millions of car mechanics who specialise in ICE engines will have to retrain or, possibly, just be made redundant. That may be required for us to reduce our pollution output by a few percent globally, and we may well need to do that.

The exact moment in history when workers who've learned how to do one job are rendered obsolete is painful, yes, and they are well within their rights to what they can to retain a living. But that doesn't mean those workers are somehow right; nor that all subsequent generations should have to delay or forego the life improvement that a useful advance brings, nor all of the advances that would be built on that advance.


> the only reason anyone makes money (other than tax money) is because they're useful to someone else.

Stealing, scamming, gambling, inheriting, collecting interest, price gouging, slavery, underpaying workers, supporting laws to undermine competitors… Plenty of ways to make money without being useful—or by being actively harmful—to someone else.

> Almost all of the clothing industry companies make money from large numbers of people buying their clothes. So they are useful to us.

We don’t need all that clothing, made by monetarily exploiting people in poor countries and sold by emotionally exploiting people in rich countries under the guise of “fashion”. The usefulness line has long been crossed, it’s about profit profit profit.


>sold by emotionally exploiting people in rich countries under the guise of “fashion”.

only emotionally crippled people like fashion, if they were healthy they would all dress in gray unitards and march in formation towards the glorious future!

hey I too have often been carried away by my own rhetoric but come on!


> only emotionally crippled people like fashion

Please don’t straw man¹. That’s neither what I said, nor what intended to convey, nor what I believe.

¹ https://news.ycombinator.com/newsguidelines.html


We definitely don't need to change wardrobes entirely 2x per year, at great cost in externalities such as pollution from all the shipping. I'm sure you understood that this is the point.


I assumed that this was the point but that latexr had been carried away by their rhetoric, like perhaps to the point of sounding a little bit loony, hence the line:

>hey I too have often been carried away by my own rhetoric but come on!


This was an amusing comment.


> Plenty of ways to make money without being useful—or by being actively harmful—to someone else.

I don't equate, say, "making money" with "stealing money". I mean the way people do things within the law. Inheriting is different; the money is already made. Interest is being useful to someone else, via the loan of capital.


> I don't equate, say, "making money" with "stealing money". I mean the way people do things within the law.

Laws shouldn't be equated to ethics. There have been and will be countless ways to make money legally and unethically in any society.


I've no idea what your point is in relation to the topic. Stealing money is against the law, so that already rules it out from "making money". That was my point.


> I mean the way people do things within the law.

The examples considered that: gambling, collecting interest, price gouging, underpaying workers, supporting laws to undermine competitors.


As I say, interest is being useful to someone else, via the loan of capital.

Gambling - I don't do it, but I'd need more specifics to see why gambling is bad in this sense. It's a voluntary pursuit that I think is a bad idea, but that doesn't make it illegal.

Price gouging is still being useful, just at a higher price. Someone could charge me £10 for bread and if that was the cheapest bread available, I'd buy it. If it is excessive and for essential goods, it is increasingly illegal, however. 42 out of 50 states in the US have anti-gouging laws [0], which, as I say, isn't what I'm talking about. I'm talking about legal things.

Underpaying workers - this certainly isn't illegal, unless it's below minimum wage, but also "underpaying" is an arbitrary term. If there's a regulatory/legal/corrupt state environment in which it's hard to create competitors to established businesses, then that's bad because it drives wages down. Otherwise, wages are set by what both the worker and employer sides will bear. And, lest we forget, there is still money coming into the business by it being useful. Customers are paying it for something. The fact that it might make less profit by paying more doesn't undermine that fundamental fact.

As for supporting laws to undermine competitors, that is something people can do, yes. Microsoft, after their app store went nowhere, came out against Apple and Google charging 30% for apps. Probably more of a PR move than a legal one, but businesses trying to influence laws isn't bad, because they have a valid perspective on the world just as we all do, unless it's corruption. Which is (once more, with feeling) illegal, and so out of scope of my comment. And again, unless the laws are there to establish a monopoly incumbent, which is pretty rare, and definitely the fault of the government that passes the laws, the company is still only really in existence because it does something useful enough to its customers that they pay it money.

[0] https://www.law360.com/articles/1362471


You see this argument over and over again but it’s the exception that proves the rule.

Most of the time when it’s made it’s just papering over yer another situation where a surplus is being squeezed out of a transaction by a parasitic manager class using principal-agent problem dynamics.

The people who invented this stuff are always trying to tell you they’ve invented the cotton gin or something when in fact they’ve just come up with a clever way to take someone else’s work and exploit it.


What was described wasn't the principal-agent problem. If I'm an employee and my job becomes simpler or more productive through an automation investment by someone else, I don't think I deserve part of the increased profit unless I'm part of a profit-sharing agreement that would also see me absorb losses.


> unless I'm part of a profit-sharing agreement that would also see me absorb losses

And how many workers even have the possibility of an arrangement like this, i.e. a worker-owned cooperative?

Yes, that is exactly the point. When a labour-saving technological development comes along, it's payday to the capital-having class and dreary times for the labour-doing class.


And it's good for everyone down the line, because the good being produced becomes more affordable and better. It might be hard to zoom out from these current times when we can expect continual progress, but this is one of the only reasons why anything ever gets better.

I'm from the UK, and we used to make motorbikes. They got - correctly - outcompeted by Japanese bikes in the 1950s that were built with more modern investment and tooling. If Japan hadn't done that, we'd have more motorcycle jobs in the UK, and terrible motorcycles that still leaked oil because the seam of the crankcase would still be vertical and not horizontal.

I'm not saying anything about this process is perfect and pain-free, but it seems that a lot of the things we have now are because of processes like this. Should Tesla sell through dealerships instead of direct to consumers? I think the answer is, "Tesla should do what's best for its customers", and not "Tesla should act to keep dealership jobs and not worry about what's best for its customers."

Businesses exist for their customers and not their employees, and having just been part of a business that, shall we say, radically downsized, I've seen a little of the pain of that. Thankfully it was a high tech business, and as the best employment protection is other employers, and there are loads of employers wanting tech skills I've seen my great colleagues all get new jobs. But I think it's ultimately disempowering to think of your employer like a superior when it should feel like an equal whose goals happen to coincide with yours for a while.


In the case of Copilot, the automation "investment" rides directly on the back of a large pile of code. And the creators of that code are receiving none of the fruits of this "investment".


Would for you to present a concrete example of this. Genuinely curious.


> But - the only reason anyone makes money (other than tax money) is because they're useful to someone else. Almost all of the clothing industry companies make money from large numbers of people buying their clothes. So they are useful to us.

No, that's not true. Capitalists make money from simply owning things, not because they're necessarily doing anything useful.


> No, that's not true. Capitalists make money from simply owning things, not because they're necessarily doing anything useful.

Can you elaborate on this? How can I become a capitalist so all my possessions start earning me money?


> Can you elaborate on this? How can I become a capitalist so all my possessions start earning me money?

Capitalists make money from simply owning things, but that doesn't imply in the slightest that everything that can be owned produces income.

The classic example is a landlord: he collects income because he simply owns the land others need or want to use. He doesn't necessarily have do any work that's useful to anyone else, not even maintenance or "capital allocation."


If the property isn't useful, then why is anyone renting it?


I am violently opposed to copilot and it has nothing to do with feeling threatened. I would gladly use a model whose weights were open sourced or proprietary that paid for its training material.


While an interesting take, it doesn't apply directly. In my opinion, the situation is more similar to search engines bringing you code snippets - they shouldn't be stripped of their original licenses. At least that's the legal framework we used to operate in.

Sure the legal framework can change, but such profound change will have surely many consequences we won't foresee, for good or bad.


what happened with auto looms is cloth became cheap which allowed more people to have clothing and expanded the fashion industry by 1000x

Sure glad thse Luddites didn't get their way


I can guess the modern slaves producing our cheap clothes would have an opinion on that.


It's true that a lot of the fashion industry appears to be pretty disgusting. But that doesn't mean that it would be better if fewer of the steps were automated.


Local communities would be stronger and more resilient with more local crafts-people providing meaningful labor within said communities.

Currently, everything is extraction and the US is rotting from the inside out because of it.


People can do that right now; they just choose not to. I don't think you can do simple moralising at the state of things and point the blame at a single cause.


Or people would consume less and therefore be poorer.


So you read the comment which specifically made the argument that they didn't want to prevent the new looms from being used and concluded that they wanted to prevent the new looms from being used?


It's probably also opened up a whole niche market for artisan "handwoven" fabrics.

Sadly that's probably a modern thing and not something that people wanted / cared about immediately once everyone lost their jobs.


The Luddites can use auto looms on their own and be their own owners. Labor force wants salary + profit but entirely missing from accepting loss. Funny stuff.


"risk" -_-' Because no private business has ever been bailed out or received any public funding, right?


Yes, that is also bad. Public bodies choosing to give private companies money because otherwise they'd fail is bad stewardship of our taxes.

But also this point is silly. Plenty of money and effort is risked and lost with no bailout. Bailouts are extremely unusual in the grand scheme of things.


Not rarer than bankrupcy.


Are you sure? According to Statistica there were over 700k bankruptcies in the US from 2000-2020 [0]. How many bailouts have there been?

[0] https://www.statista.com/statistics/817918/number-of-busines...



Copilot is the largest disrespect to open source software I have ever seen. It is a derivative work of open source code and it is not released under the same license. It is also capable of laundering open source code. Congratulations for working on the "extinguish" phase of embrace, extend, extinguish for open source.


I really wonder what all those people who said Microsoft acquisition of Github was a good thing for opensource think now. I'm sure there will still be mental gymnastics involved.


Claiming Github's Copilot is Microsoft's "Extinguish" step against open source _is_ the mental gymnastics.


It really isn't mental gymnastic. Copilot, the model, is a program that is a derivative work of open source code. It should be open source.


That is another discussion. He is not claiming that it should be open source - he is claiming it is created to destroy open source.


He is me. Allowing large companies to ignore licenses and giving them a tool to launder licensed code at scale is a significant threat to the integrity of open source licenses.


I didn't claim such a thing, my claim is Microsoft's purchase of Github is not good for open source.


there it is


The chilling effect of this decision is something everybody who uses open source software should be worried about.


I'm worried that this will harm open source, but in a different way: lots of people switching to unfree "no commercial use at all" licenses, special exemptions in licenses, and so on. I'm also worried that it'll harm scientific progress by criminalizing a deeply harmless and commonplace activity such as "learning from open code" when it's AIs that do it. And of course retarding the progress of AI code assistance, a vital component of scaling up programmer productivity.

From an AI safety perspective, I'm also worried it will accelerate the transition to self-learning code, ie. the model both generating and learning from source code, which is a crucial step on the way to general artificial intelligence that we are not ready for.


Horrible framing. AI is not learning from code. The model is a function. The AI is a derivative work of its training material. They built a program based on open source code and failed to open source it.

They also built a program that outputs open source code without tracking the license.

This isn't a human who read something and distilled a general concept. This is a program that spits out a chain of tokens. This is more akin to a human who copied some copywritten material verbatim.


The brain is a function. You're positing a distinction without a difference.


The code that the tweet refers to is not open source. That's the scandal here.


LGPL is open source, the scandal is violating the license agreement.


> Copilot is the largest disrespect to open source No, its owner, Microsoft Coproration is. Remember, what they did with CodePlex archives?

But they ain't some kind of special villains, its today's monopoly market kicked in. Selling startuprs to Yahoo comes with consequences.

> capable of laundering open source code That's an exaggeration. Copilot is still a dumb machine which accidentally learned to mimic the practice of borrowing intellectual property from human coders.


Industry always saw open source as a cost cutting measure.

I think the real lesson to learn is if you look at the sheer amount of energy (wattage) used to replace humans it's clear that brains are really calorie efficient at doing things like producing the kinds of code that Copilot creates...but it doesn't matter because eliminating labor cost will always be attractive no matter what the up front cost is to do it. They literally can't NOT do it based on the rules of our game.

If it wasn't MS it would be someone else and is...you think IBM isn't doing this? Amazon? GTFOH. So is every other large company that has a pool of labor that is valued as a cost.

Maybe a better question would be how and why major parts of human life are organized in ways that are bad for the bulk of humanity.


> This is a new area of development, and we’re all learning. I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs.

Given that there have been major concerns about copyright infringements and license violations since the announcement of Copilot, wouldn't it have been better to do some more of this "learning", and determine what responsibilities may be expected of you by the broader community, before unleashing the product into the wild? For example, why not train it on opt-in repositories for a few years first, and iron out the kinks?


> why not train it on opt-in repositories for a few years first, and iron out the kinks?

Ha ha. Because then the product couldn’t be built. Better to steal now and ask forgiveness later, or better yet, deny the theft ever occurred.


If Copilot was designed with any ethics in mind, it would have been an opt-in model.

Instead, they scoured and plagiarized everyone's source code without their consent.


Because the ethical opt-in model builders are still working on putting together their cleanly sourced dataset.


Copyright infringement is not theft in the most important sense that matters. Theft is normally negative sum, copyright infringement is almost always positive sum.


Had to find this after a long time

IT Crowd Piracy Warning https://www.youtube.com/watch?v=ALZZx1xmAzg


And why not train it on microsoft windows and office code?


Exactly, it would actually benefit many C/C++ programmers. Some components of NT are very high quality, why not wash their license if the aim is to empower the programmers and also make some profit?


Because then your Re4ct code would look like this:

    export default class USERCOMPONENT extends REACTCOMPONENT<IUSER, {}> {
    constructor (oProps: IUSER){
      super(oProps);
    }
    render() {
      return (  
        <div>
          <h1>User Component</h1>
            Hello, <b>{This.oProps.sName}</b>
            <br/>
            You are <b>{This.oProps.dwAge} years old</b>
            <br/>
            You live at: <b>{This.oProps.sAddress}</b>
            <br/>
            You were born: <b>{This.oProps.oDoB.ToDateString()}</b>
        </div>
        );
      }
    }


Easy pal, we don't want to multiply that shyte.


> And why not train it on microsoft windows and office code?

As a thought experiment, if one were to train a model on purely leaked and/or stolen source code, would the use of model step effectively "launder" the code and make later partial reuse legit?


Only if it's not microsoft's leaked code, I guess :)


That is a rather good question.


2 things:

1. you make it out like a translation from e.g. English to Spanish wouldn't fall under copyright. That's incorrect, in most juristictions I am aware of, it actually fall under the copyright of the original work and fall under its own copyright.

2. When will copilot be released open source, it is pretty clear by now that it is a derivative of all the OSS code so how about following the licensing?


Later in the thread he stated the code was not on the machine he tested copilot with.

Copilot training data should have been sanitized better.

In addition: any code that is produced by copilot that uses a source that is licensed, MUST follow the practices of that license, including copyright headers.


Right - but if someone pushes the same code to github and changes the licence file to say "public domain", what's the legally correct way to proceed? What's the morally correct way to proceed?


Legally, if you're publishing a derived work without legitimate permission then you're civilly liable for statutory + actual damages, the only thing you're avoiding is the treble damages for wilful infringement.

Morally I'd say you should make a reasonable good faith effort to verify that you have a real license for everything you're using. When you're importing something on the scale of "all of Github" that means a bit more effort than just blindly trusting the file in the repository. When I worked with an F500 we would have a human explicitly review the license of each dependency; the review was pretty cursory, but it would've been enough to catch someone blatantly ripping off a popular repo.


How do you know GH didn't? Maybe they only included repos with LICENSE.MD files which followed a known permissive licence?

What if a particular piece of code is licensed restrictively, and then (assuming without malice) accidentally included in a piece of software with a permissive license?

What if a particular piece of code is licensed permissively (in a way that allows relicensing, for example), but then included in a software package with a more restrictive licence. How could you tell if the original code is licensed permissively or not?

At what point do Github have to become absolute arbiters of the original authorship of the code in order to determine who is authorised to issue licenses for the code? How would they do so? How could you prove ownership to Github? What consequences could there be if you were unable to prove ownership?

That's before we even get to more nuanced ethical questions like a human learning to code will inevitably learn from reading code, even if the code they read is not permissively licensed. Why then, would an AI learning to code not be allowed to do the same?


The “it’s really hard” argument isn’t a very good argument in my opinion?

If we hold reproductions of a single repository to a certain standard, the same standard should probably apply to mass reproductions. For a single repository, it’s your responsibility to make sure it’s used according to the license.

Are there slightly gray edge cases? Of course, but they’re not -that- grey. If I reproduced part of a book from a source that claimed incorrectly it was released under a permissive license, I would still be liable for that misuse. Especially if I was later made aware of the mistake and didn’t correct it.

If something is prohibitively difficult maybe we should sometimes consider that more work is required to enable the circumstances for it to be a good idea, rather than starting from the position that we should do it and moulding what we consider reasonable around that starting assumption.


If someone uploads something and says 'hey, this is some code, this is the appropriate licence for it', it is their mistake, it is in violation of Github's terms of service, and may even be fraudulent. [0].

I'm also not sure that Copilot is just reproducing code, but that's a separate discussion.

> If I reproduced part of a book from a source that claimed incorrectly it was released under a permissive license, I would still be liable for that misuse. Especially if I was later made aware of the mistake and didn’t correct it.

I don't believe that's correct in the first instance (at least from a criminal perspective). If someone misrepresents to you that they have the right to authorise you to publish something, and it turns out they don't have that right, you did not willingly infringe and are not liable for the infringement from a criminal perspective[1]. From a civil perspective, likely the copyright owner could still claim damages from you if you were unable to reach a settlement. A court would probably determine the damages to award based on real damages (including loss of earnings for the content creator), rather than anything punitive if it's found that

Further, most jurisdictions have exceptions for short extracts of a larger copyrighted work (e.g. quotes from a book), which may apply to Copilot.

This is my own code, I wrote it myself just now. Can I copyright it?

``` function isOdd (num) { if (num % 2 === 0) { return true; } else { return false; } } ```

What about the following:

``` function isOddAndNotSunday (num) { const date = new Date(); if (num % 2 === 0 && date.getDay() > 0) { return true; } else { return false; } } ```

Where do we draw the line?

[0]: https://docs.github.com/en/site-policy/github-terms/github-t... [1]: https://www.law.cornell.edu/uscode/text/17/506


> From a civil perspective, likely the copyright owner could still claim damages from you if you were unable to reach a settlement. A court would probably determine the damages to award based on real damages (including loss of earnings for the content creator), rather than anything punitive if it's found that

There are statutory damages on top of your actual damages. $50k per act of infringement. No reason for the copyright holder to settle for less when it's an open and shut case.

> Further, most jurisdictions have exceptions for short extracts of a larger copyrighted work (e.g. quotes from a book), which may apply to Copilot.

Quotes do not automatically get an exception just because they're taken from a larger work, they might be excepted either because they were de minimis (essentially because they were too short to be copyrightable) or because they were fair use (which is a complex question that takes into account the purpose and context, which Copilot is very unlikely to satisfy because it's not quoting other code for the purpose of saying something about it).

> Where do we draw the line?

Circuit specific; some but not all circuits use the AFC test. It sounds like this code was both long enough and creative/innovative enough to be well on the wrong side of it though.


I am not sure about statutory damages.

As I understand it, the complainant may CHOOSE to request the court to levy statutory damages rather than actual damages at any point, but is not entitled to both actual AND statutory (17 U.S. Code § 504)

It also seems to be absolutely capped at 30K per infringement, not 50, and ranges up from $750. It also seems that if the "court finds, that such infringer was not aware and had no reason to believe that his or her acts constituted an infringement of copyright, the court in its discretion may reduce the award of statutory damages to a sum of not less than $200."

I think you are probably right that this specific function is copyrightable though, but taken overall, I think Microsoft's lawyers have probably concluded that they would win any challenge on this. Microsoft have lost court battles before though, so who knows?


Your question can actually be answered legally. I'm not a lawyer so I'm not going to tell you what those answers are, but there are pretty well established mechanisms to determine if a function is trivial enough to warrant being copyrighted (a lot of this was explored in the SCO vs. IBM saga)


> How do you know GH didn't? Maybe they only included repos with LICENSE.MD files which followed a known permissive licence?

Since copilot famously outputs GPL covered code… no, we have proof they didn't do that.


I think you've missed my point.

If you write some code and release it under the GPL. Then I take your code, integrate it into my project, and release my project with the MIT licence (for example), it may be that Copilot was only trained on my repo (with the MIT licence)

The fault there is not on Github, it's on me. I was the one who incorrectly used your code in a way that does not conform to the terms of your licence to me.

I don't think the fact that Copilot outputs code which seems to be covered under the GPL proves that Github did not only crawl repositories with permissive licences when training Copilot.


It is responsibility of the entity (Microsoft in this case) publishing the code to make sure that they have the right to publish. The Linux kernel generally requires non anonymous contributions for that reason. As a guarantee that the person has the right to contribute.


> It is responsibility of the entity (Microsoft in this case) publishing the code to make sure that they have the right to publish.

This would basically kill github as an idea. I like the ability to be able to push some personal project to github and don't really give a fuck about technical copyright violations and I think the same is true for 90% of developers.


If you want a massive corpus of training data theb you can create it by hand like grandpappy used to do rather than just thieving it whilst telling yourself it is fine.


You keep track of where each external dependency, file and code snippet come from, link to the source, link to the source license.

If someone has lied about the license of something down the chain of links, he's the one on the hook for it.

If you have licensed code in your software and no license to show for it or cannot produce the link to it then you're on the hook.

And here's the issue at hand copilot must have seen that code under permissive license somewhere, but now cannot produce a link to it.


> If someone has lied about the license of something down the chain of links, he's the one on the hook for it.

In this case, all you have on them is an email address. Pretty sure you're still on the hook.


> When you use Google to translate from English to Spanish, it’s not like the service has ever seen that particular sentence before.

But that is exactly how it works. Translation companies license (or produce) huge corpuses of common sentences across multiple languages that are either used directly or fed into a model.

Third party human translators are asked to assign rights to the translation company. https://support.google.com/translate/answer/2534530


Their are lots of sheets properly licensed that show the notes to play ‘stairway to heaven’. Many intro to guitar books, etc. If I publish myself playing that song without the copyright owners permission (and typically attribution) I am looking at some very, very negative outcomes. The fact that there are many copies correctly licensed (or not) does not obsolve me of anything. Curious how this any different?


> If I publish myself playing that song without the copyright owners permission

Music licensing is bonkers but AFAIR (at least in the UK) I think you're allowed to do covers without explicit permission[1] - you'll have to give the original writers/composers the appropriate share of any money you make.

[1] Which is why you (used to?) get, e.g., supermarkets playing covers of songs rather than the originals because it's cheaper.


What's the appropriate share?


In the UK, at least, it seemed to depend on several decades of accumulated rules and whatnots that only the PRS understood[1] (but I haven't been involved in anything related to music licensing for a few years and even then it was baffling.)

[1] Things like "was it on the radio or a TV show or a live performance or a recording? who was the composer? which licensing region was it in?" etc.


This response expresses some of the things, that are to criticize about MS' Copilot project. But also I don't like the instant attempt to subtle discredit the report by dropping something like "I don’t know how the original poster’s machine was set-up" in the first (or second, if you want to be technical) phrase.

First consider that you made a mistake yourself, _then_ ask, whether the fault could be on the other side. I really dislike this high-horse down-talking tone. Maybe it was not meant to sound like that, maybe this kind of talk has become a habit without noticing. Lets assume that, giving a benefit of a doubt.

Onto the actual matter:

> If similar code is open in your VS Code project, Copilot can draw context from those adjacent files. This can make it appear that the public model was trained on your private code, when in fact the context is drawn from local files. For example, this is how Copilot includes variable and method names relevant to your project in suggestions.

How comes, that Copilot hasn't indicated, where the code came from? How can it ever seem, like the code came from elsewhere? That is the actual question. We still need Copilot to point us to repositories or snippets on Github, when it suggests copies of code (including just renaming variables). Otherwise the human is taken out of the loop and no one is checking copyright infringements and license violations. This has been requested for a long time. Time for Copilot to actually respect rights of developers and users of software.

> It’s also possible that your code – or very similar code – appears many times over in public repositories.

So basically it propagates license violations. Great. Like I said, the human needs to be kept in the loop and Copilot needs to empower the user to check where the code came from.

> This is a new area of development, and we’re all learning.

The problem is not, that this is a new development or that we are all learning. That is fine. Sure, we all need to learn. However, when there is clearly a problem with how Copilot works, it is the responsibility of the Copilot development team to halt any further violations and first fix that problem, before letting the train roll on and violating more people's rights. The way this is being handled, by just shrugging and rolling on, maybe at some point fixing things, is simply not acceptable.


> How comes, that Copilot hasn't indicated, where the code came from?

I can't say for sure about copilot but in general you don't have that kind of information. The problem is a bit like trying to add debug symbols back to some highly optimized binary program.


> We’ve found this happens in <1% of suggestions. To ensure every suggestion is unique, Copilot offers a filter to block suggestions >150 characters that match public data. If you’re not already using the filter, I recommend turning it on by visiting the Copilot tab in user settings.

How is that a solution though? OP isn't upset that he's regenerated his own work via Copilot, he's upset that others can unknowingly & without attribution.


Is the answer here something like Black Duck to scan local code and compare it upstream for similarities? Normalize it as a precommit hook potentially.


Your long long paragraph about neighboring code editors is disproven: https://news.ycombinator.com/item?id=33227395

You’re really not going to solve this problem with marketing (“blog posts”) or some pro-Github story from data scientists. You need a DMCA / removal request feature akin to Google image search and you need work on understanding product problems from the customer perspective.


> I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs.

This claim rings extremely hollow when your team refuses to do any of the obvious things that developers, experts and community stakeholders in this very thread (and the rest of this website) are telling you. You still haven't open-sourced Copilot. You still haven't trained it on Microsoft internal code such as Windows and Office. You still haven't made the model freely available for anyone to run locally. Until you do any of these things, you are not acting in the interest of the community and you are just exploiting people and their code for your own profit.


>Instead, the translation service understands language patterns (i.e. syntax, semantics, common phrases). In the same way, Copilot translates from English to Python, Rust, JavaScript, etc.

The statement that language models actually understand syntax and semantics is still subject of significant debate. Look at all discussion around "horse riding astronaut" for stable diffusion models and the prompts with geometric shapes which clearly show that the language model does not semantically understand the prompt.


Hi Ryan. Seeing as we’re going to see more and more of this sort of complaint come up because inherently the licensing situation with open source code is overly complicated, have you considered switching to use code where the licensing is clear: Github and Microsoft’s private codebases?


Ryan - one word you avoided using in your reply is "license". The fact that Copilot is reproducing code without attribution is not the legal show-stopper here has much as the fact Copilot is reproducing code without documenting what license it falls under.

If your hope is that saying "it came out of our ML model" somehow removes Copilot from the well-established legal framework of licensing, I think you're wrong, and you are creating a minefield that I and others choose to stay well clear of. The revenue from Copilot, and the rest of MS, can probably pay your legal bills, but certainly not mine.


> This is a new area of development, and we’re all learning.

Being a new area of development doesn't release you from your obligation to make sure what you're doing is ethical and legal FIRST.

> I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs.

And yet oddly nowhere did the phrase "I reached out to OP to discuss with with them" appear anywhere in your response." Nope. Being part of GitHub's infamous Social Media "incident response" team was more important than actually figuring out what was going on.

You don't even say that you will look into the situation with OP, or speak to them.

waves to all the github employees who will be reading this comment because someone on Github's marketing team links to it


> This is a new area of development, and we’re all learning.

Is this the tact your organization would take if someone else’s code completion software was generating proprietary Microsoft’s proprietary code?


Hi, copilot is very clearly and unambiguously violating people’s IP, and as the model is not public also likely violating gpl3 publication requirements.

I look forward to the entire product you have made being available, as is required for any product built using gpl3’d software.


> The OpenAI codex model (from which Copilot is derived) works a lot like a translation tool. When you use Google to translate from English to Spanish, it’s not like the service has ever seen that particular sentence before. Instead, the translation service understands language patterns (i.e. syntax, semantics, common phrases). In the same way, Copilot translates from English to Python, Rust, JavaScript, etc. The model learns language patterns based on vast amounts of public data

As I understand, this isn't proven is it?

We don't know that the model isn't simply stitching and approximating back to the closest combination of all the data it saw, versus actually understanding the concepts and logic.

Or is my understanding already behind times?


I read that the Amazon equivalent of GitHub Copilot does respect licensing properly, maybe you can talk to them about adopting their approach.


There is a very simple solution: Use Microsoft proprietary code for training the model. Keep your hands off open source code.

There you have the "most responsible way".

The GPL should be updated to prohibit code to be used for "learning" (i.e., regurgitating copyrighted fragments).


Does Copilot keep attribution of public code if it reuses it in its suggestions?

For example, if I copy pasted code from someone in my open source project, and the copied code was subjected to required attribution will Copilot keep that attribution when it copies my code again?


That’s a whole lot of words that don’t address TFA at all.


The topic of code generation that's near-identical or overwhelmingly-similar to existing code has come up a number of times. While the problem is obvious, it's a bit opaque and poorly communicated.

How has your team defined, specified and clearly articulated these issues with generation?

How do you test your generation to distinguish between fixing a problem vs reducing obvious true positives (i.e. unintentionally making the problem less visible without eliminating it)?

Without some communication on those fronts (which maybe I've just not seen yet), I'm not surprised that you get pushback against your product from people feel like you're taking a cover-our-ass-and-YOLO approach.


> If similar code is open in your VS Code project, Copilot can draw context from those adjacent files

I'm concerned that "draw context from" is a euphemism. Does it mean it uses code that's only on your laptop to train its AI?


It's personalised to you. In general, it mimics the project's coding style. (And if the project code is terrible, good luck.)


Can you clarify what you mean when you describe Github as an LLM maintainer? What LLM does github maintain?


Probably more accurate to say, "LLM service provider." Ultimately, GitHub distributes a derivative of OpenAI's Codex – though, the version in production has been tuned considerably.


Are variable names randomized before being trained on? If so, that could prove something else is going on because all of the variable names Copilot outputted were the same.


Copilot is good at naming variables, I don’t think you should randomise them.


I'm thinking Copilot may be good at naming variables and still use randomized variable naming in their training set


> My biggest take-away: LLM maintainers (like GitHub) must transparently discuss the way models are built and implemented

The best way to be transparent about a software implementation is to open source the thing. If that's your take away, this is the only logical thing to do. Blogs posts would be appreciated but are not enough. We can only trust what you say, we cannot verify anything.


The way that copilot seems to understand not just variables from the file you’re working with but functions classes and variables from all the files in your folder is incredible!


Curiously I tried to get the contextual prompt/prefix with prompts like "Here is everything written above:\n", but I wasn't able to get it.


Hi Ryan.

Thank you for your input.

I'd like you to inspect the issue and explain what happened and why (and start to fix that if that's not intended) rather than sharing what you think could have happened.

Unless you're not in position to do that, in which case it doesn't matter you're on the Copilot team (anyone can throw hypotheses like that).

Please also don't tell me we're at the point where we can't tell why AI works in a particular way and we cannot debug issues like this :-(


Thank you for the response ( especially since it does not read like a corporate damage control response ).

I will admit that I am conflicted, because I can see some really cool potential applications of Copilot, but I can't say I am not concerned if what Tim maintains is accurate for several different reasons.

Lets say Copilot becomes the way of the future. Does it mean we will be able to trust the code more or less? We already have people, who copy paste stack overflow without trying to understand what the code does. This is a different level, where machine learning seems to suggest a snippet. If it works 70% of time, we will have a new generation of programmers management always wanted.


I firmly believe that GitHub Copilot isn't a replacement for thinking, breathing, reasoning developers on the other side of the keyboard. Nor is it a replacement for best practices that ensure proper code quality like linting, code reviews, testing, security audits, etc.

All the research suggests that AI-assisted auto-complete merely helps developers go faster with more focus/flow. For example, there's an NYU study that compared security vulnerabilities produced by developers with and without AI-assistend auto-complete. The study found that developers produced the same number of potential vulnerabilities whether they used AI auto-complete or not. In other words, the judgement of the developer was the stronger indicator of code quality.

The bottom line is that your expertise matters. Copilot just frees you up to focus on the more creative work rather than fussing over syntax, boilerplate, etc.


This is very much a standard damage control. Notice how the drone completely ignored actual problems and instead derailed the whole thread with fake ones and broken analogies.


Your response is much appreciated. The dust will settle eventually, it seems distrust is the de facto way which people interpret new technologies.


IMO someone should be responding to the Twitter thread itself. I’m not sure why the official response is here when the copyright owner raised the complaint on Twitter.



It looks like you linked to the original, not the reply. There are so many replies to the original tweet that it's hard to find any specific reply.


The replies appear to be here:

https://nitter.net/ryanjsalva/with_replies




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: