Hacker News new | past | comments | ask | show | jobs | submit | page 2 login
GitHub Copilot (copilot.github.com)
2905 points by todsacerdoti 29 days ago | hide | past | favorite | 1272 comments



Hi HN, we've been building GitHub Copilot together with the incredibly talented team at OpenAI for the last year, and we're so excited to be able to show it off today.

Hundreds of developers are using it every day internally, and the most common reaction has been the head exploding emoji. If the technical preview goes well, we'll plan to scale this up as a paid product at some point in the future.


Lots of questions:

  - the generated code by AI belongs to me or GitHub?
  - under what license the generated code falls under?
  - if generated code becomes the reason for infringment, who gets the blame or legal action?
  - how can anyone prove the code was actually generated by Copilot and not the project owner?
  - if a project member does not agree with the usage of Copilot, what should we do as a team?
  - can Copilot copy code from other projects and use that excerpt code?
    - if yes, *WHY* ?!
    - who is going to deal with legalese for something he or she was not responsible in the first place?
    - what about conflicts of interest?
  - can GitHub guarantee that Copilot won't use proprietary code excerpts in FOSS-ed projects that could lead to new "Google vs Oracle" API cases?


In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler.

On the training question specifically, you can find OpenAI's position, as submitted to the USPTO here: https://www.uspto.gov/sites/default/files/documents/OpenAI_R...

We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!


You should look into:

https://breckyunits.com/the-intellectual-freedom-amendment.h...

Great achievements like this only hammer home the point more about how illogical copyright and patent laws are.

Ideas are always shared creations, by definition. If you have an “original idea”, all you really have is noise! If your idea means anything to anyone, then by definition it is built on other ideas, it is a shared creation.

We need to ditch the term “IP”, it’s a lie.

Hopefully we can do that before it’s too late.


> Ideas are always shared creations, by definition. If you have an “original idea”, all you really have is noise! If your idea means anything to anyone, then by definition it is built on other ideas, it is a shared creation.

Copyright doesn't protect "ideas" it protects "works". If an artist spends a decade of his life painting a masterpiece, and then some asshole sells it on printed T-shirts, then copyright law protects the artist.

Likewise, an engineer who writes code should not have to worry about some asshole (or some for-profit AI) copy and pasting it into other peoples' projects. No copyright protections for code will just disincentivize open source.

Software patents are completely bullshit though, because they monopolize ideas which 99.999% of the time are derived from the ideas other people freely contributed to society (aka "standing on the shoulders of giants"). Those have to go, and I do not feel bad at all about labeling all patent-holders greedy assholes.

But copyright is fine and very important. Nothing is perfect, but it does work very well.


Copyrights are complete bullshit too though. In your 2 examples. First, the artist I assume is using paints and mediums developed arguably over thousands of years, at great cost. So just because she is assembling the leaf nodes of the tree, the far majority of the “work” was created by others. Shared creation.

Same goes for an engineer. Binary notation is at the root of all code, and in the intermediate nodes you have Boolean logic and microcode and ISAa and assembly and compilers and high level Lang’s and character sets. The engineer who assembles some leaf nodes that are copy and pasteable is by definition building a shared creation of which they’ve contributed the least.


The basis of copyright isn’t that the sum product is 100% original. That insane since nothing we do is ever original. It’ll always be a component ultimately of nature. The point is that your creation is protected for a set amount of time and then it too eventually becomes a component for future works.


> the artist I assume is using paints and mediums developed arguably over thousands of years, at great cost.

And they went to the store and paid money for those things.


And they handed the cashier money and then got to do whatever they want with those things. Now they want to sell their painting to the cashier AND control what the cashier does with it for the rest of the cashier's life. They want to make a slave out of the cashier by a million masters.


Remind me when GitHub handed anyone any money for the code they used?


I'm sure natfriedman will be thrilled to abolish IP and also apply this to the Windows source code. We can expect it on GitHub any minute!


I used to work at Microsoft and occasionally would email Satya the same idealistic pitch. I know they have to be more conservative , but some of us have to envision where the math can take us and shout out loud about it, and hope they steer well. When I started at MS, my first week was heckled for installing Ubuntu on my windows machine. When I left, windows was shipping with Ubuntu. What may seem impossible today can become real if enough people push the ball forward together. I even hold out hope that someday BG will see the truth and help reduce the ovarian lottery by legalizing intellectual freedom.


Talking about the ovarian lottery seems strange in a thread about an AI tool that will turn into a paid service.

No one will see the light at Microsoft. The "open" source babble is marketing and recruiting oriented, and some OSS projects infiltrated by Microsoft suffer and stagnate.


All I know is that if a lawsuit comes around for a company who tried to use this, Github et al won't accept an ounce of liability.


You can't abolish IP without completely restructuring the economic system (which I'm all for, BTW). But just abolishing IP and keeping everything the same is kind of myopic. Not saying that's what you're advocating for, but I've run into this sentiment before.


Sure, but usually I tend to think "abolish X" means "lets agree on an end goal of abolishing X and then work rapidly to transition to that world." So in that sense I tend to think the person is not advocating for the simple case of changing one law, but on the broader case of examining the legal system and making the appropriate changes to realize a functioning world where we can "abolish X".


I agree that it would be a huge upheaval. Absolutely massive change to society. But I have every confidence we have a world filled with good bright people who can steer us through the transition. Step one now is just educating people that these laws are marketed dishonestly, are inequitable, and counter productive to the progress of ideas. As long as the market starts to get wind that the truth is spreading, I believe it will start to prepare for the transition.


In practical terms, IP could be referred to as unique advantages. What is the purpose of an organization that has no unique qualities?

In general, what is IP and how it's enforced are two separate things. Just because we've used copyright and patents to "protect" an organization's unique advantages, doesn't mean we need to keep using them in the same way. Or maybe it's the best we can do for now. That's why BSD style licences are so great.


@Nat, these questions (all of them, not just the 2 you answered) are critical for anyone who is considering using this system. Please answer them?

I for one wouldn't touch this with a 10000' pole until I know the answers to these (very reasonable) questions.


> training ML systems on public data is fair use

Uh, I very much doubt that. Is there any actual precedent on this?

> We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!

But apparently not eager enough to have this discussion with the community before deciding to train your proprietary for-profit system on billions of lines of code that undoubtedly are not all under CC0 or similar no-attribution-required licenses.

I don't see attribution anywhere. To me, this just looks like yet another case of appropriating the public commons.


How do you guarantee it doesn't copy a GPL-ed function line-by-line?


Yup, this isn't a theoretical concern, but a major practical one. GPT models are known for memorizing their training data: https://towardsdatascience.com/openai-gpt-leaking-your-data-...

Edit: Github mentions the issue here: https://docs.github.com/en/github/copilot/research-recitatio... and here: https://copilot.github.com/#faq-does-github-copilot-recite-c... though they neatly ignore the issue of licensing :)


That second link says the following:

> We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set

That's kind of a useless stat when you consider that the code it generates makes use of your existing variable/class/function names when adapting the code it finds.

I'm not a lawyer, but I'm pretty sure I can't just bypass GPL by renaming some variables.


It's not just about regurgitating training data during a beam search, it's also about being a derivative work, which it clearly is in my opinion.


> GPT models are known for memorizing their training data

Hash each function, store the hashes as a blacklist. Then you can ask the model to regenerate the function until it is copyright safe.


What if it copies only a few lines, but not an entire function? Or the function name is different, but the code inside is the same?


If we could answer those questions definitively, we could also put lawyers out of a job. There’s always going to be a legal gray area around situations like this.


Matching on the abstract syntax tree might be sufficient, but might be complex to implement.


You can probably tokenize the names so they become irrelevant. You can ignore non-functional whitespace, so that code C remains. Maybe one can hash all the training data D such that hash(C) is in hash(D). Some sort of Bloom filter...


Surprised not to see more mention of this. It would make sense for an AI to "copy" existing solutions. In the real world, we use clean room to avoid this.

In the AI world, unless all GPL (etc.) code is excluded from the training data, it's inevitable that some will be "copied" into other code.

Where lawyers decide what "copy" means.


It's not just about copying verbatim. They clearly use GPL code during training to create a derivative work.

Then you have the issue of attribution with more permissive licenses.


How do you know that when you write a simplish function for example, it is not identical to some GPL code somewhere? "Line by line" code does not exist anywhere in the neural network. It doesn't store or reference data in that way. Every character of code is in some sense "synthesized". If anything, this exposes the fragility of our concept of "copyright" in the realm of computer programs and source code. It has always been ridiculous. GPL is just another license that leverages the copyright framework (the enforcement of GPL cannot exist outside such a copyright framework after all) so in such weird "edge cases" GPL is bound to look stupid just like any other scheme. Remember that GPL also forbids "derivative" works to be relicensed (with a less "permissive" one). It is safe to say that you are writing code that is close enough to be considered "derivative" to some GPL code somewhere pretty much every day, and you can't possibly prove that you didn't cheat. So the whole framework collapses in the end anyways.


> How do you know that when you write a simplish function for example, it is not identical to some GPL code somewhere?

I don't, but then I didn't go first look at the GPL code, memorize it completely, do some brain math, and then write it out character by character.


I truly don't think they can guarantee that. Which is a massive concern.


(1) That may be so, but you are not training the models on public data like sports results. You are training it on copyright protected creations of humans that often took years to write.

So your point (1) is a distraction, and quite an offensive one to thousands of open source developers, who trusted GitHub with their creations.


   (1) training ML systems on public data is fair use 

This one is tricky considering that kNN is also a ML system.


kNN needs to hold on to a complete copy of the dataset itself unlike a neural net where it's all mangled.


What about privacy. Does the AI send code to GitHub? This reminds me of Kite


Yes, under "How does GitHub Copilot work?":

> [...] The GitHub Copilot editor extension sends your comments and code to the GitHub Copilot service, which then uses OpenAI Codex to synthesize and suggest individual lines and whole functions.


Fair use doesn't exist in every country, so it's US only?


Yes, my partner likes to remind me we don't have it here in Australia. You could never write a search engine here. You can't write code that scrapes websites.


It exists in EU also (and it much mire powerful here).


The EU doesn't have a copyright related fair use. Quite the opposite, that why we are getting upload filters.


False. In Spain you have it as under "uso legitimo".


Spain is only part of the EU not the EU.


One exception makes the whole "EU doesn't have" incorrect.

EU doesn't enforce it on the states, yes. But some (maybe all) countries that are in EU do have it.


> We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!

Another question is this: let's hypothesize I work solo on a project; I have decided to enable Copilot and have reached a 50%-50% development with it after a period of time. One day the "hit by a bus" factor takes place; who owns the project after this incident?


Your estate? The compiler comparison upthread seems to be perfectly valid. If you work on a solo project in c# and die, Microsoft doesn’t automatically own your project because you used visual studio to produce it


> the output belongs to the operator, just like with a compiler.

No it really is not that easy, as with compilers it depends on who owned the source and which license(s) they applied on it.

Or would you say I can compile the Linux kernel and the output belongs to me, as compiler operator, and I can do whatever I want with it without worrying about the GPL at all?


Fair Use is an affirmative defense (i.e. you must be sued and go to court to use it; once you're there, the judge/jury will determine if it applies). But taking in code with any sort of restrictive license (even if it's just attribution) and creating a model using it is definitely creating a derivative work. You should remember, this is why nobody at Ximian was able to look at the (openly viewable, but restrictively licensed) .NET code.

Looking at the four factors for fair use looks like Copilot will have these issues: - The model developed will be for a proprietary, commercial product - Even if it's a small part of the model, the all training data for that model are fully incorporated into the model - There is a substantial likelihood of money loss ("I can just use Copilot to recreate what a top tier programmer could generate; why should I pay them?")

I have no doubt that Microsoft has enough lawyers to keep any litigation tied up for years, if not decades. But your contention that this is "okay because it's fair use" based on a position paper by an organization supported by your employer... I find that reasoning dubious at best.


What does "public" mean? Do you mean "public domain", or something else?


Unfortunately, in ML "public data" typically means available to the public. Even if it's pirated, like much of the data available in the Books3 dataset, which is a big part of some other very prominent datasets.


So basically youtube all over again? I.e bootstrap and become popular by using widely available whatever media (pirated by crowdsourced piracy) and then many years later, when it gets popular, dominant, it has to turn around and "do things right" and guard copyrights.


It is the end of copyright then. NNs are great at memorizing text. So I just train a large NN to memorize a repository and the code it outputs during "inferencing" is fair use ?

You can get past GPL, LGPL and other licenses this way. Microsoft can finally copy the linux kernel and get around GPL :-).


> training ML systems on public data is fair use

So, to be clear, I am allowed to take leaked Windows source code and train an ML model on it?


Or, take leaked Windows source code, run it through a compiler, and own it!


> - under what license the generated code falls under?

Is it even copyrighted? Generally my understand is that to be copyrightable it has to be the output of a human creative process, this doesn't seem to qualify (I am not a lawyer).

See also, monkeys can't hold copyright: https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...


> Is it even copyrighted?

Isn't it subject to the licenses the model was created from, as the learning is basically just an automated transformation of the code, which would be still the original license - as else I could just run some minifier, or some other, more elaborate, code transformation, on some FOSS project, for example the Linux kernel, and relicense it under whatever?

Does not sound right to me, but IANAL and I also did not really look at how this specific model/s is/are generated.

If I did some AI on existing code I'd be quite cautious and group by compatible licences classes, asking the user what their projects licence is and then only use the compatible parts of the models.-Anything else seems not really ethical and rather uncharted territory in law to me, which may not mean much as IANAL and just some random voice on the internet, but FWIW at least I tried to understand quite a few FOSS licences to decide what I can use in projects and what not.

Anybody knows of some relevant cases of AI and their input data the model was from, ideally in jurisdictions being the US or any European Country ones?


This is a great point. If I recall correctly, prior to Microsoft's acquisition of Xamarin, Mono had to go out of its way to avoid accepting contributions from anyone who'd looked at the (public but non-FOSS) source code of .NET, for fear that they might reproduce some of what they'd seen rather than genuinely reverse engineering.

Is this not subject to the same concern, but at a much greater scale? What happens when a large entity with a legal department discovers an instance of Copilot-generated copyright infringement? Is the project owner liable, is GitHub/Microsoft liable, or would a court ultimately tell the infringee to deal with it and eat whatever losses occur as a result?

In any case, I hope that GitHub is at least limiting any training data to a sensible whitelist of licenses (MIT, BSD, Apache, and similar). Otherwise, I think it would probably be too much risk to use this for anything important/revenue-generating.


> In any case, I hope that GitHub is at least limiting any training data to a sensible whitelist of licenses (MIT, BSD, Apache, and similar). Otherwise, I think it would probably be too much risk to use this for anything important/revenue-generating.

I'm going to assume that there is no sensible whitelist of licenses until someone at GitHub is willing to go on the record that this is the case.


> I hope that GitHub is at least limiting any training data to a sensible whitelist of licenses (MIT, BSD, Apache, and similar)

Yes, and even those licences require preservation of the original copyright attribution and licence. MIT gives some wiggle room with the phrase "substantial portions", so it might just be MIT and WTFPL


Interesting to see since Nat was a founder of Xamarin


(Not a lawyer, and only at all familiar with US law, definitely uncharted territory)

No, I don't believe it is, at least to the extent that the model isn't just copy and pasting code directly.

Creating the model implicates copyright law, that's creating a derivative work. It's probably fair use (transformative, not competing in the market place, etc), but whether or not it is fair use is github's problem and liability, and only if they didn't have a valid license (which they should have for any open source inputs, since they're not distributing the model).

I think the output of the model is just straight up not copyrighted though. A license is a grant of rights, you don't need to be granted rights to use code that is not copyrighted. Remember you don't sue for a license violation (that's not illegal), you sue for copyright infringement. You can't violate a copyright that doesn't exist in the first place.

Sometimes a "license" is interpreted as a contract rather than a license, in which you agreed to terms and conditions to use the code. But that didn't happen here, you didn't agree to terms and conditions, you weren't even told them, there was no meeting of minds, so that can't be held against you. The "worst case" here (which I doubt is the case - since I doubt this AI implicates any contract-like licenses), is that github violated a contract they agreed to, but I don't think that implicates you, you aren't a party to the contract, there was no meeting of minds, you have a code snippet free of copyright received from github...


So if I make AI that takes copyrighted material in one side, jumbles it about, and spits out the same copyrighted material on the other side, I have successfully laundered someone else's work as my own?

Wouldn't GitHub potentially be responsible for the infringement by distributing the copyrighted material knowing that it would be published?


I exempted copied segments at the start of my previous post for a reason, that reason is I don't really know, I doubt it works because judges tend to frown on absurd outcomes.


Where does copying end though? If an AI "retypes" it, not only with some variable renaming but some transformations that are not just describable by a few code transformations (neural nets are really not transparent and can do weird stuff), it wouldn't seem like a copy when just comparing parts of it, but it effectively would be one, as it was an automated translation.


Probably, copying ends when the original creative elements are unrecognizable. Renaming variables actually goes a long way to that, also having different or standardized (and therefore not creative) whitespace conventions, not copying high level structure of files, etc.

The functional parts of code are not copyrightable, only the non functional creative elements.

(Not a lawyer...)


> The functional parts of code are not copyrightable, only the non functional creative elements.

1. Depends heavily on the jurisdiction (e.g., Software patents are a thing in America but not really in basically all European ones)

2. A change to a copyrightable work, creative or not, would still mean that you created a derived work where you'd hold some additional rights, depending on the original license, but not that it would now be only in your creative possession. E.g., check §5 of https://www.gnu.org/licenses/gpl-3.0.en.html

3. What do you think of when saying "functional parts"? Some basic code structure like an `if () {} else {}` -> sure, but anything algorithmic like can be seen as copyrightable, and whatever (creative or not) transformation you apply, in its basics it is a derived work, that's just a fact and the definition of derived work.

Now, would that matter in courts? That depends not only on 1., but additionally to that also very much on the specific case, and for most trivial like it probably would be ruled out, but if an org would invest enough lawyer power, or suing in a for its case favourable court (OLG Hamburg anyone). Most small stuff would be thrown out as not substantial enough, or die even before reaching any court.

But, that actually scares me a bit when thinking about that in this context, as for me, it seems like when assuming you'd be right, this all would significantly erodes the power of copyleft licenses like (A)GPL.

Especially if a non-transparent (e.g., AI), lets call it, code laundry would be deemed as a lawful way to strip out copyright. As it is non-transparent it wouldn't be immediately clear if creative change or not, to use the criteria for copyright you used. This would break basically the whole FOSS community, and with its all major projects (Linux, coreutils, ansible, git, word press, just to name a few) basically 80% of core infrastructure.


If the model is a derivative work, why wouldn’t works generated using the model also be derivative works?


Because a derivative work must "as a whole, represent an original work of authorship".

https://www.law.cornell.edu/uscode/text/17/101

(Not a lawyer...)


In the US, yes. Elsewhere, not necessarily.


It is output of humans creative processes, just not yours. Like an automated stackoverflow snippet engine.


>Generally my understand is that to be copyrightable it has to be the output of a human creative process

https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...


You should read the FAQ at the bottom of the page; I think it answers all of your questions: https://copilot.github.com/#faqs


> You should read the FAQ at the bottom of the page; I think it answers all of your questions: https://copilot.github.com/#faqs

Read it all, and the questions still stand. Could you, or any on your team, point me on where the questions are answered?

In particular, the FAQ doesn't assure that the "training set from publicly available data" doesn't contain license or patent violations, nor if that code is considered tainted for a particular use.


From the faq:

> GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set.

I'm guessing this covers it. I'm not sure if someone posting their code online, but explicitly saying you're not allowed to look at it, getting ingested into this system with billions of other inputs could somehow make you liable in court for some kind of infringement.


That doesn't cover it, since that is a technical answer for a non-technical question. The same questions remain.


that doesn't include patent violations nor license violations or compatibility between licenses. Which would be the most numerous and non-trivial cases.


How is it possible to determine if you've violated a random patent from somewhere on the internet via a small snippet of customized auto-generated code?

Does everyone in this thread contact their lawyers after cutting and pasting a mergesort example from Stackoverflow that they've modified to fit their needs? Seems folks are reaching a bit.


For that very reason, many companies have policies that forbid copying code from online (especially from StackOverflow).


That mitigates copyright concerns, but patent infringement can occur even if the idea was independently rediscovered.


I was answering a specific question, "How is it possible to determine if you've violated a random patent from somewhere on the internet via a small snippet of customized auto-generated code?" The answer is that many companies have forbidden that specific action in order to remove the risk from that action.

You are expanding the discussion, which is great, but that doesn't apply in answer to that specific question.

There are answers in response to your question, however. For example, many companies use software for scanning and composition analysis that determines the provenance and licensing requirements of software. Then, remediation steps are taken.


Not sure what you're getting at. Are you suggesting that independent discovery is a defense against patents? Or are you clear that it isn't a defense, but just arguing that something from the internet is more likely to be patented than something independently invented in-house? Maybe that's true, but it doesn't really answer the question of

> How is it possible to determine if you've violated a random patent from somewhere on the internet via a small snippet of customized auto-generated code?

The only real answer is a patent search.


There are different ways to handle risk, such as avoidance, reduction, transfersal, acceptance. I was answering a specific question as to how people manage risk in a given situation. In answer I related how companies will reduce the risk. I was not talking about general cases of how to defend against the risk of patents but a specific case as to reducing the risk of adding externally found code into a product.

My answer described literally what many companies do today. It was not a theoretical pie in the sky answer or a discussion about patent IP.

To restate, the real-world answer I gave for, "How is it possible to determine if you've violated a random patent from somewhere on the internet via a small snippet of customized auto-generated code?" is often "Do not take code from the Internet."


I think a patent violation with CoPilot is exactly the same scenario as if you violated a patent yourself without knowing it.


Sounds like using CodePilot can introduct GPLd code into your project and make your project bound by GPL as a result...

0.1% is a lot when you use 100 suggestions a day.


The most important question, whether you own the code, is sort of maybe vaguely answered under “How will GitHub Copilot get better over time?”

> You can use the code anywhere, but you do so at your own risk.

Something more explicit than this would be nice. Is there a specific license?

EDIT: also, there’s multiple sections to a FAQ, notice the drop down... under “Do I need to credit GitHub Copilot for helping me write code?”, the answer is also no.

Until a specific license (or explicit lack there-of) is provided, I can’t use this except to mess around.


None of the questions and answers in this section hold information about how the generated code affects licensing. None of the links in this section contain information about licensing, either.


I dont see the answer to a single one of their questions on that page - did you link to where you intended?

Edit: you have to click the things on the left, I didn't realize they were tabs.


Sorry Nat, but I don't think it really answers anything. I would argue that using GPL code during training falls under Copilot being a derivative work of said code. I mean if you look at how a language model works, than it's pretty straightforward. The word "code synthesizer" alone insinuates as much. I think this will probably ultimately tested in court.


This page has a looping back button hijack for me


Does Copilot phone home?


When you sign up for the waitlist it asks permission for additional telemetry, so yes. Also the "how it works" image seems to show the actual model is on github's servers.


Yes, and with the code you're writing/generating.


This obviously sucks.

Can't companies write code that runs on customer's premises these days? Are they too afraid somebody will extract their deep learning model? I have no other explanation.

And the irony is that these companies are effectively transferring their own fears to their customers.


It's a large and gpu-hungry model.


Some of your questions aren't easy to answer. Maybe the first two were OK to ask. Others would probably require lawyers and maybe even courts to decide. This is a pretty cool new product just being shared on an online discussion forum. If you are serious about using it for a company, talk to your lawyers, get in touch with Github's people, and maybe hash out these very specific details on the side. Your comment came off as super negative to me.


> This is a pretty cool new product just being shared on an online discussion forum.

This is not one lone developer with a passion promoting their cool side-project. It's GitHub, which is an established brand and therefore already has a leg up, promoting their new project for active use.

I think in this case, it's very relevant to post these kinds of questions here, since other people will very probably have similar questions.


I think these are very important questions.

The commenter isn't interrogating some indy programmer. This is a product of a subsidiary of Microsoft, who I guarantee has already had a lawyer, or several, consider these questions.


No, they are all entirely reasonable questions. Yeah, they might require lawyers to answer - tough shit. Understanding the legal landscape that ones' product lives in is part of a company's responsibility.


Regardless of tone, I thought it was chock full of great questions that raised all kinds of important issues, and I’m really curious to hear the answers.


What do you think about this being overall detrimental to code quality as it allows people to just blindly accept completions without really understanding the generated code. Similar to copy-and-paste coding.

The first example parse_expenses.py uses a float for currency - that seems to be a pretty big error that's being overlooked along with other minor issues around no error handling.

I would say the quality of the generated code in parse_expenses.py is not very high, certainly not for the banner example.

EDIT - I just noticed Github reordered the examples on copilot.github.com in order to bury the issues with parse_expenses.py for now. I guess I got my answer.


How is it different from the status quo of people just doing the wrong thing or copy pasting bad code? Yes there's the whole discussion below about float currency values, but I could very well see the opposite happening too, where this thing recommends better code that the person would've written otherwise.


> How is it different from the status quo of people just doing the wrong thing or copy pasting bad code?

Well, yes, the wrong code would be used. However - the wrong code would then become more prevelant as an answer from gh, causing more people to blindly use it. It's a self-perpetuating cycle of finding and using bad and wrong code.


Hmm, not quite. My point was that if they aren't a good enough programmer to understand why the code is wrong, then chances are they would've written bad code or copy pasted bad code anyways. It just makes the cycle faster.

And again, I could argue that the opposite could happen too, people who would otherwise have written bad code could be given suggestions of better code that they would've written.


> Hmm, not quite. My point was that if they aren't a good enough programmer to understand why the code is wrong, then chances are they would've written bad code or copy pasted bad code anyways. It just makes the cycle faster.

No, not quite. It also makes the cycle more permanent and its results deeply ingrained, which is what is actually relevant.


Either way it wouldn’t matter since the only thing short of stopping the cycle is stack overflow to close down and a new stack overflow not to open up. A very unlikely scenario for this industry. Either way , no matter the difference in time frame, the result would have always been permanent.


People make mistakes. With computers people make mistakes much faster :)


To err is human. To really mess things up, you need a computer.


“People can create tech debt, but robots can do it at scale.”


It seems that copilot lets one cycle through options, which is an opportunity for it to facilitate programmers moving from a naive solution to one they hadn't thought of that is more correct.

(Unclear to me yet whether the design takes advantage of this opportunity)


I use a similar feature in IntelliJ idea, and I've often found that first time I learn about new feature in the language is when I get a suggestion. I usually explore topic much more deeply at that time. So far from helping me copy-paste, I find code suggestions help me explore new features of the language and framework, that I might not have known about.


>The first example parse_expenses.py uses a float for currency

I have made this mistake without the help of any AI or copy/paste. It's still in the hands of the developer to test and review everything they commit.


Why would you say it's an error to use a float for currency? I would imagine it's better to use a float for calculations then round when you need to report a value rather than accumulate a bunch of rounding errors while doing computations.


It is widely accepted that using floats for money[1] is wrong because floating point numbers cannot guarantee precision.

The fact that you ask is a very good case in point though: Many programmers are not aware of this issue and would maybe not question the "wisdom" of the AI code generator. In that sense, it could have a similar effect to blindly copy-pasted answers from SO, just with even less friction.

[1] Exceptions may apply to e.g. finance mathematics where you need to work with statistics and you're not going to expect exact results anyway.


Standard floats cannot represent very common numbers such as 0.1 exactly so they are generally disfavored for financial calculations where an approximated result is often unacceptable.

> For example, the non-representability of 0.1 and 0.01 (in binary) means that the result of attempting to square 0.1 is neither 0.01 nor the representable number closest to it.

https://en.wikipedia.org/wiki/Floating-point_arithmetic#Accu...



By definition, currency uses fixed point arithmetic not floating point arithmetic.


Not even remotely true. It is entirely context dependent. I've always used floats when working in finance.


Some people say "goin der" instead of "going there", that doesn't change the definitions of words, just because people are lazy with their language.


I fail to see your point. Floats are best practice for many financial applications, where model error already eclipses floating point error and performance matters.


Your point in the floating-point discussion is true, but you’re wrong about this one- linguistics is a descriptive field, not a prescriptive one.


Standard practice is to use a signed decimal number with an appropriate precision that you scale around.


You don't want to kick the can down to the floating point standard. Design for deterministic behavior. Find the edge cases, go over it with others and explicitly address the edge case issues so that they always behave as expected.


micro-dollars are a better way of representing it (multiply by 10e6); store as bigint.

See: https://stackoverflow.com/a/51238749


Googler, opinions are my own.

Over in payments, we use micros regularly, as documented here: https://developers.google.com/standard-payments/reference/gl...

GCP on there other hand has standardized on unit + nano. They use this for money and time. So unit would 1 second or 1 dollar, then the nano field allows more precision. You can see an example here with the unitPrice field: https://cloud.google.com/billing/v1/how-tos/catalog-api#gett...


No, they aren't. Micro-dollars do not exist, so this method is guaranteed to cause errors.


this is a common approach when you are dealing in rates less than .01 -- you just need to be sure you are rounding correctly


When you are approximating fixed-point using floating-point there is a lot more you need to do correctly other than roun ding. Your representation must have enough precision and range for the beginning inputs, intermediate results, and final results. You must be able to represent all expected numbers. And on. There is a lot more involved than what you mentioned.

Of course, if you are willing to get incorrect results, such as in play money, this may be okay.


When did mdellavo anything about floating point? You can, and should, use plain old fixed-point arithmetic for currency. That’s what he means by “microdollar”.


> store as bigint


Thank you. I made a mistake due to the starting comment in the thread.


Using float for currency calculations is how you accumulate a bunch of rounding errors. Standard practice when dealing with money is to use an arbitrary-precision numerical type.


Because it's an error to use floats in almost every situation. And currency is something where you don't want rounding errors, period. The more I've learned about floating point numbers over the years, the less I want to use them. Floats solve a specific problem, and they're a reasonable trade-off for that kind of problem, but the problem they solve is fairly narrow.


Using float is perfectly OK since using fixed point decimal (or whatever "exact" math operations) will lead to rounding error anyway (what about multiplying a monthly salary by 16/31 (half a month) ?)

The problem with float is that many people don't understand how they work to handle rounding errors correctly.

Now there are some cases where float don't cut it. And big ones. For example, summing a set of numbers (with decimal parts) will usually be screwed if you don't round it. And not many people expect to round the results of additions because they are "simple" operations. So you get errors in the end.

(I have written applications that handle billions of euros with floats and have found just as many rounding errors there as in any COBOL application)


It seems incorrect to determine a half a month as 16/31 but ok , for your proposed example:

    >>> from decimal import Decimal
    >>> Decimal(1000) * Decimal(16) / Decimal(31)
    Decimal('516.1290322580645161290322581')
    >>> 1000 * 16 / 31
    516.1290322580645
The point is using Decimal allows control over precision and rounding rather than accepting ad-hoc approximations of a float.

https://docs.python.org/3/library/decimal.html

If it were me, I wouldn't go around bragging about how much money my software manages while being willfully ignorant of the fundamentals.


OK, the salary example was a bit simplified; in my case it was about giving financial help to someone. That help is based on a monthly allowance and then split in the number of allocated days in the month, that's for the 16/31.

Now for your example, I see that float and decimal just give the same result. Provided I'm doing financial computations of a final number, I'm ok with 2 decimals. And both your computations work fine.

Th decimal module in python gives you number of significant digits, not number of decimals. You'll end up using .quantize() to get to two decimals which is rounding (so, no advantage over floats).

As I said, as soon as you have division/multiplication you'll have to take care of rounding manually. But for addition/subtraction, then decimal doesn't need rounding (which is better).

The fact is that everybody say "floats are bad" because rounding is tricky. But rounding is always possible. And my point is that rounding is tricky even with the decimal module.

And about bragging, I can tell you one more thing : rounding errors were absolutely not the worse of our problems. The worse problem is to be able to explain to the accountant that your computation is right. That's the hard part 'cos some computations imply hundreds of business decisions. When you end up on a rounding error, you're actually happy 'cos it's easy to understand, explain and fix. And don't start me on how laws (yes, the texts) sometimes explain how rounding rules should work.


  sum = 0
  for i in range(0, 10000000):
    sum += 0.1
  print(round(sum*1000, 2))
what should this code print? what does it print?

I mean, sure, this is a contrived example. But can you guarantee that your code doesn't do anything similarly bad? Maybe the chance is tiny, but still: wouldn't you like to know for sure?


We agree, on additions, floats are tricky. But still, on division, multiplications, they're not any worse. Dividing something by 3 will end up in an infinite number of decimals that you'll have to round at some point (except if we use what you proposed : fractions; in that case that's a completely different story).


No, exact precision arithmetic can do that 16/31 example without loss of precision:

  from fractions import Fraction
  
  # salary is $3210.55
  salary = Fraction(321055,100)
  monthlyRate = Fraction(16,31)

  print(salary*monthlyRate)
This will give you an exact result. Now, at some point you'll have to round to the nearest cent (or whatever), true. However, you don't have to round between individual calculations, hence rounding errors cannot accumulate and propagate.

The propagation of errors is the main challenge with floating point numbers (regardless of which base you use). The theory is well understood (in the sense that we can analyse an algorithm and predict upper bounds on the relative error), but not necessarily intuitive and easy to get wrong.

Decimal floating-point circumvents the issue by just not introducing errors at all: money can be represented exactly with decimal floating point (barring very exotic currencies), therefore errors also can't propagate. Exact arithmetic takes the other approach where computations are exact no matter what (but this comes at other costs, e.g. speed and the inability to use transcendental functions such as exp).

For binary floating point, that doesn't work. It introduces errors immediately since it can't represent money well and these errors may propagate easily.


Of course, if you use "fractions" then, we agree, no error will be introduced nor accumulated over computations which is better. The code base I'm talking about is Java 10 years ago. I was not aware of fractions at that time. There was only BigDecimal which was painful to work with (the reason why we ditched it at the time).


It's mostly painful because Java doesn't allow custom types to use operators, which I think was a maybe reasonable principle applied way too strictly. The same applies to any Fraction type you'd implement in Java.

Still, I'll take "verbose" over "error-prone and possibly wrong".


I visited https://copilot.github.com/, and I don't know how to feel. Obviously it's a nice achievement, not gonna lie.

But I have a feeling it will end up causing more work. e.g. the `averageRuntimeInSeconds` example, I had to spend a bit of time to see if it was actually correct. It has to be, since it's on the front page, but then I realized I'd need to spend time reviewing the AI's code.

It's cool as a toy, but I'd like to see where it is one year from now when the wow factor has cooled down a bit.


Interesting comment and I agree. Reading and writing code seem to involve different parts of the brain. I wonder if tools like this will create some sort of code review fatigue. I can write code for a few hours a day and enjoy it but I couldn't do code review for hours, every day.

This isn't like skimming through a codebase to get a sense of what the code does. You'd have to thoroughly review each line to make sure it does what you want it to do, that there are no bugs. And even then, you'd feel left behind pretty quickly because your brain didn't create the paths to the solution to the problem you're trying to solve. It is like reading a solution to a problem on leetcode vs coming up with it yourself.


It has the ability to generate unit tests as well, which will help cut down some on the verification side if you feed it enough cases.


I think I'd love to use this to generate tests and then write the functions myself. Test generation seems like a killer feature.


Yes!! Totally agree. Imagine writing a method and then telling an AI to write your unit tests for it. The AI would likely be able to come up with the edge cases and such that you would not normally take the time to write.

While I think the AI generating your mainline code is interesting, I must certainly agree that generating test code would be the killer feature. I would like to see this showcased a little more on the copilot page.


You don’t need AI for that. While example based testing is familiar to most, other approaches exist that can achieve this with less complexity. See: property based testing.


Yes, I agree. But just to ask, wouldn't we consider that a form of AI testing, even just in very primitive form? We're begging the question for the very definition of AI. I would argue that your example is just the primordial version of what machine reasoned testing could potentially offer.


Well then you have to check the generated tests. That's just one more layer, isn't it?


If you question the veracity of the code that is produced, you have to question the usefulness of the unit test that is produced.


> I had to spend a bit of time to see if it was actually correct.

Interesting point - it reminds me of the idea that it’s harder to debug code than to write it. Is it also harder to interpret code you didn’t write than to write it?


Might this end up putting GPL code into projects with an incompatible license?


It shouldn't do that, and we are taking steps to avoid reciting training data in the output: https://copilot.github.com/#faq-does-github-copilot-recite-c... https://docs.github.com/en/early-access/github/copilot/resea...

In terms of the permissibility of training on public code, the jurisprudence here – broadly relied upon by the machine learning community – is that training ML models is fair use. We are certain this will be an area of discussion in the US and around the world and we're eager to participate.


> ...the jurisprudence here – broadly relied upon by the machine learning community – is that training ML models is fair use.

To be honest, I doubt that. Maybe I am special, but if I am releasing some code under GPL, I really don't want it to be used in training a closed source model, which will be used in a closed source software generating code for closed source projects.


The whole point of fair use is that it allows people to copy things even when the copyright holder doesn't want them to.

For example, if I am writing a criticism of an article, I can quote portions of that article in my criticism, or modify images from the article in order to add my own commentary. Fair use protects against authors who try to exert so much control over their works that it harms the public good.


This isn't the same situation at all. The copying of code doesn't seem to be for a limited or transformative purpose. Fair use might cover parody or commentary & criticism but not limitless replication.


They are not replicating the code at all. They are training a neural network. The neural network then learns from the code and synthesises new code.

It's no different from a human programmer reading code, learning from it, and using that experience to write new code. Somewhere in your head there is code that someone else wrote. And it's not infringing anybody's copyright for those memories to exist in your head.


We can't yet equivocate ML systems with human beings. Maybe one day. But at the moment, it's probably better to compare this to a compiler being fed licensed code. The compilation output is still subject to the license. Regardless of how fancy the compiler is.

Also, a human being that reproduces licensed code from memory - because they read that code - would constitute a license violation. The line between derivative work, and authentic new original creation is not a well defined one. This is why we still have human arbiters of these decisions and not formal differential definitions of it. This happens in music for example all the time.


If avoiding copyright violations was simply "I remembered it", then I don't think things like clean-room reverse engineering would be ever legally necessary [1]

[1] https://en.wikipedia.org/wiki/Clean_room_design


It is replication, maybe not of a single piece of code - but creating a synthesis is still copying. For example, constructing a single piece of code of three pieces of code from your co-workers is still replication of code.

Your argument would have some merit if something were created instead of assembled, but there is no new algorithm that is being created. That is not what is happening here.

On the one hand, you call this copying in fair use. On the other hand, you say this is creating new code. You can't have it both ways.


> Your argument would have some merit if something were created instead of assembled, but there is no new algorithm that is being created. That is not what is happening here.

If you're going to set such a high standard for ML tools like this, I think you need to justify why it shouldn't apply to humans too.

When a human programmer who has read copyrighted code at some point in their life writes new code that is not a "new algorithm", are they in violation of the copyrights of every piece of code they've ever read that was remotely similar in any respect to the new work?

I mean, I hope not!

> On the one hand, you call this copying in fair use. On the other hand, you say this is creating new code. You can't have it both ways.

I'm not a lawyer, but this actually sounds very close to the "transformative" criterion under fair use. Elements of existing code in the training set are being synthesized into new code for a new application.

I assume there's no off-the-shelf precedent for this, but given the similarity with how human programmers learn and apply knowledge, it doesn't seem crazy to think this might be ruled as legitimate fair use. I'd guess it would come down to how willing the ML system is to suggest snippets that are both verbatim and highly non-generic.


From https://docs.github.com/en/github/copilot/research-recitatio...: "Once, GitHub Copilot suggested starting an empty file with something it had even seen more than a whopping 700,000 different times during training -- that was the GNU General Public License."

On the same page is an image showing copilot in real-time adding the text of the famous python poem, The Zen of Python. See https://docs.github.com/assets/images/help/copilot/resources... for a link directly to copilot doing this.

You are making arguments about what you read instead of objectively observing how copilot operates. Just because GH wrote that copilot synthesizes new code doesn't mean that it writes new code in the way that a human writes code. That is not what is happening here. It is replicating code. Even in the best case copilot is creating derivative works from code where GH is not the copyright owner.


> You are making arguments about what you read instead of objectively observing how copilot operates.

Of course I am. We are both participating in a speculative discussion of how copyright law should handle ML code synthesis. I think this is really clear from the context, and it seems obvious to me that this product will not be able to move beyond the technical preview stage if it continues to make a habit of copying distinctive code and comments verbatim, so that scenario isn't really interesting to me. Github seems to agree (from the page on recitation that you linked):

> This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.

> But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.

> The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.

> This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.

The arguments you've made here would seem to apply equally well to a version of Copilot hardened against "recitation", hence my reply.

> Even in the best case copilot is creating derivative works from code where GH is not the copyright owner.

It would be convenient for your argument(s) if it were decided legal fact that ML-synthesized code is derivative work, but it seems far from obvious to me (in fact, I would disagree) and you haven't articulated a real argument to that effect yourself. It has also definitely not been decided by any legal entity capable of establishing precedent.

And, again, if this is what you believe then I'm not sure how the work of human programmers is supposed to be any different in the eyes of copyright law.


> Of course I am. We are both participating in a speculative discussion of how copyright law should handle ML code synthesis. I think this is really clear from the context, and it seems obvious to me that this product will not be able to move beyond the technical preview stage if it continues to make a habit of copying distinctive code and comments verbatim, so that scenario isn't really interesting to me. Github seems to agree (from the page on recitation that you linked):

No. We both aren't. I am discussing how copilot operates from the perspective of a user concerned about legal ramifications. I backed that concern up with specific factual quotes and animated images from github, where github unequivocally demonstrated how copilot copies code. You are speculating how copyright law should handle ML code synthesis.


> No. We both aren't

You say I'm not ... but then you say, explicitly in so many words, that I am:

> You are speculating how copyright law should handle ML code synthesis.

I don't get it. Am I, or aren't I? Which is it? I mean, not that you get to tell me what I am talking about, but it seems like something we should get cleared up.

edit: Maybe you mean I am, and you aren't?

Beyond that, I skimmed the Github link, and my takeaway was that this is a small problem (statistically, in terms of occurrence rate) that they have concrete approaches to fixing before full launch. I never disputed that "recitation" is currently an issue, but honestly that link seems to back up my position more than it does yours (to the extent that yours is coherent, which (as above) I would dispute).


> They are not replicating the code at all.

Now that five days have passed, there have been a number of examples of copilot doing just that, replicating code. Quake source code that even included comments, the famous python poem, etc. There are many examples of code that has been replicated - not synthesized but duplicated byte for byte from the originals.


surely that depends on the size of the training set?

I could feed the Linux kernel one function at a time into a ML model, then coerce its output to be exactly the same as the input

this is obviously copyright infringement

whereas in the github case where they've trained it on millions of projects maybe it isn't?

does the training set size become relevant legally?


Fair Use is specific to the US, though. The picture could end up being much more complicated when code written outside the US is being analyzed.


The messier issue is probably using the model to write code outside the US. Americans can probably analyze code from anywhere in the world and refer to Fair Use if a lawyer comes knocking, but I can't refer to Fair Use if a lawyer knocks on my door after using Copilot.


It is not US specific, we have it in EU. And e.g. in Poland I could reverse engineer a program to make it work on my hardware/software if it doesn't. This is covered by fair use here.


Is it any different than training a human? What if a person learned programming by hacking on GPL public code and then went to build proprietary software?


It is different in the same way that a person looking at me from their window when I pass by is different from a thousand cameras observing me when I move around city. Scale matters.


> a thousand cameras observing me when I move around city. Scale matters. reply

While I certainly appreciate the difference, is camera observation illegal anywhere where it isn't explicitly outlawed? Meaning, have courts ever decided that the difference of scale matters?


No idea. I was not trying to make a legal argument. This was to try to convey why someone might feel ok about humans learning from their work but not necessarily about training a model.


This is a lovely analogy, akin to "sharing mix tapes" vs "sharing MP3s on Napster". I fear the coming world with extensive public camera surveilance and facial recognition! (For any other "tin foil hatters" out there, cue the trailer for Minority Report.)


>I fear the coming world with extensive public camera surveilance and facial recognition!

I fear the coming world of training machine learning models with my face just because it was published by someone somewhere (legally or not).


You can rest assured that this is already the case if your picture was ever posted online. There are dozens of such products that law enforcement buys subscriptions to.


A human being who has learned from reading GPL'd code can make the informed, intelligent decision to not copy that code.

My understanding of the open problem here is whether the ML model is intelligently recommending entire fragments that are explicitly licensed under the GPL. That would be a licensing violation, if a human did it.


Actually, I believe it's tricky to say if even human can actually do that safely. There's the whole concept of "cleanroom rewrite" - meaning, if you want to rewrite some GPL or closed-source project into a different license, you should make sure you never ever seen even a glimpse of the original code. If you look on GPL or closed-source code (or, actually, code governed by any other license), it's hard to prove you didn't accidentally/subconsciously remember parts of this code, and copy them into your "rewrite" project even if "you made a decision to not copy". The border between "inspired by" and "blatant copyright infringement" is blurry and messy. If that was already so tricky and troublesome legal-wise before, my first instinct is that with the Copilot it could be even more legally murky territory. IANAL, yet I'd feel better if they made some [legally binding] promises that their model is based only on code carefully verified to have one of an explicit (and published) whitelist of permissive licenses. (Even this could be tricky, with MIT etc. actually requiring some mention in your advertising materials [which is often forgotten], but now that's a completely different level of trouble than not knowing if I'm infringing GPL or some closed-source code, or other weird license.)


> A human being who has learned from reading GPL'd code can make the informed, intelligent decision to not copy that code.

A model can do this as well. Getting the length of a substring match isn’t rocket science.


But wouldn't a machine learning AGPL code it be hosting AGPL code in its memory?


Pretty sure merely hosting code hoesn't trigger AGPL; if it did, github would have to be open-sourced.


Would you hire a person who only knew how to program by taking small snippets of code from GPL and rearranging them? That's like hiring monkey's to type Shakespeare.

The clear difference is that a human's training regimen is to understand how and why code interacts. That is different from an engine that replicates other people's source code.


What if a person heard a song by hearing it on the radio and went on to record their own version?


There is already a legal structure in place for cover song licensing.

https://en.wikipedia.org/wiki/Cover_version#United_States_co...


Exactly so it needs licensing of some sort - this is closer to cover tunes than it is to someone getting a CS degree and being asked to credit Knuth for all their future work.


How do you distribute a human?


A contractor seems equivalent to SaaS to me


Perhaps we need GPL v4. I don't think there is any clause in current V2/V3 that prohibits learning from the code, only using the code in other places and running a service with code.


Would you be okay with a human reading your GPL code and learning how to write closed source software for closed source projects?


> To be honest, I doubt that.

Okay, but that's...not much of a counterargument (to be fair, the original claim was unsupported, though.)

> Maybe I am special, but if I am releasing some code under GPL, I really don't want it to be used in training a closed source model

That's really not a counterargument. “Fair use” is an exception to exclusive rights under copyright, and renders the copyright holder’s preferences moot to the extent it applies. The copyright holder not being likely to want it based on the circumstances is an argument against it being implicitly licensed use, but not against it being fair use.


> a closed source model

It seems like some of the chatter around this is implying that the resultant code might still have some GPL still on it. But it seems to me that it's the trained model that Microsoft should have to make available on request.


That's the point of fair use. To do something with a material the original author does not want.


Can you explain why you think this is covered by fair use? It seems to me to be

1a) commercial

1b) non-transformative: in order to be useful, the produced code must have the same semantics as some code in the training set, so this does not add "a different character or purpose". Note that this is very different from a "clean room" implementation, where a high-level design is reproduced, because the AI is looking directly at the original code!

2) possibly creative?

3) probably not literally reproducing input code

4) competitive/displacing for the code that was used in the input set

So failing at least 3 out of 5 of the guidelines. https://www.copyright.gov/fair-use/index.html


1a) Fair use can be commercial. And copilot is not commercial so the point is moot.

1b) This is false. This is not literally taking snippets it has found and suggesting it to the user. That would be an intelligent search algorithm. This is writing novel code automatically based on what it has learned.

2) Definitely creative. It's creating novel code. At least it's creative if you consider a human programming to be a creative endeavor as well.

3) If it's reproducing input code it's just a search algorithm. This doesn't seem to be the case.

4) Most GPLed code doesn't cost any money. As such the market for it is non-existent. Besides copilot does not displace the original even if there were a market for it. As far as I know there is not anything even close to comparable in the world right now.

So from my reading it violates none of the guidelines.


This is what is so miserable about the GPL progression. We went from GPLv2 (preserving everyone's rights to use code) to GPLv3 (you have to give up your encryption keys) - I think we've lost the GPL as a place where we could solve / answer these types of questions which are good ones - GPL just tanked a lot of trust in it with the (A)GPLv3 stuff especially around prohibiting other developers from specific uses of the code (which is diametrically different from earlier versions which preserved rights).


Think what you will of GPLv3, but lies help no one. Of course it doesn't require you to give up your encryption keys.


Under GPLv2 I could make a device with GPLv2 software and maintain root of trust control of that device if I wanted (ie, do an anti-theft activation lock process, do a lease ownership option of $200/month vs $10K to buy etc).

Think what you will, but your lies about the GPLv3 can easily be tested. Can you point me to some GPLv3 software in the Apple tech stack?

We actually already know the answer.

Apple had to drop Samba (they were a MAJOR end user use of Samba) because of GPLv3

I think they also moved away from GCC for LLVM.

In fact - they've probably purged at least 15 packages I'm aware of and I'm aware of NO GPLv3 packages being included.

Not sure what their App Store story is - but I wouldn't be surprised if they were careful there too.

Oh - this is all lies and apple's lawyers are wrong? Come one - I'm aware of many other companies that absolutely will not ship GPLv3 software for this reason.

In fact, by 2011 even it was clear that GPLv3 is not really workable in a lot of contexts and alternatives like MIT became more popular.

https://trends.google.com/trends/explore?date=all&geo=US&q=%...

Apple geared up to fight DOJ over maintaining root control of devices (San Bernadino case).

Even Ubuntu has had to deal with this - SFLC made it clear that if some distributor messed things up ubuntu would have to release their keys, which is why they ended up with a MICROSOFT (!) solution.

"Ubuntu wishes to ensure that users can boot any operating system they like and run any software they want. Their concern is that the GPLv3 makes provisions by which the FSF could, in this case as the owner of GRUB2, deem that a machine that won't let them replace GRUB2 with something else is in violation of the GPLv3. At that point, they can demand that Ubuntu surrender its encryption keys used to provide secure bootloader verification--which then allows anyone to sign any bootloader they want, thus negating any security features you could leverage out of the bootloader (for example, intentionally instructing it to boot only signed code--keeping the chain trusted, rather than booting a foreign OS as is the option)." - commentator on this topic.

It's just interesting to me that rather than any substance the folks arguing for GPLv3 reach for name calling type responses.


That's why Apple's SMB implementation stinks! Finally, there's a reason for it, I thought they had just started before Samba was mature or something.


Yeah, it was a bit of a big bummer!

Apple used to also interoperate wonderfully if you were using Samba SERVER side too because - well, they were using Samba client side. Those days were fantastic frankly. You would run Samba server side (on Linux), then Mac client side - and still have your windows machines kind of on -network (for accounting etc) too.

But the Samba folks are (or were) VERY hard core GPLv3 folks - so writing was on the wall.

GPLv3 shifted things really from preserving developer freedom for OTHERs to do what they wanted with the code, to requiring YOU to do stuff in various ways which was a big shift. I'd assumed that (under GPLv2) there would be natural convergences, but GPLv3 really blew that apart and we've had a bit of a license fracturing relatively.

AGPLv3 has also been a bit weaponized to do a sort of fake open source where you can only really use the software if you pay for a commercial license.


The macOS CIFS client was from BSD, not from Samba.


BSD's have also taken a pretty strong stance against GPLv3 - again for violating their principles on freedom.

I can't dig it up right now but someone can probably find it.

But the BSD's used samba for a while as well.


As of Darwin 8.0.1 (so Tiger?) smbclient(1)'s man page was saying it was a component of Samba. I think some BSDs used Samba.


You can do what you describe with the GPLv3. You'll just have to allow others to swap out the root of trust if they so please.

Everything else you write is just anecdotes about how certain companies have chosen to do things.


Let me be crystal clear.

If I sell an open source radio with firmware limiting broadcast power / bands etc to regulated limits and ranges - under GPLv3 I can lock down this device to prevent the buyer from modifying it? I'm not talking about making the software available (happy to do that, GPLv2 requires that). I'm talking about the actual devices I build and sell (physical ones).

I can build a Roku or Tivo and lock it down? Have you even read the GPLv3? It has what is commonly called the ANTI-tivoisation clause PRECISELY to block developers from locking devices down for products they sell / ship.

If I rent a device and build in a monthly activation check - I can use my keys to lock device and prevent buyer from bypassing my monthly activation check or other restrictions?

The problem I have with GPLv3 folks is they basically endlessly lie about what you can do with GPLv3 - when there is plenty of VERY CLEAR evidence that everyone from Ubuntu to Apple to many others who've looked at this (yes, with attorney's) says that no - GPLv3 can blow up in your face on this.

So no, I don't believe you. These aren't "just anecdotes" These care companies incurring VERY significant costs to move away / avoid GPLv3 products. AGPLv3 is even more poisonous - I'm not aware of any major players using it (other than those doing the fake open source game).


No, you can't lock it down without letting its owner unlock it. That's indeed the point. But your original comment said you have to give up your encryption keys. That's the lie I was getting at.

Now we can debate whether or not it's a good thing that the user gets full control of his device if he wants it. I think it is. You?


These claims are absurd. AGPL and GPLv3 carry on the same mission of GPLv2 to protect authors and end users from proprietization, patent trolling and freeloading.

This is why SaaS/Cloud companies dislike them and fuel FUD campaigns.


> ...the jurisprudence here – broadly relied upon by the machine learning community – is that training ML models is fair use.

If you train az ML model on GPL code, and then make it output some code, would that not make the result a derivative of the GPL licensed inputs?

But I guess this could be similar to musical composition. If the output doesn't resemble any of the inputs, or contains significant continous portions of them, then it's not a derivative.


> If the output doesn't resemble any of the inputs, or contains significant continous portions of them, then it's not a derivative.

In this particular case, the output resembles the inputs, or there is no reason to use Github Copilot.


> It shouldn't do that, and we are taking steps to avoid reciting training data in the output

This just gives me a flashback to copying homework in school, “make sure you change some of the words around so it’s not obvious”

I’m sure you’re right Re: jurisprudence, but it never sat right with me that AI engineers get to produce these big, impressive models but the people who created the training data will never be compensated, let alone asked. So I posted my face on Flickr, how should I know I’m consenting to benefit someone’s killer robot facial recognition?


Wait I thought y'all argued Google didn't copy Java for Android, now that big tech is copying your code you're crying wolf?


The whole point of that case begins with the admission "yes of course Google copied." They copied the API. The argument was that copying an API to enable interoperability was fair use. It went to the Supreme Court because no law explicitly said that was fair use and no previous case had settled the point definitively. And the reason Google could be confident they copied only the API is because they made sure the humans who did it understood both the difference and the importance of the difference between API and implementation. I don't think there is a credible argument that any AI existing today can make such a distinction.


>training ML models is fair use

How does that apply to countries where Fair Use is not a thing? As in, if you train a model on a fair use basis in the US and I start using the model somewhere else?


Fair use doesn’t exist in Germany.


I don’t think it’s fair to ask a US company to comment on legalities outside of the US.


It's fair to expect a international company pushing its products all over the world to be prepared to comment on non-US jurisdictions. (I have some sympathy for "we have a local market, and that's what we are solely targeting and preparing for" in companies where that is actually the case, but that's really not what we are dealing with in the case of Microsoft/GitHub)


One would expect GitHub (owned by Microsoft) to have engaged corporate counsel for an opinion (backed by statue and case law), and to be prepared to disable the functionality in jurisdictions where it’s incompatible with local IP law.


You just shared a URL that says "Please do not share this URL publicly".


Well, he's also GitHub's CEO so it's probably just fine.


> training ML models is fair use

In what context? You are planning on commercializing Copilot and in that case the calculus on whether or not using copyright protected material for your own benefit changes drastically.


It isn't. US copyright law says brief excerpts of copyright material may, under certain circumstances, be quoted verbatim

----> for purposes such as criticism, news reporting, teaching, and research <----, without the need for permission from or payment to the copyright holder.

Copilot is not criticizing, reporting, teaching, or researching anything. So claiming fair use is the result of total ignorance or disregard.


Would i be able to use something like this in the near future to produce a proprietary linux kernel?


This is obviously controversial, since we are thinking about how this could displace a large portion of developers. How do you see Copilot being more augmentative than disruptive to the developer ecosystem? Also, how you see it different from regular code completion tools like tabnine.


We think that software development is entering its third wave of productivity change. The first was the creation of tools like compilers, debuggers, garbage collectors, and languages that made developers more productive. The second was open source where a global community of developers came together to build on each other's work. The third revolution will be the use of AI in coding.

The problems we spend our days solving may change. But there will always be problems for humans to solve.


This innovation does not seem like a natural successor to compilers, debuggers and languages. If today's programming environments still require too much boilerplate and fiddling with tools, it seems like better programming languages, environments that require less setup, etc would be a better use of time. Using GPT to spit out code you may or may not understand seems more like a successor to WSDLs and UML code generators. I really hope we're just in a wild swing of the pendulum towards complex tooling and that we swing back to simplicity before too long.

Edit:

To expand a little and not sounds so completely negative towards AI, seems like there could be value in training models to predict whether a patch will be accepted, or whether it will cause a full build to fail.


If this is the drive behind this project, seems like you are putting too many eggs in one basket. Maybe a good attempt to get rid of the "glue" programming but, I wouldn't pay for this. Its all trivial stuff that I now need to review.

It would be a "cool tool" if it inspected the code statically and dynamically. Testing the code to see if it actually does what the AI thinks it should do. From running small bits of code on unit level to integration and acceptance testing. Suggest corrections or receive them. _That_ will save time and I and companies will pay for.

Also you cannot call this the "third revolution" if it is a paid service.


I appreciate this insight, as a proponent of progress studies. It is indeed a pragmatic view of what the industry will be or should be. I believe the thing that would be also appreciated would be a pair security auditor. Most vulnerabilities in software can be avoided early on in development , I believe this could be a great addition to Github's Security Lab securitylab.github.com/


Do you or 'natfriedman have authored any works in a public repository, so that we can judge the validity of the pragmatic view?


I'm super interested to read more about your theory/analysis. Have you written on it in a blog or anything?


There's a good amount of discussion on this topic in "The Mythical Man-Month". The entire book is discussing the factors that affect development timeframes and it specifically addresses whether AI can speed it up (albeit from 1975, 1986 and 1995 viewpoints, and comparing progress between those points.)


Thanks! That's a great suggestion. I forgot that was in there.

I read Mythical Man Month many years ago and enjoyed it. Time for a re-read. Of course it won't cover the third wave very well though. Would love to see a blog post cover that.

xna 29 days ago [flagged] [–]

Let's solve the problem of replacing CEOs next. The above paragraph could have been written by GPT-3 already.


I think this is already happening. There's credible evidence that the Apple CEO, Tim Cook, has been essentially replaced by a Siri-driven clone over the last 7 months. They march the real guy out when needed, but if you watch closely when they do, it's obvious he's under duress reading lines prepared by an AI. His testimony in the Epic lawsuit for example. They'll probably cite how seriously he and the company take 'privacy' to help normalize his withdrawal from the public space in the coming years.


This is exactly the kind of fun, kooky conspiracy theory I've missed with all the real conspiracies coming to light over the last decade or so.


Can you cite some of this credible evidence?


I think you’re looking at the problem the wrong way. This provides less strong engineering talent with more leverage. The CEO (which could be you!) gets closer to being a CTO with less experience and context necessary (recall businesses that run on old janky codebases or no code platforms; they don’t have to be elegant, they simply have to work).

It all boils down to who is capturing the value for the effort and time expended. If a mediocre software engineer can compete against senior engineers with such augmentation, that seems like a win. Less time on learning language incantations, more time spent delivering value to those who will pay for it.


That's not really how it's going to go though. Just look at what your average person is able to accomplish with Excel.

Your own example of the CEO becoming a CTO can be used in every level and part of the business.

Now the receptionist is building office automation tools because they can describe what they want in plain English and have this thing spit out code.


> Just look at what your average person is able to accomplish with Excel.

Approximately nothing.

The average knowledge worker somewhat more, but lots of them are at the level of “I can consume a pivot table someone else set up”.

Sure, there are highly-productive, highly-skilled excel users that aren't traditional developers that can build great things, but they aren’t “your average person”.


https://news.ycombinator.com/item?id=24791017 (HN: Excel warriors who save governments and companies from spreadsheet errors)

https://news.ycombinator.com/item?id=26386419 (HN: Excel Never Dies)

https://news.ycombinator.com/item?id=20417967 (HN: I was wrong about spreadsheets)

https://mobile.twitter.com/amitranjan/status/113944938807223... (Excel is every #SAAS company's biggest competitor!)


Yes, Excel “runs the world”, and in most organizations, you’ll find a fairly narrow slice of Excel power users that build and maintain the Excel that “runs the world”.

We may not call them developers or programmers (or we might; I’ve been one of them as a fraction of my job at different times, both as a “fiscal analyst” by working title and as a “programmer analyst” by title), but effectively that's what they are, developers using (and possibly exclusively comfortable with) Excel as a platform.


Well, agree to disagree here as I’ve seen it with my own eyes, but it’s kind of besides the point.

Is it a coincidence that the same company that makes Excel is trying to… “democratize” and/or de-specialize programming?

I don’t really think so, but shrug.


LOL. But you actually make a good point here. GPT-3 can replace most comms / PR type jobs since they all sound like Exec-speak.


Usually I agree but I think Nat's comment here makes perfect sense and isn't just some PR buzzword stew. Tools like these are basically a faster version of searching stack overflow. You could have suggested that things like github and stack overflow would replace programmers since you could just copy and paste snipits to write your code.

And sure, we do now have tools like square space which fully automate making a basic business landing page and online store. But the bar has been raised and we now have far more complex websites without developer resources being wasted on making web stores.


natfriedman is a human being like you and me, not an AI; let's treat them with consideration for that.


Perhaps he should go easy on the euphemisms then and show respect for the developers who wrote the corpus of software that this AI is being trained on (perhaps illegally).


OK, then ask him to go easy! Great idea, and it might get a good response.


How many jobs have developers helped displace in business and industry? I don't think it's controversial that we become fair game for that same automation process we've been leading.


>How many jobs have developers helped displace in business and industry? I don't think it's controversial that we become fair game for that same automation process we've been leading.

historically when has that sort of 'tit-for-tat' style of argument ever been helpful?

the correct approach would be "we've observed first hand the problems that we've cause for society, how can we avoid creating such problems for any person in the future?"

It might seem self-serving, and it is, but 'two wrongs don't make a right'. Let's try to fix such problems rather than serving our sentence as condemned individuals.


> historically when has that sort of 'tit-for-tat' style of argument ever been helpful?

It's not tit-for-tat, it's a wake up call. As in, what exactly do you think we've been doing with our skills and time?

> ""we've observed first hand the problems that we've cause for society"...

But not everyone agrees that this is actually a problem. There was a time when being a blacksmith or a weaver was a very highly paid profession, and as technology improved and the workforce became larger, large wages could no longer be commanded. Of course the exact same thing is going to happen to developers, at least to some extent.


> How many jobs have developers helped displace in business and industry?

How many?

> I don't think it's controversial that we become fair game for that same automation process we've been leading.

This is not correct. A human (developer) displacing another human (business person) is entirely different than a tool (AI bot) replacing a human (developer).

Regardless, this is the Lump of Labour fallacy (https://en.wikipedia.org/wiki/Lump_of_labour_fallacy).

In this case, it is assumed that the global amount of development work is fixed, so that, if AI takes a part of it, the equivalent workforce in terms of developers, will be out of job. Especially in the field of SWE, this is obviously false.

It also needs to be seen what this technology will actually do. SWE is a complex field, way more than typing a few routines. In best case (technologically speaking) this will be an augmentation.


> A human (developer) displacing another human (business person) is entirely different

That's not what is happening though, a few developers replace thousands of business and industry people with automated tools. Say, automated route planning for package delivery, would take many thousands of humans if not for the AI bots that do the job instead.

> SWE is a complex field, way more than typing a few routines. In best case (technologically speaking) this will be an augmentation.

Of course there will always be some jobs for humans to do. Just like there are still jobs for humans loading thread into the automated looms and such.

But your arguments against automation displacing programming jobs ring hollow. People said the same thing about chess playing programs, they would never be able to understand the subtlety or complexity like a human could.


> That's not what is happening though, a few developers replace thousands of business and industry people with automated tools. Say, automated route planning for package delivery, would take many thousands of humans if not for the AI bots that do the job instead.

Without reading and understanding the lump of labour fallacy, it can't be understood the relation between the fallacy and the displacement of jobs. In short, the fallacy is not incompatible with the displacement argument; the difference is in the implications.

> But your arguments against automation displacing programming jobs ring hollow. People said the same thing about chess playing programs, they would never be able to understand the subtlety or complexity like a human could.

Chess is a finite problem, SWE isn't, so they can't be compared.


Nope, before the modern approach to shipping stuff you simply couldn't get many different things unless you were in a big city. There weren't humans doing the route planning, there was no one because it wasn't worth doing at all.


> SWE is a complex field, way more than typing a few routines. In best case (technologically speaking) this will be an augmentation.

If there is a pathway to improving this AI assist efficiency say by restricting the language, methodology, UI paradigm and design principles, it will happen quick due to market incentives. The main reason SWE is complex is it's done manually in myriad subjectively preferred ways.


Indeed. It should be the goal of society to automate away as much work as possible. If there are perverse incentives working against this then we should correct them.


1. How do you define work differently from "that which should be automated"?

2. While I agree with your stance, it is not by itself sufficient. If you provide the automation but you do not correct the perverse incentives (or you worry about correcting them only later) that you mention, then you are contributing to widening the disparity between a category of workers (who have now lost their leverage) and those with assets and capital (who have a reduced need for workers).


I agree, the fact we're even talking about this is evidence that our society has the perverse incentive baked in and we should be aware of and seek to address that.

Regardless, programmers would be hypocritical to decry having their jobs automated away.


That's why it's best to get unions or support systems (like UBI) before they're needed. It's hard to organize and build systems when you have no leverage, influence, or power left.


Why is the disparity bad?


What do you mean by "bad"? If you're asking why it makes sense to structure society with an eye toward avoiding disparity, then it's enough to just observe empirically that people have an aversion to unfair treatment. And not just people: https://en.wikipedia.org/wiki/Social_inequity_aversion

If you're asking why do people respond the way they do to disparity, then I can only speculate that it has something to do with the meaning of life.


Human beings need something to do to have a fulfilling life. I do not agree at all that the ultimate goal of society is to automate everything that’s possible. I think that will be horrible overall for society.


My job is probably the least fulfilling activity in my life and I'm sure that goes for a lot of people.

By your reasoning, maybe we don't need backhoes and should just hire a bunch of guys with spoons instead?


I typically find other things fulfill my life more than work.


I would be up for that, if said society did not leave us destitute as a result


Nothing is inevitable. Doctors and lawyers have protected their professions successfully for centuries.

Only some software developers seem interested in replacing themselves in order to enrich their corporate masters (mains?) even further.

Just don't use this tool!


This tool only replaces a small part of a good programmer and just further highlights the differences between people blindly writing code and people building actual systems.

The challenge in software development is understanding the real world processes and constraints and turning them into a design for a functional & resilient system that doesn't collapse as people add every little idea that pops into their head.

If the hard part was "typing in code" then programmers would have been replaced long ago. But most people can't even verbally explain what they do as a series of steps & decision points such that a coherent and robust process can be documented. Once you have that it's easy to turn into code.


> Doctors and lawyers have protected their professions successfully for centuries.

And one could argue that this means we all pay more for health and legal services than we otherwise would. You have to calculate both costs and benefits; what price does society pay for those few people having very high paying jobs?


This feels like people protesting against automation by not using the automated checkout machines at a store. Go ahead, faster queues for me.


> This is obviously controversial, since we are thinking about how this could displace a large portion of developers.

It... couldn't, in net.

Tools which improve developer productivity increase the number of developers hired and the number of tasks for which it is worthwhile to employ them and the market clearing price for development work.

See, for examples, the whole history of the computing industry as we’ve added more layers of automation between “conceptual design for software” and “bit patterns in hardware implementing that conceptual design as concrete software”.

It might displace or disadvantage some developers in specific (though I doubt a large portion) by shifting the relative value of particular subskills within the set used in development, I suppose.


I agree with this viewpoint.

A tool which increases how rapidly we can output code—correct code—would allow for more time spent on hard tasks.

I can see the quality of some "commodity" software increasing as a result of tools in this realm.


In just 2-3 years time DL algorithms will have as many synapses (parameters) as the human brain. The only thing needed to teach such algorithm to program better than us it to say to it that program needs to be faster, more secure and more user friendly (it's not impossible to teach this to DL alg.). This "tool" will make our work 90% easier in the next few years, so unless we have much more work to do we will earn much less and juniors will most likely be not needed anymore...


Intel put itself out of business since all the world's computing needs can now be done on one CPU.

Or, perhaps, lower unit costs of computing lead to far far greater demand for computing since it became practical to apply to more domains.


I have been using this - for example working in Go on Dapr (dapr.io) or in Python on one of its SDKs.

I love it. So often the code suggestions accurately anticipate what I planned to do next.

It's especially fun to write a comment or doc string and then see Copilot create a block of code perfectly matching your comment.


I'm glad they find it head exploding but my concern is that it would be most head exploding to newbies who don't have the skill to discern if AI code is how it should be written.

For a seasoned veteran writing the code was never really the hard part in the first place.


> For a seasoned veteran writing the code was never really the hard part in the first place.

Yes, to most coders this Copilot software is just a fancy keyboard.


Sounds great. I'm a bad typist, anything that makes me type less (vim, voice assistant, completion) is a big win to me


Vim is great because it doesn't try to be smart.


And you can't quit it so you are forced to learn it well.


If I put a section in my LICENSE.txt prohibiting use as training data in commercial models, would that be sufficient to keep my code out of models like this?


In the end this would slightly increase likelihood of such sections appearing in licenses generated by AIs.


> If I put a section in my LICENSE.txt prohibiting use as training data in commercial models, would that be sufficient to keep my code out of models like this?

Neither in practice (because it doesn't look for it) nor legally in the US, if Microsoft’s contention that such use is “fair use” under US copyright law.

That “fair use” is an Americanism and not a general feature of copyright law might create some interesting international wrinkles, though.


Their contention is

> Why was GitHub Copilot trained on data from publicly available sources?

> Training machine learning models on publicly available data is now common practice across the machine learning community. The models gain insight and accuracy from the public collective intelligence. But this is a new space, and we are keen to engage in a discussion with developers on these topics and lead the industry in setting appropriate standards for training AI models.

Personally, I'd prefer this to be like any other software license. If you want to use my IP for training, you need a license. If I use MIT license or something that lets you use my code however you want, then have at it. If I don't, then you can't just use it because it's public.

Then you'd see a lot more open models. Like a GPL model whose code and weights must be shared because the bulk of the easily accessible training data says it has to be open, or something like that.

I realize, however, that I'm in the minority of the ML community feeling this way, and that it certainly is standard practice to just use data wherever you can get it.


When I referenced their contention on Fair Use, that's not what I was referencing, but instead Github CEO Nat Friedman’s comment in this thread that “In general: (1) training ML systems on public data is fair use”.

https://news.ycombinator.com/item?id=27678354


> however you want

I don't see any attribution here.

MIT may say "substantial portions" but BSD just says "must retain".


would be interesting if someone uploaded a leaked copy of the NT kernel, then coerced the system to regurgitate it piece by piece

would MS position then be different?


Don't make your code public. Someone could read it and train the model in their brain to synthesize some code based on it.

If its publicly available than its fair game to use it to learn and base ideas on.


Only if they trained a model to be able to read and understand LICENSE.txt files -- wowzers what a monster improvement that would be for the world

Or, I guess a sentinel phrase that the scraper could explicitly check: `github-copilot-optout: true`


Or it could explicitly check for known standard licenses that permit it, if it were opt in instead of opt out, the way most everything else in software licensing is opt-in for letting others use.


This looks really cool. Do you plan to release this in some other form like a language server so that it can be easily integrated to other editors?


Have there yet been reports of the AI writing code that has security bugs? Is that something folks are on the lookout for?


I haven't seen any reports of this, but it's certainly something we want to guard against: https://copilot.github.com/#faq-can-github-copilot-introduce...


Has there been an attempt to train a similar ML model on a smaller dataset of standards-compliant code? e.g. MISRA C.

I started working at a healthcare company earlier this year, and my whole approach to software has needed to change. It's not about implementing features any more - every change to our embedded code requires significant unit testing, code review, and V&V.

Having a standards-compliant Copilot would be wonderful. If it could catch some of my mistakes before I embarrass myself to code-reviewing colleagues, the codebase would be better off for it and I'd be less discouraged to hear those corrections from a machine than a person.


Is there a public API? Will it be documented? Are you open to folks porting the VSCode plugin to other editors (I.e. kakoune’s autocomplete)?


Hi Nat! Just signed up for the preview (even though I'm the type to turn off intellisense and smart indent). I was wondering if WebGL shader code (glsl) was included in the training set? Translating human understandable graphics effects from natural language is a real challenge ;)


The technical preview document points out that editor context is sent back to the server not just for code generation but as feedback for improvement. Are you (or OpenAI) improving the ML models based on the usage of this extension? It is interesting what the pricing will look like given that the model was originally trained on FOSS and then you go and harvest test cases from real users. If that’s the case I think that should be clearly explained upfront.


Has this been tested for accessibility yet, particularly with a screen reader?



Are those developers worried about having their jobs replaced by a code-writing AI? :)

I mean... why would 95% of developer jobs exist with this tech available?

You just need that 5% of devs who actually write novel code for this thing to learn from.


https://en.wikipedia.org/wiki/Profession_(novella)

An Isaac Asimov story about someone who didn't take to the program and, as a result, got picked to create new things because someone has to make them.


I love this story.

If you want to read the whole thing, it's here:

https://www.abelard.org/asimov.php


Is there anyway to port this into emacs?


This is impressive. And scary. How long has your team been working on this first release?


Cool project! Have you seen any interesting correlations between languages, paradigms and the accuracy of your suggestions?


I assume this will also work alright on other languages that aren't in the demo, e.g. C++?


One question: how long is the waitlist? Very excited to try this!


Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: