Hacker News new | past | comments | ask | show | jobs | submit login

> As a human, I am allowed to read copyrighted code and learn from it.

Of course not. Reading some copyrighted code can have you entirely excluded from some jobs - you can't become a wine contributor if it can be shown you ever read Windows source code and most likely conversely. Likewise, you can't ever write GPL VST 2 audio plug-ins if you ever had access to the official Steinberg VST2 SDK. Etc etc...

Did people forget why black box reverse engineering of software ever came to be ?




> Of course not. Reading some copyrighted code can have you entirely excluded from some jobs

That's not a law. That's a cautionary decision made by those companies or projects to make it more difficult for competitors to argue that code was copied.

Those projects could hire people familiar with competitor code and assign them to competing projects if they wanted. The contributors could, in theory, write new code without using proprietary knowledge from their other companies. In practice, that's actually really difficult to do and even more difficult to prove in court, so companies choose the safe option and avoid hiring anyone with that knowledge altogether.

Now the question is whether or not GitHub's AI can be argued to have proprietary knowledge contained within. If your goal is to avoid any possibility that any court could argue that GitHub copilot funneled proprietary code (accessible to GitHub copilot) into your project, then you'd want to forbid contributors from using CoPilot.


In this case though we have machine learning model that is trained with some code and is not merely learning abstract concepts to be applied generally in different domains, but instead can use that knowledge to produce code that looks pretty much the same as the learning material, given the context that fits the learning material.

If humans did that, it would be hard to argue they didn't outright copy the source.

When a machine does it, does it matter if the machine literally copied it from sources, or first transformed it into an isomorphic model in its "head" before regurgitating it back?

If yes, why doesn't parsing the source into an AST and then rendering it back also insulate you from abiding a copyright?


>When a machine does it, does it matter if the machine literally copied it from sources, or first transformed it into an isomorphic model in its "head" before regurgitating it back?

You've hit the nail on the head here. If this is okay, then neural nets are simply machines for laundering IP. We don't worry about people memorizing proprietary source code and "accidentally" using it because it's virtually impossible for a human to do that without realizing it. But it's trivial for a neural net to do it, so comparisons to humans applying their knowledge are flawed.


> We don't worry about people memorizing proprietary source code and "accidentally" using it

I'm not sure why it's different, but that's a common concern with music. For example: https://www.reddit.com/r/WeAreTheMusicMakers/comments/4v8u8d...


That's a really good observation. Perhaps it highlights an essential difference between two modes of thought - a fuzzy, intuitive, statistical mode based on previously seen examples, and a reasoned, analytical calculating mode which depends on a precise model of the system. Plausibly, the landscape of valid musical compositions is more continuous than the landscape of valid source code, and therefore more amenable to fuzzy, example-based generation; it's entirely possible to blend two songs and make a third song. Such an activity is nonsensical with source code, and so humans don't even try. We probably do apply that sort of learning to short snippets (idioms), but source code diverges too rapidly for it to be useful beyond that horizon.


This is not such a big problem in reality because the output of Copilot can be filtered to exclude snippets too similar to the training data, or any corpus of code you want to avoid. It's much easier to guarantee clean code than train the model in the first place.


> then you'd want to forbid contributors from using CoPilot

I mean, if you used CoPilot on one computer, stared at it intensely for 1 hour, closed that computer, and then typed out code in the other computer that you were contributing from, you technically didn't use it for the contribution, you just used CoPilot for your education only.

Intellectual property is itself a flawed concept in many ways. It's like asking someone to do physics research but forbidding them from using anything that Einstein wrote.


Intellectual property itself is silly. How can a thought be the property of someone ? Secrecy is the solution if you don't want others to learn from you (like Coca-Cola does).


It's not a natural right, we supposedly do it too stimulate innovation by offering a reward and in order to get things into the public domain -- obviously Disney (and the politicians that kowtowed to them) ruined that for the World.

Patents should have reduced with product lifecycles, copyright should be a similar period; maybe 10-14 years.

My personal opinion.


That's just words for you... Intellectual property isn't property, at least not a full-blown one.

http://www.av8n.com/physics/weird-terminology.htm


It's not silly, it's an evolved and pragmatic solution to the question of how society can incentivize creative work. More or less every society has developed some notion of IP and there's little appetite in wider society to debate it - the idea of abolishing IP laws is deeply fringe and only really surfaces in forums like this one.

Does it have flaws and can it be improved upon? Sure. I think society underweights what improvements to the patent system in particular could do. But such ideas are so niche they are hardly even written down, let alone debated at large. Society has bigger issues on its mind.

Like any evolved system IP law encounters new challenges over time and will be expected to evolve again, which it will surely do. A simple fix for Copilot is surely to just exclude all non-Apache2/BSD/MIT licensed code. Although there might technically still be advertising clause related issues, in practice hardly anyone cares enough to go to court over that.


> Intellectual property itself is silly. How can a thought be the property of someone ?

No category of intellectual property covers thoughts, so the question has no relevance to the preceding statement.


An equally valid, AFAICS, way of looking at it is that "intellectual property" covers nothing but thoughts, only expressed in different forms.


If you read the video with a view to reproducing it then you created a derivative work, ie copyright infringement.

If you just used it for inspiration, that's fine; if the way it was coded is a result of technical constraints, that's fine too; if the code is generic it's not distinctive enough to acquire copyright in the first place.


>That's not a law. That's a cautionary decision made by those companies or projects to make it more difficult for competitors to argue that code was copied.

and they made those decisions based on the need to be able to argue in court that code was not copied.

>then you'd want to forbid contributors from using CoPilot

Right, the whole thing about arguing if copilot spits out a ten line function verbatim is not really what will be the problem, the problem is a human programmer still needs to run copilot and they will be the ones shown in the codebase as the author of the code (they could of course put a comment 'I got this bit from copilot' but might be cumbersome and anyway would hardly work as proof), although I suppose it would be not just proprietary code but code with an incompatible license.


> >That's not a law. That's a cautionary decision made by those companies or projects to make it more difficult for competitors to argue that code was copied.

> and they made those decisions based on the need to be able to argue in court that code was not copied.

Yeah, but only to make it easier for them to argue it; the letter of the law doesn't require it. You could argue that "Sure, I read Windows source code once -- but that was years ago and I can't remember shit of it, so anything I wrote now is my own invention." That might be harder to get the court to accept as a fact, but it's not a prima facie legal impossibility.

Cautionary decision =/= actual law.


>That's not a law. That's a cautionary decision made by those companies or projects to make it more difficult for competitors to argue that code was copied.

Okay, so it's not law, it's just a policy compelled by preceding legal judgements. Case law, perhaps.


It's a policy compelled by the cost of fighting a court case compared to getting one dismissed.


That's not what GP is saying.

In general, you're absolutely allowed to learn programming techniques from anywhere. You can contribute software almost anywhere even if you've read Windows source code. Re-using everything you've learned, in your own creative creation, is part of fair use.

Your example is the very specific scenario where you're attempting to replicate an entire program, API, etc., to identical specifications. That's obviously not fair use. You're not dealing with little bits and pieces, you're dealing with an entire finished product.


> Your example is the very specific scenario where you're attempting to replicate an entire program, API, etc., to identical specifications. That's obviously not fair use. You're not dealing with little bits and pieces, you're dealing with an entire finished product.

No - google's 9 lines of sorting algorithm (iirc) copied from Oracle's implementation were not considered fair use in the Google / Oracle debacle.

Likewise SCO claimed that 80 copied lines (in the entirety of the Linux source code) were a copyright violation, even if we never had a legal answer to this.


Sorry, but you're not recalling correctly. :)

The Supreme Court decided Google v. Oracle was fair use. It was 3 months ago:

https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_...

That's the highest form of precedent, the question has now been effectively settled (unless Congress ever changes the law).

Edit: added a dummy hash to end of URL so HN parses it correctly (thanks @thewakalix below)


nope, those lines were specifically excluded from the prior judgment - and SC did not cast another judgment on them:

> With respect to Oracle’s claim for relief for copyright infringement, judgment is entered in favor of Google and against Oracle except as follows: the rangeCheck code in TimSort.java and ComparableTimSort.java, and the eight decompiled files (seven “Impl.java” files and one“ACL” file), as to which judgment for Oracle and against Google is entered in the amount of zero dollars (as per the parties’ stipulation).


The fair use was about Googled API reimplementation. It becomes a whole different case with a 1:1 copy of code. And don't forget fair use works in the US, not necessarily in the rest of the world.

But I'm happy about all the new GPL programs created by Copilot


I wonder if GPL would actually price to be that infectious.


That Supreme Court ruling doesn't appear to address the claims of actual copied code (the rangeCheck function), only the more nebulous API copyright claims.


There seems to be an issue with Hacker News's URL parsing. The final period isn't included as part of the link.


That would be the wrong result in almost all cases.


This is true, but there's also a murkier middle option. I used to work for a company that made a lot of money from its software patents but I was in a division that worked heavily in open-source code. We were forbidden to contribute to the high-value patented code because it was impossible to know whether we were "tainted" by knowledge of GPL code.


Same here. I worked at a NAS storage (NFS) vendor and this was a common practice. Could not look at server implementation in Linux kernel and open source NFS client team could not look at proprietary server code.


No you are not, guaranteed (I think, not a lawyer).

At least from a copyright point of few.

TL;DR: Having right, and having a easy defense in a law suite are not the same.

BUT separating it makes defending any law-suite against them because of copyright and patent law much easier. It also prevents any employee from "copying GPL(or similar) code verbatim from memory"(1) (or even worse the clipboard) sure the employee "should" not do it but by separating them you can be more sure they don't, and in turn makes it easier to defent in curt especially wrt. "independent creation".

There is also patent law shenanigans.

(1): Which is what GitHub Copilot is sometimes doing IMHO.


This model doesn't learn and abstract: it just pattern matches and replicates; that's why it was shown exactly replicating regions of code--long enough to not be "de minimis" and recognizable enough to include the comments--that happen to be popular... which would be fine, as long as the license on said code were also being replicated. It just isn't reasonable to try to pretend Copilot--or GPT-3 in general--is some kind of general purpose AI worthy of being compared with the fair use rights of a human learning techniques: this is a machine learning model that likes to copy/paste not just tiny bits of code but entire functions out of other peoples' projects, and most of what makes it fancy is that it is good at adapting what it copies to the surrounding conditions.


Have you used Copilot? I have not, but I have trained a GPT2 model on open source projects (https://doesnotexist.codes/). It does not just pattern match and replicate. It can be cajoled into reproducing some memorized snippets, but this is not the norm; in my experience the vast majority of what it generates is novel. The exceptions are extremely popular snippets that are repeated many many times in the training data, like license boilerplate.

Perhaps Copilot behaves very differently from my own model, but I strongly suspect that the examples that have been going around twitter are outliers. Github's study agrees: https://docs.github.com/en/github/copilot/research-recitatio... (though of course this should be replicated independently).


So, to verify, your claim is that GPT-3, when trained on a corpus of human text, isn't merely managing to string together a bunch of high-probability sequences of symbol constructs--which is how every article I have ever read on how it functions describes the technology--but is instead managing to build a model of the human world and the mechanism of narration required to describe it, with which it uses to write new prose... a claim you must make in order to then argue that GPT-3 works like a human engineer learning a model of computers, libraries, and engineering principals from which it can then write code, instead of merely using pattern recognition as I stated? As someone who spent years studying graduate linguistics and cognitive science (though admittedly 15-20 years ago, so I certainly haven't studied this model: I have only read about it occasionally in passing) I frankly think you are just trying to conflate levels of understanding, in order to make GPT-3 sound more magical than it is :/.


What? I don't think I made any claim of the sort. I'm claiming that it does more than mere regurgitation and has done some amount of abstraction, not that it has human-level understanding. As an example, GPT-3 learned some arithmetic and can solve basic math problems not in its training set. This is beyond pattern matching and replication, IMO.

I'm not really sure why we should consider Copilot legally different from a fancy pen – if you use it to write infringing code then that's infringement by the user, not the pen. This leaves the practical question of how often it will do so, and my impression is that it's not often.


It's not really comparable to a pen. Because a pen by itself doesn't copy someone else's code/written words. It's more like copying code from Github or if you wrote a script that did that automatically. You have to be actively cautious that the material that you are copying is not violating any copyrights. The problem is Copilot has enough sophistication to for example change variable names and make it very hard to do content matching. What I can guarantee it won't be able to do is to be able to generate novel code from scratch that does a particular function (source: I have a PhD in ML). This brute-force way of modeling computer programs (using a language model) is just not sophisticated enough to be able to reason and generate high level concepts at least today.


The argument I was responding to--made by the user crazygringo--was that GPT-3 trained on a model of the Windows source code is fine to use nigh unto indiscriminately, as supposedly Copilot is abstracting knowledge like a human engineer. I argued that it doesn't do that: that GPT-3 is a pattern recognize that not only theoretically just likes to memorize and regurgitate things, it has been shown to in practice. You then responded to my argument claiming that GPT-3 in fact... what? Are you actually defending crazygringo's argument or not? Note carefully that crazygringo explicitly even stated that copying little bits and pieces of a project is supposedly fair use, continuing the--as far as I understand, incorrect--assertion by lacker (the person who started this thread) that if you copied someone's binary tree implementation that would be fair use, as the two of them seem to believe that you have to copy essentially an entire combined work (whatever that means to them) for something to be infringing. Honestly, it now just seems like you decided to skip into the middle of a complex argument in an attempt to made some pedantic point: either you agree that GPT-3 is a human that is allowed to, as crazygringo insists, read and learn from anything and the use that knowledge in any way they see fit, or you agree with me that GPT-3 is a fancy pattern recognizer and it can and will just generate copyright infringements if used to solve certain problems. Given your new statements about Copilot being a "fancy pen" that can in fact be used incorrectly--something crazygringo seems to claim isn't possible--you frankly sound like you agree with my arguments!!


I think a crucial distinction to be made here, and with most 'AI' technologies (and I suspect this isn't news to many people here) is that – yes – they are building abstractions. They are not simply regurgitating. But – no – those abstractions are not identical (and very often not remotely similar) to human abstractions.

That's the very reason why AI technologies can be useful in augmenting human intelligence; they see problems in a different light, can find alternate solutions, and generally just don't think like we do. There are many paths to a correct result and they needn't be isomorphic. Think of how a mathematical theorem may be proved in multiple ways, but the core logical implication of the proof within the larger context is still the same.


Statistical modelling doesn't imply that GPT-3 is merely regurgitating. There are regularities among different examples, i.e. abstractions, that can be learned to improve its ability to predict novel inputs. There is certainly a question of how much Copilot is just reproducing input it has seen, but simply noting that its a statistical model doesn't prove the case that all it can do is regurgitate.


One way to look at these models is to say that they take raw input, convert it into a feature space, manipulate it, then output back as raw text. A nice example of this is neural style transfer, where the learnt features can distinguish content from style, so that the content can be remixed with a different style in feature space. I could certainly imagine evaluating the quality of those features on a scale spanning from rote-copying all the way up to human understanding, depending on the quality of the model.


Imagine for a second a model of the human brain that consists of three parts. 1) a vector of trillion inputs, 2) a black box, and 3) a vector of trillion outputs. At this level of abstraction, the human brain "pattern matches and replicates" just the same, except it is better at it.


Human brains are at least minimally recurrent, and are trained on data sets that are much wider and more complex than what we are handing GPT-3. I have done all of these standard though experiments and even developed and trained my own neural networks back before there were libraries that have allowed people to "dabble" in machine learning: if you consider the implications of humans being able to execute turing complete thoughts it should be come obvious that the human brain isn't merely doing pattern-anything... it sometimes does, but you can't just conflate them and then call it a day.


The human brain isn't Turing-complete as that would require infinite memory. I'm not saying that GPT-3 is even close, but it is in the same category. I tried playing chess against it. According to chess.com, move 10 was its first mistake, move 16 was its first blunder, and past move 20 it tried to make illegal moves. Try playing chess without a chessboard and not making an illegal move. It is difficult. Clearly it does understand chess enough not to make illegal moves as long as its working memory allows it to remember the game state.


>The human brain isn't Turing-complete as that would require infinite memory

A human brain with an unlimited supply of pencils and paper, then.


Hmm... but a finite state machine with an infinite tape is Turing complete too. If you're allowed to write symbols out and read them back in, you've invalidated the "proof" that humans aren't just doing pattern matching.


> The human brain isn't Turing-complete as that would require infinite memory.

This is wrong, this is not what Turing completeness is. It applies to computational models, not hardware.

https://en.wikipedia.org/wiki/Turing_completeness


How so? The page you link offers three definitions[1], and all of them require an infinite tape.

You could argue that a stack is missing in my simplified model of the human brain, which would be correct. I used the simple model in allusion to the Chinese room thought experiment which doesn't require anything more than a dictionary.

[1]: https://en.wikipedia.org/wiki/Turing_completeness#Formal_def...


Turing completeness applies to models of computation, not hardware. Otherwise, nothing would be Turing-complete because infinite memory doesn't exist in the real world. Just read the first sentence of what you linked to:

In computability theory, several closely related terms are used to describe the computational power of a computational system (such as an abstract machine or programming language)


Thank you for pointing that out, I was indeed wrong to assume it was used to classify hardware rather than a model.


Human thought isn't anything like GPT thought - humans can spend a variable amount of time thinking about what to learn from "training data" and can use explicit logic to reason about it. GPT is more like a form of lossy compression than that.


This is called prompt engineering. If you find a popular, frequently repeated code snippet and then fashion a prompt that is tailored to that snippet then yes the NN will recite it verbatim like a poem.

But that doesn't mean it's the only thing it does or even that it does it frequently. It's like calling a human a parrot because he completed a line from a famous poem when the previous speaker left it unfinished.

The same argument was brought up with GPT too and has been long debunked. The authors (and others) checked samples against the training corpus and it only rarely copies unless you prod it to.


I don't know if I agree with your argument about GPT-3, but I think our disagreement seems to be besides the point: if your human parrot did that, they would--not just in theory but in actual fact! see all the cases of this in the music industry--get sued for it, even if they claim they didn't mean to and it was merely a really entrenched memory.


The point is that many of the examples you see are intentional, through prompt engineering. The pilot asked the copilot to violate copyright, the copilot complied. Don't blame the copilot.

There also are cases where this happens unintentionally, but those are not the norm.


Transformers do learn and abstract. Not as well as humans, but for whatever definitive of innovation or creativity you wanna run with, these gpt models have it. It's not magic, it's math, but these programs are approximating the human function of media synthesis across narrowly limited domains.

These aren't your crazy uncle's Markov chain chatbots. They're sophisticated bayesian models trained to approximate the functions that produced the content used in training.


Are they Bayesian? It would be good if ML models were Bayesian (they'd be able to show uncertainty better) but they usually aren't.


The model and attention mechanism produces Bayesian properties, but transformers as a whole contain non-Bayesian aspects, depending on how rigorous you want to be in defining Bayesian.


> this is a machine learning model that likes to copy/paste not just tiny bits of code but entire functions out of other peoples' projects

Github could make a blacklist and tell Copilot never to suggest that code. Problem solved. You use one of the other 9 suggestions.


In my experience open source has now become so prevalent that I think some young developers could be completely caught out if the pendulum swings the other way.

Semi-related, the GNU/Linux copypasta is now more familiar to some than the GNU project in general - this is a shame to me as I view the copypasta to be mocking people who worked very hard to achieve what GNU has achieved asking for some credit.


> Reading some copyrighted code can have you entirely excluded from some jobs

What provision of copyright law are you referring to? Are you conflating copyright law with arbitrary organizational policies?


Who said it was a law?


Which “it” are you referring to? @lacker was talking about copyright in the comment @jcelerier replied to.


Yeah... but they didn't say it was the law that got you excluded from working on some projects from reading copyright code. It's corporate policy that does that - it's not a law but they do it based on who owns the copyright. Not everything that impacts you is a law.

They said

> Reading some copyrighted code can have you entirely excluded from some jobs

And they're right. It's because of corporate policies. They never said it was because of a law - you imagined that out of nothing.


> They never say it was because of a law - you imagined that out of nothing.

@jcelerier flatly contradicted the statement that copyright doesn’t prevent you from reading something.

You’re right that @jceleier didn’t say their example was law, that’s because the example is a straw man in the context of what @lacker wrote.


Are you editing your comments out after they're been replied to? That's really poor form.


I did not edit my comments above after reading your replies, why do you ask? What do you think I changed that affected how the thread reads?

And, who says improving or clarifying a comment is poor form? What is the edit button for, and why is it available once replies have been posted?


> What do you think I changed

I think you added

> Which “it” are you referring to?...

Because I have a tab open and can see the old one!


I added that before I saw your comment. So?


So @chrisseaton was correct, you did edit your posts and their question was in good faith.

Edit - I’m adding another point as an edit to show another way to communicate. Would any of your points been lost had you done something similar?


> So @chrisseaton was correct

No that’s not true. I did not edit my posts after reading their reply, and the false accusation was that I changed my comment after it was replied to.

I didn’t challenge whether the question was in good faith, but I’ll just note that the relevant discussion of copyright got dropped in favor of an ad-hominem attack.

My question of which “it” was being referred to is a legitimate question that I believe clarified the intent of my comment, and I added it to make clear I was talking about what @lacker said, not what @jcelerier wrote.

> Edit - I’m adding another point as an edit to show another way to communicate. Would any of your points been lost had you done something similar?

This doesn’t answer my question of why an edit should not be made before I see any replies, nor of why any edit is “poor form” and according to whom. I made my edit immediately. I’m well aware of the practice of calling out edits with a note, I’ve done it many times. I don’t feel the need to call out every typo or clarification with an explicit note, especially when edited very soon after the original comment.


> I did not edit my posts after reading their reply, and the false accusation was that I changed my comment after it was replied to.

Replies exist before you read them.


Thanks? Edits exist before you finish replying too, right? Maybe point that out to @chrisseaton, whose incorrect assumption was that I edited in response to what he wrote.


It's dependent on jurisdiction. Black box reverse engineering is only required in certain countries. If I remember correctly, most of Europe doesn't require it.


> > As a human, I am allowed to read copyrighted code and learn from it.

> Of course not. Reading some copyrighted code can make you entirely excluded from some jobs - you can't become a wine contributor if it can be shown you ever read Windows source code and most likely conversely.

You can of course read the code. The consequences are thus increased limitations, like you say.

What you mention is not an absolute restriction from reading copyrighted material. You perhaps have to cease other activities as a result.


I've you've ever read a book or interacted with any product, you've learned from copyrighted material.

You've extrapolated "some organizations don't allow you to contribute if you've learned from the code of their direct competitor" to "You're not allowed to learn from copyrighted code", which is absurd.


> Reading some copyrighted code can have you entirely excluded from some jobs - you can't become a wine contributor if it can be shown you ever read Windows source code and most likely conversely.

If that's the case, it should be easy to kill a project like wine - just send every core contributor an email containing some Windows code.


Nobody could grant if that thing is really windows code or a fake. Not without the sender self-identifying as a well known top MS employee having access to it. In that case the sender would be doing something illegal and against MS interests.

The result would be WINE having an advantage to redo the snippet of code in a totally new and different way and MS being forced to show part of its private code, that would expose them also to patent trolls.

Would be a win-win situation for Wine and a lose-lose situation for MS.


Wasn't that the entire premise of "Halt and Catch Fire"?


if it can be shown

...and the "if" is the important part. This is why Imaginary Property is so absurd.


[flagged]


The limitations on those projects are well-known though.


To whom? Because I work with VST a lot, and that's sure news to me.


It's very clearly visible on the Wine wiki that people who have ever seen Microsoft Windows source code cannot contribute to Wine due to copyright restrictions:

https://wiki.winehq.org/Developer_FAQ#Who_can.27t_contribute...

I think OP has a point here, personally.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: