Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
NetBSD bans use of Copilot-generated code (osnews.com)
69 points by jaypatelani on May 18, 2024 | hide | past | favorite | 39 comments



This policy isn't meant to be enforceable; it's to signal to contributors — who may not have even considered the legal implication of using LLM codegen tools — that LLM codegen is not allowed because of the legal risk. It's an addendum to the existing guideline "Do not commit tainted code to the repository", which was already practically unenforceable, and it's intended to address a real and new phenomenon.


Here's an example of how easy it is to generate code on NetBSD using an open LLM:

    Last login: Fri May 17 23:58:08 2024 from 10.10.10.129
    NetBSD 9.2 (GENERIC) #0: Wed May 12 13:15:55 UTC 2021
    
    Welcome to NetBSD!
    
    We recommend that you create a non-root account and use su(1) for root access.
    $ ./curl -L -o tinyllama https://huggingface.co/Mozilla/TinyLlama-1.1B-Chat-v1.0-llamafile/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile
    $ chmod +x tinyllama
    $ ./tinyllama -e -p '```c\nvoid *memcpy(void *dst,' -r '```\n' --temp 0 --log-disable
    <s> ```c
    void *memcpy(void *dst, const void *src, size_t n) {
        char *d = (char *)dst;
        const char *s = (const char *)src;
        while (n--) {
            *d++ = *s++;
        }
        return dst;
    }
    ```
    
    $
I think tinyllama and llamafile are much more culturally compatible with NetBSD's values than something like Microsoft Copilot.


I don't see your point. This is first semester CS code copying character by character. With some unrolling with uint64_t you can it will be sped up by 4 times at least, and even more so using SIMD. I don't know if your LLM can generate anywhere near a real world function in OS standard libraries.


Also, the issue isn’t the quality of the LLM output, or whether you’re rooting for MS or Meta (e.g., there’s a vscode port these days).

The issue is that the copyright on that code is extremely questionable, and FreeBSD chose not to pirate source code from other operating systems a long time ago.


See also https://news.ycombinator.com/item?id=40375029: 99 points | LeoPanthera | 10 hours ago | 80 comments


This seems basically unenforceable. It very quickly becomes the war that schools are now waging on plagiarism.


I teach and it's not a war at all, from my point of view it's actually quite simple. We use Gradework. Students upload their work in Word or PDF, and there's a percentage counter in the top right corner. If it's above 30%, which rarely happens, I pass it to the head teacher. They have a conversation with that student, which I'm told involves a talk-to for the first time, a warning the second time, and expulsion the third time (which AFAIK hasn't happened yet).

I don't see why a code review couldn't be the same.


I looked for information on gradework, and couldn’t find anything.

From your description, it sounds like you presume the software is accurate.

I’d expect there to be peer reviewed studies about its error rate, a description of how it works, limitations, etc. For instance, if you ask for a one sentence factual response to a question, there’s no way it will be able to distinguish LLM from human. I’m sure it has other limitations.

I met someone that built one of the better packages in the “don’t copy your answers from google” space a while back.

He routinely gets pleas for help from people his software has falsely accused of plagiarism (the tool is open source and popular and maintained by others).

He generally sends strongly worded letters to educators explaining the limitations of the tool, why it’s probably wrong in this case, and that says he will happily testify against them in court as an expert witness.

I imagine the commercial vendors are less transparent, but not more accurate.


Youre letting a computer tell you what to do?

> I'm sorry, Dave. I'm afraid I can't do that.


I have no problem when a computer say's what to do, but i have a problem when a program decides what's right.


You think you can reliably identify AI text?


Anecdotally my experience with generative AI is that the valid output space is very tiny especially under a restrictive prompt, i.e. homework question, even smaller when the output also has to be a correct answer. Using a riskier decoding scheme only generates output that is paraphrastic to each other. So lexical similarity should catch most of the naive prompter.

Subjectively, if a student can knowledge-guide an LLM to generate response from a specific perspective, I think it is very debatable whether such usecase should be outright punished.


Ok, given that restrictions I can see how you can arrive at a pragmatic solution, so long as we don’t have the illusion this is actually a solution.

For example, giving it a small corpus of my previous handwritten work and asking it to respond in the same style is exceedingly easy and very, very hard to detect. Especially if you run a small algo on it afterwards to introduce your category of typos etc.

It is only a matter of time before services will pop up for students that automate/wrap this.


Indeed, they would just be preparing for industry in that case (guiding the LLM)


In day to day code review I most definitely can. I can recognise fairly well each of my team coding style, when it deviates significantly from that it's either blindly copied or comes from AI.


It’s the opinion of OpenAI that ai text detection is not possible. Do you really want to argue that you know more than they do?


the very notion of "IP" or "this idea is mine and therefore all copies of it owe me royalties" is absolute nonsense and while not great in normal circumstances,

the coming of digital media and the internet make it 100% a bad thing to have because it breeds digital scarcity

plagiarism is kind of silly. the only contribution from that mindset is that it allows for tracking of following the same ideas as explored by different people


IP protection as an idea came about because creators had no way to financially exploit their creations. It is not a bad idea at its core. Creators have a choice of making their works available freely or under certain conditions. I am always in favour of the creator deciding how their work is shared, not those who want to use it. Scarcity creates value. Plagiarism is a not silly. Plagiarising other people's work is used as a shortcut to obtain employment, access, academic recognition, etc. on the basis of somebody else's work which is pure lying and deceit. That can land you in court. In the field of IT and software development in particular we had long and expensive court battles over a few lines of code. LLMs recombine code found on the internet without any care or attention paid to the licensing terms. The copyright to the output is magnanimously assigned to you and you are left holding the bag when litigation time comes. Based on past experience BSD are right to reject code generated with the help of LLMs, because there is no way of telling where the code came from and what future legal issues it may cause.


Ignoring the legal issues, there are also ethical issues.

The BSD crowd has always struck me as being thoughtful and principled. (vs., say, RedHat these days, which has decided to just violate the GPL en masse by banning people that redistribute source code).


BSD license is the most freedom encouraging license

GPL made the mistake of "using the tools in the master's shed to fight the master" as they say; hence the FSF and the GNU project are not in a good path. Open source (IMO) is -- simply put, an idelogical attack against the freedom of software movements (in all their forms and means)


This is not a license violation per-se

They must send the source to their clients. They have nothing obligations to keep all of their clients.


There were also cases of GPL fanboys relicensing BSD code as GPL.


NetBSD can basically request people not do it and people won't. It kinda has that reputation.


Given the (AFAIK) questionable copyrightability of LLM-generated code (a number of jurisdictions do not allow assignment of copyright to non-humans nevermind the derivative work question of its inputs), these proclamations (Arch and Debian have had similar discussions) sound, to me, like it is mostly a clarification of already-disallowed activities (where copyright must be able to be attributed or otherwise traceable).

Basically, if you're using Copilot and making constructive[1] contributions, there's likely some human element involved where copyright can be applied. If you're just slapping together a pipeline to generate and submit patches without oversight, this just says "don't do that" more strongly than already exists for "don't contribute crap please". I see it as a way to help stem the flow of an LLM-generated deluge of contributions that flood the review queue with sub-par work that just ends up wasting precious reviewer time. If this discourages LLM-script-kiddies from flooding FreeBSD with such things and instead doing it for other projects, it seems like a win for FreeBSD to me.

[1] https://xkcd.com/810/


The law hasn’t been settled yet, but I strongly suspect that the courts will rule that when an LLM memorizes an input and outputs something substantially identical, that the copyright of the original is maintained.

If not, then we can just bias these models toward memorization, input harry potter books or whatever, and then declare output text that’s 99.95% identical to the input to be public domain.

The obvious problem this creates for groups like NetBSD is that there will be copyright trolls that leverage the lack of provenance of LLM output in order to extort people that use these tools.


I suspect that the LLM creators feel the same way, deep down. Otherwise I want to know why Microsoft, for instance, doesn't train Copilot on the Office and Windows codebases.


For now, from the looks of it things like SynthID are gonna start making their way through the multi modal outputs of models and having signatures identifying the source that generated it (and the user). I'm sure open source "scramblers" will appear but the more code you generate the easier it is to identify. From what I read of state of the art SynthID for normal text output only 3 sentences are needed for an accurate signature to be produced in the text without affecting quality of output.


When I use Copilot I make sure code it writes matches my own style. If it doesn’t I modify the output to what I would have written. I never accept more than a few lines at a time unless it’s something completely boilerplaty. The result is my Copilot-assisted code is byte by byte identical to the code I would produce without it. I challenge you to SynthID that.

I know there are people who don’t give a shit about code quality as long as it runs, which I suppose is why we’re seeing all these “ChatGPT-written” Show HNs. But that’s far from everyone.


Yep with that way of using it you should be safe, like I said even for normal text it needs a minimum of 3 sentences to be reliable, for code many times you only do small autocompletions and it would be very hard. I just think SynthID is interesting because I don't like the idea of watermarked AIs so it's useful to know how they're doing it.


SynthID relies on putting watermarks in AI output in the first place. We already have high quality open source models with no such watermark in the wild that are as good as CoPilot, “scrambling” would be totally unnecessary.


I thought it went without saying that SynthID’s only practical use cases are warrantless mass surveillance and surveillance capitalism.

It’s been said literally millions of times before about thousands of other interchangeable schemes.

I guess some things need to be repeated every. single. day.


Have you ever been in an institution? Cells. Cells. Do they keep you in a cell? Cells. Cells. When you're not performing your duties do they keep you in a little box? Cells. Cells. Interlinked. Interlinked. What's it like to hold the hand of someone you love? Interlinked. Interlinked. Did they teach you how to feel finger to finger? Interlinked. Interlinked. Do you long for having your heart interlinked? Interlinked. Interlinked. Do you dream about being interlinked...? Interlinked. What's it like to hold your child in your arms? Interlinked. Interlinked. Do you feel that there's a part of you that's missing? Interlinked. Interlinked. Within cells interlinked. Within cells interlinked. Why don't you say that three times: Within cells interlinked. Within cells interlinked.

Anyone who downvotes this is a Replicant and must be apprehended immediately.


This is like outlawing sneezing at home. Unless you’re sneezing so loudly that your neighbor decides to call the police (equivalent would be reproducing large blocks of existing code verbatim, but Copilot already has warnings for that kind of thing), good luck enforcing it.


When you fly into some countries you have to fill out a form confirming that you haven’t committed genocide. A lot of people laugh at it without thinking about why the affirmation is there.

This is about setting a public position and expectation around contributions. If you want to violate it then so be it - maybe you’ll get away with it, but you are now violating the policy.


It also means that the country can summarily deport you for lying on an immigration form, even if you committed genocide where it’s legal and outside their jurisdiction.


Do they think Copilot/ChatGPT are usernames of contributor accounts which can be banned from a project/repo?

It seems the underlying issue here is bad code submitted ultimately by human contributors. Consistently thorough code review and testing will always be necessary.


How do code review and testing address the licensing concerns identified by the project? If you read their policy (which is two paragraphs, one of which is quoted directly in the article), it's pretty clear that the quality of the code is besides the point.


I’m not claiming that quality or licensing concerns are irrelevant, but that:

1. It will not be possible to reliably detect whether code was LLM assisted or not.

2. Humans are not always 100% truthful.

3. All the concerns cited here also applies to human-written code anyway.

So attempting to treat LLM coding assist tools as a special case here is going to be a losing battle. To solve these issues, we’re gonna have to come up with code review processes and tools that apply to ALL code up for review.


Those same quibbles applied to the policy before the addition of the LLM section: how does the NetBSD project detect if I copy & paste a bunch of code from my day job into a patch submission (and then lie about it)? Obviously, they can't. I, personally, don't feel like it's a failure of the policy if it relies on your contributors acting in good faith, because:

a) many people are acting in good faith, and their behavior will change as a result of this policy;

b) if someone wants to be a jerk and use an LLM after they were told not to, and is at some later time found out, it makes it easier for the org to act quickly and in a fair and consistent manner;

c) [more speculative as to the motives of the NetBSD project] normative statements by well-regarded institutions are useful in setting an example for other organizations to follow, so there is some political utility regardless of the practical efficacy of these rules.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: