Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] MIT No-AI License (ognjen.io)
39 points by rognjen on Feb 13, 2023 | hide | past | favorite | 64 comments

> ... without restriction, including without limitation the rights to use ...

> ... Permission is not granted to use this software ...

Is this going to hold up to legal scrutiny? It looks like it's blatantly contradicting itself. The author is calling it an "MIT" license, but does not appear to be associated with MIT -- is MIT cool with that? The author is not a lawyer. Nice concept, but I don't trust the execution.

Judges handle stuff like this a lot, where people cobble together clauses to create a franken-contracts, and wind up with a lot of self contradictions. In general, specific clauses should win out over general clauses, so this in theory might hold up depending on the judge, but it's not advisable to use this in any real life setting. Also using the name MIT is how you end up with some of the complaints in the Neo4j lawsuit, and is also not advisable. OP should seek legal counsel.

> where people cobble together clauses to create a franken-contracts

A blight in any number of industries. I'm most intimately familiar with construction. My go-to example is a project where the contract specifications from the Owner required, in one paragraph buried deeply within, that a certain critical person have a license for their craft.

Not a problem, that's how it usually goes and why the state issues licenses, after all. The problem was that the project was in Alaska, and the specifications explicitly listed out a license in Virginia.

> to any person obtaining a copy

A relevant distinction, perhaps

IANAL, but it sounds like what a "person" obtaining a copy means could be challenging to define. If I open my browser, navigate to a page containing the code, and click some type of download button, I think it would be hard to argue that I was not a person obtaining a copy. What if I write a crawler and my crawler downloads a copy of the code for me? Am I still a person obtaining a copy?

Rather ambiguous, I agree. Sadly it has resulted in billion-dollar multinational commercial entities getting away with stealing and fencing near endless amounts of otherwise protected data by claiming "fair-use".

Basilisks are not humans, and the law needs to catch up to this fact.

What is the license on the MIT license?

There's currently a trademark fight happening around AGPLv3+CC, because "AGPL" is a trademark not owned by the people trying to combine the AGPL with the Commons Clause. Might be worth looking into that.

Mostly irrelevant. What's at issue here is trading on MIT's name.

This is the perfect example of why developers should not write their own licenses.

- The first two paragraphs contradict each other. "deal in the Software without restriction", "Permission is not granted to use this software..." etc. Which one is it? Remember that ambiguity in contracts work against the side that drafted it.

- What is "machine learning models"? Is there an objective definition of it that a judge/jury/court can understand? Why doesn't the license include that definition?

- Where is the attribution clause?

- "MIT" in the title is a copyrighted term, so the license is itself likely illegal to publish.

MIT is a trademark; even assuming it was original enough to warrant copyright protection, it was founded in 1861 (by someone who died in 1882). Even under today's expansive copyright duration scheme, any copyright would have fallen into the public domain half a century ago.

Agree on your other points.

Trademarks and copyright are different things. As long as the owner of a trademarks keeps defending it, the trademarks will remain valid.

This implies that unless MIT wants to lose their trademark, they must defend it, ie going after things like this.

And MIT is pretty clear on their willingness to defend their trademarks; https://policies.mit.edu/policies-procedures/120-relations-p...

Yes. Obviously. I encourage you to reread my comment; you appear to have misunderstood it.

names of things are not copyrightable, even the title of books and films are not part of the copyright (that's why we see different movies come out with the same title). The reason is, there is not enough space in a title to exhibit the type of creative expression that copyright is meant to protect.

> even assuming it was original enough to warrant copyright protection

> not enough space in a title to exhibit the type of creative expression

I think we're saying the same thing. See 37 CFR § 202.1(a). However, even if that weren't the case, it still wouldn't be subject to copyright.


I appreciate you trying to prompt discussion of the issue and how it might be resolved within the legal system, but I think this misses the mark, for a few reasons:

(1) As others have pointed out, you've left open a sub licensing loophole, especially because the license does not require as a condition that the sublicensed version retain the notice of your copyright and license.

(2) The terms you use are vague. What is "machine learning"? Is it machine learning to use an organic neural network (i.e., for a flesh-and-blood person to train their mind on it)? Why or why not? Do you intend to make it illegal to view the code in an editor with adaptive autocorrect? What about just syntax highlighting? Surely when the editor "learns" your defined types and function prototypes, that could reasonably fall within the definition of "machine learning," couldn't it?

(3) Have you considered the interplay of copyright law and contract law here? You call this a license but it's more like a license agreement because you aren't just limiting your grant of copying permissions, you're limiting how it's used by the end user. Such an EULA falls well outside the domain of international copyright law and would be governed by individual jurisdictions' contract law. What constitutes a user's acceptance of your EULA? What jurisdiction governs?

Again, I don't practice in the area of IP; my take is only possibly (and if so, only slightly) more informed than the average HN user when it comes to this topic. Hopefully someone like kemitchell will weigh in.

I hate to sound glib here, but, "...And for my next trick, robots.txt!"

I don't think we have any choice in opting in or out to being model fodder, the only choice is in whether or not to emit that which we seek to keep out of its view. Anything in a public space will be assumed public first. It didn't stop copilot, it's not going to stop OpenAI, and the next thing won't be stopped either. If you're carrying a phone right now, you're already feeding models aisle by aisle as you walk through retail stores. Nobody scoffs at an anti-theft camera when they're shopping for drill bits, but would you feel the same way if it said it is for advertising purposes?

If you thought phones and social media were the opiate of the masses, just you wait. From top to bottom, from consumer to producer to middleman, the entire world is salivating at this opportunity and there is too much gold in them hills for any of this to stop.

Any AI model that fails in this current market will be the ones who make a gentleman's agreement with the wide unaccountable internet to hobble themselves for un-spendable good-guy points.

I view and reward this license as having noble intent, but limited efficacy against those who'd do the most harm to us, the unscrupulous.

This is the "criminals^H^H^H^H^H tech companies will steal content anyway, so why bother?" argument, but that seems defeatist to me. Laws forcing tech companies to attribute sources for generated content would help.

It's not "why bother", it's a critique on the specific license at hand and its efficacy against this issue.

No actually.

Instead it is that training AI in this data, is fully legally, with or without your permission.

Fair use allows you to ignore the wishes of the original copyright holder.

> No actually.

> Instead it is that training AI in this data, is fully legally, with or without your permission.

> Fair use allows you to ignore the wishes of the original copyright holder.

I keep seeing this sentiment repeated again and again. Wrong facts travel faster than right ones, it seems.

"Fair use" is a legal term allowing certain exemptions to copyright enforcement, which recognised in many jurisdictions, and also recognised across jurisdictions via WIP treaties, or other reciprocal recognition.

There is no fair use exemption that I am aware of that specifically recognises "learning" or "training" as a fair use exemption.

Everyone who makes the statement you did, when asked for a citation, throws out some court case which made exemptions for reverse-engineering. They didn't call it fair use. The situation was "let's learn how this works so we can fix it/clone it". THIS situation is not "lets learn how this works", it's "lets train an entity with this".

Do you have a citation[1] for that assertion that fair use exemptions apply in the case of learning or training?

[1] I know you don't. I'm going to ask anyway.

Really? You have had multi discussions about fair use and yet you weren't aware of the 3rd factor in the 4 factor test of fair use?

Here it is, since you were not aware: "the amount and substantiality of the portion taken".

That is what I am referencing, when I am talking about training an AI being covered under fair use.

Obviously, I didn't mean "Well, if you have 1 single image, and I 'train' the model, on that 1 single image, and it produces the exact same image pixel for pixel in the model, this is allowed because 'training' is a bullet proof exception, in the law itself".

Thats obviously not what I meant. Instead, what I am saying, is that if there is a model, trained on millions and millions of images, the output of the model is fair use, because it is not taking significantly from your individual work.

> Here it is, since you were not aware: "the amount and substantiality of the portion taken".

Yes. If you use only 1% of a work, then you are not using a substantial or large amount of the work and it is considered fair use.

But training doesn't use 1% of the work, it uses the entire work. No one is using 1/100th of an individual image to train, nor are they using 1/100th of a codebase to train, etc.

They're using entire individual works, and all those factors that are applicable are evaluated collectively, not in isolation.

Besides, all those factors become irrelevant if "...On the other hand, it is as clear, that if he thus cites the most important parts of the work, with a view, not to criticize, but to supersede the use of the original work, and substitute the review for it, such a use will be deemed in law a piracy ." (https://en.wikipedia.org/wiki/Fair_use)

It's hard to claim that the owners of ChatGPT and similar are not trying to supercede the works it is fed as input. They state as much everywhere.

> Instead, what I am saying, is that if there is a model, trained on millions and millions of images, the output of the model is fair use, because it is not taking significantly from your individual work.

Whether the output from the model is fair use or not is irrelevant to whether the input falls under fair use.

I must say your take is certainly novel, and no, I haven't seen anyone try to make that claim before; each time I have asked I have gotten a different answer.

I think a better case to be made is that ChatGPT is transformative, which would make it fair use.

If you read through the entire wikipedia article I linked above you'll see that:

1. All the factors are evaluated collectively, in relation to each other, not individually.

2. The burden of proof lies with the defendant, not the claimant. IOW, the court starts off with "prove that the use is fair", and not "prove that the use is not fair". From wikipedia "This means that in litigation on copyright infringement, the defendant bears the burden of raising and proving that the use was fair and not an infringement. "

In short, when the license says "not to be used as training data or learning data for any machine model", and it is ignored, the defendant is already in violation. If sent a cease and desist with a request for royalties, the defendant is already presumed to be in violation", and will have to prove fair use, which will (in order of factors) mean that they have to answer "No" to all of the following questions in court:

1. Is the output product being used for commercial purposes and/or profit?

2. Is the input work a freely available fact, or is of a nature that it's in the public interest to reproduce.

3. Is the proportion that is used of the input work insignificant (typically less than 1/100th) of the input work?

4. Does the output work harm the market for the input work?

The owners of ChatGPT are unable to answer YES to any of the above.

Gotcha, so then can you give me a date/time limit on when I am allowed to make fun of you, if zero people lose court cases on this?

I am more than happy to put this on my calendar here.

I just need an exact date, on when I can come back to your comments, and make fun of you for being completely wrong, when nobody loses any court cases on this topic.

Give me a date, and please describe specifically the exact words I am allowed to use to describe someone who would make such a mistake.

And if you refuse to give an exact date, I will assume that it is both the dates 6 months, and exactly 1 year from now, and I will check in with you on exactly those dates to see if you will admit that you were wrong (spoilers... You won't!)

> Gotcha, so then can you give me a date/time limit on when I am allowed to make fun of you, if zero people lose court cases on this?

Well, people have already lost fair-use defenses because they failed on ONE of the four factors. Some cases were lost due to commercialisation, some were lost because too much of the original work was used, some were lost because of monetary or distribution harm to the original author.

So, when you say "like this" you mean "commercial mass harvesting of copyright works to produce a new work"?

> I just need an exact date,

The onus is on the AI owners to prove fair use, and you want a date when that defense will lose?

Just how new are you to copyright and law? Who knows when court cases end? We cannot tell in advance when cases (hearings) may actually start (can be up to two years, sometimes), when they will actually end (another two years?).

How about this instead - we wait for the first judgement that rules on a fair use defense for training machine models?

We set a specific wager, I propose "Fair use is not a significant defense against usage of works to train machine models". That's binary - there's no shades of grey there.

I'm betting on that statement being true, you're betting against that statement being true.

Loser has to post in one of HN or r/programming a link to the first post in this thread, along with a small and short exercise in humility, admitting, "Yes, I was wrong about this call that I made in a public forum"?

It's a friendly wager, if you are willing I'd put it up on my site somewhere (or a google spreadsheet, which is better) so you and I can both update it regularly with suits-in-progress and suits-completed, excluding appeals (otherwise this wager will take multiple decades to settle).

Happy? DM (or email me - my HN username at gmail) and we can both save this link to our emails :-)

Nothing short of sweeping legislation will matter here. And given the US’s recent track record for legislating technology, AI is going to be the Wild West for the foreseeable future.

Guess we’ll just see what the EU decided

We are not as powerless as you claim we are. There are billions of dollars of capital being allocated toward building AI systems, the most abundant sources of which like to view (or at minimum present) themselves as above-board, legal and operating within some ethical framework.

If there is visible pushback and attention to the harms of AI, whether they are visited on the creators used for training or moderators standing in the way of blatant negative outputs, this can alter that investment.

When we throw up our hands and believe that development of this technology in the most exploitative and unscrupulous way is inevitable, we politically disarm ourselves.

Does Github support robots?

What I would like, is rather a well-written[1] license that forbids the distribution or sale of machine learning models created from my source code.

I would be perfectly fine with higher-degree statistical models created from my source code, for personal use only when editing the source code. That would only be a good tool. That is something that this MIT NO AI license also disallows.

[1] Written by someone more knowledgeable in legalese than I am, so that it becomes water-tight. BTW, personally, I prefer copyleft.

I need to write something like this. Basically open source data for non-commercial uses. Have the non-commercial aspect extend to models trained on the data. The idea being to enable academics and hobbyist but collect from people making money.

Way I see it, either companies will comply with your existing copy left license (and we don't need a new one) or they're going to ignore your license, and adding a 'please don't ignore my license' clause won't do any good.

I'm not more knowledgeable in legalese than you.

Couldn't you circumvent this by sub-licensing the project under the standard MIT license and then use THAT as training data?

If you care what people do with your code, don't use permissive licenses. Use something with strong-copyleft like the GPL, or come up with a license that actually restricts the behavior you want.

IANAL so I don't know. But imagine the time investment required to do so...

Zero? The act of licensing something to myself creates no observables, but it's something I'm legally allowed to do. So it just becomes an argument in a court case.

Now, personally I would probably respect the clearly stated intent of the author, because I like to think myself basically decent. But I also try to advocate for people not undermining themselves by selecting licenses that actively permit people to do the thing they don't want them to do. If you want people to contribute back to your ecosystem in exchange for licensing your code, don't use a permissive license, use a copy-left. If you don't want someone building a business off your code, use a non-commercial license. If you want to put any meaningful restrictions at all on your codebase don't give people a carte blanche to sublicense it.

Wouldn't you, at the very least, have to copy repo and change the license text?

The license is just an agreement between the license issuer (me) and the person looking for permission to use the text (me). I don't need to publish anything to sublicense it.

Not sure if you sub-licensing something to yourself would work out? But I am sure it's just as easy for a company to spin up two shells to play this games.

That‘s like selling a cookbook with TOS that say you can‘t use it inside a commercial kitchen. Or, if you do, can‘t sell the food cooked with the methods contained in it.

It's like a cookbook with TOS that say you can't feed it to machine learning model.

What's wrong with that?

And companies do this all the time with enterprise software...

The problem with it is that copyright was never designed to protect ideas.

Society has to strike a balance. You want to protect authors, so they do publish their stuff, and get paid for it. But you also want ideas to spread, so that innovation spreads from one to many.

Hence legal institutions like patents and copyright. Both carefully balance the interests of the owner with those of the public.

Patents give you the right to sue people who use your idea without a valid license, but only if you publish the idea, and only for a certain amount of time. Afterwards, the idea falls into the public domain.

Copyright, on the other hand, has never never protected the ideas contained in a work. Rather, copyright gives you the right to sue people who re-publish your work verbatim. However, copyright never did and never can give you the right to sue somebody because they are learning from your material. Because that would 1) be a de-facto patent without the formal requirements of a patent and 2) run counter to the goal of free speech and the circulation of ideas for the common good.

Besides, copyright laws were enacted in a time where machines learning from text did not exist. So even if a society wanted to put restrictions on such a use of published content, it would require new legislation to do so.

(I am a lawyer, and the above represents my personal view of the legal situation here.)

Besides, what always gets ignored in this discussion:

What about those who already did harvest publicly available information and used it to train their models? Google. Facebook. OpenAI. Now draw the curtain for everyone else, making it a legal moat for those companies‘ profits? That‘s like allowing Google to build an index of the web and then to make it illegal for everyone else to scrape publicly accessible websites.

Depending on the outcome of the recent court cases related to AIs like Copilot and Stable Diffusion, this might not be of use at all.

If courts rule that training, distributing and using models without regard for the data author's consent is a fair use (which has a high chance of happening) then the license is is worthless.

I hope not...

I suspect it's likely. It's really hard to argue that there's nothing "sufficiently transformative" about turning some source code for a web api into a machine learning model capable of presenting code suggestions for a wide variety of different types of problems.

The actual issue isn't the training of the models, but the use of the resulting output: Just because training the models was fair use, doesn't mean that a case where you use some code the model regurgitates from its training data verbatim is.

If the model never leaves the company's servers, and you strap a simple filter on its output that removes verbatim regurgitates, then this should be moot?

It seems irresponsible to use the “MIT License” brand for a license that isn’t open source by most commonly understood definitions. If you want to limit how people use your software, fine, but don’t trick users by naming your license similarly to an open source one.

Is this incompatible with the GPL?

Of course it is. It rather plainly restricts the user from using the software in a specific way, which is against user freedom.

That reminds me of this incredibly funny anecdote told by Douglas Crockford: https://youtu.be/-hCimLnIsDA.

What if you reword it as copyleft-AI GPL. You may only use this code to train “open” models, that are themselves subject to the copyleft-AI GPL. In that case you would just spell out, that you consider the model a derivate of your work.

Most likely yes, because it imposes additional conditions about the use of the code.

It’s also either incompatible with MIT or easily worked around.

It's an open source license, but not an Open Source one.

Was a lawyer consulted for this?

No, and IANAL. If you are one, or have access to one, email me.

Are all machine learning models bad? Surely there could be some machine learning models that are so simplistic in nature that they could be considered "harmless." It seems like the purpose and scope of the model is not considered by most people when they are debating over AI.

well you also removed the attribution clause.

Didn't know MIT had an attribution clause tbh. I only added the second paragraph.

the original has `The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.`

I flagged it as it has nothing to do with MIT

Well the name 'expat license' didn't stick and MIT likes the free advertising.

@dang is the flag necessary?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact