Hacker News new | past | comments | ask | show | jobs | submit login

I hate to sound glib here, but, "...And for my next trick, robots.txt!"

I don't think we have any choice in opting in or out to being model fodder, the only choice is in whether or not to emit that which we seek to keep out of its view. Anything in a public space will be assumed public first. It didn't stop copilot, it's not going to stop OpenAI, and the next thing won't be stopped either. If you're carrying a phone right now, you're already feeding models aisle by aisle as you walk through retail stores. Nobody scoffs at an anti-theft camera when they're shopping for drill bits, but would you feel the same way if it said it is for advertising purposes?

If you thought phones and social media were the opiate of the masses, just you wait. From top to bottom, from consumer to producer to middleman, the entire world is salivating at this opportunity and there is too much gold in them hills for any of this to stop.

Any AI model that fails in this current market will be the ones who make a gentleman's agreement with the wide unaccountable internet to hobble themselves for un-spendable good-guy points.

I view and reward this license as having noble intent, but limited efficacy against those who'd do the most harm to us, the unscrupulous.




This is the "criminals^H^H^H^H^H tech companies will steal content anyway, so why bother?" argument, but that seems defeatist to me. Laws forcing tech companies to attribute sources for generated content would help.


It's not "why bother", it's a critique on the specific license at hand and its efficacy against this issue.


No actually.

Instead it is that training AI in this data, is fully legally, with or without your permission.

Fair use allows you to ignore the wishes of the original copyright holder.


> No actually.

> Instead it is that training AI in this data, is fully legally, with or without your permission.

> Fair use allows you to ignore the wishes of the original copyright holder.

I keep seeing this sentiment repeated again and again. Wrong facts travel faster than right ones, it seems.

"Fair use" is a legal term allowing certain exemptions to copyright enforcement, which recognised in many jurisdictions, and also recognised across jurisdictions via WIP treaties, or other reciprocal recognition.

There is no fair use exemption that I am aware of that specifically recognises "learning" or "training" as a fair use exemption.

Everyone who makes the statement you did, when asked for a citation, throws out some court case which made exemptions for reverse-engineering. They didn't call it fair use. The situation was "let's learn how this works so we can fix it/clone it". THIS situation is not "lets learn how this works", it's "lets train an entity with this".

Do you have a citation[1] for that assertion that fair use exemptions apply in the case of learning or training?

[1] I know you don't. I'm going to ask anyway.


Really? You have had multi discussions about fair use and yet you weren't aware of the 3rd factor in the 4 factor test of fair use?

Here it is, since you were not aware: "the amount and substantiality of the portion taken".

That is what I am referencing, when I am talking about training an AI being covered under fair use.

Obviously, I didn't mean "Well, if you have 1 single image, and I 'train' the model, on that 1 single image, and it produces the exact same image pixel for pixel in the model, this is allowed because 'training' is a bullet proof exception, in the law itself".

Thats obviously not what I meant. Instead, what I am saying, is that if there is a model, trained on millions and millions of images, the output of the model is fair use, because it is not taking significantly from your individual work.


> Here it is, since you were not aware: "the amount and substantiality of the portion taken".

Yes. If you use only 1% of a work, then you are not using a substantial or large amount of the work and it is considered fair use.

But training doesn't use 1% of the work, it uses the entire work. No one is using 1/100th of an individual image to train, nor are they using 1/100th of a codebase to train, etc.

They're using entire individual works, and all those factors that are applicable are evaluated collectively, not in isolation.

Besides, all those factors become irrelevant if "...On the other hand, it is as clear, that if he thus cites the most important parts of the work, with a view, not to criticize, but to supersede the use of the original work, and substitute the review for it, such a use will be deemed in law a piracy ." (https://en.wikipedia.org/wiki/Fair_use)

It's hard to claim that the owners of ChatGPT and similar are not trying to supercede the works it is fed as input. They state as much everywhere.

> Instead, what I am saying, is that if there is a model, trained on millions and millions of images, the output of the model is fair use, because it is not taking significantly from your individual work.

Whether the output from the model is fair use or not is irrelevant to whether the input falls under fair use.

I must say your take is certainly novel, and no, I haven't seen anyone try to make that claim before; each time I have asked I have gotten a different answer.

I think a better case to be made is that ChatGPT is transformative, which would make it fair use.

If you read through the entire wikipedia article I linked above you'll see that:

1. All the factors are evaluated collectively, in relation to each other, not individually.

2. The burden of proof lies with the defendant, not the claimant. IOW, the court starts off with "prove that the use is fair", and not "prove that the use is not fair". From wikipedia "This means that in litigation on copyright infringement, the defendant bears the burden of raising and proving that the use was fair and not an infringement. "

In short, when the license says "not to be used as training data or learning data for any machine model", and it is ignored, the defendant is already in violation. If sent a cease and desist with a request for royalties, the defendant is already presumed to be in violation", and will have to prove fair use, which will (in order of factors) mean that they have to answer "No" to all of the following questions in court:

1. Is the output product being used for commercial purposes and/or profit?

2. Is the input work a freely available fact, or is of a nature that it's in the public interest to reproduce.

3. Is the proportion that is used of the input work insignificant (typically less than 1/100th) of the input work?

4. Does the output work harm the market for the input work?

The owners of ChatGPT are unable to answer YES to any of the above.


Gotcha, so then can you give me a date/time limit on when I am allowed to make fun of you, if zero people lose court cases on this?

I am more than happy to put this on my calendar here.

I just need an exact date, on when I can come back to your comments, and make fun of you for being completely wrong, when nobody loses any court cases on this topic.

Give me a date, and please describe specifically the exact words I am allowed to use to describe someone who would make such a mistake.

And if you refuse to give an exact date, I will assume that it is both the dates 6 months, and exactly 1 year from now, and I will check in with you on exactly those dates to see if you will admit that you were wrong (spoilers... You won't!)


> Gotcha, so then can you give me a date/time limit on when I am allowed to make fun of you, if zero people lose court cases on this?

Well, people have already lost fair-use defenses because they failed on ONE of the four factors. Some cases were lost due to commercialisation, some were lost because too much of the original work was used, some were lost because of monetary or distribution harm to the original author.

So, when you say "like this" you mean "commercial mass harvesting of copyright works to produce a new work"?

> I just need an exact date,

The onus is on the AI owners to prove fair use, and you want a date when that defense will lose?

Just how new are you to copyright and law? Who knows when court cases end? We cannot tell in advance when cases (hearings) may actually start (can be up to two years, sometimes), when they will actually end (another two years?).

How about this instead - we wait for the first judgement that rules on a fair use defense for training machine models?

We set a specific wager, I propose "Fair use is not a significant defense against usage of works to train machine models". That's binary - there's no shades of grey there.

I'm betting on that statement being true, you're betting against that statement being true.

Loser has to post in one of HN or r/programming a link to the first post in this thread, along with a small and short exercise in humility, admitting, "Yes, I was wrong about this call that I made in a public forum"?

It's a friendly wager, if you are willing I'd put it up on my site somewhere (or a google spreadsheet, which is better) so you and I can both update it regularly with suits-in-progress and suits-completed, excluding appeals (otherwise this wager will take multiple decades to settle).

Happy? DM (or email me - my HN username at gmail) and we can both save this link to our emails :-)


Nothing short of sweeping legislation will matter here. And given the US’s recent track record for legislating technology, AI is going to be the Wild West for the foreseeable future.

Guess we’ll just see what the EU decided


We are not as powerless as you claim we are. There are billions of dollars of capital being allocated toward building AI systems, the most abundant sources of which like to view (or at minimum present) themselves as above-board, legal and operating within some ethical framework.

If there is visible pushback and attention to the harms of AI, whether they are visited on the creators used for training or moderators standing in the way of blatant negative outputs, this can alter that investment.

When we throw up our hands and believe that development of this technology in the most exploitative and unscrupulous way is inevitable, we politically disarm ourselves.


Does Github support robots?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: