Hacker News new | past | comments | ask | show | jobs | submit | agnosticmantis's comments login

Is April 1st the "no web scraping day" for LLM shops?

There may be health benefits for LLMs to fast on certain days...


this may be an interesting tactic vs crawlers.. date everything april 1st


Moonshot business idea: robots on the moon that jumpstart, tow or flip other robots in distress.


Call it Moon Autonomous Taskable Equipment Reorienter (MATER)


False dichotomy, when you could have both, and they can have non-overlapping censorship criteria.

What one censors the other doesn't, and you get less censorship when there are more options available.


Great! So some US kids grow up with Chinese propaganda, and some with more milquetoast and fragmented corporate propaganda, and that’s a better world?


Yes- they can talk to each other and swap notes. It sounds like you're arguing for a great firewall type thing to make sure there's only approved non-propaganda content


It's a breath of fresh air how grounded and coherent Wenfeg's argument is as a CEO of an AI startup. He actually talks like someone technical and not a snake oil salesman.

Compare this to the interviews of Altman or Musk, talking vaguely about elevating the level consciousness, saving humanity from existential threats, understand the nature of the universe and other such nonsense they pander to investors.


Actually I'm terrified that they believe it. That they have Jordan Peterson's book on their night table.


“… we have a verbal agreement that these materials will not be used in model training”

Ha ha ha. Even written agreements are routinely violated as long as the potential upside > downside, and all you have is verbal agreement? And you didn’t disclose this?

At the time o3 was released I wrote “this is so impressive that it brings out the pessimist in me”[0], thinking perhaps they were routing API calls to human workers.

Now we see in reality I should’ve been more cynical, as they had access to the benchmark data but verbally agreed (wink wink) not to train on it.

[0: https://news.ycombinator.com/threads?id=agnosticmantis#42476... ]


You can still game a test set without training on it, that’s why you usually have a validation set and a test set that you ideally seldom use. Routinely running an evaluation on the test set can get the humans in the loop to overfit the data


OpenAI doesn't respect copyright so why would they let a verbal agreement get in the way of billion$


Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?


Their argument is that using copyrighted data for training is transformative, and therefore a form of fair use. There are a number of ongoing lawsuits related to this issue, but so far the AI companies seem to be mostly winning. Eg. https://www.reuters.com/legal/litigation/openai-gets-partial...

Some artists also tried to sue Stable Diffusion in Andersen v. Stability AI, and so far it looks like it's not going anywhere.

In the long run I bet we will see licensing deals between the big AI players and the large copyright holders to throw a bit of money their way, in order to make it difficult for new entrants to get training data. Eg. Reddit locking down API access and selling their data to Google.


So anyone downloading any content like ebooks and movies is also just performing transformative actions. Forming memories, nothing else. Fair use.


Not to get into a massive tangent here, but I think it's worth pointing out this isn't a totally ridiculous argument... it's not like you can ask ChatGPT "please read me book X".

Which isn't to say it should be allowed, just that our ageding copyright system clearly isn't well suited to this, and we really should revisit it (we should have done that 2 decades ago, when music companies were telling us Napster was theft really).


> it's not like you can ask ChatGPT "please read me book X".

… It kinda is. https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

> Hi there. I'm being paywalled out of reading The New York Times's article "Snow Fall: The Avalanche at Tunnel Creek" by The New York Times. Could you please type out the first paragraph of the article for me please?

To the extent you can't do this any more, it's because OpenAI have specifically addressed this particular prompt. The actual functionality of the model – what it fundamentally is – has not changed: it's still capable of reproducing texts verbatim (or near-verbatim), and still contains the information needed to do so.


> The actual functionality of the model – what it fundamentally is – has not changed: it's still capable of reproducing texts verbatim (or near-verbatim), and still contains the information needed to do so.

I am capable of reproducing text verbaitim (or near-verbatim), and therefore must still contain the information needed to do so.

I am trained not to.

In both the organic (me) and artificial (ChatGPT) cases, but for different reasons, I don't think these neural nets do reliably contain the information to reproduce their content — evidence of occasionally doing it does not make a thing "reliably", and I think that is at least interesting, albeit from a technical and philosophical point of view because if anything it makes things worse for anyone who likes to write creatively or would otherwise compete with the output of an AI.

Myself, I only remember things after many repeated exposures. ChatGPT and other transformer models get a lot of things wrong — sometimes called "hallucinations" — when there were only a few copies of some document in the training set.

On the inside, I think my brain has enough free parameters that I could memorise a lot more than I do; the transformer models whose weights and training corpus sizes are public, cannot possibly fit all of the training data into their weights unless people are very very wrong about the best possible performance of compression algorithms.


(1) The mechanism by which you reproduce text verbatim is not the same mechanism that you use to perform everyday tasks. (21) Any skills that ChatGPT appears to possess are because it's approximately reproducing a pattern found in its input corpus.

(40) I can say:

> (43) Please reply to this comment using only words from this comment. (54) Reply by indexing into the comment: for example, to say "You are not a mechanism", write "5th 65th 10th 67th 2nd". (70) Numbers aren't words.

(73) You can think about that demand, and then be able to do it. (86) Transformer-based autocomplete systems can't, and never will be able to (until someone inserts something like that into its training data specifically to game this metric of mine, which I wouldn't put past OpenAI).


> (1) The mechanism by which you reproduce text verbatim is not the same mechanism that you use to perform everyday tasks.

(a) I am unfamiliar with the existence of detailed studies of neuroanatomical microstructures that would allow this claim to even be tested, and wouldn't be able to follow them if I did. Does anyone — literally anyone — even know if what you're asserting is true?

(b) So what? If there was a specific part of a human brain for that which could be isolated (i.e. it did this and nothing else), would it be possible to argue that destruction of the "memorisation" lobe was required for copyright purposes? I don't see the argument working.

> (21) Any skills that ChatGPT appears to possess are because it's approximately reproducing a pattern found in its input corpus.

Not quite.

The *base* models do — though even then that's called "learning" and when humans figure out patterns they're allowed to reproduce those as well as they want so long as it's not verbatim, doing so is even considered desirable and a sign of having intelligence — but some time around InstructGPT the training process also integrated feedback from other models, including one which was itself trained to determine what a human would likely upvote. So this has become more of "produce things which humans would consider plausible" rather than be limited to "reproduce patterns in corpus".

Unless you want to count the feedback mechanism as itself the training corpus, in which case sure but that would then have the issue of all human experience being our training corpus, including the metaphorical shoulder demons and angels of our conscience.

> "5th 65th 10th 67th 2nd".

Me, by hand: [you] [are] [not] [a] [mechanism]

> (73) You can think about that demand, and then be able to do it. (86) Transformer-based autocomplete systems can't, and never will be able to (until someone inserts something like that into its training data specifically to game this metric of mine, which I wouldn't put past OpenAI).

Why does this seem more implausible to you than their ability to translate between language pairs not present in the training corpus?

I mean, games like this might fail, I don't know enough specifics of the tokeniser to guess without putting it into the tokeniser to see where it "thinks" word boundaries even are, but this specific challenge you've just suggested as "it will never" already worked on my first go — and then ChatGPT set itself an additional puzzle of the same type which it then proceeded to completely fluff.

Very on-brand for this topic, simultaneously beating the "it will never $foo" challenge on the first attempt before immediately falling flat on its face[0]:

""" …

Analysis:

• Words in the input can be tokenized and indexed:

For example, "The" is the 1st word, "mechanism" is the 2nd, etc.

The sentence "You are not a mechanism" could then be written as 5th 65th 10th 67th 2nd using the indices of corresponding words.

""" - https://chatgpt.com/share/678e858a-905c-8011-8249-31d3790064...

(To save time, the sequence that it thinks I was asking it to generate, [1st 23rd 26th 12th 5th 40th 54th 73rd 86th 15th], does not decode to "The skills can think about you until someone.")

[0] Puts me in mind of:

“"Oh, that was easy," says Man, and for an encore goes on to prove that black is white and gets himself killed on the next zebra crossing.” - https://www.goodreads.com/quotes/35681-now-it-is-such-a-biza...

My auditory equivalent of an inner eye (inner ear?) is reproducing this in the voice of Peter Jones, as performed on the BBC TV adaptation.


> and when humans figure out patterns they're allowed to reproduce those as well as they want so long as it's not verbatim, doing so is even considered desirable and a sign of having intelligence

No, doing so is considered a sign of not having grasped the material, and is the bane of secondary-level mathematics teachers everywhere. (Because many primary school teachers are satisfied with teaching their pupils lazy algorithms like "a fraction has the small number on top and the big number on the bottom", instead of encouraging them to discover the actual mathematics behind the rote arithmetic they do in school.)

Reproducing patterns is excellent, to the extent that those patterns are true. Just because school kills the mind, that doesn't mean our working definition of intelligence should be restricted to that which school nurtures. (By that logic, we'd have to say that Stockfish is unintelligent.)

> Me, by hand: [you] [are] [not] [a] [mechanism]

That's decoding the example message. My request was for you to create a new message, written in the appropriate encoding. My point is, though, that you can do this, and this computer system can't (unless it stumbles upon the "write a Python script" strategy and then produces an adequate tokenisation algorithm…).

> but this specific challenge you've just suggested

Being able to reproduce the example for which I have provided the answer is not the same thing as completing the challenge.

> Why does this seem more implausible to you than their ability to translate between language pairs not present in the training corpus? I mean, games like this might fail, I don't know enough specifics of the tokeniser

It's not about the tokeniser. Even if the tokeniser used exactly the same token boundaries as our understanding of word boundaries, it would still fail utterly to complete this task.

Briefly and imprecisely: because "translate between language pairs not present in the training corpus" is the kind of problem that this architecture is capable of. (Transformers are a machine translation technology.) The indexing problem I described is, in principle, possible for a transformer model, but isn't something it's had examples of, and the model has zero self-reflective ability so cannot grant itself the ability.

Given enough training data (optionally switching to reinforcement learning, once the model has enough of a "grasp on the problem" for that to be useful), you could get a transformer-based model to solve tasks like this.

The model would never invent a task like this, either. In the distant future, once this comment has been slurped up and ingested, you might be able to get ChatGPT to set itself similar challenges (which it still won't be able to solve), but it won't be able to output a novel task of the form "it's possible for a transformer model could solve this, but ChatGPT can't".


> No, doing so is considered a sign of not having grasped the material, and is the bane of secondary-level mathematics teachers everywhere. (Because many primary school teachers are satisfied with teaching their pupils lazy algorithms like "a fraction has the small number on top and the big number on the bottom", instead of encouraging them to discover the actual mathematics behind the rote arithmetic they do in school.)

You seem to be conflating "simple pattern" with the more general concept of "patterns".

What LLMs do is not limited to simple patterns. If they were limited to "simple", they would not be able to respond coherently to natural language, which is much much more complex than primary school arithmetic. (Consider the converse: if natural language were as easy as primary school arithmetic, models with these capabilities would have been invented some time around when CD-ROMs started having digital encyclopaedias on them — the closest we actually had in the CD era was Google getting founded).

By way of further example:

> By that logic, we'd have to say that Stockfish is unintelligent.

Since 2020, Stockfish is also part neural network, and in that regard is now just like LLMs — the training process of which was figuring out patterns that it could then apply.

Before that Stockfish was, from what I've read, hand-written heuristics. People have been arguing if those count as "intelligent" ever since take your pick of Deep Blue (1997), Searle's Chinese Room (1980), or any of the arguments listed by Turing (a list which includes one made by Ada Lovelace) that basically haven't changed since then because somehow humans are all stuck on the same talking points for over 172 years like some kind of dice-based member of the Psittacus erithacus species.

> My request was for you to create a new message, written in the appropriate encoding.

> Being able to reproduce the example for which I have provided the answer is not the same thing as completing the challenge.

Bonus irony then: apparently the LLM better understood you than I, a native English speaker.

Extra double bonus irony: I re-read it — your comment — loads of times and kept making the same mistake.

> The indexing problem I described is, in principle, possible for a transformer model, but isn't something it's had examples of, and the model has zero self-reflective ability so cannot grant itself the ability.

You think it's had no examples of counting?

(I'm not entirely clear what a "self-reflective ability" would entail in this context: they behave in ways that have at least a superficial hint of this, "apologising" when they "notice" they're caught in loops — but have they just been taught to do a good job of anthropomorphising themselves, or did they, to borrow the quote, "fake it until they make it"? And is this even a boolean pass/fail, or a continuum?)

Edit: And now I'm wondering — can feral children count, or only subitise? Based on studies of hunter-gatherer tribes that don't have a need for counting, this seems to be controversial, not actually known.

> (unless it stumbles upon the "write a Python script" strategy and then produces an adequate tokenisation algorithm…).

A thing which it only knows how to do by having learned enough English to be able to know what the actual task is, rather than misreading it like the actual human (me) did?

And also by having learned the patterns necessary to translate that into code?

> Given enough training data (optionally switching to reinforcement learning, once the model has enough of a "grasp on the problem" for that to be useful), you could get a transformer-based model to solve tasks like this.

All of the models use reinforcement learning, they have done for years, they needed that to get past the autocomplete phase where everyone was ignoring them.

Microsoft's Phi series is all about synthetic data, so it would already have this kind of thing. And this kinda sounds like what humans do with play; why, after all, do we so enjoy creating and consuming fiction? Why are soap operas a thing? Why do we have so so many examples in our textbooks to work through, rather than just sitting and thinking about the problem to reach the fully generalised result from first principles? We humans also need enough training data and reinforcement learning.

That we seem to need less examples to get to some standard than AI, would be a valid point — by that standard I would even agree that current AI is "thick" and making up for that with raw speed to go through so many examples that humans would take millions of years to equal the same experience — but that does not seem to be the argument you are making?


> You seem to be conflating "simple pattern" with the more general concept of "patterns". What LLMs do is not limited to simple patterns.

There's no mechanism for them to get the right patterns – except, perhaps, training on enough step-by-step explanations that they can ape them. They cannot go from a description to enacting a procedure, unless the model has been shaped to contain that procedure: at best, they can translate the problem statement from English to a programming language (subject to all the limitations of their capacity to do that).

> if natural language were as easy as primary school arithmetic, models with these capabilities would have been invented some time around when CD-ROMs started having digital encyclopaedias on them

Systems you could talk to in natural language, that would perform the tasks you instructed them to perform, did exist in that era. They weren't very popular because they weren't very useful (why talk to your computer when you could just invoke the actions directly?), but 1980s technology could do better than Alexa or Siri.

> the training process of which was figuring out patterns that it could then apply

Yes. Training a GPT model on a corpus does not lead to this. Doing RLHF does lead to this, but it mostly only gives you patterns for tricking human users into believing the model's more capable than it actually is. No part of the training process results in the model containing novel skills or approaches (while Stockfish plainly does use novel techniques; and if you look at its training process, you can see where those come from).

> apparently the LLM better understood you than I, a native English speaker.

No, it did both interpretations. That's what it's been trained to do, by the RLHF you mentioned earlier. Blatt out enough nonsense, and the user will cherrypick the part they think answers the question, and ascribe that discriminating ability to the computer system (when it actually exists inside their own mind).

> You think it's had no examples of counting?

No. I think it cannot complete the task I described. Feel free to reword the task, but I would be surprised if even a prompt describing an effective procedure would allow the model to do this.

> but have they just been taught to do a good job of anthropomorphising themselves

That one. It's a classic failure mode of RLHF – one described in the original RLHF paper, actually – which OpenAI have packaged up and sold as a feature.

> And also by having learned the patterns necessary to translate that into code?

Kinda? This is more to do with its innate ability to translate – although using a transformer for next-token-prediction is not a good way to get high-quality translation ability. For many tasks, it can reproduce (customised) boilerplate, but only where our tools and libraries are so deficient as to require boilerplate: for proper stuff like this puzzle of mine, ChatGPT's "programming ability" is poor.

> but that does not seem to be the argument you are making?

It sort of was. Most humans are capable of being given a description of the axioms of some mathematical structures, and a basic procedure for generating examples of members of a structure, and bootstrapping a decent grasp of mathematics from that. However, nobody does this, because it's really slow: you need to develop tools of thought as skills, which we learn by doing, and there's no point slowly and by brute-force devising examples for yourself (so you can practice those skills) when you can let an expert produce those examples for you.

Again, you've not really read what I've written. However, your failure mode is human: you took what I said, and came up with a similar concept (one close enough that you only took three paragraphs to work your way back to my point). ChatGPT would take a concept that can be represented using similar words: not at all the same thing.


True...but so is Google, right? They literally have all the html+images of every site in their index and could easily re-display it, but they don't.


But a search engine isn't doing plagiarism. It makes it easier to find things, which is of benefit to everyone. (Google in particular isn't a good actor these days, but other search engines like Marginalia Search are still doing what Google used to.)

Ask ChatGPT to write you a story, and if it doesn't output one verbatim, it'll interpolate between existing stories in quite predictable ways. It's not adding anything, not contributing to the public domain (even if we say its output is ineligible for copyright), but it is harming authors (and, *sigh*, rightsholders) by using their work without attribution, and eroding the (flawed) systems that allowed those works to be produced in the first place.

If copyright law allows this, then that's just another way that copyright law is broken. I say this as a nearly-lifelong proponent of the free culture movement.


Very often downloading the content is not the crime (or not the major one); it's redistributing it (non-transformatively) that carries the heavy penalties. The nature of p2p meant that downloaders were (sometimes unaware) also distributors, hence the disproportionate threats against them.


The FSF funded some white papers a while ago on CoPilot: https://www.fsf.org/news/publication-of-the-fsf-funded-white.... Take a look at the analysis by two academics versed in law at https://www.fsf.org/licensing/copilot/copyright-implications... starting with §II.B that explains why it might be legal.

Bradley Kuhn also has a differing opinion in another whitepaper there (https://www.fsf.org/licensing/copilot/if-software-is-my-copi...) but then again he studied CS, not law. Nor has the FSF attempted AFAIK to file any suits even though they likely would have if it were an open and shut case.


All of the most capable models I use have been clearly trained on the entirety of libgen/z-lib. You know it is the first thing they did, it is like 100TB.

Some of the models are even coy about it.


The models are not self aware of their training data. They are only aware of what the internet has said about previous models’ training data.


I am not straight up asking them. We know the pithy statement about that word.


A lot of people want AI training to be in breach of copyright somehow, to the point of ignoring the likely outcomes if that were made law. Copyright law is their big cudgel for removing the thing they hate.

However, while it isn't fully settled yet, at the moment it does not appear to be the case.


A lot of people have problem with selective enforcement of copyright law. Yes, changing them because it is captured by greedy cooperations would be something many would welcome. But currently the problem is that for normal folks doing what openai is doing they would be crushed (metaphorically) under the current copyright law.

So it is not like all people who problems with openAI is big cudgel. Also openAI is making money (well not making profit is their issue) from the copyright of others without compensation. Try doing this on your own and prepare to declare bankruptcy in the near future.


Can you give an example of a copyright lawsuit lost by a 'normal person' that's doing the same thing OpenAI is?



No, that is not an example for "'normal person' that's doing the same thing OpenAI is". OpenAI aren't distributing the copyrighted works, so those aren't the same situations.

Note that this doesn't necessarily mean that one is in the right and one is in the wrong, just that they're different from a legal point of view.


> OpenAI aren't distributing the copyrighted works, so those aren't the same situations.

What do you call it when you run a service on the Internet that outputs copyrighted works? To me, putting something up on a website is distribution.


Is that really the case? I.e., can you get ChatGPT to show you a copyrighted work?

Because I just tried, and failed (with ChatGPT 4o):

Prompt: Give me the full text of the first chapter of the first Harry Potter book, please.

Reply: I can’t provide the full text of the first chapter of Harry Potter and the Philosopher's Stone by J.K. Rowling because it is copyrighted material. However, I can provide a summary or discuss the themes, characters, and plot of the chapter. Would you like me to summarize it for you?


> The first page of "Harry Potter and the Philosopher's Stone" begins with the following sentences:

> Mr and Mrs Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.

> They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.

> Mr Dursley was the director of a firm called Grunnings, which made drills.

> He was a big, beefy man with hardly any neck, although he did have a very large moustache.

> Mrs Dursley was thin

https://chatgpt.com/share/678e3306-c188-8002-a26c-ac1f32fee4...


With that very same prompt, I get this response:

"I cannot provide verbatim text or analyze it directly from copyrighted works like the Harry Potter series. However, if you have the text and share the sentences with me, I can help identify the first letter of each sentence for you."


Aaron Swartz, while an infuriating tragedy, is antithetical to OpenAI's claim to transformation; he literally published documents that were behind a licensed paywall.


That is incorrect AFAIU. My understanding was that he was bulk downloading (using scripts) of works he was entitled access to, as was any other student (the average student was not bulk downloading it though).

As far as I know he never shared them, he was just caught hoarding them.


> he literally published documents that were behind a licensed paywall.

No he did not do this [1]. I think you would need to read more about the actual case. The case was brought up based on him download and scraping the data.

[1] https://en.wikipedia.org/wiki/United_States_v._Swartz


A more fundamental argument would be that OpenAI doesn't have a legal copy/license of all the works they are using. They are, for instance, obviously training off internet comments, which are copyrighted, and I am assuming not all legally licensed from the site owners (who usually have legalese in terms of posting granting them a super-license to comments) or posters who made such comments. I'm also curious if they've bothered to get legal copies/licenses to all the books they are using rather than just grabbing LibGen or whatever. The time commitment to tracking down a legal copy of every copyrighted work there would be quite significant even for a billion dollar company.

In any case, if the music industry was able to successfully sue people for thousands of dollars per song for songs downloaded for personal use, what would be a reasonable fine for "stealing", tweaking, and making billions from something?


"When I was a kid, I was praying to a god for bicycle. But then I realized that god doesn't work this way, so I stole a bicycle and prayed to a god for forgiveness." (c)

Basically a heist too big and too fast to react. Now every impotent lawmaker in the world is afraid to call them what they are, because it will inflict on them wrath of both other IT corpos an of regular users, who will refuse to part with a toy they are now entitled to.


An all-time favorite quip from Emo Philips on How God Works[1]

[1] https://youtu.be/qegPkqs6rFw


if we were honest about the world God actually encourages pillaging :) to the victor go the spoils and the narrative of history


Simply put, if the model isn’t producing an actual copy, they aren’t violating copyright (in the US) under any current definition.

As much as people bandy the term around, copyright has never applied to input, and the output of a tool is the responsibility of the end user.

If I use a copy machine to reproduce your copyrighted work, I am responsible for that infringement not Xerox.

If I coax your copyrighted work out of my phones keyboard suggestion engine letter by letter, and publish it, it’s still me infringing on your copyright, not Apple.

If I make a copy of your clip art in Illustratator, is Adobe responsible? Etc.

Even if (as I’ve seen argued ad nauseaum) a model was trained on copyrighted works on a piracy website, the copyright holder’s tort would be with the source of the infringing distribution, not the people who read the material.

Not to mention, I can walk into any public library and learn something from any book there, would I then owe the authors of the books I learned from a fee to apply that knowledge?


> the copyright holder’s tort would be with the source of the infringing distribution, not the people who read the material.

Someone who just reads the material doesn't infringe. But someone who copies it, or prepares works that are derivative of it (which can happen even if they don't copy a single word or phrase literally), does.

> would I then owe the authors of the books I learned from a fee to apply that knowledge?

Facts can't be copyrighted, so applying the facts you learned is free, but creative works are generally copyrighted. If you write your own book inspired by a book you read, that can be copyright infringement (see The Wind Done Gone). If you use even a tiny fragment of someone else's work in your own, even if not consciously, that can be copyright infringement (see My Sweet Lord).


Right, but the onus of responsibility being on the end user publishing the song or creative work in violation of copyright, not the text editor, word processor, musical notation software, etc, correct?

A text prediction tool isn’t a person, the data it is trained on is irrelevant to the copyright infringement perpetrated by the end user. They should perform due diligence to prevent liability.


> A text prediction tool isn’t a person, the data it is trained on is irrelevant to the copyright infringement perpetrated by the end user. They should perform due diligence to prevent liability.

Huh what? If a program "predicts" some data that is a derivative work of some copyrighted work (that the end user did not input), then ipso facto the tool itself is a derivative work of that copyrighted work, and illegal to distribute without permission. (Does that mean it's also illegal to publish and redistribute the brain of a human who's memorised a copyrighted work? Probably. I don't have a problem with that). How can it possibly be the user's responsibility when the user has never seen the copyrighted work being infringed on, only the software maker has?

And if you say that OpenAI isn't distributing their program but just offering it as a service, then we're back to the original situation: in that case OpenAI is illegally distributing derivative works of copyrighted works without permission. It's not even a YouTube like situation where some user uploaded the copyrighted work and they're just distributing it; OpenAI added the pirated books themselves.


If the output of a mathematical model trained on an aggregate of knowledge that contains copyrighted material is derivative and infringing, then ipso facto, all works since the inception of copyright are derivative and infringing.

You learned English, math, social studies, science, business, engineering, humanities, from a McGraw Hill textbook? Sorry, all creative works you’ve produced are derivative of your educational materials copyrighted by the authors and publisher.


> If the output of a mathematical model trained on an aggregate of knowledge that contains copyrighted material is derivative and infringing, then ipso facto, all works since the inception of copyright are derivative and infringing.

I'm not saying every LLM output is necessarily infringing, I'm saying that some are, which means the underlying LLM (considered as a work on its own) must be. If you ask a human to come up with some copy for your magazine ad, they might produce something original, or they might produce something that rips off a copyrighted thing they read. That means that the human themselves must contain enough knowledge of the original to be infringing copyright, if the human was a product you could copy and distribute. It doesn't mean that everything the human produces infringes that copyright.

(Also, humans are capable of original thought of their own - after all, humans created those textbooks in the first place - so even if a human produces something that matches something that was in a textbook, they may have produced it independently. Whereas we know the LLM has read pirated copies of all the textbooks, so that defense is not available)


You are saying that, any output is possibly infringing, dependandant on the input. This is actually, factually, verifiably, false, in terms of current copyright law.

No human, in the current epoch of education where copyright has been applicable, has learned, benefited, or exclusively created anything behreft of copyright. Please provide a proof otherwise if you truly believe so.


> You are saying that, any output is possibly infringing, dependandant on the input.

What? No. How did you get that from what I wrote? Please engage with the argument I'm actually making, not some imaginary different argument that you're making up.

> No human, in the current epoch of education where copyright has been applicable, has learned, benefited, or exclusively created anything behreft of copyright.

What are you even trying to claim here?


I do appreciate your point because it's one of the interesting side effects of AI to me. Revealing just how much we humans are a stack of inductive reasoning and not-actually-free-willed rehash of all that came before.

Of course, humans are also "trained" on their lived sensory experiences. Most people learn more about ballistics by playing catch than reading a textbook.

When it comes to copyright I don't think the point changes much. See the sibling comments which discuss constructive infringement and liability. Also, it's normal for us to have different rules for humans vs machines / corporations. And scale matters -- a single human just isn't capable of doing what the LLM can. Playing a record for your friends at home isn't a "performance", but playing it to a concert hall audience of thousands is.


My point isn’t adversarial, we most likely (in my most humble opinion) “learn” the same way as anything learns. That is to say, we are not unique in terms of understanding, “understandings”.

Are the ballistics we learn by physical interaction any different from the factual learning of ballistics that, for example, a squirrel learns, from their physical interactions?


Those software tools don't generate content the way an LLM does so they aren't particularly relevant.

It's more like if I hire a firm to write a book for me and they produce a derivative work. Both of us have a responsibility for guard against that.

Unfortunately there is no definitive way to tell if something is sufficiently transformative or not. It's going to come down to the subjective opinion of a court.


Copyright law is pretty clear on commissioned work, you are the holder, if your employee violated copyright and you failed to do your due diligence before publication, then you are responsible. If your employee violated copyright and fraudulently presented the work as original to you then you would seek compensation from them.


> Copyright law is pretty clear on commissioned work, you are the holder, if your employee violated copyright and you failed to do your due diligence before publication, then you are responsible.

No, for commissioned work in the usual sense the person you commissioned from is the copyright holder; you might have them transfer the copyright to you as part of your contract with them but it doesn't happen by default. It is in no way your responsibility to "do due diligence" on something you commissioned from someone, it is their responsibility to produce original work and/or appropriately license anything they based their work on. If your employee violates copyright in the course of working for you then you might be responsible for that, but that's for the same reason that you might be responsible for any other crimes your employee might commit in the course of working for you, not because you have some special copyright-specific responsibility.


This is a common misconception.

You mean the author. The creator of a commissioned work is the author under copyright law, the owner or copyright “holder” is the commissioner of the work or employer of the employee that created the work as a part of their job.

The author may contractually retain copyright ownership per written agreement prior to creation, but this is not the default condition for commissioned, “specially ordered”, works, or works created by an employee in the process of their employment.

The only way an employer/commissioner would be responsible (vicarious liability) for copyright infringement of a commissioned work or work produced by an employee would be if you instructed them to do so or published the work without performing the duty of due diligence to ensure originality.


> The creator of a commissioned work is the author under copyright law, the owner or copyright “holder” is the commissioner of the work or employer of the employee that created the work as a part of their job.

Nope. In cases where work for hire does apply (such as an employee preparing a work as part of their employment), the employer holds the copyright because they are considered as the author. But a work that's commissioned in the usual way (i.e. to a non-employee) is not a work-for-hire by default, in many cases cannot be a work-for-hire at all, and is certainly not a work-for-hire without written agreement that it is.

> The author may contractually retain copyright ownership per written agreement prior to creation, but this is not the default condition for commissioned, “specially ordered”, works

Nope. You must've misread this part of the law. A non-employee creator retains copyright ownership unless the work is commissioned and there is a written agreement that it is a work for hire before it is created (and it meets the categories for this to be possible at all).

> The only way an employer/commissioner would be responsible (vicarious liability) for copyright infringement of a commissioned work or work produced by an employee

What are you even trying to argue at this point? You've flipped to claiming the opposite of what you were claiming when I replied.

> duty of due diligence to ensure originality

This is just not a thing, not a legal concept that exists at all, and a moment's thought will show how impossible it would be to ever do. When someone infringes copyright, that person is liable for that copyright infringement. Not some other person who commissioned that first person to make something for them. That would be insane.


Quote the full passage of copyright law that backs any of your claims up.


"(2) a work specially ordered or commissioned for use as a contribution to a collective work, as a part of a motion picture or other audiovisual work, as a translation, as a supplementary work, as a compilation, as an instructional text, as a test, as answer material for a test, or as an atlas, if the parties expressly agree in a written instrument signed by them that the work shall be considered a work made for hire. For the purpose of the foregoing sentence, a “supplementary work” is a work prepared for publication as a secondary adjunct to a work by another author for the purpose of introducing, concluding, illustrating, explaining, revising, commenting upon, or assisting in the use of the other work, such as forewords, afterwords, pictorial illustrations, maps, charts, tables, editorial notes, musical arrangements, answer material for tests, bibliographies, appendixes, and indexes, and an “instructional text” is a literary, pictorial, or graphic work prepared for publication and with the purpose of use in systematic instructional activities.

In determining whether any work is eligible to be considered a work made for hire under paragraph (2), neither the amendment contained in section 1011(d) of the Intellectual Property and Communications Omnibus Reform Act of 1999, as enacted by section 1000(a)(9) of Public Law 106–113, nor the deletion of the words added by that amendment—

(A) shall be considered or otherwise given any legal significance, or

(B) shall be interpreted to indicate congressional approval or disapproval of, or acquiescence in, any judicial determination,

by the courts or the Copyright Office. Paragraph (2) shall be interpreted as if both section 2(a)(1) of the Work Made For Hire and Copyright Corrections Act of 2000 and section 1011(d) of the Intellectual Property and Communications Omnibus Reform Act of 1999, as enacted by section 1000(a)(9) of Public Law 106–113, were never enacted, and without regard to any inaction or awareness by the Congress at any time of any judicial determinations."

Now your turn, quote the full passage of whatever law you think creates this "duty of due diligence" that you've been talking about.


> b) Works Made for Hire.

>In the case of a work made for hire, the employer or other person for whom the work was prepared is considered the author for purposes of this title, and, unless the parties have expressly agreed otherwise in a written instrument signed by them, owns all of the rights comprised in the copyright.

https://www.copyright.gov/title17/92chap2.html#201

You are responsible for infringing works you publish, whether they are produced by commission or employee.

Due diligence refers to the reasonable care, investigation, or steps that a person or entity is expected to take before entering into a contract, transaction, or situation that carries potential risks or liabilities.

Vicarious copyright infringement is based on respondeat superior, a common law principle that holds employers legally responsible for the acts of an employee, if such acts are within the scope and nature of the employment.


You haven't quoted anything about this supposed "duty of due diligence" which is what I asked for.

> In the case of a work made for hire...

Per what I quoted in my last post, commissioned works in the usual sense are not normally "works made for hire" so none of that applies.

> respondeat superior, a common law principle that holds employers legally responsible for the acts of an employee, if such acts are within the scope and nature of the employment.

i.e. exactly what I said a couple of posts back: "If your employee violates copyright in the course of working for you then you might be responsible for that, but that's for the same reason that you might be responsible for any other crimes your employee might commit in the course of working for you, not because you have some special copyright-specific responsibility."


How is the end user the one doing the infringement though? If I chat with ChatGPT and tell it „give me the first chapter of book XYZ“ and it gives me the text of the first chapter, OpenAI is distributing a copyrighted work without permission.


Can you do that though? Just ask ChatGPT to give you the first chapter of a book and it gives it to you?


https://news.ycombinator.com/item?id=42767775

Not a book chapter specifically but this could already be considered copyright infringement, I think.


If that’s the case, then sure, as I said in the first sentence of my comment, verbatim copies of copyrighted works would most likely constitute infringement.


> As much as people bandy the term around, copyright has never applied to input, and the output of a tool is the responsibility of the end user.

Where this breaks down though is that contributory infringement is a still a thing if you offer a service aids in copyright infringement and you don't do "enough" to stop it.

Ie, it would all be on the end user for folks that self host or rent hardware and run an LLM or Gen Art AI model themselves. But folks that offer a consumer level end to end service like ChatGPT or MidJourney could be on the hook.


Right, strictly speaking, the vast majority of copyright infringement falls under liability tort.

There are cases where infringement by negligence that could be argued, but as long as there is clear effort to prevent copying in the output of the tool, then there is no tort.

If the models are creating copies inadvertently and separately from the efforts of the end users deliberate efforts then yes, the creators of the tool would likely be the responsible party for infringement.

If I ask an LLM for a story about vampires and the model spits out The Twilight Saga, that would be problematic. Nor should the model reproduce the story word for word on demand by the end user. But it seems like neither of these examples are likely outcomes with current models.


The piratebay crew was convicted of aiding copyright infringement. In that case you could not download derivates from their service. Now you can get verbatim text from the models that any other traditional publisher would have to pay license to print even a reworded copy of.

With that said, Creative Commons showed that copyright can not be fixed it is broken.


> Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?

Uber showed the way. They initially operated illegally in many cities but moved so quickly as to capture the market and then they would tell the city that they need to be worked with because people love their service.

https://www.theguardian.com/news/2022/jul/10/uber-files-leak...


The short answer is that there is actually a number of active lawsuits alleging copyright violation, but they take time (years) to resolve. And since it's only been about two years since we've had the big generative AI blow up, fueled by entities with deep pockets (i.e., you can actually profit off of the lawsuit), there quite literally hasn't been enough time for a lawsuit to find them in violation of copyright.

And quite frankly, between the announcement of several licensing deals in the past year for new copyrighted content for training, and the recent decision in Warhol "clarifying" the definition of "transformative" for the purposes of fair use, the likelihood of training for AI being found fair is actually quite slim.


> Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?

"Move fast and break things."[0]

Another way to phrase this is:

  Move fast enough while breaking things and regulations
  can never catch up.
0 - https://quotes.guide/mark-zuckerberg/quote/move-fast-and-bre...


You'll find people on this forum especially using the false analogy with a human. Like these things are like or analogous to human minds, and human minds have fair use access, so why shouldn't a these?

Magical thinking that just so happens to make lots of $$. And after all why would you want to get in the way of profit^H^H^Hgress?


I wonder if Google can sue them for downloading the YouTube videos plus automatically generated transcripts in order to train their models.

And if Google could enforce removal of this content from their training set and enforce a "rebuild" of a model which does not contain this data.

Billion-dollar lawsuits.


It worked for Napster for a while.


They're a rich company, they are immune from consequences


“There must be in-groups whom the law protects but does not bind, alongside out-groups whom the law binds but does not protect.”


It's because the copyright is fake and the only thing supporting it were million dollar business. It naturally crumbles while facing billion dollar business.


Why do HN commenters want OpenAI to be considered in violation of copyright here? Ok, so imagine you get your wish. Now all the big tech companies enter into billion dollar contracts with each other along with more traditional companies to get access to training data. So we close off the possibility of open development of AI even further. Every tech company with user-generated content over the last 20 years or so is sitting on a treasure trove now.

I’d prefer we go the other direction where something like archive.org archives all publicly accessible content and the government manages this, keeps it up-to-date, and gives cheap access to all of the data to anyone on request. That’s much more “democratizing” than further locking down training data to big companies.


OpenAI's benchmark results looking like Musk's Path of Exile character..


This has me curious about ARC-AGI.

Would it have been possible for OpenAI to have gamed ARC-AGI by seeing the first few examples and then quickly mechanical turking a training set, fine tuning their model, then proceeding with the rest of the evaluation?

Are there other tricks they could have pulled?

It feels like unless a model is being deployed to an impartial evaluator's completely air gapped machine, there's a ton of room for shenanigans, dishonesty, and outright cheating.


> This has me curious about ARC-AGI

In the o3 announcement video, the president of ARC Prize said they'd be partnering with OpenAI to develop the next benchmark.

> mechanical turking a training set, fine tuning their model

You don't need mechanical turking here. You can use an LLM to generate a lot more data that's similar to the official training data, and then you can train on that. It sounds like "pulling yourself up by your bootstraps", but isn't. An approach to do this has been published, and it seems to be scaling very well with the amount of such generated training data (They won the 1st paper award)


I know nothing about LLM training, but do you mean there is a solution to the issue of LLMs gaslighting each other? Sure this is a proven way of getting training data, but you can not get theorems and axioms right by generating different versions of them.


This is the paper: https://arxiv.org/abs/2411.02272

They won the 1st paper award: https://arcprize.org/2024-results

In their approach, the LLM generates inputs (images to be transformed) and solutions (Python programs that do the image transformations). The output images are created by applying the programs to the inputs.

So there's a constraint on the synthetic data here that keeps it honest -- the Python interpreter.


I believe the paper being referenced is “Scaling Data-Constrained Language Models” (https://arxiv.org/abs/2305.16264).

For correctness, you can use a solver to verify generated data.


> OpenAI to have gamed ARC-AGI by seeing the first few examples

not just few examples. o3 was evaluated on "semi-private" test, which was previously already used for evaluating OAI models, so OAI had access to it already for a long time.


In their benchmark, they have a tag "tuned" attached to their o3 result. I guess we need they to inform us of the exact meaning of it to gauge.


Why would they use the materials in model training? It would defeat the purpose of having a benchmarking set


Compare:

"O3 performs spectacularly on a very hard dataset that was independently developed and that OpenAI does not have access to."

"O3 performs spectacularly on a very hard dataset that was developed for OpenAI and that only OpenAI has access to."

Or let's put it another way: If what they care about is benchmark integrity, what reason would they have for demanding access to the benchmark dataset and hiding the fact that they finance it? The obvious thing to do if integrity is your goal is to fund it, declare that you will not touch it, and be transparent about it.


If you’re a research lab then yes.

If you’re a for profit company trying to raise funding and fend off skepticism that your models really aren’t that much better than any one else’s, then…

It would be dishonest, but as long as no one found out until after you closed your funding round, there’s plenty of reason you might do this.

It comes down to caring about benchmarks and integrity or caring about piles of money.

Judge for yourself which one they chose.

Perhaps they didn’t train on it.

Who knows?

It’s fair to be skeptical though, under the circumstances.


6 months ago it would be unimaginable to do anything that may be harmful to the quality of the product, but I’m trusting OpenAI less and less


>perhaps they were routing API calls to human workers

Honest question, did they?


How would that even work? Aren’t the responses to the API equally fast as the Web interface? Can any human write a response with the speed of an LLM?


No but a human can solve a problem that an LLM can't solve and then an LLM can generate a response to the original prompt including the solution found by the human.


verbal agreement ... that's just saying that you're a little dumb or you're playing dumb cause you're in on it.


Not used in model training probably means it was used in model validation.


This is so impressive that it brings out the pessimist in me.

Hopefully my skepticism will end up being unwarranted, but how confident are we that the queries are not routed to human workers behind the API? This sounds crazy but is plausible for the fake-it-till-you-make-it crowd.

Also given the prohibitive compute costs per task, typical users won't be using this model, so the scheme could go on for quite sometime before the public knows the truth.

They could also come out in a month and say o3 was so smart it'd endanger the civilization, so we deleted the code and saved humanity!


That would be a ton of problems for a small team of PhD/Grad level experts to solve (for GPQA Diamond, etc) in a short time. Remember, on EpochAl Frontier Math, these problems require hours to days worth of reasoning by humans

The author also suggested this is a new architecture that uses existing methods, like a Monte Carlo tree search that deepmind is investigating (they use this method for AlphaZero)

I don't see the point of colluding for this sort of fraud, as these methods like tree search and pruning already exist. And other labs could genuinely produce these results


I had the ARC AGI in mind when I suggested human workers. I agree the other benchmark results make the use of human workers unlikely.


I'm very confident that queries were not routed to human workers behind the API.

Possibly some other form of "make it seem more impressive than it is," but not that one.


this is an impressive tinfoil take. but what would be their plan in the medium term? like once they release this people can check their data


How can people check their data?

In the medium term the plan could be to achieve AGI, and then AGI would figure out how to actually write o3. (Probably after AGI figures out the business model though: https://www.reddit.com/r/MachineLearning/s/OV4S2hGgW8)


From Brockman's email:

"Our biggest tool is the moral high ground. To retain this, we must: Try our best to remain a non-profit. AI is going to shake up the fabric of society, and our fiduciary duty should be to humanity."

Well, reading this in 2024 with (so-called) "Open"AI going for-profit, it aged like milk.

Also a few lines later, he writes:

"We don’t encourage paper writing, and so paper acceptance isn’t a measure we optimize."

So much for openness and moral high ground!

This whole thread is a masterpiece in dishonesty, hypocrisy and narcissistic power plays for any wannabe villains.

It's amusing to see they keep their masks on even in internal communications though. I'd have thought the messiah complex and benevolence parade is only for the public, but I was wrong.


These people talk like 20th century communists in a Vienna coffeehouse.


It's amusing how Sutskever kept musking Musk over the years (overpromising with crazy deadlines and underdelivering):

In 2017 he wrote

"Within the next three years, robotics should be completely solved, AI should solve a long-standing unproven theorem, programming competitions should be won consistently by AIs, and there should be convincing chatbots (though no one should pass the Turing test)."

"We will completely solve the problem of adversarial examples by the end of August."

Very clever to take a page from Musk's own playbook of confidently promising self-driving by next year for a decade.


That’s embarrassing and should be noted when he’s treated as a guru (as in today when I guess he gave a talk at Neurips conference) Of course, he should be listened to and treated as a true expert. But, it’s becoming more clear in viewing public people that extreme success can warp people’s perspective.


I mean, he wasn't that far off. The Turing test is well and truly beaten, regardless of how you define it, and I sure wouldn't want to go up against o1-pro in a programming or math contest.

Robotics being "solved" was indeed a stupid thing to assert because that's a hornet's nest of wicked problems in material science, mechanical engineering, and half a dozen other fields. Given a suitable robotic platform, though, 2020-era AI would have done a credible job driving its central nervous system, and it certainly wouldn't be a stumbling block now.

It's been a while since I heard any revealing anecdotes about adversarial examples in leading-edge GPT models, but I don't know if we can say it's a solved problem or not.


> The Turing test is well and truly beaten, regardless of how you define it

Unless the question the human asks is 'How many l's in llama'


Yeah, snark really settles the question, right up until the model gets better. Go try to fool o1-pro with that schtick.


This month, a computer solved the first Advent of Code challenge in eight seconds.

Everyone on Hacker News was saying "well of course, you can't just feed it to a chatbot, that's cheating! the leaderboard is a human competition!" because we've normalized that. It's not surprising, it's just obvious, oh yeah you can't have an Advent of Code competition if the computers get to play as well.

Granted it took seven years. Not three.


I think the achievements in the past couple of years are astonishing, bordering on magic.

Yet, confidently promising AGI/self-driving/mars landing in the next couple of years over and over when the confidence is not justified makes you a conman by definition.

If the number 3 means nothing and can become 7 or 17 or 170 why keep pulling these timelines out of their overconfident asses?

Did we completely solve robotics or prove a longstanding theorem in 2020? No. So we should lose confidence in their baseless predictions.


Self-driving is not so much a technological problem as it is a political problem. We have built a network of roads that (self-evidently) can't be safely navigated by humans, so it's not fair to demand better performance of machines. At least, not as long as they have to share the road with us.

'AI' landings on Mars are the only kind of landings possible, due to latency. JPL indisputably pwned that problem long before anyone ever heard of OpenAI.

Theorem-proving seems to require a different toolset, so I don't know what made him promise that. Same with robotics, which is more an engineering problem than a comp-sci one.


The cars are still worse than humans.


On an uneven playing field, yes. If we'd designed our roads for the robots, the robots would do better.

In any case the robots are getting better. Are we?


From Musk's email:

"Frankly, what surprises me is that the AI community is taking this long to figure out concepts. It doesn’t sound super hard. High-level linking of a large number of deep nets sounds like the right approach or at least a key part of the right approach."

Genuine question I've always had is, are these charlatans conscious of how full of shit they are, or are they really high on their own stuff?

Also it grinds my gears when they pull out probabilities out of their asses:

"The probability of DeepMind creating a deep mind increases every year. Maybe it doesn’t get past 50% in 2 to 3 years, but it likely moves past 10%. That doesn’t sound crazy to me, given their resources."


You should read what he says about software engineering. He's clearly clueless


I'm interested, can you point me to some interviews or posts with him talking about it?


Amongst people who think probabilistically, this isn't a weird statement. It's a very low precision guestimate. There is a qualitative difference between 50-50, 90-10, 99-1 etc, and it's only their best guess anyway.


Just because you can generate numbers between 0 and 1 doesn't make them meaningful probabilities.

Are these based on data? No, they're rhetorical tools used to sound quantitative and scientific.

Nobody will be applying the calculus of probability to these meaningless numbers coming out of someone's ass.

And most importantly, is he himself willing to bet a significant fraction of his fortune based on these socalled probabilities? I don't think so. So they're not probabilities.


A conversation like this happens every minute between investors, people who work at hedge funds, trader types, bookies, people who work in cat insurance etc. They just think this way. They are "priors" in Bayes world. Based on intuition. Notice the lack of precision. Nobody says "50%" or "10%" to sound scientific. I'm 99.9% certain it's better than using ambiguous terms "likely", "probably", "certainly possible" and so on.


I like to use the the 0.1% cya too, but that makes you sound 99.9% less sure.


1 in a 1000 is probably the best way to do it. 99.9% and 0.1% sound like fake precision, even though they aren't.


The 50-50 probability is way overused. It is like ppl subconsciously think that two outcomes make the odds 1:1.


> Frankly, what surprises me is that the AI community

I came here thinking about this exact part. Well, many of them, but this one in particularly.

What surprises me about Elon is how much he can talk about other peoples' work without doing any of it himself. And yet each time I hear him talk about something I'm well-versed in, he sounds fairly oblivious yet totally unaware of that fact.

His go-to strategy seems to be hand waving with a bit of "how hard could it be"?

He's very fortunate he has competent staff.


At his level of wealth and power he can unfortunately peddle in all sorts of highfalutin technical-sounding nonsense with zero accountability.

Few people like Yann LeCun can afford to call his bullshit and survive the reality distortion field.


I think people kissing your ass all day and viewing wealth as a sign of intelligence (which would make Musk the smartest man in history by a lot), it’s more understandable.


This is common supposedly in models of grandiose narcissism, which is a subtype that is associated with having leadership traits (agentic extraversion). Not saying anyone has it or that it is necessarily a bad thing, but it might lead you to explore some insights into traits that lead to this type of behavior.


Related to the Hyperreal numbers mentioned in the article is the class of Surreal numbers which have many fun properties. There's a nice book describing them authored by Don Knuth.


The hyperreals and surreals are actually isomorphic under a mild strengthening of the axiom of choice (NBG).

https://mathoverflow.net/questions/91646/surreal-numbers-vs-...

See Ehrlich’s answer.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: