Hacker News new | past | comments | ask | show | jobs | submit | asperous's comments login

One advantage of uuids is they can be generated on several distributed systems without having to check with each other that they are unique. Only long ids make this reliable. Youtube ids are random and short, but youtube has to check they are unique when generating them.

Maybe one way is to split up a random assignment space and assign to each distributed node, but that would be more complex.


And then there’s uuid5 which you can use to generate identical unique identifiers across multiple systems without having to check on each other. Very very useful to have in some circumstances.


Having tons of people employ human ingenuity to manipulate existing LLMs into passing this one benchmark kind of defeats the purpose of testing for "AGI". The author points this out as it's more of a pattern matching test.

Though on the other hand figuring out which manipulations are effective does teach us something. And I think most problems boil down to pattern matching, creating a true, easily testable AGI test may be tough.


Let me play devil's advocate for a second. Let's suppose that with LLMs, we've actually invented an AGI machine that also happens to produce useful textual responses to a prompt.

This would sound more far-fetched if we knew exactly how they work, bit-by-bit. We've been training them statistically, via the data-for-code tradeoff. The question is not yet satisfactorily answered.

In this hypothetical, for every accusation that an LLM passes a test because it's been coached to do so, there's a counter that it was designed for "excessively human" AGI to begin with, maybe even that it was designed for the unconscious purpose of having humans pass it preferentially. The attorney for the hypothetical AGI in the LLM would argue that there are tons of "LLM AGI" problems it can solve that a human would struggle with.

Fundamentally, the tests are only useful insofar as they let us improve AI. The evaluation of novel approaches to pass them like this one should err in the approaches' favor, IMO. A 'gotcha' test is the least-useful kind.


There’s every reason to believe that AGI is meaningfully different from LLMs because humans do not take anywhere near this amount of training data to create inferences (that and executive planning and creative problem solving are clear weak spots in LLMs)


>There’s every reason to believe that AGI is meaningfully different from LLMs because humans do not take anywhere near this amount of training data to create inferences

The human brain is millions of years of brute force evolution in the making. Comparing it to a transformer or any other ANN really which essentially start from scratch relatively speaking doesn't mean much.


Plus it's unclear if the amount of data used to "train" a human brain is really less than what GPT4 used. Imagine all the inputs from all the senses of a human over a lifetime: the sound, light, touches, interactions with peers, etc.


Don’t forget all the lifetimes of all ancestors as well. A lot of our intelligence is something we are born with and a result of many millions of years of evolution.


But that is of little help when you want to train an LLM to do the job at your company. A human requires just a little bit of tutorials and help, an LLM still require an unknown amount of data to get up to speed since we haven't reached that level yet.


Yeah humans can generalize much faster than LLM with far fewer "examples" running on sandwiches and coffee.


>Yeah humans can generalize much faster than LLM with far fewer "examples" running on sandwiches and coffee.

This isn't really true. If you give an LLM a large prompt detailing a new spoken language, programming language or logical framework with a couple examples, and ask it to do something with it, it'll probably do a lot better at it than if you just let an average human read the same prompt and do the same task.


Hmm, but is it really "generalizing" or just pulling information from the training data? I think that's what this benchmark is really about: to adapt to something it has never seen before quickly.


How many attempts have there been for humans to solve math or science outstanding problems? We're also kind of spamming with ideas until one works out


I’ll give you as much time as you want with an LLM and am 100% sure that it won’t solve a single outstanding complex math problem.


I can say the same about myself, and I would probably consider myself generally intelligent.


There’s a meaningful difference between a silicon intelligence and an organic one. Every silicon intelligence is closer to an equally smart clone whereas organic ones have much more variance (not to mention different training).

Anyway, my point was that humans butter direct their energy than randomly spamming ideas, at least with the innovation of the scientific method. But an LLM struggles deeply to perform reasoning.


> I’ll give you as much time as you want with an LLM

With infinite amount of time you can LLM brute force whole search space. Infinite monkeys with typewriters.


Our compute architecture has been brute forced via an revolutionary algorithm over a billion years. An LLM approaching our capabilities in like a year is pretty fucking good.


Perhaps if we don’t know how to create an evaluation that can’t be “gamed” it tells us something about how special our intelligence really is?


I don't know how to create a liver, or test one, so what does that say about my liver? Pretty much nothing.


Wouldn’t the real AGI test be that an AI would be able to do what the author did here and write this blog post?


I won't be surprised if GPT-5 would be able to do it: it knows that it's LLM, so it knows its limitations. It can write code to pre-process input in a format which is better understood, etc.

https://chatgpt.com/share/2fde1db5-00cf-404d-9ae5-192aa5ac90...

GPT-4 created a plan very similar to the article, i.e. it also suggested using Python to pre-process data. It also suggested using program synthesis. So I'd say it's already 90% there.

> "Execute the synthesized program on the test inputs."

> "Verify the outputs against the expected results. If the results are incorrect, iteratively refine the hypotheses and rules."

So people saying that it's ad-hoc are wrong. LLMs know how to solve these tasks, they are just not very good at coding, and iterative refinement tooling is in infancy.


Yep, but a float is more useful than a bool for tracking progress, especially if you want to answer questions like "how soon can we expect (drivers/customer support staff/programmers) to lose their jobs?"

Hard to find the right float but worth trying I think.


I agree, but it does seem a bit strange that you are allowed to "custom-fit" an AI program to solve a specific benchmark. Shouldn't there be some sort of rule that for something to be AGI it should work as "off-the-shelf" as possible?


If OpenAI had an embedded python interpreter or for that matter an interpreter for lambda calculus or some other equally universal Turing machine then this approach would work but there are no LLMs with embedded symbolic interpreters. LLMs currently are essentially probability distributions based on a training corpus and do not have any symbolic reasoning capabilities. There is no backtracking, for example, like in Prolog.


Show me a test and I’ll show you a neural network that passes it… used to be an saying.


its LLM grade school. let them cook, train these things to match utility in our world. I'm not married to the "AGI" goal if there is other utility along the way.


> estimated 2.4 million items

That's 5 years if one person worked on it nonstop without sleeping and each item took 60 seconds.

I would assume they probably sit in a secure location and items on display or items leaving/transferred are catalogued first so there's bit of a triage and backlog.

Museums probably don't want to turn down valuable item donations even if they don't have the resources to catalogue if right away.


British Museum seems to have about 439 employees who work on "care, research, and conservation", of a total of around a thousand employees. Seems like they have enough budget and staff to get such a high-priority task done.

https://www.britishmuseum.org/sites/default/files/2023-07/br...


> and each item took 60 seconds

The required time depends on a lot of things, such as on the target quality of the data record, the complexity and fragility of the item, etc. The primary purpose of a catalogue is not to prevent theft, but to provide a tool for research. Therefore you typically want high quality photos, ideally from different sides, angels and lighting (or even a 3D scan), a description of the item, its provenance, its treatment, keywords from a normalised vocabulary, a bibliography, etc.

Here is a random example from the British Museum catalogue: https://www.britishmuseum.org/collection/object/G_1896-0201-... -- Just think yourself how long it would take you to compile all this information. I would estimate several hours, if not days.

Following the theft, the British Museum announced a plan for a quick inventory of 2,400,000 items in 5 years for £10m.[1] This means £4.17 per item. If we use the UK adult minimum wage of £11.44 as a lower bound, this yields an upper bound of 2.74 items per hour -- in other words: not more than aprox. 22 minutes per record (but probably a lot less, depending on the wages of the people involved). Such a tight budget does not seem like it would allow for anything useful to be compiled for research. It sounds more like a big waste of money.

[1] https://www.theartnewspaper.com/2023/10/19/british-museum-to...


This seems like a reasonable use of resources and time? I'm assuming the British Museum has been around a bit longer than 5 years and hopefully plans on being around longer than 5 years.

Maybe than can hire a couple people. [edit] removed inflammatory last sentence.


> That's 5 years if one person worked on it nonstop without sleeping and each item took 60 seconds.

Or... 2 people doing regular working hours for 3 years taking 10 seconds per item.

Each item can literally be 'photo' + drawer/cabinet number. All other details can be crowdsourced or done later.

How long does it take to take a photo?


That's not cataloguing, that's recording, and as far as I understand this is long ago done - cataloguing is those "all other details" which require expertise and time; all the things like figuring out that this coin is a roman coin from 1st century, and that other coin from the same find is from another location.


If it's long ago done, where is the wiki with 2.4 million photos? If it's not online it might as well not exist


It might be strange to us, but a lot of the world exists outside the internet. Many people never go online or go online seldom. And that's okay.


Museums have huge collections of many things that are not online but which researchers can access locally.


How much time does it take to move a specific piece of artefact in/out of storage? What are the dimensions of the artefact? Are they sensitive to light? Are special equipments required to handle them? Every piece is different, not to mention the mandatory planning involved before moving every item. It's not the same as a retail store photographing their merchandise.


It's one-quarter of their collection, and they've had 271 years to accumulate and catalog all this material. As others have mentioned, they have enough staff.

I would assume they issue a receipt and itemize donations nowadays. I think part of it could be reluctance because not everything they have in their possession is rightfully theirs[0].

I don't know all the attributes required to properly catalog an artifact, but I imagine that advances in computer vision and translation could help tremendously.

https://www.businessinsider.com/british-empire-stole-cultura...


Not a lawyer but those contracts aren't legal. You need something called "consideration" ie something new of value to be legal. They can't just take away something of value that was already agreed upon.

However they could add this to new employee contracts.


"Legal" seems like a fuzzy line to OpenAI's leadership.

Pushing unenforceable scare-copy to get employees to self-censor sounds on-brand.


I agree with Piper's point that these contracts aren't common in tech, but they're hardly unheard of. In 20 years of consulting work I've seen dozens of them. They're not uncommon. This doesn't look uniquely hostile or amoral for OpenAI, just garden-variety.


Well, an AI charity -- so founded on openness that they're called OpenAI -- took millions in donations, everyone's copyright data...only to become effectively for-profit, close down their AI, and inflict a lifetime gag on their employees. In that context, it feels rather amoral.


This to me is like the "don't be evil" thing. I didn't take it seriously to begin with, I don't think reasonable people should have taken it seriously, and so it's not persuasive or really all that interesting to argue about.

People are different! You can think otherwise.


Therein lies the issue. The second you throw idealistic terms like “don’t be evil” and __OPEN__ ai around you should be expected to deliver.

But how is that even possible when corporations are typically run by ghouls who enjoy relativistic morals when it suits them. And are beholden to profits, not ethics.


I think we do need to start taking such things seriously, and start holding companies accountable using all available venues (including legal, and legislative if the laws don't have enough leverage as it is) when they act contrary to their publicly stated commitments.


Contracts like this seem extremely unusual as a condition for _retaining already vested equity (or equity-like instruments)_, rather than as a condition for receiving additional severance. And how common are non-disclosure clauses that cover the non-disparagement clauses?

In fact both of those seem quite bad, both by regular industry standards, and even moreso as applied to OpenAI's specific situation.


as an exit contract? Not part of a severance agreement?

Boomberg famously used this as an employment contract, and it was a campaign scandal for Mike.


This sounds just like the non-compete issue that the FTC just invalidated. I can see if the current FTC leadership is allowed to continue working after 2025/01/20 that these things might be moved against as well. If new admin is brought in, they might all get reversed. Just something to consider going into your particular polling place


It doesn’t matter if they are not legal. Employees do not have resources to fight expensive legal battles and fear retaliation in other ways. Like not being able to find future jobs. And anyone with family plain won’t have the time.


“You get shares in our company in exchange for employment and eternal never-talking-bad-about-us”?

Doesn’t mean that that’s legal, of course, but I’d doubt that the legality would hinge on a lack of consideration.


You can't add a contingency to a payment retroactively. It sounds like these are exit agreements, not employment agreements.

If it was "we'll give you shares/cash if you don't say anything bad about us", that's normal, kind of standard fare for exit agreements, it's why severance packages exist.

But if it is "we'll take away the shares that you already earned as part of your regular employment compensation unless you agree to not say anything bad about us", that's extortion.


Through in a preamble of “For $1 and other consideration…


They give you a general release of liability, as noted elsewhere in the thread.


Have you seen the contracts?


The training jurisdiction one is interesting. Are future companies going to exclusively train their models in a copyright lack country?

Seems like jurisdiction would be based on the copyright of the allegedly infringed images, and UK-based users creating copyright infringing copies in the UK.

But that's apparently not the law or case law in the UK yet.


As anyone stealing data or other digital goods in countries that dont respect property they will be punished when trying to sell in countries there property is protected.


Still not convinced training AI is theft, and I think copyright law is a cudgel used by powerful corpos to extract rent and smash innovation.


While I somewhat agree with that take on copyright, I think you have to pick a lane to keep that position coherent:

Either you insist that copyright must be respected at every level, and the creators of material used for training deserve appropriate compensation, or

You throw out copyright completely in this context, but that means the resulting models cannot be treated as proprietary either unless they were produced using absolutely no unlicensed training data.

I think there is an argument for both. Want to create a proprietary model for commercial use? Pay up. Creating an open source, copyleft project exclusively for personal use and artistic expression? Exemption.

The current status quo is perfectly described by powerful corpos extracting rent. Billions for themselves and pennies for the average artist.


> Either you insist that copyright must be respected at every level, and the creators of material used for training deserve appropriate compensation

I don't think that current copyright laws automatically entitles people to royalties from something like AI-generated imagery. The dichotomy you've presented here isn't pro-copyright vs anti-copyright, but "so pro-copyright that they argue for expanding the current laws" vs not.

> Want to create a proprietary model for commercial use? Pay up. Creating an open source, copyleft project exclusively for personal use and artistic expression? Exemption.

That definitely benefits all the "powerful corpos" you've mentioned here. Now, Disney, Adobe, Meta etc. can use a fraction of their money to get all the data they would ever need and be the sole profiteers, while all newcomers will face an impassable barrier to entry that prevents them from ever threatening the existing players.


Copyright already has a lot of limitations. It has never been, nor been intended to, be absolute, because the point is to promote the arts and a too strict grant of rights would stifle it instead - indeed most places it is accepted that copyright is a significant limitation on the liberty of society at large, justified (or not, depending on ones opinions) by encouraging more works, but accepting it as a restriction means there is some degree of acceptance that it should not be more expansive than it needs to be (and many will disagree about whether the current length of copyright is or is not more expansive than it needs to be)

The only limitation that needs to be there for training on copyrighted works not to be infringing is to accept that extracting information about the work is not infringing if copyrighted elements of the work itself is not significantly reproduced.


There is at least one middle ground area where you acknowledge that copyright and intellectual property restrictions should be removed, but that we should also recognize that all of the existing work was created by artists who expected they would have copyright protection. We should in my view not take from artists without their consent, and there is no implied consent when their works were posted at a time they believed they were protected by copyright.

This would mean we have to do a few difficult and worthwhile things: explicitly dismantle the copyright system, encourage artists to donate their existing works to the commons, and then only make datasets based on legally collected information. This would also have the side effect of encouraging the development of new training techniques and model designs which are more sample efficient.

I am afraid that what we will do instead is allow some erosion of copyright for small creators without dismantling the power large intellectual property holders have over the rest of us.


> We should in my view not take from artists without their consent

I think "take" is the wrong word here, nobody is republishing the copyrighted works, instead the model gets a gradient update. The update is shaped exactly like the model itself, and it gets stacked up with other updates from other examples. It doesn't look like the original work at all, the original work was a picture or book, the gradients look like a set of floating point tensors. AI models decompose inputs into basic concepts, they don't copy like bittorrent.

Why should an AI not be allowed to form a full world model that includes all published works? It's not like the authors can use copyright to stop anyone from seeing their works, they never had a right to stop others from seeing.


I am more arguing that if it’s considered taking, we should follow the path I recommend.

Whether or not it is taking is more nuanced, but I will say I’m not sympathetic to the idea that it’s broadly similar to a human looking at the work. It’s just very, very different. You can’t spin up a copy of a human on a cloud server and make them work 24/7.

I would expect that as laypeople we aren’t equipped to reason about this effectively. I suspect that decades or more of case law would be relevant to how this would be viewed, and I’m personally not equipped to argue it.

What I do know is that artists don’t feel good about it. They feel like they’re being taken from. And I’m not inclined to quickly dismiss their concerns. I think this needs careful, deliberate consideration. And if a system could be built that is consent based, I’d feel much better about it. A human child could be raised and mature without ever being exposed to copyrighted material beyond a handful of books (harder in the modern world but common 200 years ago). Maybe we just need to build better models. It certainly seems possible.


if you can legally use something its not stealing.

Even the argument that its logical to call it that isn't certain.

If i take your picture I own it?

If you take a picture and i upload it to fb meta gets to use it?

If I publish your book under my name and no one finds out, did I write it?

If no author can be found, may I read it?


> I think copyright law is a cudgel

It is interesting to me that we are finally seeing a case where many smaller, independent artists and creators are using these laws to assert their rights against the encroachment of the moneyed tech interests, and of course now all of the powerful corpos are singing another tune. Rules for thee, not for me.


Yep. That is the question. Anyone that immediately comes back with “it’s stealing!”, especially those that were confidently saying it when this first became an issue, long before they would’ve had any time to contemplate it deeply, are just proving that techies’ sense of transferable expertise is completely unfounded.


No it isn’t, because ‘steeling’ is allowed.

There’s no question these neural networks and their output are derivative works. However being a derivative work isn’t enough to guarantee copyright infringement.

So, the only question is if we are going to carve out an exception here or not. The idea someone can use a VCR to copy live TV and let people watch it later came out of a court case not copyright law. There’s a lot of such exceptions, but getting one isn’t guaranteed.


> There’s no question these neural networks and their output are derivative works.

In the two US cases we have any progress on so far, the established requirement for substantial similarity (opposed to "dependant on" or such) has been upheld, with Judge Vince Chhabria specifically setting out that it'd "have to mean that if you put the Llama language model next to Sarah Silverman's book, you would say they're similar". and Judge William H. Orrick agreeing with the defendants that "plaintiffs cannot plausibly allege the Output Images are substantially similar or re-present protected aspects of copyrighted Training Images, especially in light of plaintiffs’ admission that Output Images are unlikely to look like the Training Images".

The UK definition of derivative works is, to my understanding, narrower and specifically enumerated as opposed to the US's more open-ended definition.

The remaining area of doubt, assuming the above remains consistent, is over the transient copying that occurs during training.


> the transient copying that occurs during training.

i think this should be dismissed as it is the same level of transience as the workings of the internet; you and your ISP, caching proxies etc, all made a transient copy as part of the existing (legal) consumption of the works that the author has put online.

Unless the works was illegally copied for training - which cannot be true if the works was publicly available for viewing on the internet, this transient copying cannot be a valid infringement.


Doing something a little isn’t the same a doing something a lot. You can walk into a restaurant and look at a menu for 5 minutes and then leave without issue but try to do that same thing for 8 hours.

Downloading a singe transient copy of some image once in the lifetime of a company is different than doing that same action a hundred times once for each version of the network.


This case involves a many examples of substantial similarity. Worse it’s precedent that generative AI doesn’t necessarily avoid creating such examples.

Defendants can easily argue that being 1/10 millionth or whatever of the training set means their specific work is unlikely to show up in any specific example but the underlying mechanism means it can be recreated.


The defendants will evidently claim transient copying.


I doubt these companies constantly downloading the full training set rather than keeping it in a database somewhere.

Hard to argue keeping a copy of some copyrighted work indefinitely counts as transient.


> I doubt these companies constantly downloading the full training set rather than keeping it in a database somewhere.

Precisely to argue for transient copies, they don't need to keep terabytes of data stored.

>Hard to argue keeping a copy of some copyrighted work indefinitely counts as transient.

You're assuming that they're keeping the works indefinitely, which again is not the case.


> Precisely to argue for transient copies, they don't need to keep terabytes of data stored.

Those kinds of legal workarounds rarely work.

They are dependent persistent access allowing them the equivalent benefit of keeping a persistent copy.


> There’s no question these neural networks and their output are derivative works.

A derivative work is an expressive creation that includes major copyrightable elements of a first, previously created original work (the underlying work).

There is absolutely no agreement that what neural networks do (as a rule) counts as such, so it is not at all correct to say "there is no question..."

If learning how to draw by watching other people draw makes everything you draw a derivative work, then perhaps you have a point.


The network in question recreated the exact content in question on a specific event. What happens is general isn’t the issue, the problem comes from specific output.

For a neural network to be able to recreate a complex work with minimal prompting it must be encode that information and therefore be a derivative work.


There are some ironclad exceptions but they would have to make it through the dysfunctional Congress.

The big one is recipes. Recipes under the current copyright regime in the US are considered non-copyrightable facts, which is why every cookbook and recipe blog has lots of copyrightable splash photos and personal anecdotes. Congress specifically doesn’t want grandmas getting sued for copying the recipe on the box.


> Congress specifically doesn’t want grandmas getting sued for copying the recipe on the box.

Recipes don't have a specific exception within the the copyright law that Congress has carved out.

It is also not cut and dry. It basically boils down to facts not being copyrightable. So a list of ingredients and basic instructions (e.g. cooking time and temperature) won't be granted copyright protection.

But, the prose in the instructions can be copyrighted. So copying a whole recipe verbatim can be copyright infringement, but copying the list of ingredients and writing out the basic instructions is not.


Sounds like a job for LLMs - extract ingredients and steps, then verbalize it back in a completely different style.


But to what end? SEO optimized recipe copy sites already exist and are so numerous to the point where going to specific sites or books is now just a signal of reputability in a sea of trash.


not sure what Congress has to do with a case in the UK

fair use is mostly a US concept, there is no such thing in the UK or most other countries


It seems like UK and EU agree that you cannot copyright a recipe other than maybe the exact way it was written:

https://www.twobirds.com/en/insights/2020/uk/intellectual-pr...

https://www.copyright.eu/docs/protection-of-a-recipe/

Though you can patent novel methods of food production, which is also true in the US.

The root statement is still the same, legislatures can amend copyright laws as they wish if they really care. I don’t know that the UK parliament is exactly functioning well right now, but that’s my impression from across the pond.


> I don’t know that the UK parliament is exactly functioning well right now

in terms of ability to legislate it works considerably better than the US congress

up to you if you call that well functioning


You can only copyright the actual expression of a recipe as a literary work, but the functional aspect, the cake let's say, isn't copyrightable.


> There’s no question these neural networks and their output are derivative works.

Most generated content almost certainly isn’t derivative work by the standards of copyright law. It’s plainly obvious to anybody who’s read Frank Herbert’s books that he derived a lot of ideas from Isaac Asimov, but it’s equally obvious that Dune isn’t a derivative work of Foundation.

If I had some commercial interest in generative AI models, I’d be very happy that everybody is debating the copyright implications. Because copyright law is certainly going to favour the models. The biggest regulatory risk to them as far as I can tell is that they clearly don’t have section 230 protections, and I can’t imagine how that isn’t going to come crashing down around them rather soon.


If you run someone over you can’t defend yourself by saying 99.999% of the time you didn’t run someone over. Most output being free of copyright issues isn’t a defense if any output has those issues.

Specific examples of clear copyright infringement mean that output is a derivative work AND by encoding enough information to recreate it the underlying neural network must itself be a derivative work.


Derivative work has a specific meaning in copyright law, there has to be something in the output, and that's not the case here. Otherwise every single owner of 5 billion images could sue you for your "cat at a cafe" midjourney picture.

Judge Orrick in one of the US cases already called this idea 'nonsense", his words.


Not all outputs are at issue here, but if ANY output is copyright infringement they have problems.

Specific and clear examples of derivative works are shown therefore both those exact examples and the underlying neural network must be a derivative work.


I think that if you stick to a definition of 'fair use' that allows the slurping of entire corpuses, then copyright doesn't have any teeth anymore.

If the license makes the data public viewing, like with websites, then slurp all you want. If the license forbids automated bulk processing, then stop whining about fair use and pay for a license that allows bulk processing.

"Out system actively uses every single byte of data to produce any output" is so obviously not the intention fair use clauses.


Fair use doesn't exist in the UK.


What is your justification for AI training not being theft?


If including a copyrighted work in an AI training corpus is theft because of its influence on some artificial neural net, then so is viewing it by a human being, whose memory is now somehow the property of the copyright holder, an absurd conclusion.


Well if we're concluding that training a neural net and a human mind forming memories is exactly the same thing, I'm looking forward to all their defenders agreeing that the neural nets should be held criminally responsible every time they generate an image deemed unlawful in certain jurisdictions...

Otherwise there's obviously a legally relevant distinction between a human mind which is ascribed agency to decide if and how to use its memories of copyrighted material, and importing into an information retrieval system which can't help but spit out transformations of parts of its inputs on demand, (including lossy representations of the Getty watermark if it's fed enough Getty material, or an exact facsimile of an image if that's all it's trained on...)


There's a massive jump in that logic, which is basically equating a large neural net to being exactly the same as a singular human being. If you ask me that is clearly not the case. They operate in entirely different ways and have very different properties.


>There's a massive jump in that logic, which is basically equating a large neural net to being exactly the same as a singular human being.

It does not assume that.


Are you willing to give up this point of view the first time you fail a Turing test? It seems only fair.


Humans are not treated the same as machines and business ventures. Many things are illegal for the latter and are not thought crimes for the former.


Your premise that a neural net is equivalent to a human brain seems much more absurd.


You wouldn't look at a car.


Copying is not theft in general as you don't take anything away from anybody


The major corps and tech companies loved to claim otherwise for years as it suited their interests.


>>Copying is not theft in general as you don't take anything away from anybody

That's a perfect oxymoron.


Theft has specific meaning in law, and it's reserved for physical property (or unique digital assets and financial instruments like bonds).

Copyright uses infringement, which is not theft: it's non-rivalrous, and it contains a number of exceptions.


Fair use. (Even if the creator doesn't like that)


Training a model on another person’s work shouldn’t be considered theft, but the model also shouldn’t be allowed to generate profits. Without of course compensating the owner of the training data.

Someone needs to come up with a royalty structure for this stuff.


I think using data that you don't have the copyrights to train AI is theft.

That being said, Getty is hardly the paragon of goodwill considering they regularly steal from public domain databases, issue DMCA takedown requests of the stolen content from said databases, and then turn around to sell it to unwitting people for a subscription. They own none of the copyrights for what they are doing but have been allowed to get away with it.


> I think using data that you don't have the copyrights to train AI is theft.

There are public domain works you can use and copyright doesn't protect ideas. It protects expression of ideas, so getting "just the ideas" without the expression is ok.


Right. Public domain is stuff that doesn't have exclusive IP rights. You can do with that what you want.

The problem is that "expression of ideas" in the realm of AI is akin to plagiarism by human standards, because its a literal copying of the source material blended together. I couldn't recite you the entire plot of the Odyssey off the top of my head literally, but AI can, because it has the source material. We just tell it to do funny ha-ha things so its okay.


Have you only read books you own the copyright to?

What’s the legal distinction between you learning and AI learning?


If I regurgiate something I read in copyrighted book without proper license that also would be theft, no distinction there.

I'm not distributing my brain, at least same (but probably more restrictive) should apply to models - training is okay, but using and distributing should be limited by copyright


Explaining anything publicly based on my understanding I got reading books would be illegal following this logic. I'm not sure this is how it works.


They want to muddle the distinction between ideas and expression. You can't copyright ideas. Everyone is entitled to copy ideas.


It would not be illegal based on fair use (though you have to be careful there also), but if you try to regurgiate large portions of the book then it would be. And we do know that models regurgiate training material verbatim (Copilot)


Redistribution, and the scale of it.

Besides which, "learning" isn't a fair use exemption anyway.


Using that which belongs to others without their consent is theft. There isn’t much to debate, unless of course, you wish to benefit from that theft. Powerful corpos can train whatever they wish against the data they own. For instance, microsoft can train its bots on microsoft’s source code, instead of people’s code, but that is not going to happen because they are aware of the implications - the procedural generators are exactly that and nothing more. Meanwhile we cant use their products without a license.

So if they decides to play this game then so should we.


> Using that which belongs to others without their consent is theft

Are text snippets, thumbnnails and site caches shown by search engines (on an opt-out basis) "theft"? If you draw a car, which you can do due to having seen many individually-copyrighted car designs, are you stealing from auto manufacturers? Have I just committed theft by using a portion of your comment above as a quote?

I don't claim here that statistical model fitting inherently needs to be treated the same as the above examples, but rather use examples to show that the bar of "using" is far too broad.

Legally, copyright infringement in the US requires that the works are substantially similar and not covered by Fair Use. Morally, I believe that artificial scarcity, such as evergreening of medical patents, is detrimental and needs to be prevented wherever feasible - and wouldn't call any kind of copying/sharing/piracy "theft". The digital equivalent of theft is, for example, account theft where you're actually removing the object from the owner's possession.


Theft is the act of taking away someone else's property. "Using" (aka copying) the public data I create isn't theft, be it with my consent or without. It may be copyright infringement under certain conditions, but arguing that this infringement is stealing is like arguing that digital piracy and shoplifting are basically the same thing.


> Using that which belongs to others without their consent is theft.

Using publicly available information doesn’t require anybody’s consent.


Thats because the word 'training' is doing all the heavy lifting here. Think of it as copying, compressing and storing all the copyrighted material in a database. Humans learn, humans train, computers encode data. You would never say ffmpeg learned a movie.


> You would never say ffmpeg learned a movie.

no you wouldn't, but these diffusion models do way more than ffmpeg, and do qualitatively different things.

I am on the fence, but i lean towards the side where training an AI using existing works is not infringement, as long as the AI's output is (or can be) majority new works. For example, a poor training algorithm that merely repeats the training dataset (and cannot output new works) is infringing, while a different algorithm (such as the current stable diffusion one) that can output works that has never been made and is totally new, does not infringe - after all, style and ideas are not infringing and if the algorithm managed to extract those ideas from the training set, all the better.


Majority new works is not a good enough standard. If any output is a direct reproduction of a copyrighted input that output is copyright infringement whether it was intended or not. If the trainer of the model doesn’t want to be sued for infringement they are responsible for a robust safety mechanism that prevents it. If that safety mechanism isn’t possible than don’t use copyrighted works if you have any possibility of directly reproducing them.


> If any output is a direct reproduction of a copyrighted input that output is copyright infringement

so by that standard, why isnt photoshop a copyright infringement? You can use it to create a copy just the same.


Photoshop isn’t a copyright infringement inherently but producing an infringed image with photoshop is still infringement. Much the same way AI is not inherently infringement but any production of infringing content by the AI is still infringement.


What’s the test for “has never been made and is totally new”?

If I look at a photo of Prince and then using that image as reference create a new silkscreen painting is that fair use or infringement?

Because the US Supreme Court has ruled that instance I referenced was infringement as both images were used for magazine covers [0].

[0] https://www.nbcnews.com/news/amp/rcna64624


> What’s the test for “has never been made and is totally new”?

the existing copyright rulings are sufficient to determine this, and has nothing to do with ai models.

You've already pointed out a case - if you use an AI to generate an image which has sufficient likeness to an existing one, then the AI portion is irrelevant to the ruling. You could've made that same image in photoshop without AI, and should obtain the same ruling.

But in the above circumstance, the silkscreen used in the creation of the image does not itself infringe. And replace that silkscreen with AI model, nothing has changed.


> Think of it as copying, compressing and storing all the copyrighted material in a database.

But it isn’t. It’s just a series of vectors that point to a likely occurrence of the next word or pixel or bit in a sequence.


You are trying to argue encoding semantics, but at the end of a day the "AI" was completely happy to recite Carmack's Fast inverse square root including original comments verbatim word for word.

https://twitter.com/StefanKarpinski/status/14109710611816816...


With the way these AI models work, that data isn’t stored in a database though.

It’s hard for people to understand this concept, but the fact that a model repeated some data verbatim is a happy coincidence (!) solely based on patterns of data that it seen before.

I think people have also have a hard time with how these models are trained. They are vacuuming up all sorts of data and learning from them by creating vectors that determine how follow-up data should be generated.

Sure, the original creators of this content aren’t being compensated or even recognized for it. I don’t have a good idea on how that should be handled.

For normal humans though, looking at art or reading a book, and later repeating some passage or drawing something from your own memory is not a crime. (Unless you’re sharing the DeCSS source code I guess…)

Slightly changing the topic here, but I do wonder what were to happen if someone wrote a program called “Monkeys on Typewriters” that just iterated through various combinations of characters (or bits or pixels) and was able to recreate things verbatim.

Is that random happenstance copyright infringement?


> For normal humans though, looking at art or reading a book, and later repeating some passage or drawing something from your own memory is not a crime.

False, actually; memorizing a copyrighted work and reproducing it other than in conditions specifically excepted from copyright protection is a violation of the exclusive rights of the copyright holder to make copies.

Copyright doesn't just apply to mechanical copies which don't have a human brain in the middle of the process.


Reciting common text or common license elements and commentary isn't necessarily copyright infringement.


You would never say ffmpeg stole a movie either...


I looked it up and there is a whole industry of "low voltage" or cable networking installers.

I think it's more a matter of cost and that people wouldn't think to do it if their internet mostly works.


In case people are curious, I had several lines of ethernet added throughout my home I wanted to add (cameras outside, ethernet port in garage, etc.) and I was charged $100 per line (they called them drops). This included going into my attic and doing as much as possible passing the ethernet inside the walls. When they went outside, they used conduit. I think most people would actually save money by dropping from 1Gbps to 200Mbps and using those savings to pay for a wired backhaul and good mesh network.


Part of the problem is that even if you know this is what you want, it's a classic hiring problem.

You're typically doing this once every what-- 20 years-- so you don't have a "default guy" and odds are you're one of maybe 3 households in your block of 450 houses who are even considering the problem, so it's not like you can ask your neighbours.

What the market needs is a national franchise, maybe implemented as a co-op model-- enforcing some basic level of "predictably not great, but not terrible", and offering a one-stop pricing and quoting system. You know they won't be the cheapest or best, but the franchise structure sells confidence. If they try to take you for a ride, you can escalate to the brand, who has leverage over them, to get satisfaction before having to duke it out with the local Contractor Licensing Authority.


If you haven't used aws a lot then you might not know this but the old instance types stick around and you can still use them, especially as "spot" which lets you bid for server time.

I had a science project which was cpu bound and it turns out because people bid based on the performance, the old chips end up costing the same in terms of cpu work done/$ (older chips cost less per hr but do less).

aws though was by far the most expensive so switching to like oracle with their ampere arm was a lot cheaper for me.


I don't think that's quite right, Microsoft's main game was keeping the money train going by any means necessary, they have staked so much on copilots and Enterprise/Azure Open AI. So much has been invested into that strategic direction and seeing Google swoop in and out-innovate Microsoft would be a huge loss.

Either by keeping OpenAI as-is, or the alternative being moving everyone to Microsoft in an attempt to keep things going would work for Satya.



Just to add another perspective, isn't sales like this at any large company? Sales is pushed hard on quotas/targets and so look for any way possible to hit those. In other words were these requests taken seriously by the product managers and leadership?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: