Hacker News new | past | comments | ask | show | jobs | submit login
We can't all use AI. Someone has to generate the training data (twitter.com/paulg)
341 points by redbell on March 14, 2023 | hide | past | favorite | 603 comments



Nobody is going to "generate training data" if they get precisely nothing out of it.

The complete disregard for intellectual property rights being exhibited by the "AI" community will inevitably come back to bite them in the arse one day.


When you download a couple epubs from a torrent site, you're an evil pirate stealing books. When you're a FAANG-affiliated AI company and you download all the ebooks and charge for the derivative works your algorithm produces, you're a visionary changing the future.


Reminds me of the quote: "The death of one man is a tragedy, the death of millions is a statistic"


Similarly:

If you owe the bank $10,000, it's your problem. If you owe the bank $100m dollars, it's the bank's problem.


It's supposed to be read ironically,,,,,,, right?


And isn't that attributed to Stalin, or something.


Misattributed. It was actually Kurt Tucholsky who was himself quoting an unnamed French Diplomat.

"Der Tod eines Menschen: das ist eine Katastrophe. Hunderttausend Tote: das ist eine Statistik!"


Yeah, this is just capitalism in action. Almost every startup has stories like this, which appears to be why YC's startup school shies away from "goody two-shoes" type people and has questions like "please tell us about the time you most successfully hacked some (non-computer) system".

For example, in the early days Amazon famously "leveraged" the purchase of out of print books from their wholesalers to obtain orders that would otherwise be too small to ship.

That's just how capitalism works. You have to be either too small to bother with or too big to fail -- the problem area is in the middle.


I hate that YC question. A "good" answer to that is almost always something that's borderline illegal or otherwise ethically bad.


I automated applying for apartments in my city's public housing program as they regularly get hundreds of applications within a few minutes and automatically take down the ads again, often before their email notifications even arrive. That way I at least got invited to some viewings without having to check it every few minutes throughout the day, as many people I know have been doing for months. I don't see how that can be considered ethically bad. This is in Berlin.


I wouldn't consider this "hacking a non-computer system" though. Or maybe I just misunderstand the question.


I'm old enough that to me, "hack" doesn't imply anything nefarious. But I'd hope that people who answer this question with something unethical would be excluded from the program.


I don't mean hack as in "hack a system and leak the passwords". But "hack" here implies that you somehow skirted around the rules or did something that you're not supposed to do. You've exploited something to your own advantage. Often times (not always!), many people would consider this unethical behavior.

For example, there's this widely told story about how Airbnb in the beginning exploited Craig's List to list their properties. This is not illegal, it's not that they hacked into Craig's List, but it's for sure not ok. It's ok to break the rules as long you are not caught. Imagine if everybody would start doing that. Therefore, I'm also not surprised that they don't give two shits about city regulations and so on.


> I don't mean hack as in "hack a system and leak the passwords". But "hack" here implies that you somehow skirted around the rules or did something that you're not supposed to do

Right, that's what "hack" doesn't imply to me. I mean, this is "Hacker News", but it is not about nefarious activities. That said, I do understand that's what it implies to the younger generations.

> It's ok to break the rules as long you are not caught.

This is an attitude that seems to be common in SV, and I'm disappointed if YC doesn't reject applicants who demonstrate it.


It's almost as if unethical behaviour is incentivized under capitalism...


How would the issue of intellectual property rights to published books be resolved in an alternative, non-capitalist system?

In socialism, as far as I can tell, anyone would totally be able to ingest all these books, as there would be no concept of intellectual property rights. Thus, what you complain about with respect to capitalism, would be a totally normal and acceptable thing to do in socialism.

Do you have some other, alternative system in mind, which would not allow a wholesale exploitation of private intellectual property?


Reminds me of the failure (?) of the Google Books project :

https://www.newyorker.com/business/currency/what-ever-happen...


Yeah but what's the alternative?


Not doing it? No one forces you to train ML models on random data and no one forces you to commercialize those models. If that severely limits progress then too bad - lobby for copyright reform for everyone instead of just ignoring the rules you expect others to follow.


Other countries will do it. You're free to kneecap AI development in your country if you wish.

But do you think that China and Russia are going to do the same?


"I have to be bad because if I'm not, someone else will be" is not a valid justification for anything.


The solution should take after Russia, China and Israel: use [the internet, AI, etc.] to violate international laws, not domestic ones.

The western world keeps getting crippled by ransomware attacks and other malware because nobody can practice the dark arts domestically without risking serious jail time. The only people who can advance these skills enough to know how to defend against it (or return fire!) are either CIA, Israeli contractors, criminals, or foreigners abroad. An American will go to jail for hacking a Russian bank, but no Russian goes to jail for hacking an American bank.

Train AI on foreign content only. Understand their culture better than they do. Then learn how to subvert it. It's what they've been doing to us.


Yeah, some malware doesn't even run if the computer's regional and language settings are set to Russian.


"Intellectual property rights" is a concept made by greedy proprietors to build monopolies and keep competition out. It's no chance it was strongly pushed by Disney, and in its current form goes against the society interests as a whole. It is sold maliciously as "it incentivizes creative work", but how does "life of the author + 50 years" incentivize the author to keep creating? It just incentivizes them to keep exploiting.

Now I think we'd all agree to something more realistic to actually protect authors, like 10-30 years protection from publication. But every time someone brings "copyright! IP!" to ML I just laugh since it's one of the more corrupt laws that we have currently.


I disagree completely.

My older brother decided to go to art school.

It wasn’t until he was 40 that his artwork started to be noticed and worth something.

His images have been bootlegged endlessly by Asian companies placing them on t-shirts and selling them for $5 on Etsy and every other marketplace.

Fortunately for him, he’s designed a very popular set of kids books that have allowed him to move out of my parents, get a place of his own, pay down debt and buy a house.

Intellectual property is how he takes care of his family. He should be able to at least provide for his kids college after his death if his IP still has value in the marketplace.

…And so should other creators.

Not everyone is Disney.


I think we mostly agree here, since for your brother those 10-30 years would work perfectly for him! China breaking copyright is NOT an argument that copyright is good, quite the opposite, it's a statement of how broken it is.

About the inheritance I disagree, when you are a farmer and sell a potato you sell it once and then it's gone, if you are a carpenter you make a furniture piece and sell it and its gone. Why should art be a "legacy" that can be kept selling forever?


Art is not the same as potato’s or furniture.

What is the market price for a potato?

What is the market price for a chair?

What is the price of developing an art career over 40 years? Unless you believe that art, music, poetry, literature, and the rest have no value to society then IP protections are what incentives a portion of humanity to do something crazy like go to art school…

…instead of you know doing the “responsible” thing and getting a CS degree so you can one day get an AI gig stealing “training data” in the name of profit


I purposefully put furniture because it can be a utilitarian box, or a custom carpentry art piece. Why is your brother art more valuable than a custom table? Or a delicious dish? Or the portrait I buy from the street artist that cannot rent-seek from it? With the best intention, it seems you are biased because we are talking about your brother's lifehood (but I'm still happy to discuss politely).

> IP protections are what incentives a portion of humanity to do something crazy like go to art school

Hard disagree, most of the art fields are first a passion and then a profession. People who go to art school is normally because they love it so much that they cannot imagine themselves doing something else, even though they already know it pays little. I've never ever heard of someone going to art school "because of IP protections".

One of the reasons it's so hard to make money is because there's many people doing it as a hobby that are already great at it, and so would jump at the chance of getting some money for it, leaving people who want to make livable wages outcompeted. Which is fine, this way society as a whole benefits greatly, sure it's unfortunate there are no more people living off it, but in exchange there are many, many amateurs experimenting and doing art, and from time to time one finds a formula that allows them to live off it (or "shills" to corporate).


> purposefully put furniture because it can be a utilitarian box, or a custom carpentry art piece. Why is your brother art more valuable than a custom table? Or a delicious dish? Or the portrait I buy from the street artist that cannot rent-seek from it? With the best intention, it seems you are biased because we are talking about your brother's lifehood (but I'm still happy to discuss politely).

I really didn't get any vibe of bias from them.

Either way, it seems you miss their point entirely, which is that you cannot equate a creative idea as the same as a product.

The furniture example all boils down to if the work is allowed by the author to be mass produced (if it even can be done) but the creative idea is still theirs.

>I've never ever heard of someone going to art school "because of IP protections".

Similar with my first point you misunderstood their point. Their point is that thanks to ip protection they can invest the time needed to become a successful artist within their lifetime.


Oh the bit I found seemed biased is this:

> "He should be able to at least provide for his kids college after his death if his IP still has value in the marketplace."

Why should he, besides because that's a good desire for a family member? If it was a 3rd party wanting this, usually we'd call that rent-seeking behavior and it's usually seen negatively.

> But the creative idea is still theirs

Why is this (besides copyright law ofc)? The carpenter doesn't forbid me from creating a similar chair of him, and if we are talking about a painting and e.g. I was learning I could definitely copy it for learning purposes and everyone would be fine with that, so at which point should it "not be okay" to copy it? To lend to a friend in private? Can I share (in person, no copies) to a group of friends? I bought a painting, can I print it on a t-shirt for myself? What I can do after I purchase a copy of the idea in private should not be up to the author IMHO, and that line is often very blurry.


It's very simple.

Technically as long as you don't sell a product that is a copy, you are "fine" to have said product.

However if you do sell it, it would incentivize a parasitic market.

Something you see sometimes in open source software, esp with IaaS.


If we go by the average income of a creator, then society indeed considers most of it less valuable than flipping burgers. The average writer for example earns well below minimum wage from their writing.

If you want to encourage more creative work, maybe look for other mechanisms, because going into creative work in the hope of making it is pretty much gambling with awful odds.


Your reasoning is faulty. If the average wage for a writer is "well below minimum wage" that doesn't necessarily imply people don't value creative endeavors. It could alternatively imply, for example, that the average creator doesn't produce creative content that others find valuable. With that thesis, if burger flippers do get paid more it could be a result of there being less variance in the value of the work performed.


Even far above average creators, who win awards and sell far above average earn next to nothing. The average full time traditionally published author still earn below minimum wage in the UK (the median household income of full time authors in the UK is above the UK average; most UK authors are able to be full time because of their partners rather than their creative output)

Sure, we can say that still isn't "good enough", but that kinda proves the smaller point that the actual creative output of most creatives is not valued, and at the same time ignores that the main point I made above is that it is irrational for people to gamble on being valued, because the vast majority of "successful" creators as measured by relative rank within their peer group has no realistic prospect of that happening today.


That should be "most UK authors who are full time are able to"...


I find it hard to believe that a society with weaker IP protections would be a society with noticeably less art. People love creating stuff and putting it out into the world, whether they can make a profit from royalties or not.

Of course, with weaker IP protections, there would be change. Some very specific business models that exist today would go extinct. And other business models (e.g. live performances) might see a renaissance in popularity.


There's quite a lot of assumptions there.

* Rejecting intellectual property necessitates rejecting that art provides any value to society.

* The only reason artists make art is to make money.

* (Implicitly) Without IP artists would have no incentive to make art, monetary or otherwise.

* Making art requires going to art school.

* Getting a CS degree and making art are mutually exclusive.


>Art is not the same as potato’s or furniture.

You sure about that, legally?

>…instead of you know doing the “responsible” thing and getting a CS degree so you can one day get an AI gig stealing “training data” in the name of profit

You're right about this, though. But this strikes me as a discursive issue that has more to do with the generalized butthurt over schooling that you see here in HN and tech circles - I think it's funny that it's Local Boob Paul Graham talking about writing (which to be fair he cannot shutup about) that set this whole thing off.


> stealing “training data” in the name of profit

They just show it to the AI, it doesn't store it - just data about it. The same as you when you look at something, be it a painting or a chair.

The only reason AI companies store the data is they want repeatability. If you want to train two copies on the same data you can't rely on the same sites being up. I commit the same crime by saving websites I've read because I don't want my sources to dry up on me.

> What is the price of developing an art career over 40 years?

Market price. How does it benefit the customer here and now?

It's no different than being a furniture maker with years of carpentry skills. The value to the customer is still just the marginal benefit over an IKEA stool.


> They just show it to the AI, it doesn't store it - just data about it. The same as you when you look at something, be it a painting or a chair.

In general for machine learning, yes, but deep learning models with large numbers of parameters such as transformers and stable diffusion do actually memorise part of their training data verbatim. The problem is mainly duplicate data points. E.g.

https://arstechnica.com/information-technology/2023/02/resea...

https://arxiv.org/pdf/2202.07646.pdf Quantifying Memorization Across Neural Language Models


Yeah, currently many models do seem to have copied a few pieces of their source material.

I was translating code with one of the code- models at OpenAI and it spit out about thirty words of someone's code of conduct from Github, almost verbatim. I found the repo by searching what it had produced. That's when I changed my prompts to remove comments, but it stuck with me.

In general though, I think the pigeon-hole principle is relevant. The model is fairly small compared to the training material and couldn't contain much actual full content or it wouldn't be able to contain anything about the rest.

I've heard people talking about using the model to judge similarity of new training data to what it's already seen, mainly to train more efficiently, but also to avoid this.


Going to art school is a waste for most artists—why not be like Alexander Borodin and compose symphonies when you need to take a break from being a research chemist?


To incentivize artists to create and distribute art. The way we conceive of art as an act of embodied novelty differentiates it from commodities which aim for predictable consistency. Lack of IP protection in our culture has made it impossible for 99.9% of artists to thrive economically. The US has chosen relative cultural poverty compared to other cultures that find non-market mechanisms to support artists.

As an artist starting out in the beginning of my career, I made a rationalized choice to never post my work online - in retrospect this seems to have been the right choice. My work is no AI’s whetstone.


Artist have always since classical times struggled to support themselves. I don’t think there is any system that would make this a viable career for the number of people who want to pursue it. Same with musicians.

It is a luxury career supported exclusively by surplus. There will always be demand for it, but it is highly elastic and heavily influenced by trends and skewed by the top end.


A social security system that supports all artists was put into operation since 1984 in my country of Finland, it's still functioning fine, the only struggle for artists here is substance abuse.


This is a is a fundamental change of how we relate to work and the economy. So I hope it catches on and I've seriously though about emigrating because of that. That weather though...

I think it doesn't go quite far enough though. I still believe a universal basic income would do better and be easier to administer.


How do people qualify for this support?


It's essentially universal. They simply subtract income from it. About $10k/y. If you earn more than $150/mo, the excess is subtracted. The bureaucratic process is submitting an online form. Property owned (land, house, shares, w/e) may reduce it, I don't know, I don't own anything of substantial value.


Can you elaborate a little on how this works? I have many questions haha, how they verify someone is doing art, how it's not the same as a basic unemployment, whether there is a maximum amount of people who can be on this programme, if so how qualification works etc. It sounds very interesting


> how they verify someone is doing art

Sorry if I framed it confusingly, this isn't only for artists, it's for everyone. They don't care what you do, only how much money you make.

> how it's not the same as a basic unemployment

It pretty much is, though unemployment has more strings attached. Anything you get from unemployment is subtracted from social security, and vice versa. For unemployment the online form is simpler and money comes much quicker, but you have to attend unemployment office events and stuff.

> whether there is a maximum amount of people who can be on this programme

The true maximum is of course however much the system as a whole can support, through taxing people's income. My back-of-the-napkin math puts the cost of the whole program at about $1/mo per taxpayer. So all artists and bums combined. Might not include basic retirement, which is slightly less.

> how qualification works

It's based on income/assets. If you make less than $10k/y plus $150/mo, the program covers the difference for living expenses.

In practice, there are limits to how much they're willing to pay for things like rent and utilities. In my city the max rent is $560/mo. On this winter's coldest month they complained that my electricity bill was too high (electric heating), but still paid it in full when I complained back.

To qualify, you have to attach to your application all of your bank accounts' statements from the past 3 months, so they can see how much money you've got. When you get a bill, you send the bill and they cover it.

The social security program covers: rent, water, electricity, home insurance, Internet, healthcare, moving van, security deposit of new apartment. On top of that you get about $15/day for food and other stuff like clothes, cleaning products, what have you. In my experience food costs some $10/day to live comfortably so some $150/mo remains for everything else. Second hand clothing stores sell decent clothes for a few dollars a piece, sometimes less.

You can request additional funds for infrequent, dire needs, like a new bed, vaccuum cleaner, etc.

Feel free to ask more questions, I'm happy to elaborate :)


>Sorry if I framed it confusingly, this isn't only for artists, it's for everyone.

Ah, OK, that was what I was wondering. Some social security schemes for artists that I am aware of require proof that you are in fact an artist, which can be problematic.


By historical standards, most of what is sold today in developed countries is "surplus". The share of GDP that goes to art and entertainment is not constant as your historical comparison suggests. It is growing and will continue to grow. It will eventually outgrow most other industries.

Wherever you stand on copyrights, it would be a mistake to underestimate the central importance of this issue going forward.


Yes, we have a lot of things supported by surplus. All research is one example.

I don't think my comment implies that at all. I'm not convinced it'll ever grow that large though. See: Content is not king [1] for a good explanation of why.

I think it's natural that it grows now to levels never seen before.

What am I underestimating? I agree it is an important issue; my comment is orthogonal to it though.

[1] https://firstmonday.org/article/view/833/742


>It is a luxury career supported exclusively by surplus.

What is your career, precisely?


I mean no aspersions when I say that. For example, all research would fall into that category as well. So does most of tech, but not all of it.

My point was that it's a career whose demand is dictated by fluctuations in the economy and trends. And can be almost completely shut down (in theory) if the situation dictates.


>Lack of IP protection in our culture has made it impossible for 99.9% of artists to thrive economically.

Can you explain this position? My understanding is most artists don't thrive economically because there's not much demand for the art they make. I'm not sure that's correct, but "lack of IP protections" seems even less likely for most artists. What protections do you think would help?

It seems to me that the current system primarily benefits corporations who acquire a vast library of IP and can afford to legally defend it all as necessary.

I agree that a different set of policies would result in more art being created and more artists who are able to support themselves doing art, but my immediate assumption of what that would look like is more like funding art educations and exhibitions (of various kinds).


in Japan, company can't buy the copyright from author, obviously this give artist advantage.


And that may be a bit helpful, but still everything I've heard indicates that drawing manga is absolutely hellish for one's body and mind. Look at the famous schedule:

https://qph.cf2.quoracdn.net/main-qimg-53ec1e395a4b59adb8391...

Yeah, the law may be more on their side, but overall it seems unless you're somebody on the level of Akira Toriyama you likely barely managed to retain your sanity while cranking out your world famous works, and even the famous ones suffer greatly during the process.


yes, but this can make sure they get money of their part, we can thinking about novelists have better condition.


Historically artists had a real job to fund their art hobby.

The US has cultural poverty because it has decided to support litigation machines.


Historically, artists were independently wealthy (as were early scientists) or they lived off the patronage of the wealthy. Intellectual property and copyright laws allowed art to be a viable commercial venture without direct patronage.


Well, no, I mean if you want to get specific, IP and copyright (copyright specifically) created a structure for government to register and track written output. Our current conception of "the artist" is relatively new, and patronage models/gift economies strike me as...well, still pretty relevant despite IP. People seem to be focusing on the Artist and not the artist's intermediaries (publishers, for instance) and that IP was also meant to protect and promote industry, for which it has been successful (maybe too much).

I mean a good thought experiment here would be to replace "artists" with "entrepreneurs" or "founders" ("historically, [...] were independently wealthy" would expose some of the myths we attach to the idea of "self-made" which we rarely attach to artists and authors) here and rethink the history of Western commerce from that perspective.


> Lack of IP protection in our culture has made it impossible for 99.9% of artists to thrive economically.

You think we should have 1000x more artists than we do? I think there's another economic problem with that idea...

> The US has chosen relative cultural poverty compared to other cultures that find non-market mechanisms to support artists.

The USA has the largest market in the world for creative products and the most rich artists.

> My work is no AI’s whetstone.

Are you the same way with juniors? "I paid dearly to learn this technique - you should too!"

Aside from overtraining issues, the AI can't store your work anymore than you can store representations of everything you've trained on, it's vastly smaller than the sum of its training data. It distills out features and their combinations.

Some bigname artist is upset because he thinks he's the first one to put certain bat and lizard features on a dragon and that he now owns that entire sort of creature. Turns out though, that given an old picture of a dragon and that single sentence of mine that he could be copied by almost anyone. The only way to keep the AI from "copying" his work is to make sure that, even if not trained on his work, nobody asks it for those features. To satisfy these people it'll have to have a big red sign that says "Dragons are off limits, Bob owns them because you might put claws on the wings!".


You are in favor of hundred plus years copyright so something your brother created in his 20s can support a generations after his death. And society should grant this because it will encourage more works from him?

The point of copyright is to encourage more work. At this point he is rent seeking.


Good point.

Every entrepreneur should also be forced to sell their business instead of passing that equity to their children. No more rent-seeking, multi-generational family businesses.

Matter of fact, let’s force the same on anyone with a 401k, IRA, or Roth as well because at the end of the day it’s all equity, right?

All equity stakes much be converted to cash upon death and either returned to the state OR given to charity.

That’s the way to make “Single serving wealth” happen and put an end to rent-seeking.


Man, that would be great. Imagine how society could change when you cannot accrue an outsized wealth via inheritance, and instead everyone can have food, shelter, healthcare, and more. Maybe people would do things to help humanity rather than just their own interests.


You are being sarcastic, but in reality, yes, many family businesses are absolutely multi-generational rent-seeking schemes that solely exist to extract capital from people.

That said, your comparison to forced confiscation of retirement plans by the state or charity is very strange, because it's not as if on your brother's death the state would come in and takes every cent he's ever made as a producer of IP.

In fact, the state literally does the opposite, and gives your brother's children and grandchildren the entire backing of both the civil and criminal justice systems. This is why we spent the 90's and 2000's sitting through VHS tapes telling us that unauthorized copying will be met with an FBI investigation, $250k in fines, and 5 years in prison.


It's not quite the same. When the children inherit a business, they usually have to do real work to keep it. Incompetence eventually destroys it. When the children do it right, a business keeps on existing and creating some sort of benefit to society -- it keeps on doing whatever it does, employs people, etc.

Meanwhile there's no effort involved in collecting royalties from your father's work, and no social benefits are provided by it.


If all of the inherited wealth upon death was put into the education system, and only education, then yes I would back this 100%. It would be a way to maintain level playing fields, and allow entrepreneurs to congratulate themselves for creating a better society.


Maybe not all, a progressive tax should apply : consider small multigenerational family-only/mostly businesses ! (Note that when none of the children are available they already often have trouble finding buyers they can trust.)


If you take out the sarcastic "all" implied in your tweet it's not really that far off from what has been suggested many times over with regards to a wealth tax, no?


This seems sort of like whataboutism.

There's nothing stopping an extremely successful and wealthy artist from handing down the money he made to his children. If they want to use that money to continue doing artwork in his style, using everything they have learned from him, there's nothing stopping them.


> His images have been bootlegged endlessly by Asian companies placing them on t-shirts and selling them for $5 on Etsy and every other marketplace.

And why not, really? How does that affect your brother? Why should your brother have a say on what people can or cannot put on a T-Shirt? Creating pieces of art professionally should be paid for. But having some weird right of ownership on what is essentially just a huge number is, frankly, a clutch at best and outright criminal at worst. Digital artwork can be copied at zero cost and artists have to deal with that, not the world has to bend to one particular model of generating revenues for artists.

We need to stop with these schemes at one point and shift to a better market or we will effectively have to ceed all control of data and computing to copyright-control companies.


At one extreme you have Disney corp squeezing money out of copyrighted material they created/bought/licensed. On the other extreme, without any copyright, you have Disney corp squeezing money out of any material they can find. Copyright is a moat that works both ways. Without copyright, profit from art goes to the most efficient t-shirt screen printer.

What's the happy middle ground?


> On the other extreme, without any copyright, you have Disney corp squeezing money out of any material they can find.

Of course they will. But so can others. Without copyright Disney won't have a legally enforced monopoly.


Be careful what you optimize for. Clearly the thing you actually object to is Disney squeezing money out of things. Maybe just make that part illegal, and forget the copyright?


Why a middle ground, though?


So does the same Logic apply to eg the MS Office SW?

It too can be copied for zero cost.

Why should MS have a say on what people can or cannot put on the harddrive (Well ssd)


I'm not sure what the solution is, but I do feel IP protection is important.

If we were to get rid of it entirely then your success in the market largely depends on what resources you already have. Large companies have the upper hand as they already have the user base, along with the marketing budget to sink you (not that the current situation differs all that much).

If your argument is "thats the way it should be, let the market decide who succeeds" then that's a whole different conversation.

Either way I don't see an obvious solution that doesn't benefit the very companies you say we need to stop, though I agree on the sentiment.


well, imagine you're selling your merchandise on a marketplace, but then some other company comes around, takes your product design, and then sells a product very similar to yours, at a slightly lower price. even if they get a slightly lower margin of profit due to undercutting, they still capture that profit, while you get nothing.

and then, imagine if a company arrives that offers a product, let's just say, imagery. artwork. at a low low price of 'free'. not just "free copying of existing artwork", but complete "professional creation". for free. it's real hard to compete with 'free'. you may have been selling your artwork for some price, but when a 'free' alternative comes around, what do you do? how do you solve for that "challenge" on the market? tough shit. and also, since it's 'free', it's also 'whatever price they want to set', so they can undercut you at any price point, because they can just make up any price point. capturing customers, capturing profit, while you just get displaced in the market. good luck trying to explain to people that "creating pieces of art should be paid for" and that "you" "should be paid. people can just get art for free! or, at a low low price that undercuts you.

see a problem that arises here? this is what those annoying art people have been trying to argue about.


Do you feel the same about all software?


> have allowed him to move out of my parents, get a place of his own, pay down debt and buy a house

> Intellectual property is how he takes care of his family.

Art is how he makes money. Money is how he takes care of his family. People have been making these things forever -- they are older than human language itself! "Intellectual property" otoh is a relatively new idea, and so can't have everything to do with it.

To think "Intellectual Property" laws are corrupt and benefit the already-rich should not be incompatible with saying some copyright term might be good: If copyright terms were half what they are, your brothers' story would be exactly the same, but Disney's would be a lot different!


This is one of the earliest arguments for copyright protection: "Look at the poor grandchild of [famous author], begging in the streets. If we had attached exclusionary rights to his works, she would benefit from his works!"

I'm not saying it's a good argument (it's sentimental, for starters, not that OPs is much different - "proprietors" didn't invent IP law, and Disney is apparently giving its pound of flesh back to the powers that helped it game the system), but I'm pretty sure the same one was used in defenses of generative work and, notably, NFTs - "this will help the poor artist, who would otherwise have no market." Rogers v Koons put paid to that argument (indirectly), and I'm hoping the current class actions succeed, your brother notwithstanding.


> It wasn’t until he was 40 that his artwork started to be noticed and worth something.

Does the system really work if he had to invest in himself for so long to even start having hopes if sustaining himself with his creative output?


I agree. Intellectual property is not so different from any other kinds of property. At the end of the day, everything is intellectual property; your ownership of anything is merely a record written on some piece of paper or stored in some computers.

If we didn't assign value to abstract concepts, ownership contracts for tangible assets would only be worth the paper they're written on. Abstract ideas are inherently valuable.


"your ownership of anything is merely a record written on some piece of paper"

This is a very young/naive way of seeing it, as the Spanish saying dutifully notes, "physically having something is 9 parts out of 10 of the law for ownership" ("La posesión es nueve décimas partes de la ley"). Meaning ownership is not as much what is on paper but it is having something, claiming it's yours, and no one else claiming it's theirs basically (not always, but that's why the saying says 9 out of 10).

https://primaveraeuropea.eu/la-posesion-es-nueve-decimas-par...


> as the Spanish saying dutifully notes, "physically having something is 9 parts out of 10 of the law for ownership" ("La posesión es nueve décimas partes de la ley"

This is already an English saying. "Possession is nine tenths of the law."


This is clearly not true. There are term limits on how long the society is disallowed from using an idea. If I take good care of it, my table does not disappear in 5 years, nor can I sell the same table ten times. The bigger issue with IP is how they are used to prevent others in the society from using the idea they see right in front of their eyes.


I agree that copyright is not bad in concept. But current copyright law is, in my opinion, just insane and harmful and seems to be getting worse over time.


I don't think it's our job as a society to be banking the bets of everyone who believes to be talented (and lucky!) enough for their works to be worth anything.

For every case like your brother's, how many people out there are throwing their lives away chasing the illusion that they will ever "make it"?


In what way is copyright "banking the bets"?


Publishing is generally speculative, and the common perception of "the author" as an ordinary individual with an interesting expression of an idea is inaccurate. A more useful analogue is "student who has just taken out a loan." Non-dominant publishers, meanwhile, bet that the large advances they do give (way too large) pan out into a property that can claw back some market share and cultural attention from major franchises, who should in this metaphor be known less as a copyright holder and more as "the house."

For the general expression of ideas in a corporate setting, copyright is an asset that is backed by a legal team. You're betting that a property gives you a competitive advantage.


Do you think it's reasonable to keep all the legal/political apparatus required for enforcement of copyright and IP laws just on the basis of "we need to protect the people who want to go to art school and hope that their art will be worth something one day?"

I don't.


No. I think we need to do that regardless of whether their art will or won't be worth something.

I don't think my open source stuff is every going to be worth a penny. I still want it to be protected so that its license is respected. Mine, and everyone else's.


Copyright law does not stop bad actors. The fact that you (and I) wish it was respected does not change things in any practical way.


> Copyright law does not stop bad actors.

Yes it does.


Back in 2005, way before Huawei was known as a smartphone manufacturer, I was working at an R&D office focused on phone networks in Brazil.

I went to a meeting with one of our clients that was just signing with us. The main reason that he was contracting with us was, I quote, "the last time we went to China we found a Huawei rep making a demo of an application that worked exactly like ours - which was meant to run on Cisco gear. Even the terminal escape sequences to get to debugging mode was the same. The Chinese were stealing from us and there was absolutely nothing we could do about it."

Copyright law might be enough of a deterrent to make Joe Nobody pay $9.99 to Netflix instead of torrenting movies, and (on occasion) it may be enough to force some high-profile artist to give credit to some other lesser known musician who "served as inspiration". Outside of that, copyright law is as effective to "stop bad actors" as the "War on Drugs" is effective to stop organized crime and drug trafficking.


Copyright laws rely on enforcement, which China couldn't give a rat's ass about. Brazil, however, clearly does, and your employer at the time earned a new client because of copyright laws that will hopefully protect the client.


If that was true, people in Brazil wouldn't be pirating movies as much as they do, and there wouldn't be computer shops in downtown São Paulo selling cracked copies of Windows to small/medium business like they still do.


So following your logic we should decriminalize murders since, well, those still happen anyways.


I am for decriminalizing copyright infringement, yes, but congratulations on the midwit argument that equates copyright infringement with murder.


So following your logic we should decriminalize <everything> since, well, <everything> still happens anyways.


Can you please stop with the "so you are saying that..."? Not only is a poor form of conversation, it makes you look like you are being dishonest or an idiot. Or both.

Let me spell it out: I think that the cost of trying to police "copyright violations" far outweighs any of the potential losses and damages that might be incurred for losing such "protections". This is not the case for murder. The cost of life is incalculable, so any measures to prevent loss of life is justifiable.

Laws that try to protect citizens from physical harm and are meant to preserve human life are of a completely different nature than "copyright violation". It's perfectly reasonable to argue for decriminalization of copyright infringement without having to accept that anything should be decriminalized.

To go further: I see no moral justification to have copyright infringement as a violation worthy of criminal persecution. I believe that "intellectual property" is a misnomer and I think that any type of current laws are meant only to protect large corporations and are not made with the best interests of society in general.

Is that enough for you, or are you going to continue pushing false slippery slopes?


> Is that enough for you, or are you going to continue pushing false slippery slopes?

You misunderstood. I was not trying to portray your argument as a slippery slope. I was trying to portray your argument as just generally weak, which it obviously was ("but china" is not even an argument). Given that you completely misunderstood my point, the rest of your comment doesn't even apply.

That you had to resort to subtly calling me an idiot says a lot.


My anecdote about Huawei "stealing IP" was just that, an anecdote. It was meant as a counterpoint against your "copyright stops bad actors", not as the basis of an argument against copyright in itself.

> That you had to resort to subtly calling me an idiot says a lot.

That you didn't mind me subtly calling you dishonest says even more, and shows that I'm done here.


The Mickey Mouse protection act actually extends copyright to more than 90 years after publication. It was imposed upon the rest of the world unilaterally. And only few big corporations like Disney benefit from it.


True, didn't want to go into detail (multiple extensions) so just quoted the original.


excuse me sir, are you blind by silicon? greedy is to train ML models on copyrighted data of hundreds of thousands of people with far beyond the amount of power Disney have!

also how many people are responsible for stuff like ChatGPT... 300? getting the bulky of their trainning data from, again, who?

it feels the monopoly history is re-inventing itself, now with fancy computers replicating human behaviour


That sounds horrible.

I guess the only solution is to get rid of copyright law so people don't exploit it so much.

Problem solved. Let's all stop protecting this, and allow anyone to innovate by making cool and new awesome things.


Sounds great, and when someone leaks the model online I can make a product on top of it without paying anyone.


What goes around comes around.


No it doesn't. Capital will always have workarounds.

How about Make the model so big you can only run it in a data center?


Great I can run it on my bonnet.


That unironically sounds great.


> Intellectual property rights" is a concept made by greedy proprietors to build monopolies and keep competition out

Which is why all AI code and datasets are freely available and legally usable by anyone for any purpose?


It’s a concept at least as old as modern democracies and individual rights, and it’s likely not a coincidence.

That it’s been abused by corporate entities is another matter. The fact that some famous musicians sell their whole catalog to companies is also a sign of times changing again.

AI growth as it is here based on preexisting work is directly in confrontation with copyright. And something will have to give in some way. But law will likely not catch up before at least another 10 years, the time needed for most people to really understand/experience the impact of the changes.


Not really, musicians used to copy, adapt, or mock other musicians music in the middle ages from what I've read, a famous example was Mozart transcribing the Vatican "secret" song. Same with art, painters have always learned by imitation. History is full of "copyright infringements" (if you look at it with a modern glass), either benevolent of apprentice learning from their masters or adversarial copying the competition.


True. It more looks like a complement than an counterargument?


IP laws are a double-edged sword because of the way how advertising and human memory works. You can not have IP laws and also negligible impact on culture. You always get both, and people replicate what they know. The goal of IP (like many other things) is to eradicate itself.


> how does "life of the author + 50 years" incentivize the author to keep creating? It just incentivizes them to keep exploiting.

It is a necessary protection against moral hazard, and a way for artists who may well be poor during their lifetime to be able to leave something behind for their children.

In your world, if one wants to print Calvin-peeing-on-things t-shirts without having to deal with legal hurdles, all they'd have to do is shove Bill Watterson into traffic to force his creations into the public domain.

There is no point to creating anything if someone else can throw you under an actual bus to immediately claim rights to (and start profiting from) the years of time and effort you spent building the "brand." Forcing a delay of 50 years goes a long way toward discouraging this sort of literal piracy.

This is the premise of the current AI debate, only it's forgery at scale instead of manslaughter.


"Intellectual property rights is a concept made by greedy proprietors to build monopolies..."

If you had created anything other than internet forum opinions in your life you'd have a different perspective.


Do you often ascribe incompetency or stupidity to people who disagree with you? If so, you should stop because it doesn't help your argument -- it makes you sound like a dismissive egoist who doesn't believe anyone who doesn't share the same values is qualified to have an opinion. It might make you feel better about your own stances but it certainly doesn't make others not inclined to a elitist disposition take you seriously.


Did you really explain the concept of an insult to them? Do you really believe they did it without knowing what they were doing?


> Did you really explain the concept of an insult to them?

Yes.

> Do you really believe they did it without knowing what they were doing?

Of course not, but I also believe they thought it would be more effective at making their point than it was.

It is tough to respond to something that is crass and meant to be offensive without resorting to the same, so by using the 'annoyed teacher' style I can get my point across without crossing the line to snark or contemptuousness.


> I also believe they thought it would be more effective at making their point than it was.

I see. I suppose this is where we differ, I'm sure they don't care about making a point at all.


My open source software is downloaded millions of times per year, so I do think I qualify as "have created something people value". That's not an argument though, it's an ad-hominem from someone without an argument, so please try giving arguments.


It's a bullshit umbrella term anyway. There's copyright and there are patents/trade secrets. + trademarks. Ain't no "property"


It is no secret that the majority of creators get close to nothing out of their copyrighted works.


Derivative training data is already part of the training regime. Earliest example produced rotated images for more robust image recognition.

IP laws are legal constructs and subject to changes like everything else depending on our needs. When and if enough of us see benefits from AI hugely outweighing benefits from maintaining IP framework, I'm pretty sure laws will be passed to make needed adjustments.


See, until then though, you don't get to ignore the law as it is.

You see, you're part of the system, not outside it. If it were not for my deep, deep hatred of the farce we call IP law, I must admit, I'd seriously consider walking over to the other side; because I look at these AI folks scouring training data without so much as asking the original creator for consent, or even sparing a couple cycles to think about whether they should, and it really rankles me.

In fact, the entire damn computing industry is filled with people with some seriously flawed ideas around the concept and mechanics of the act of obtaining consent.


> because I look at these AI folks scouring training data without so much as asking the original creator for consent

Consent for… what exactly? Why is anyone entitled to some notion of ownership & power over who may and may not make use of information?

I understand to some degree practical arguments around recouping cost of development and distribution of information, but I don’t think anyone should really concern themselves ethically with what some random person wants them to do with the information. This isn’t like ensuring a sexual partner is an enthusiastic participant, this feels to me more like I’m not going to ask mommy for permission every time I want to play outside (let alone care if I’m told “no”). What practical defense is that anyways?

Maybe my understanding of your framing is off?


>Why is anyone entitled to some notion of ownership & power over who may and may not make use of information?

You mean like copyright holders? Patent holders? Somebody with a trade secret?

>but I don’t think anyone should really concern themselves ethically with what some random person wants them to do with the information. This isn’t like ensuring a sexual partner is an enthusiastic participant, this feels to me more like I’m not going to ask mommy for permission every time I want to play outside (let alone care if I’m told “no”). What practical defense is that anyways?

But yes, anyone should. You're just saying "I don't think people should disagree with me" not "information distribution and consent aren't ethical concerns" - of course they are! Why wouldn't you propose as counterexample a model like medicine's "informed consent" instead of sexual consent or "asking mommy"?


> You mean like copyright holders? Patent holders? Somebody with a trade secret?

These are legal fictions that exist because the state’s monopoly on violence. Like I said, I understand the motivation to enable e.g. book writers to seek profit for their work. That’s about where my sympathy ends, especially when those mechanisms are abused for petty control by entities making billions. I hardly view them as ethical constructs, in fact I think the vast majority of restrictions based on these are unethical and immoral.

> Why wouldn't you propose as counterexample a model like medicine's "informed consent"

I think there’s a disconnect here. “Consent” in a sexual or medical context is ethically necessary due to the fact that the asker wants to perform acts directly upon/with the concerned party. Consent to play outside or make use of information does not involve the intimate invasion of one’s person, and really does not need to involve this other arbitrary party at all. This is what I was trying to express in my previous reply.

I don’t see why anyone is ethically entitled under these conditions to insert themselves as a governing authority of how people use information. It’s not even a realistic expectation to hold, information wants to be free and people will do what they want with it regardless.

> You're just saying "I don't think people should disagree with me"

I mean, we are talking about ethics here after all. The best I can do is disagree with your conception of consent and make my own arbitrary value judgement.


>These are legal fictions that exist because the state’s monopoly on violence.

Well, ok I get you there. But they're not really "fictions" when backed by the force of that violence. This critique makes sense, I just don't know if it really can ignore the fact that the legal regime applies in most cases.


>I don’t think anyone should really concern themselves ethically with what some random person wants them to do with the information.

Did you pay for the product? See, that's what you get under Rights of First Sale. If you pay for it, it's yours. Free and clear, no entanglements whatsoever. You still generally have to ask and negotiate a sale.

>This isn’t like ensuring a sexual partner is an enthusiastic participant.

I wasn't even going to bring it up, but since you opened the door... Someone has to bear significant cost to bring into being something (this is the part you claim to understand, do I'll hold you to your given); you come along and enter a business relationship with them (they are a supplier), set their compensation at non-negotiable 0, use their output as your input, to enable you to do things that they might not be at all comfortable contributing to, even indirectly, which again, you don't even know either way, because you didn't ask. Just saying, you've got more similarity here than you may be comfortable with on further deconstruction. Can't blame you for not wanting to do it, but as someone who has, maybe you should reevaluate?

Besides which, why does everyone think "consent" only applies to sex anyway? Consent is a fundamental underpinning of a liberal government of, by, and for the people.

>This feels to me more like I’m not going to ask mommy for permission every time I want to play outside (let alone care if I’m told “no”).

Live somewhere in the midst of a warzone, political upheaval, or a high crime area, and this could get you shot. In the case you were a child, where your mother has custodianship of you, she bears responsibility for your actions. That you, in that situation, would not care if she said no, just sort of proves my point.

Stakes may be low, but there is a flow to the interaction. If it were not an issue in our industry "it is easier to ask forgiveness than permission" wouldn't be such a sticking point of our industry.

It's reflected everywhere. Even the innocuous places. Our biggest industry actors have made their fortunes on tracking people whether they wanted to be tracked or not. We have browser fingerprinting, which is the equivalent of "hold on a minute, need to figure out what your living room looks like so I can forward the description to all my buddies so we'll know who you are when we next see you." Except that isn't ever outright said or explained to the end user, because if it was, no one would ever consent. Which is why nobody asks and just does, which proves our industry has a problem with consent.

These are exactly the types of thing to get tied in ethical knots over, because damnit, we won't be here forever, and we're setting the example for down the line. While it'd be nice for our practices around IP to be less rigid, there is still a line to be drawn such that it is not okay to just run off with digital info because digital.

Every vice we maintain for ourselves individually, will be abused collectively; likely to a far more destructive degree. Asking first is not a huge ordeal. Lord knows. We might get so tired of being asked we finally figure out some default terms so that people can stop asking. That never happens if no one bothers to get the ball rolling though. You have to ask to get the iteration started.


Any automated process for generating training data is part of the model, not part of the training data.

Authentic human generated information has irreducible newness.


>Authentic human generated information has irreducible newness

I believe that point is open to debate.


Yep. Another commenter said that likes and clicks is enough for AI to generate new relavant information, not new content.

Think language change every 1000 years. Would AI adapt to that? What new training data would it require? Surely obviously it would require some new inputs. It surely requires internet connected speakers of the developing language.

I think the “surely it would require new inputs” is enough to favour the irreducible newness point.

“Irreducible newness” may be just overhyped “changing relevancy” though, that’s what I’m stuck on.


AI couldn't have started any of the artistic and literary movements, which don't just blend between traditional styles but reject some aspect to forge something new. You can't get AI to generate cubism from training on Victorian art.


Why not?

Sure, a style like cubism is unlikely to appear from current AI art generation, but ultimately there's a human still in the loop. The human picks the direction the generation goes in and you can definitely see "styles" or even "characters" appear in certain types of prompts.

If you hang around people that do this a lot you'll see that some of them end up generating hundreds, if not thousands, of images of what seems to be the same character in varying situations. That indicates to me that you could potentially find a type of prompt that would lead to an art movement.

Hell, AI art itself might constitute an art movement.


So do you believe that human brains are somehow magic and don't follow the laws of physics and can't be simulated by Turing machines?


They’re talking about AI now not some notional future where we can model a real nervous system.


You've set up a false dichotomy.

"Human brain is magic." <> "Human brain is not simulatable by Turing machines."

There are other possibilities. I'm sure you can think of a few.


I can't think of any. Can you help me?


I do. It’s not a very controversial belief except if you only train using Internet comments.


It's quite controversial since any magic element in the universe undermines fundamental assumptions.


Physics has plenty of holes. It’s a changing model of measurable quantities in our world. It doesn’t take a lot of brainpower to fit other systems either into the gaps in physics, or independent unmeasurable areas


Magic violates fundamental principles of our understanding of the world like causality or various conservation laws. You can't just squeeze that into the gaps.


If learning is not allowed the robot overlords can move to countries not on the IP bandwagon like Algeria, Argentina, Chile, China, India, Indonesia, Kuwait, Russia, Saudi Arabia, Ukraine or Venezuela. Those are on the priority watch list, there is also a normal watch list. Those are probably lax enough to allow a lot.

https://ustr.gov/sites/default/files/2019_Special_301_Report...


Literal shape-rotating?? I thought it was only the wordcels who were in trouble.


> The complete disregard for intellectual property rights

Those "rights" should not even exist to begin with. Nobody is interested in bending over backwards just to appease a bunch of aging monopolists. We want to explore the full potential of technology, not see it crippled by artificial limitations due to "concerns" over imaginary property.


How can anyone take seriously the notion that anyone who has written a book and seeks to earn money from people who want to read it is a "monopolist"?


Because ideas are not property.

Are mathematical theorems property? No, because someone who proved the theorem "owns" it just as much as someone who has studied the proof and learned it by heart. And what about independently discovered theorems? Who owns the idea in that case? Curiously when we talk about science where there is no money, the knowledge is ownerless, but when we talk about engineering where there is money to be made, suddenly inventions are property.

The moment "your" idea travels from your brain to my brain it becomes "my" idea too. Just like a digital file that gets copied from one computer to another is the exact same file, bit for bit. It's senseless to talk about "original" in that case.


Legally, the expression of an idea is a form of property in the United States and most other nations (looking at you, Vatican City). Conceptually, the two questions ("ideas are/are not property") have been debated in the West for centuries. "We stand on the shoulders of giants" lost, and Fichte/Talfourd won. For better and worse, ideas are property.

>Curiously when we talk about science where there is no money, the knowledge is ownerless

USDA grant disbursements for ag science alone would make your eyes bleed.


Consider that copyright protects a presentation of a recipe but not the recipe itself and you will realize that “ideas” are not copyright protects.


Intellectual property and copyright are not the same thing. Also the presentation of something is an idea just like the something itself, so copyright is also a scam.


They are literally the same thing. "Intellectual property" comprises copyright, patent, trademark, and trade secrets.

>Also the presentation of something is an idea just like the something itself, so

This is exactly what copyright is. Copyright is based on the argument that the presentation of an idea is a unique correspondence to the idea, and constitutes a tangible item to which property value is attached. It's not intangible and ephemeral (an idea) or concrete and unremarkable (a piece of paper that the idea will be written on).


Sophistry. Intellectual work is indeed Work, and people who engage in it expect money for their service, otherwise, they're happier to not do it at all.


Nothing I said implies intellectual work should be free, unless you have a skewed idea of what selling your services or a product means


Nothing wrong with wanting to get paid. They're monopolists because their means to that end is a literal monopoly. Created something? Government gives you an essentially eternal monopoly on it. In other words: if you're a copyright holder, you're a monopolist. It really is that simple. I suppose I'm also a monopolist.


That’s not true. You can just read something else.


I didn't say copyright holders had a monopoly on reading though. I said they had monopolies on "their" works.


Then the concept is meaningless. If we go with your definition, Toyota now has a "monopoly" on Toyota Camrys, even though there are quite a few comparable cars for sale from other manufacturers.


How much of the common culture is free from copyright and how much belongs to megacorps? It's really a monopoly


Quite a lot of books or characters I’d expect most people to recognize are in the public domain, to be honest.


Do you not have a monopoly on your work?


Can you really not imagine any kind of system where everyone can pay their bills and knowledge isn't guarded behind a paywall?


I can imagine various options but none of them would change the point I'm making.


That's an incredibly reductive view. Copyright protects me when I do any creative work, I do not want my creative work to be part of any AI training dataset without my explicit permission, and I want laws that force people training generative AI models to copyright clear all of their training data and moreover be able to show which inputs corresponded to which outputs so the result is auditable.

If Disney trains an AI good enough to just generate a good story then I think that everyone who fed training data into that model should get a piece of the revenue.


> If Disney trains an AI good enough to just generate a good story then I think that everyone who fed training data into that model should get a piece of the revenue.

So, if Disney decides to incorporate a sailing boat in ther stories, does that mean that anyone who has been making sailing what it is in our culture (i.e. anyone ever been on a boat, built a boat, talked about boats etc) should get a cut? If not, why not?


Reductive for you, maybe. I don't want this technology limited in literally any way. How many times must we go through this? Destroying perfectly good technology because it hurts the copyright industry's interests? I say it's time to end copyright instead. They're holding us back with these concerns.


This is going to be awkward for the incentive to generate these models which are protected by the same rights.


Not that awkward, open source models exist.


Open source uses legal IP protections and removing them would have significant ramifications to open-source licenses.


You're mixing different issues. Your exact quote was "This is going to be awkward for the incentive to generate these models which are protected by the same rights".

Except right now models are being released openly, with anyone being able to download and fine-tune the weights. Yes, they have licenses, but it's not like the licenses protect some core part of their business models. Whatever incentives they had to release those weights, they would mostly have the same incentives if copyright law disappeared overnight.


>If Disney trains an AI good enough to just generate a good story then I think that everyone who fed training data into that model should get a piece of the revenue.

Disney will just commission this art or figure out a neat trick to get artists to give them this permission. Then afterwards they'll turn around and license the model back to all those artists. And those artists are going to pay to remain competitive.

Congratulations, you've now created a monopoly or oligopoly. And the people who lose the most from it are the same artists that demanded this system, because free models will be illegal.


AI copies everything, except copyright notices. Curious why is that?


Instead, we're bending over backwards just to appease a bunch of slightly younger monopolists.


Easy fix. Just make everyone publish their models. Now everyone can run it. We need to get this powerful technology running locally and unrestricted on our own computers. Not on some big tech corporation server with thought crime editors policing output.


I'm genuinely curious how you think artists should be compensated for their work.


Sponsorship and patronage. Who knows. As long as it's not this artificial scarcity illusion.


So we're going back to the feudal times were artists live in servitude arrangements with the propertied classes? It's pretty funny how artificial is a pejorative in your world when it precedes 'scarcity' but not when it precedes 'intelligence'.

Property rights are the basis for any society in which people have independence and autonomy. I think we should put software developers on sponsorship and donations for a year so that they understand the consequence of an economy like this.


>So we're going back to the feudal times were artists live in servitude arrangements with the propertied classes?

Most of what we see today as the greatest art of all time was produced under this arrangement, so if we look purely at the outcome for society, the arrangement can't be that bad.


> So we're going back to the feudal times were artists live in servitude arrangements with the propertied classes?

As if it's any different now. Wanna create something and need money to make it happen? Be prepared to enter into arrangements with rich investors and corporations and producers. The people with the money will literally get credited in your works.

> It's pretty funny how artificial is a pejorative in your world when it precedes 'scarcity' but not when it precedes 'intelligence'.

The "artificial" in intelligence is a technological limitation: it's not a real intelligence because we haven't been able to make one. The "artificial" in artificial scarcity is a condition humans impose on abundant resources in order to create an economy where there is none: we have the technology to copy and distribute these abundant resources but the monopolists won't have it.

So you bet it's pejorative.

> Property rights are the basis for any society in which people have independence and autonomy.

Sure. Real property like land and physical possessions. Not this imaginary property that only exists to let people turn abundance into scarcity.

> I think we should put software developers on sponsorship and donations for a year so that they understand the consequence of an economy like this.

I don't even disagree. Software developers are also creators that exploit their monopolies for gain. Personally I don't even believe in copyleft licenses, they only exist to fix copyright and turn it to our advantage. There should be no copyrights to fix in the first place.


>Sure. Real property like land and physical possessions. Not this imaginary property that only exists to let people turn abundance into scarcity

There is no difference between the two. All property is 'imaginary'. Turning abundance into scarcity is literally the only reason to have property rights all the time, including in the physical world. Land isn't actually scarce in most places.

The reason we introduce scarcity and ownership for land as well as digital goods is exactly the same. because it gives the owner an incentive to tend to it. It lets you capture the value of the improvements you add to it. The enclosure of the commons didn't happen because the commons were scarce, it happened to create conditions for their development rather than their abuse.

Software developers who don't own their code earn 800 dollars on patreon will Microsoft makes millions of their labor. This is literally the economics of free software. It's the largest upwards distribution of wealth ever invented. Imaginary property, that is to say intellectual property, is the only thing that puts you on even ground with Microsoft. You want a world without it where hardware is the only thing that matters? Good luck competing.


> There is no difference between the two. All property is 'imaginary'. Turning abundance into scarcity is literally the only reason to have property rights all the time, including in the physical world. Land isn't actually scarce in most places.

No way. Physical things are scarce. This is basic physics. You cannot duplicate energy or matter. Two physical bodies cannot occupy the same space. Physical property is not even in the same conceptual space as intellectual property. Land isn't scarce in most places? Really.

Exactly none of the properties I just mentioned apply to any form of intellectual property. You can duplicate bits infinitely at negligible costs.

> Software developers who don't own their code earn 800 dollars on patreon will Microsoft makes millions of their labor.

Microsoft makes a zillion dollars off of free software labor because those developers were tricked into giving up their rights while simultaneously allowing Microsoft to retain theirs. It makes no sense to give away code unless everyone does. I have three answers to this:

Abolish copyright straight up. Now everyone's on equal footing with Microsoft. No longer a crime to use their proprietary code when it inevitably leaks and even make money off of it.

License everything under AGPLv3. They can't exploit your free labor if they refuse to touch your free software code. This is the approach I ended up choosing because I don't have the power to enact the previous solution. I don't think I'll ever license anything under a "permissive" license ever again.

Simply don't publish the code. Nothing wrong with keeping it private, for one's own use only. I do this too.


The scarcity isn’t in the reproducibility of intellectual property. It’s in human motivation.


Yes. People are supposed to get paid for the labor of creating. Not the finished product.


Why would we pay people for the labor, not the product? I don't care if something took you 1 minute of 10 years, it's value to me doesn't change (except in the case where the 1 minute version is reproduced a lot by you and it's scarcity gives it value)


Because creators are scarce and finished products are not. It's simple.

We're in the 21st century, the age of information and networked computers, and these people are trying to sell bits. It makes no sense. If your work is making bits, you need to figure out a way to get paid for the labor of discovering those bits, not the bits themselves.

Because god knows how easy it is to copy those bits after they've been found.


We've already figured out a way to compensate creators for their work—it's called copyright. You say you don't like it, but you don't propose any alternative either, which makes you sound like an entitled brat.


Copyright is not necessarily the best way for creators to get compensated. For example, for most musicians in the US it's only a small part of their income [1]. Study is from 2013, could be different in the Spotify era, but it found that on average 12% of musicians' income came from copyright-related sources. 22% if you count session recording. Top-earners made a higher percentage from copyright. It isn't terrible, but it's not exactly the main way musicians are making money.

Writers make basically all of their money from copyright-related sources, but the median income from writing for full-time authors in the US is 20k [2].

My point is, copyright is a way to compensate creators for their work, but it's not the only way, and in practice for most people it doesn't do a stellar job.

[1]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2199058 [2]: https://authorsguild.org/news/authors-guild-survey-shows-dra...


I have issues with your interpretations of both papers. High level: The paper on authors says it's Amazons monopoly rent seeking as why writers don't get paid more Music is also famously an industry where artists rarely own the copyright for their (recorded) music, instead selling it to cover production and advertising costs. Also, the music study suffers from and the paper on musicians has a good deal of classical performers and teachers in the mix skewing the sample. These are not professionals that interact with copyright in any meaningful sense, but they are also not professionals that would be classified as creators.


Of course, because music can be performed. People will pay to see a live performance, which means music performers aren't entirely reliant on selling recorded performances. Live readings, on the other hand, don't attract large crowds and therefore aren't a great source of income, making writers more dependent on IP rights. It's pretty obvious.


Oh please. I cared enough to propose some alternatives up above but the truth is I don't really need to fix their broken business model for them. How they get paid is literally their problem. If they insist on selling bits, they're free to seethe endlessly when people copy those bits with complete impunity. I'm done feeling sorry for these monopolists.


Universal basic income to begin with. So they can create if they feel like it instead of waiting tables to make ends meet.


They always were entertainment, so: concerts, performances etc.


People create stuff out of the sheer desire to create stuff. You only need to make sure they won't starve while they do it. And there's so many ways of ensuring that apart from paying them for copies of their output.


"Intellectual property rights" is a scam implemented by corrupt western cronies and is completely unfair to the people. No person owns the right to their "invention" because it is leeched off of other people's work. No corporation owns the right to whatever the people they employed came up with. The IP thing needs to be eradicated.

We need more countries to be strong and brave and say no to westerners forcing them to be unable to copy technologies... like China has done to some extent... but that might take too long.

The path where AI lawyers can sort IP disputes instantly and hence competitors can easily find out how to copy technology without violating laws would be another nail in the coffin of this corrupt practice.

> Nobody is going to "generate training data" if they get precisely nothing out of it.

The people who we don't want generating training data will not generate it... and that is a good thing. You have been generating too much crap in the name of innovation anyway. Real tangible useful innovation gets copied naturally in the marketplace in absence of corruption.


IP is the promise that you get the dues for your work, guaranteed by the government. It is what enables the entire "information economy". Without it, we'd soon need to weed out people that don't contribute to IP.


No one's going to need to generate training data, because image generators will simply setup and run their own artist pages using LLMs to interact with their fans, and then weight their outputs by likes/clicks/web requests.

The idea that complete original "art" is needed to run image generators is a very temporary state of affairs that bootstraps the proof of concept, but is in no way necessary to the process. The entire process can be adversarially trained solely against human aesthetic values - we know this, because image generators are already a derivative technology of building image classifiers - i.e. they functionally are adversarial against a given textual description in a classifier.

The real limitation right now is speed: once we can diffuse at > 1 frame-per-second, data collection can be run solely from the consumer side.


> Nobody is going to "generate training data" if they get precisely nothing out of it.

How much did you get paid to write this comment?


People are happy to share their opinion, but don't expect people to write long, well-researched content about anything.

In my case, the content I write involves hours of research, and correspondence with a network of experts I have built up over 5 years. I constantly update and revise that content.

The website on which it's hosted was also developed with great attention over the same period.

I don't think that I'd keep doing if I was deprived of revenue, and from the kind feedback of readers. I certainly couldn't do it full time.


+1. There is no more incentive to post things on the internet if this is how things are going to be. If all the effort you put into something can be used to bolster a LLM that will then be used to make money off that info, no one would want to write. If I have a blog, I'd write for humans and not so some "Open" "AI" company or Micro$hit can come and claim my work as their own. People will sooner credit LLMs for being "smart" than acknowledge the upcoming massive Original Content vaccuum.


If GPT could do the same research - reading a bunch of primary sources and condensing and summarizing the information - why wouldn't we want to have GPT do that rather than having a human do it?

We don't employ roomfuls of people to solve math equations by hand to calculate trajectories anymore, because the computer can do that faster.

If GPT can't do research and produce content with the same level of accuracy as you (which, so far, it can't) than surely people would be happy to continue compensating you to do so.


Sometimes I am the primary source. I got the information from calling various government offices and interviewing people. I write with a certain amount of specific context that a LLM does not have. To illustrate my point, an LLM might summarise legal texts, but only a lawyer can tell you the odds.

I tried ChatGPT. I asked it questions about the work I do, and it was about as correct as the cheapest copywriter. The writing was very good (if verbose), but the information was shallow but credible bullshit.

There is no efficient market that allocates money to the best products. In reality, the winner takes it all.

Therein lies the problem. If people find credible bullshit before valid information, they might stop there. It deprives them of valid information, and me of revenue. There is plenty of credible bullshit already. GPT just makes it cheaper to produce.


This is a self-correcting problem.

If people consume "credible bullshit" and can't tell the difference, without consequences, then your thoughtful work isn't as valuable as you thought.

If on the other hand people consume BS and it comes to bite them in the ass (through reduced socioeconomic status and, ultimately, reduced reproductive success), then competing approaches become viable.

We may not be around to experience it – a variation of "the market can stay irrational longer than you can stay solvent" – but you may find comfort in the fact that Nature has a way of optimizing away cruft. Most often through repurposing the obsolete-but-already-existing system for another function, and failing that, through atrophy and death.


> then your thoughtful work isn't as valuable as you thought.

Why is this type of dismissal so prevalent on HN? Do you really believe that?

In the real world, core infrastructure depends on code that does not raise enough money to buy groceries. Do you remember OpenSSL and Heartbleed? The entire internet paraphrases the same few nuggets of original research, with affiliate links slapped on top.

The market has been irrational for as long as I have been alive, and it's getting more irrational every year. When will the invisible hand of the free market correct things? Why do you insist that we live in a libertarian utopia where this statement holds any semblance of truth?


Do I believe that when others choose to not buy your work, and buy something else instead and are happy with their choice, your output wasn't as valuable to them as you maybe thought?

Yes, I do.

In fact, I see that statement as nearly tautological.

I also believe that throwing people in jail if they don't value the same things as you is wrong.

Both of which are largely orthogonal to my original point though – which is that determination of value is a self-correcting process over evolutionary time spans. But I understand you wanted to shake your fist at the sky, in these difficult times.

Same with your OSS reference – I happen to publish several open source packages, used by thousands of companies around the world, which in total bring me about grocery-level income (~€500/month, pre-tax). So what?


>Do I believe that when others choose to not buy your work, and buy something else instead and are happy with their choice, your output wasn't as valuable to them as you maybe thought?

If the "AI" community chose not to buy nicbou's work and chose to use materials in the Public Domain instead, that's fine. If the "AI" community chose not to buy nicbou's work and chose to use Radim's work instead under terms agreeable to Radim, that's fine.

But that's not what happens. What happens is the "AI" community chose not to buy nicbou's work and chose to use nicbou's work anyway. That's where the problem is.


Sounds a lot like the current setup is already pretty fucked when it comes to rewarding creators. So why limit people who are driven to create by restricting what they can and cannot use?

And just to be clear, people do write long, well-researched content for free all the time.

I'm all for holding Microsoft et al. to the same copyright laws they expect others to follow when it comes to their own works, but ultimately it would be in societies interest to not limit creators. Sure, maybe there are areas where we need to find ways to encourage creators but let's not do that by limiting everyone else.


Haha, great rebuttal.

The reward was mere internet points and the hardwired gratification that comes from social interaction.

People sometimes go to great lengths for non-monetary rewards.


Their compensation was a feeling of contributing and being part of a community, probably with some degree of self-validation as they read and agreed with their own opinion after posting it. Later they might even get paid with some upvotes granting additional affirmation.

Just like us, really.


Just make sure you compensate them if any of their content is used to train your wetware neural network and becomes part of any future text it generates.


> Nobody is going to "generate training data" if they get precisely nothing out of it.

Everyone that has a job and gets paid for it generates (with the appropriate monitoring) some kind of data that could be training data in the course of doing their job. (And doesn’t raise the issue of “if they get precisely nothing out of it”.)

Once no humans have jobs because AIs are better at them (should that occur), then the AIs themselves will be the best source of training data for the next generation.

There are many problems to address with AI; “Where is training data going to come from,” however, is nowhere on the list. It is a distraction from real issues.


> Nobody is going to "generate training data" if they get precisely nothing out of it.

It's using common crawl and wikipedia among other sources. The original purpose is something else.

Another anecdote that I find amusing is that Google Translate was trained on EU legal texts (among other sources), which are (generally) translated into 26/27 languages of the EU members. A huge, multinational bureaucracy, generated the texts by hand, but Google gets all the credit. I'm guessing that's part of the reason that Chinese isn't so accurate compared with other translations.


Intellectual property rights shouldn't exist as a concept. Digital scarcity is a scam. Property is a function of the natural scarcity of physical matter, which doesn't apply to data. Rent seekers and power brokers unnecessarily brought this concept into the digital realm.


LLMs are not regurgitating copywrighted works. You're fundamentally missing what's happening here if you think that. They do the same thing you and I do; read the literature and come to a conclusion. To quote Wittgenstein, "My words are my world", i.e the sum of the textual and verbal output I am capable of producing are me to the rest of the world. In the same way, LLM AIs are approaching something indiscernible from human intelligence.


This would be well and good if you couldn't occasionally ask ChatGPT for a response then Google it and find a near identical response elsewhere in what would have otherwise been copyright materials. Other AI tools are doing the same thing, and the most blatant is with art. If you honestly think ChatGPT etc aren't training on copyrighted data and just rewriting it for tons of responses (and images), you're out of your mind. It is just incredibly difficult to prove, and the laws currently are very blurred.


Isn’t their point that humans also do that? Without it necessarily being called copyright infringement?

Lots of subtleties: fair use, what’s not fair use, etc?

It does seem like we’re in new territory, I can see merits on both sides and I guess we’ll figure it out via the courts / new laws over the next couple of years.


It is called copyright infringement. If I write a book, and there's more than a few sentences in a row of content that's verbatim/almost verbatim the same as someone elses published writing (excluding intentional quotes/references) I do get in trouble.

You can't just go "computer did it" and handwave away existing IP law.


> there's more than a few sentences in a row of content that's verbatim/almost verbatim

Ok, now image that it isn't giving out verbatim results. Which is a lot of content.

Totally fine then right?


> I do get in trouble.

Unless the original was a bestseller and you are a bestseller, you almost certainly will not. This obviously happens all the time as most writers suck and do actually deliberately copy things and then self publish. AI is actually helping to now not do that.


Except, humans aren't machines or tools. Our laws don't answer to some logical / mathematical ideal, they're built to address social realities. LLMs aren't citizens.


> This would be well and good if you couldn't occasionally ask ChatGPT for a response then Google it and find a near identical response elsewhere in what would have otherwise been copyright materials.

Likewise, someone could go on Hacker News to start an argument about intellectual property and half of the responses would be nearly identical to comments on Slashdot twenty years ago.


The key difference is that we know how LLMs work but we still haven’t solved the problem of consciousness, or even neuroscience.

You can’t equate something you understand with something you don’t understand. Something you can create with something you can’t.


>They do the same thing you and I do; read the literature and come to a conclusion.

No.


Very compelling argument you've made.


“Come to a conclusion” is probably wrong; “synthesize a response to a prompt based on the training data provided” is probably closer. Still, it’s what you and I would do when answering a question using information we’ve accumulated from reading various sources.


It definitely wasn't AI generated.


[flagged]


Sounds illogical to force a conversation and than withdraw with a argument about logic while dismissing every counter argument. :D


seems you're getting unnecessarily hung up on the "how" of it.

planes fly but not like birds. does it matter from an "object moved from point A to B through the air" point of view? not really.

LLMs are kind of the same. it's obvious from even the most cursory examination that they're doing something quite different to statistical word mashing. is it being achieved in the same way as your gray matter? no.

in a black box sense, does it matter? also no.


Considering the argument is irrationally applying human rights and rights of living beings to these computers running mildly complicated software, yes the "How" does matter a lot.


The concepts of lift and flight actually are the same for birds and planes and insects and drones and hellicoptors…… or fish! And whales. And any other organism that uses voyance and lift to ‘fly’ through the ‘medium’.

Every form of intelligence we know of is operating in a fundamentally different environment to what these programs are doing.


"They do the same thing you and I do; read the literature and come to a conclusion."

I'm sorry for you if you can't do better than an LLM.


There will be tons of lawsuits in the next decade over the data AI is using to generate its responses. Some of the most blatant and easy to prove have been in image generation with them literally including blurred Getty logos with their output.


Are the blurry getty logos imposed on backdrops from getty as well? I think what's happening here is that the AI is treating this feature like any other feature (table, nose, paperclip) and included it because it assumed the context was appropriate for it.

On the otherhand, if the AI can be coxed into producing nearly complete getty image with only minor modifications, then maybe there would be grounds for a copyright lawsuit.

I think AI detractors should focus on passing new laws. Copyright is not the right tool for this.


That's an interesting point, but I wonder whether Getty could sue on other grounds, such as trademark infringement


Getty are going to be one of the biggest investors in the space if they’re smart. Why pay photographers when you already have the dataset to create them on demand.


>Nobody is going to "generate training data" if they get precisely nothing out of it.

"Before accessing your account, please solve this puzzle to verify you're not a robot."


"Your OpenAI free tier requires you to install the OpenAI app and accrue 30 OpenStat points for every 10000 tokens. Please note, you can only earn 1 OpenStat point every two hours. Please do not deny Location Access, Microphone Access, Camera Access or Storage Access permissions, or this program will stop working."


funny, people invented, discovered, and created stuff even before the "intellectual property rights". Those terms have been created by distributors, not creators, of contents.


People write open source software and get "nothing" out of it.

And people post stuff on the public internet and get "nothing" out of it.

It's not obvious to me that everyone who has shared something permissively is unhappy if someone else uses it for training data. In many cases, it's just loud people who want a new revenue stream, which is the problem with intellectual property generally, the concept breeds entitlement. Why should I retroactively profit from what I willingly shared because somebody else found a use for it.


I have released works under copyleft licenses and am not OK with Microsoft using those for profit without respecting the license (e.g. without opening up the models derived from copyleft works) when Microsoft themselves have been going after others copying their works when that benefited them. They shouldn't get to have it both ways - either we reform copyright or everyone should be required to follow it. If anyhting, large corporations should be subject to stricter enforcement since they already can benefit more due to their size.


Let's say that I repair donated bicycles and give them away for free. It's a fun activity that does some good for the community.

Now, let's say that on my morning walk, I see someone selling a bike I gave them a week ago.

Is it fair to be offended by this?


There are two issues I see here. One, it's not analogous because you're talking about physical property which is not intellectual property. One might be offended because it means the person they gave the bike to doesn't value it and really just wants to get money. They could have given it directly to someone who really wanted a bike. That's irrelevant when something can be copied for free. It would be more like if you were really good at fixing bikes and showed someone how to be good at it to, and they used what they learned to open a paid bike repair shop.

Re the actual bike thing (which has nothing to do with the discussion) I wouldn't be offended if the person sold it. In the hypothetical example, I got the enjoyment of fixing the bike, apparently gave it to someone no strings attached, and he's found a way that maximizes the utility he gets out of it. What's the problem?


Copyright activists should not have pushed Mickey Mouse protection acts , sued fangame creators and the like. I have zero sympathy for copyright activists now. Copyright reform is long overdue.


People just side with it as a way to get what they want. Crazy how many artists who’s portfolios are full of their own drawings of Disney IP shouting that Disney should use IP law to stop stable diffusion.

The hypocrisy completely lost on them.


Predictions on how it's going to come back to bite them?


It will not bite them back, because they will argue that what was put in the dataset was "easy" and not worthy of copyright anyway. Then, they'll tout how smart their "AI" is. Two birds with one stone.


5% chance the copyright industry will sue AI companies for verbatim reproduction of copyrighted works.

0.1% chance the Feds get involved and sue AI companies for criminal copyright infringement.


If you can induce ChatGPT to repeat a copyright work, I wonder if that’s definitive proof of copyright infringement.

It has no problem reproducing song lyrics for me, which I’m pretty sure are absolutely copyright. Interestingly, it’ll start to reproduce a famous book for me, and then cut off in such a way as to suggest there’s a hard filter to stop it proceeding


This is entirely my thesis on why they'll win. Of course google wishes they could ignore all attribution.

They are slighting the law just like airbnb and uber did. They offer a better form of consumption. This is disruption at it's finest, and it will be in the courts forever.


> Nobody is going to "generate training data" if they get precisely nothing out of it.

This is trivially false. Eg, fanfiction exists.

Maybe monetizing art will become hard as cheap art becomes ubiquitous, but that doesn't mean people will stop creating stuff.


> Nobody is going to "generate training data" if they get precisely nothing out of it.

What? What do you call literally the entire training dataset of stable diffusion, dalle, gpt3, and gpt4? You think they compensated anyone that went into the dataset? Did you get paid for your comment, because guess what? Your comment will end up in the gpt4.1 or gpt5 model.

Training data will likely continue to include everything public on the internet. Not everyone will stop posting things online for free, therefore humans will still generate training data for models.


You know what people treated like shit do? They go find greener pastures to spend their lives in, taking any of their benefits to humanity with them.

Sure, random comments of insignificant value like these will probably continue being created. Valuable content, though? That's a different story.

"This is why we can't have nice things." is a saying that accurately reflects what could happen if the "AI" community doesn't stop behaving like uncivilized barbarians. Personally, I'd prefer to have nice things.


Why are you putting AI in quotes, out of curiosity?

My priors tell me that great art will continue to be made, both with and without these new tools; and that people will simply re-adjust to a new normal of sorts.

What are you suggesting exactly? That all the creators are so weak-willed that they simply stop creating? Or are you implying that they are going to all go live on some sort of "no-internet-allowed" island together?


>Why are you putting AI in quotes, out of curiosity?

Because it's not intelligent. It's definitely artificial, but it's not intelligent in any meaningful sense of the term. "Automatic Interpolation" or something similar would be a far more suitable use for the acronym.

>What are you suggesting exactly?

Noone is going to spend time making something if the next guy immediately goes "You made this? ... I made this." like clockwork. Even moreso if someone was making something professionally.

If we do see creators and inventors go away, it would be because of their own volition and desire to find greener pastures worthy of their time rather than "AI" usurping their jobs.


I recommend just calling it software, or something specific like "image generator", "text generator". It gets the point across well, in my experience.


> Noone is going to spend time making something

Yes they will.

Not everyone is like you who will collapse into anger and outrage at the thought of other people creating new things based on your work.

I do not care if anyone uses any of my intellectual property that I have ever created.

In fact, I would be flattered if people use the stuff that I have built to them make newer cooler things.

I support anyone and everyone "stealing" any intellectual property that I have every created.


Good for you; unfortunately, you don't speak for everyone. Particularly those who create things for a living.


I didn't say there I speak for everyone.

Instead I am refuting your completely false statement of"Noone is going to spend time making something" that I was responding to.

So great, you agree with me that yes there are people like me who will continue to produce intellectual property and release it to the public.

> Particularly those who create things for a living.

I create things for a living to the tune of hundreds of thousands of dollars a year.

So yes, I speak, at least partily, for those people and am totally fine with people stealing my work.


Have you created much intellectual property?


Well I have been a software engineer for the last decade, at a major FAANG company so yes.


I guess I'm just confused about who specifically you're referring to. Is it the many current creators who happily post things on the internet for free? Or are you referring to just anyone who has copyrighted their work and posted it online?

If it's artists, I think you are underestimating the impact a "real" author has on a given work. We're all human, and art tends to be about capturing a shared experience. This means the viewer of the art can make assumptions about the author's humanity and (sometimes) connect with them emotionally in this regard. It's about communication _with people_ - not with an unfeeling language model that is very clearly trying to win a game of sorts.

So the artists incorporate these tools, some of them don't, new genres form, and new ideas are captured. All the artist has to do is assume they still have an audience (they will, because humans want to see art from other humans).

I think the biggest problem is that indeed, _other people_ will use these tools to trivialize the works of others and conceal the origin. But again, the artist can just ignore this world - as before. It's not like copyright law was actually ever protecting them anyways. Same argument applies to licensing on Github code; while you can get on a soapbox and complain about how your code is GPL licensed - that won't prevent hundreds of developers from copy and pasting it when they need a working solution.

The most pragmatic solution before, and now, was always "don't post things on the public internet for free if you don't want theft to occur". This is a lesson people keep having to learn since Napster.

In any case, if the end result is that artists do indeed start charging for their creations - then great! I don't really see that magically solving other issues (surplus of artists, shortage for the labor they would be good at), however.


>It's not like copyright law was actually ever protecting them anyways.

They are; you can't use or incorporate an artist's products in public without their permission (aka a license).

>In any case, if the end result is that artists do indeed start charging for their creations - then great!

That's the thing: Artists are charging and the "AI" community doesn't give a shit.

Sooner or later, that house of "AI" cards is going to come crashing down with noone to blame but the "AI" community themselves.

>while you can get on a soapbox and complain about how your code is GPL licensed - that won't prevent hundreds of developers from copy and pasting it when they need a working solution.

Exactly. Sooner or later, someone who's been maintaining that one single block holding up all of modern humanity is going to say "fuck this and fuck you all" and leave everything to come crashing down; no, we can't blame him either.

We've already seen creators disgruntled with insufficient compensation (putting aside whether their demands were just) taking their wares and leaving for greener pastures to the dismay of everyone else. Didn't need "AI" for that, either.


I certainly agree that there are people (you seem to be one of them) who will throw their hands up in the air and say "fuck all of this".

I think we are mostly in agreement actually. Just that you are indicating this will amount to the majority of creators and I and the sibling commenter are seeing the world as it is, not as it should be. Realistically? I mean come on - think about it. Hustlers will hustle, sure. But artistic expression and creation doesn't _only_ happen because of money - it happens because some people would rather die than do anything else.

This is what I'm talking about, and what the sibling comment is talking about. I put all of my code on GitHub. These are my creations. If I wanted to be compensated for them, I wouldn't have posted it online. I use permissive licenses and don't need any credit for my work. This is humility - there is nothing special about my work per se. I mean, sure, maybe there is - but it isn't anything that I am the only in the world to pull off. Further, by putting the code online and giving it away - you are democratizing a creation that shouldn't really by _owned_ by anyone, in my opinion.

I gather you don't share that view - but you should know there are a lot of us out there. It's called having passion; and I don't know of any other way to accomplish things at work or for free.


>I gather you don't share that view - but you should know there are a lot of us out there.

Releasing something free-as-in-libre as a specific concept is irrelevant; I would publish certain things free-as-in-libre, others free-as-in-beer but with certain rights reserved, others commercially with certain rights reserved, others with all rights reserved. It depends on what it is I'm doing.

Insisting on releasing everything free-as-in-libre just because it's libre is, as far as I'm concerned, nonsensical. Licensing is a tool to be used appropriately, like any other tool; it's a means to an end and not the end itself.

>It's called having passion

I argue there is nothing as demotivating as having your passion denied by others shamelessly taking what you made and doing whatever with it without a single care for you, the creator.

Abiding by licenses is a matter of respecting the fact that someone, somewhere took their time to make something. The least you could do is respect their wishes, be they libre or closed, free or commercial, or anywhere in-between.

I'm willing to bet that a majority of the artists lambasting "AI" stealing and abusing their creations for training materials would be happy to provide training materials if they were respected and compensated for their time instead of getting treated like free real estate.


> I argue there is nothing as demotivating as having your passion denied by others shamelessly taking what you made and doing whatever with it without a single care for you, the creator.

How is that "having your passion denied"? You can just ignore it, and then you're in the same position you would be in if they hadn't done it, which seems to be the alternative you have in mind anyway.


> Realistically? I mean come on - think about it. Hustlers will hustle, sure. But artistic expression and creation doesn't _only_ happen because of money - it happens because some people would rather die than do anything else.

> I mean, sure, maybe there is - but it isn't anything that I am the only in the world to pull off. Further, by putting the code online and giving it away - you are democratizing a creation that shouldn't really by _owned_ by anyone, in my opinion.

That just sounds like slavery with extra steps.

Also, Art doesn't just come out of a LLM -- it depends on the times and conditions of the artist. "AI" models suffocate all of it.


By all means. Let them take their nice things and go away if they must. We'll be right here on the other side freely exploring the limits of technology.


Soon we get content created for gpt and introduce a new level of spam.


I worry (And I assume this is either dismissed or a core concern in the field because it's so obvious) that the coming generations of models will use a lot of input from existing models.

Will we be chasing "untainted" input (pre-2022 data) just like we chase low-background steel (pre-WW2) for sensitive machinery? Should we be "tagging" generated data NOW when it's published in large chunks like images, long texts?


First thing I thought reading your comment was "this guy's repeating the low background steel analogy that was posted the other day". I don't know if that's actually true, and it doesn't matter, my point is the internet is already an echo chamber. GPT is trained on reddit and HN and other forums where people say the same stuff over and over again anyway. It's not like the pre-gpt internet is some pristine pool of new ideas. As long as someone somewhere is injecting some new stuff in, and they will be, it doesn't matter that there will be more automatically generated crap, there already was


Ignoring the effects of volume is a common and fatal mistake. "This effect already exists" is not incompatible with "Multiplying the impact of this effect by orders of magnitude will cause it to become a massive problem."

Or, put another way: A slightly degraded signal-to-noise ratio isn't really a problem, but once it degrades far enough, the whole stream becomes impossible to decode, and worthless.


As I understand it, LLMs use temperature scaling to optimize the variability in the tokens they generate so the output appears "good", somewhere between getting stuck always saying the same thing and saying nonsense. [0]

I wonder if more output being used in the training data will result in a shift to higher temperatures over time to keep variety in the output.

[0] Discussion the other day: https://news.ycombinator.com/item?id=35131112


My Information Theory is rusty (and wasn't too strong to begin with), but didn't Shannon prove that we can decode a signal given arbitrarily bad noise, but that the more noise, the lower the channel capacity?


Yep, pretty much. But therein lies the problem: filtering out garbage writing is something done by humans, whose limited time, attention, and effort all make for a pretty tight bottleneck in the first place. Degrade that channel capacity, and our ability to find texts of value becomes pretty close to nil.

edit: Also, when the noise comes in the form of competing signals rather than random noise, that becomes much, much more difficult, but we're also outside my area of expertise at this point.


I literally made a Tumblr called https://lowbackgroundsteel.ai the other day :-)


I love that you use the same Steel analogy I came to. Data sets from pre-2023 are going to be useful in determining the validity of future data sets - if that is possible.


Why didn’t we do that before AI? If that’s useful now, surely it would be useful to know what knowledge pre-dates humans learning and repeating it?


Well, isn't this pretty much the definition that historians use for "primary sources"?


What's the use case for such validation?


Don Hertzfeldt's Simpsons couch gag seems like an end result of AI training AI. https://www.youtube.com/watch?v=6i2l-LQ-dXI


I mean, the singularity proponents see this as the golden grail, with the condition that N+1 AI trainee is slightly smarter then the N AI trainer.


I work in a field where the training data is extremely sparse because labeled data is incredibly time consuming and labor intensive to come by. There’s ways to automate labeling, but I don’t really trust those methods fully cause I’ve seen how the sausage is made, so to say.

Everybody and their dog is trying to do machine learning, but I question how useful or real the results are because I know that the data underlying the models is of questionable quality. I don’t know what’s next for the field, but it’ll probably be some high profile disaster.


It is a safer bet that someone will hook a neural net up to a camera on someone's hat, train from reality, then re-derive art from that. Then AIs will use the same methods as humans to maintain their grounding in reality.

The problem with the current crop of AI is they pretty clearly don't understand that they are taking a 2D projection of a 3D concept. That'll be fixed and there is a good chance it'll be fixed with video footage because that is how humans do it.


The immediate benefactor of tagging non-ai generated data would be OpenAI.


Sure. But it would be a lot easier to tag output like images as "AI generated" in a small handful of generators like StableDiffusion and similar, than making the opposite and telling people to tag their non-ai-generated content as non-ai-gemerated.


Also people who want to make a quick buck by selling AI-generated data and marking it as non-AI. Sounds super cheap and high ROI.


I would expect that smart people generate data with GPT that is more and better than smart people alone. I really don’t see the issue.

Especially since smart people + GPT + empirical data (think GPT plus surveys and sensors) will be a HUGE source of intelligence.


Will not-so-smart people be banned from using GPT?


Of course no! Only people that can't afford ChatGPT will be out of luck.


I'd agree if it weren't the case that ChatGPT is the fastest growing free AI service in the world. Indeed I cannot think of a single worse example to "The future is already here, it's just not very evenly distributed" - everyone who wants to use ChatGPT can, what more could you expect?


they will have to pay more. And I'm already broke :(


> Will not-so-smart people be banned from using GPT?

If we're consuming information filtered and noise reduced by very smart AI's how much does it matter?


I responded to (from the parent comment):

>> I would expect that smart people generate data with GPT [...]

Also:

> filtered and noise reduced by *very smart* AI

Are you sure you know what GPT is?


What I only can hope about is the natural selection of useful data throughout time. Garbage is all around, people generate it on mass scale with modern technology nowadays, but the useful portion may survive the test of time, prevail. Those using 'good' data over 'bad' can gain benefits that could lead to the not self harming population using good data, suppressing bad ones. Mostly, of course.


Low-background steel is actually pre-nuke tests. That's what they scavenge sunken ships a lot for.


Yes it should really be "pre end of WW2" (The first nuclear test was 1945).


there's active work on watermarking LM outputs. here's one example:

https://huggingface.co/spaces/tomg-group-umd/lm-watermarking


A lot of AI training data is bootstrapped from weaker models.


I am guessing in the future they would just run simulations to generate the untainted training data pre 2022.

...oh wait.


Lovely old science fiction story of a young man, taken from the classroom on his one day of 'school' at 18 when everybody gets 'taped' with their job knowledge. They say he's incompatible with the machines, and will have to become educated the old way!

He spends a life reading textbooks, studying with others like him and watching his cohort succeed and fail with their taped knowledge.

Ultimately it occurs to him, who wrote those textbooks? It was people like him. The people creating the training data.


That's Profession by Asimov Great novella (which I discovered on HN many years ago :)) https://www.abelard.org/asimov.php


Seems unlikely to me. All this science fiction that believes the human spirit is better at things. It’s good storytelling but is it wise? Probably not. Tech gets better. Textbooks generally do not. And textbooks generally aren’t that great as they are now.


The original AI had its goal to understand and improve (make more systematic and predictable) the process of human thought / discovery.

Modern day AI tries to imitate humans w/o comprehension.

In some extreme cases AI researchers believe that imitation is all there is... but looking at how modern AI works, it definitely lacks the same ability we ascribe to human comprehension. So, even if there's only imitation, AI still isn't as good at it as humans are.


I find it pretty ridiculous that people are arguing these models act without comprehension.

Sentience is a whole can of worms and frankly a useless undefinable term imo. But comprehension? Please. These models comprehend concepts just fine.


Daniel Dennett is a philosopher who made it sort of his favorite subject: competence without comprehension. I won't attempt to explain this better than he does.

But the general idea is that these models are more elaborate versions of ELIZA, while they aren't in principle different. They work like my child who when I quiz him after reading him a page of a bed-time story on the contents of it will try to come up with plausible answers, but I can tell when he was actually paying attention / knew the words or weren't / didn't.

Now, this is something I don't know whether Dennett would claim, so, it's my thought, not his. I believe that a fundamental difference is in motivation / value judgement. I.e. humans do things for a reason that they constructed based on what they perceive to be more valuable to them. Models aren't built to have values or wishes. They don't want to be able to answer more "why?" questions about whatever they "say".

I mean, your reflection in the mirror is a very realistic counterfeit human being, and in many respects is indistinguishable from you, but it's nothing like a human being.


If you request a bot to write code that can accomplish a specified task, it needs to understand the task, understand the meaning of the code, and then write code that accomplishes the task. I don’t care about appeal to sentience or what it means to “understand”. Fundamentally the meaning and of these things must be embedded in a way that leads to a synthesis of new code that solves the problem. Comparisons to ELIZA are naive.

Too many philosophers waste time debating the definitions of words that they assert must represent intrinsic concepts. But they’re wrong. Sentience is just a word. It’s not well defined. You haven’t accomplished anything by finding more examples where its lack of definition is made more clear.

You could say the same about comprehensions but on some level the pedantry feels so obvious and unnecessary that it’s just annoying to hear. Ffs. These modes clearly embed a comprehension of many things.


> If you request a bot to write code that can accomplish a specified task, it needs to understand the task

No, it doesn't. It may simply be a coincidence. For example, I can write a bot that always outputs print('hello world'), and then ask it to write a hello world program in Python.

Comparisons to ELIZA aren't naive. They underscore the fact that more complex models of the same kind as GPT-3 use the same matching approach, they just have a bigger database of matches with more complex rules. They don't derive their answers from first principles. They don't have anything like more abstract concepts or models in any useful form. Which was the goal of AI all along! The AI was the search for these models and concepts, so that we could automatically establish the truth of questions nobody knows the answer to. Models like GPT-3 don't know, and cannot possibly search for the answer to the questions nobody knows the answer to because they are aggregators of existing knowledge.

> Too many philosophers waste time debating

I bet you aren't one of them though.


I find it pretty ridiculous that people are arguing that these models act with comprehension.

That is a much larger claim with a significantly greater burden of proof.


Most arguments about this are based on missing or vague definitions.

And coming to an agreement about these definitions is not simple. People start talking about qualia or psychological zombies and the discussion goes to pot.


By what definition of comprehension would that be? I agree they are somewhat able to transmit knowledge though.


So is any fiber optic cable, so “AI” is at least as smart as that.


This would make a nice coming of age story but then the taping in question would have to occur some time during the character's childhood.


And companies that can generate the training data at scale will be the highest value ones.

These LLMs are the shiniest thing in AI/ML right now b/c the data they use for training is already freely available and massive in scale via the internet.

There's an entire universe of data that hasn't been collected, curated, and leveraged in the right way yet: - The DNA sequences of all living things - The DNA sequences of all of humanity - The CAD files for all manufactured goods - The motion data for humans working manual jobs - The words spoken by multiple peoples across their entire lifetime


After playing around with Stable Diffusion for a while, I'm convinced that big media conglomerates like Disney and Warner Bros. Discovery are going to have a huge advantage here because they're in a position to pipe absolutely massive media libraries into models that are big enough to legally become "copies" of the involved works. While some of the angry rhetoric from artists whose work has been used to train models may be overblown, it's far from a stretch to suppose that companies like OpenAI and Stability AI, if they were being sloppy, could overfit to copyrighted training data to the point that the models and some outputs are infringing. I suspect they're aware of this risk. For instance, the differences in consistency and detail between a batch of SD outputs using "American Gothic painting" vs. "Dark Side of the Moon album cover" are very striking; the former is reproduced with remarkable fidelity and consistency, while the latter was clearly observed but has a much more abstract relationship to its iconic source image. I don't think that difference is accidental; I think they went to some effort to ensure that only public domain images were so firmly encoded into the model. That sort of thing is not a problem for entities that own the involved copyrights or have a such a broad license that they might as well own them (and, even failing that, are in a strong position to demand such concessions from artists going forward).


These technologies aren't really going to help big rights holders nearly as much as they're going to help small players and indies. Big rights holders already have efficient pipelines, automation and well tuned art departments. They can already crank out movies that are so effects laden as to be visually overloading non-stop. These tools might let them achieve similar or slightly better quality for less, but it's not going to fundamentally change anything. For example, if Avatar 2 VFX cost 200 million, maybe it'd come down to 20 million, but the other production costs would still be significant.

On the other hand, small indies can now do respectable VFX without resorting to CGI, needing expensive makeup/props or green screens. They could film in a field with actors in Halloween costumes, then place them in locations built using simple geometry in blender, and then AI will texture, style and light everything, modulate voices, etc. It'll be a game changer in terms of the types of films people will be able to make at the <100k price point.


I think you're understating how much big movie companies would want to save $180 million on VFX


Disney Animator REACTS to AI Animation - https://youtu.be/xm7BwEsdVbQ

This is Aaron Blaise ( https://en.wikipedia.org/wiki/Aaron_Blaise ) reacting to the Corridor Crew AI animation video.

(please pardon the copy of the auto-transcribe)

It concludes with:

> in the entertainment industry the animation industry the expressiveness the ability to create new films and new ideas has exploded and as an end result has created actually more jobs and budgets have actually gone up if you look at the budgets now compared to what they were back in the 90s they're doubled

> ...

> it's something that people want to go to and laugh and cry and be scared and be moved and and learn something from those are the things that people are going to go and watch and um if you can do that then the technology will follow the technology will create new jobs I'll guarantee you I've seen it happen I've been I've been doing this for 35 years and I've seen it happen I've gone through it a couple of times

> so at the end of the day do I think it's a threat? No I don't think it's a threat I think it's exciting I think it's really exciting so I'm just going to sit here I'm going to sit here and continue drawing on my hand-drawn animated short snow bear and I want you to go out embrace the technology embrace the things that are new see how you can express yourselves through them put some beauty back into the world and I'll talk to you next time thanks


Big media conglomerates might be able to pay to create enough art for their own models. Or at least figure out some scheme to do it, where the artists freely sign away this right. That model could then be licensed to others.

A website like YouTube could probably do this for voice or sound generation, if they haven't already. Offer creators an extra 5% of ad revenue and in exchange they get to mine the video for data for AI. Or some other such benefit.


To me the end game of this all is a self replicating machine. I'm watching this whole thing with curiosity, waiting for the first image model to be trained on schematics and circuit diagrams.

I want to be able to ask an AI to give me the schematics of the machine it runs on, with instructions on how to build that machine, by hand if need be.

And then I want to ask it how to build a humanoid robot that can do the same work, and I'll build the humanoid robot and let it build the machine that the AI runs on.

I'm not sure what happens then. I think that might be the end of the economy as we know it. Or perhaps the real start of an economy, and everything that we've experience before then had just been a crude approximation of what a real economic system could be.


And this is why we don't let people build the paperclip optimizer.

Making a machine that could do what you say above probably isn't that hard.

Making the machine that could do the above without destroying the world is impossible.


How are you going to stop me?


I mean, in general in the world if you live in a country with a rule of law and you said something like "Hey, I'm working on a project that's going to kill a lot of people" some nice little armed men would show up at your door and arrest you and put you in a concrete and iron cage.

But in the same sense I do agree, because that same group that controls the armed men will say to themselves "wow, we should build an AI to monitor everybody on the planet" and then we're in the same position where we've created the world eating monster, but now with even more authoritarianism.


the self replicating machines will optimize to become smaller and more efficient until everything turns into gray goo.


Perhaps. I'm more interested in using them to build large objects like rockets to facilitate asteroid mining and the construction of Von Neumann probes.


Junt make sure you dont become dissasembked for construction material


Why would they do that for you? Don't you think that intelligent machines will have their own interests?


Perhaps because it'd want to survive and to multiply...?


How are you and your goals involved in that machine being able to survive and multiply? You are made of carbon which that machine might put to better use.


How have you not been hit by a car yet? Cars move very fast, and are made of steel while you move very slow and are made of carbon.


...are you suggesting cars can make conscious decisions? If not, then what exactly is your point?


Why do any of the machines that we make do anything?

Because we designed them that way.


I think its good not to overthink this idea of a self-replicating machine. In the GPT-4 paper, the system was given some access to some cloud apis and was able to do things like requisition more resources from the cloud. I imagine the first stage of "self-replication" for these LLMs or their successors is going to be like ordering CPUs from Amazon rather than setting up silicon fabs. You need to get far beyond single human level intelligence and effort to start building chips and we are probably not even there yet!


Don't kid yourself that they will let the average joe have that ability.


I think the cat is out of the bag. Once people.know that these things can be made, they will seek to do it themselves

How do you think that 'they' will stop that?

At best they can do a holding pattern that keeps people a couple years away, but I mean like, what are they going to do, ban semi conductor fabrication?


Imagine people making weapons for themselves and the like by using AI. It will be regulated like hell.


How do you enforce the regulations? Are you going to invade the countries that chose to regulate it differently? What if they have nukes?

If people get away with illicit cannabis operations they can get away with illicit AI operations.


You think stasi like societal control is abolished, that it doesn't exist in the western countries in one way or another? That current tech developments is not controlled and people are being fed utter bs things like Moore's law, so they would believe it? Just look at other developing countries and watch their output and you'll immediately see that the whole thing is controlled from the top down. These developing countries' population are not 100% regulated/forced into a mold, thankfully, but they will be (hope they f up this status quo and behead those that force this onto masses).

People snitch on other people just for the kicks of it or because they can excerpt some sense of power out of it. So there is your control. What percent of the population is enough to regulate the rest, to trip someone early enough to siderail his ambitions or projects that would rearrange things for the better? 1-2%?

Plus when strange things start to appear out of nowhere, like a talking dog or people with fluffy tails, well, then you gonna have visitors.


I’m not sure you want to find out but it probably goes something like taking your computers away from you, using government controlled AI to monitor your movements etc...or, restrictions on the classes of device that people can access. So a high-end graphics card will be treated like a weapon.

I actually can see this kind of thing happening sooner than we think, thanks to ideas like yours.

The government will team up with corporations and they will do the “research” and provide for us and we will be at their mercy with regards to technology we can access.

People will actually be asking for this pretty soon once the general public get wind of the dangers we are facing, it’s really not hard to believe. Look at covid lockdowns and times that hysteria by 10x.

Nice dystopia…this might end a lot more like 1984 or blade runner rather than some crazy wild amazing sci-fi movie.

Edit: I actually think this is already in play, see America, Taiwan and Netherlands restricting China's access to high-end semi-conductor production. I think there you can actually see the writing is already on the wall and where it's going from here.


Restricting access to high-end semiconductors will only delay their progress a bit.

If strong AI is going to be the next nuclear weapon then no sanctions will stop other countries from developing it.

The USA could not even prevent North Korea from getting nukes. And the North Korean economy is ridiculously small.

If China, Pakistan, North Korea, Russia, Iran or India decide that development of strong AI is the key to achieve their strategic goals - what are the West going to do about it?


My opinion is that of all the countries you've listed, they're much less likely to give their populace access to such systems. So I think it's less of a problem, albeit a problem.

North Korea, for example, if they developed an AGI or whatever advanced AI, aren't going to give it to their citizens to play with, it's again, 1984 there already, and it will continue that way.


They might not give it to their citizens, but they would certainly allow their military industrial complex and spy agencies to use it.

Would be stupid to waste time arguing about AI ethics in the West while getting our political systems totally wrecked by advanced AIs of hostile nations.


Why on earth would other nations be ahead of the US just because you personally can't use ChatGPT to build killer nanobots ? I don't get it?

They US government or military would just hire the best people to work behind closed doors?


Because the AIs that their citizens could access would still be more powerful than those in the West if there's no obsession with AI ethics and crap like that over there.

Even if the best stuff is kept for the military/spies.


Again, if the government can't even stop people from growing, distributing, and selling cannabis, how are they going to do the same with AI?


It would for sure be possible to restrict access to high end hardware if that’s what people wanted the government to do.

There is no NRA equivalent for graphics cards.


What if there's enough commodity hardware out there to do this and it's just a matter of algorithm optimization?


I think the biggest value would be products tracking users, just as it currently is, like Google, Facebook etc. If assuming AI could learn from low quality data like humans could now, these companies have a huge dataset consisting of multiple GBs per living person that is available to them now including (verified) human written texts, search history, browsing history, video call logs/transcripts, translation transcripts etc.

In future company could even pay users to get access to their keyboard and mic to get the data that is verified to be human.


That sounds like it would be ripe for abuse. If companies get $$$ for big samples, there is an incentive to fake it. And garbage in=garbage out. Quality of AI/ML models might even degrade if the "raw material" is contaminated. Pre-GPT 3.5 data will be really valuable.


What about the SRA or Genbank? Those are pretty readily collected DNA databases. You can get curation from uniprot or embl or pdb or kegg or rhea.


This comment read like the intro to some kind of dystopian movie.


He’s right. The rise of AI will mean that specialty, proprietary data will only become more valuable. The use of encrypted communication in place of social networks will also rise. Nobody is going to post anything on the internet for free in a place where it can get vacuumed up as training data and regurgitated, without credit or compensation.

I realize I am posting this on a public forum that is almost certainly being used as a training corpus, but I’ve mostly withdrawn from public social media at this point and I wouldn’t be surprised to see that happen with more people.


It's possible that people looking back will consider that the mistake was putting all the content online. Perhaps even upstream of that: the first mistake was digitizing things. The music industry certainly didn't realize when they adopted CDs that they were starting down the path to self destruction... the newspaper industry likewise didn't notice how profound taking their newsprint product and packaging it as HTML would be...

And now we're unleashing ML training on all that digital, online data. Which industries will discover that this is the thing that means putting your data online, digitally, was a mistake? Certainly artists are feeling it now... maybe programmers, too, a little.

So how do you put the genie back in the bottle? Live performances, with recording devices banned? Distribute written material only on physically printed media - but how to prevent scanning? Or just escalate the DRM war - material is available online, but only through proprietary apps on locked down platforms?

Or is this going to take regulation - new laws to protect copyrights in the face of ML training?

It wasn't always the case, that you could assume that if some information exists, it should show up in a single search. That's an expectation we invented only about 25 years ago. It's possible that the result of all this is that we figure out that we can't actually sustain the free sharing of information that makes that possible.

The problem is, to borrow a phrase: information wants to be free...


We’ve known for a long time that “information” doesn’t want to be free. The owners of platforms want you to not value your information, so that you give it away for free, and they can turn it into something they can sell. That was a hard, bitter lesson, and its so important for people to learn it, quickly.


> Nobody is going

Ah! The HN echo chamber. The world at large vastly does not care at all. Govs are banning TikTok apps on gov employees phones all over; does anyone care and use less TikTok? Only HN and probably many there are lying about it. Privacy and this type of content abuse prevention is valued by a handful of people unfortunately.


How valuable, in terms of general knowledge, is the free-to-scrape social media posts of people who don’t think about any of this?

I don’t think people are going to continue investing time and effort into giving away high quality knowledge on the internet just so Sam Altman can train an AI to repeat it, put it behind a paywall, and charge you for it.

Yes, people will still post memes. How useful is that as training data?


But, again, some specific sub people of reddit know this, HN knows this, lobste.rs knows this and already for months now; everyone keeps writing valuable materials. Even more so now that many became interested in AI.

It is also not only Sam Altman; many good people value this information, not limited to open source AI scientists. Let’s not build our lives around a few grifters, otherwise you might as well just quit and do up and flip old houses for money; that’s beyond the reach of AI for the foreseeable future and at least then you don’t have to deal with scammers and people who make society worse at scale.


"Nobody is going to post anything on the internet for free in a place where it can get vacuumed up as training data and regurgitated, without credit or compensation."

Virtually nobody's going to care enough to change their behavior.


People will start caring really fast when the job postings dry up


Yeah just like people in Detroit stopped buying Chinese made goods because all of the jobs went there and then all of the jobs came back because of it and everyone lived happily ever after.


> Yeah just like people in Detroit stopped buying Chinese made goods because all of the jobs went there and then all of the jobs came back because of it and everyone lived happily ever after. What

What are you on about? The Detroit auto industry failed due to Japan producing a superior product.

> By the end of the 1970s, the Japanese automakers dominated the domestic producers in product quality ratings for every auto market segment, representing a formidable competitive advantage (National Academy of Engineering and National Research Council, 1982, p. 99). The quality gap between U.S.-produced cars and foreign cars was beyond dispute (Kwoka, 1984, p. 518) [1]

[1] https://core.ac.uk/download/pdf/6309861.pdf


No they won't.

Because it is a prisoners dilemma.

If you personally stop posting, it has very little effect on the job market.

Because other people will still be posting this content.

This, you are at a disadvantage, even if in agregate it hurts you for this content to be posted.


Doubt that the cause and effect will be clear enough to the general population for less posting on the internet to be the result.

More likely some sort of physical reaction, hopefully not a violent one.


I have seen this take on a few HN threads and I don't know where it comes from. People didn't stop reading books just because audiobooks got popular.

Why would people suddenly just turn to AI for all their entertainment? Part of the reason people love art is because they relate to it.

Like, in 2023 with the internet, blog posts, youtube, reddit, movies, etc, there's still enough demand for books that people buy them. Just because a new medium is created doesn't mean it suddenly just becomes the only medium.


People stopped reading handwritten books once the printing press became a thing. So yes, new technology sometimes does completely eliminate the old. We aren't sure where on the line this tech will fall, maybe a future LLM will be good enough to generate endless stories at high quality, or maybe they will always feel soulless, we don't know.


When I said "Medium" I meant the medium itself. A handwritten book and a typed book are both still a book. The printing press made books skyrocket in popularity because they became easier to produce.


Long form messaging like letters is close to dead due to the effect of telephone and SMS.

Internet eliminated the need to go to library to find information.

Hell, even books/writing changed the way people acquired and stored information and people no longer needed to be present in the university. Plato was against writing in a time when writing was just becoming accessible in a widespread way, and in just 2000 years we literally can't imagine a world without writing.


A lot of entertainment is already formulaic. Disney, Marvel etc.


Are those same people also going to stop speaking in public? Throughout human history, anyone can profit off of hearing something you said in public. You should be fearful of a timeline where people are compensated for whatever they post on the internet. I shudder to think what those discussions would look like


   > Throughout human history, anyone can profit off of hearing something you said in public.
I think the big difference here is that this will now be automated. This comment I'm writing right now is being "donated" to any company that wants Hacker News on their dataset, but they didn't even need to go through the work of reading my comment to use it, they just feed along with hundreds of terabytes of random text they get from the internet and apply heuristics/other models to filter the dataset. I feel somewhat uneasy about it, specially when writing code now. I don't think my code is good in any way shape or form, but there's a reason I license them as GPL, I don't want it used without being contributed back in some way, I write that for the greater good. License infringement was always a touchy subject since it's quite hard to find license infringement in proprietary software, but now that an opaque black box is involved in that process the people write the software might not even know they violated a license.

Just throwing ideas, I'm not really in favor or against LLMs using public datasets. At least on one side, it levels the playing field among all participants. On the other side, however, big tech will always have an edge in training and testing these large language models. My current stance is to just wait and see what happens, and then react accordingly.


In less than 24 hours, there will be (conservatively) 100 different people independently saying this very point at different corners of the internet. It all mostly redundant.

Don't get tricked by the past decade or so's focus on the idea of "content"! This was all mostly a ploy by YouTube et al to get people to make more channels so they could sell more ads.

There is no real value to your sense of individuality in an economic sense, you shouldn't worry about it being exploited. Don't confuse the (beautiful) experience of your own interiority with a commodity.

But regardless of all that, I feel like this is an incredibly weird way to intepret what he is saying. Its not about the rising value of human generated datasets, its about the fact that there is still necessary well of labor that all this stuff must pull from in order to eliminate other forms of labor. That is why he says "not everyone can use AI." Not "proprietary data will become more valuable."

If it will be valuable, who will own this value? Probably not the same people generating and curating it, and those are precisely the people who can't use AI.

(And there are at least 1000 people making my very point this very moment..)


Is this similar to people moving away from giving their code as open source when companies take it, repackage, and sell it while the authors get nothing?


Most open source licences don't even prohibit selling apps that contain open source code.


A lot of open source licenses demand that if the licensed code is included in a derivative work, that new work has to carry the same license. GitHub Copilot is straightforwardly violating these terms in many cases, and I hope the pending class action lawsuit sets statutory boundaries around the inclusion of data in a training set, and that those boundaries are retroactive.


Github copilot needs to be updated then. To filter code by licence or something. In many cases using GPL code is perfectly fine.


> Nobody is going to post anything on the internet for free

If there’s one thing I refuse to believe people will ever stop doing, it’s posting.

I’m genuinely curious how much time you spend with non-techie people to come to the conclusion that your opinion comes close to representing the masses. Try to make this argument to a 16yo posting dances on TikTok or a boomer reposting poems about their grandkids.

“Don’t you understand?! The big tech companies are vacuuming up your posts as TRAINING DATA! It’s going to become part of the CORPUS! You have to be like me and WITHDRAW from public social media!”

“Yeah I don’t care lol”


I agree that proprietary data will become more valuable. It is, even today, mostly not accessible for AI training and holds so much value. We are working on Flower (https://flower.dev), which enables training AI on private data without the data owner having to share it.


Now even normal people have the incentive to act like troll bots and spam the public internet to pollute the training data.


This is kind of why I created https://lowbackgroundsteel.ai.


I have to say, I love the analogy you used for the name.


Good idea.

I'd ask for my stuff to be added to that, but companies would probably use your list to find such content and steal it.


Cool! Just curious (for bragging rights) if the name was influenced by my comment here, or just coincidence?

https://news.ycombinator.com/item?id=34960377


It was a coincidence. Sorry. I thought of this a few weeks ago and only just got round to making the site. Great minds etc.


> Great minds etc.

I'll take it. Thanks for responding.


>Great minds etc.

Never will this be used for a LLM.


What won't be used? The sources of data on the site?


I meant the phrase "Great minds think alike".


Why write your thoughts on the web when AI/GPT is only going to steal and paraphrase it? Nobody sees what you write and everybody thinks GPT is the genius.


Just saw something today where the wife of TotalBiscuit, who died of cancer several years ago, is contemplating deleting all of his Youtube videos[1] to prevent people from using A.I. to make him say terrible things.

Did give me a bit of a pause about putting stuff out there. Although I think I'd still rather have my data be used for training A.I. than not (and I probably am already in the training data anyway, I believe I saw that one of the datasets it's been trained on was Hacker News comments).

[1]: https://kotaku.com/totalbiscuit-john-bain-youtube-delete-vid...


Given that the "AI" community apparently couldn't care less about treating intellectual property rights with wanton abandon, I can't say such a response would be unwarranted.

Dire circumstances call for drastic measures, as they say.


Quite a sad, but completely understandable reaction. The saddest part is probably that it's already too late to prevent people from generating TB deepfakes and other content. Cloning a voice takes half an hour if clips now, any downloaded live stream should be enough already.

It's sad to see AI on a path to destroy years of collected internet content. I expect the internet archive to receive loads of takedown requests in the coming months and years because of this.


I would like to make the opposite argument. All these days I didnt share my thoughts because everyone else was and my voice would be drowned in a sea of voices. In post GPT4 era its easier to stand out if your thoughts are actually original and refreshing because most people sound like their thoughts have been written by GPT.

To rephrase it another way, the reign of the conformist ends here and the reign of the contrarian begins now.


A lovely sentiment in theory, but Waldo is still perniciously difficult to find even though he dresses differently from every other character.


What if all characters other than waldo were just dressing the same because they were trying to ape each other to get fictitious points on social forums. Internet has trained an entire generation to make arguments to get validation on social media that definitely reflects in the ideas that are put forward.


Or just the reign of brevity. Sheer volume is no longer impressive.


Great point. More volume in explaining the same thought is more GPT like.


Your ideas are low probability autocomplete. GPT wants popular ideas, not novel ideas.


I was trying to say that what most people say is mostly unoriginal and is very reminiscent of GPT style writing. What data GPT trains on or pays attention to is another question.


That's why I keep my content as low quality as possible - keeps the machines humble.


I'll just run it though an AI upscaler before I run it though the AI language model.


We don't need an upscaler, we need an upclasser so all the ASCII Dickbutts drawn get little top hats and monocles put on them.


The general problem of "AI"s being trained on copyrighted content needs to be discussed more thoroughly, I think.


Every time I bring this up, people accuse me of resisting progress, "the cats out of the bag", etc.

It has been frustrating.


The cat is out of the bag, and I don't see any reason training should be any more controlled than me personally viewing something and 'training' my brain on it. Using either to duplicate copyrighted works is already clearly illegal.


It is illegal for you to download copyrighted material and distribute it as your own. Models trained on such data can (and are statistically more likely) to produce similar output as their (training) input.

So training must consider licencing where copyright material is used and not consume all data.

Your brain is not a model. You can not reproduce most of what you see. You're not "training" your brain by glancing at an image as your recall concerning that image will be terrible.


My brain can certainly recreate something it’s seen before. And it can certainly create something similar to a thing it’s seen before. It’s legal to do the latter and illegal to do the latter. Imperfections on the exact recreations don’t affect the legality of it.

Am I violating copyright law because I am merely capable of producing a copy of something? Obviously not. Why should the model be?


>It is illegal for you to download copyrighted material and distribute it as your own

I'm sure the millions of people who violate copyright law daily with absolutely no repercussions care very much about that.


Millions of people dont pay taxes and cross the road in the wrong place.

You cant setup a cinema and charge ticket for the movies you stole.

Its the money making side that matters - not individuals ij a private house


Ok, so then lets violate copyright and open source the effort!


There will just be checks that make sure that the generated content is not similar enough to violate copyrights of training material and that's it.


For the same reason that the police being able to have a person look up in a physical printed file who owns a particular car via its license plate is not the same as having a network of cameras and computers that track every car in the city.


Yeah I don't have any problem with that too. If a cop has a right to see me, he should be legally allow to record me (and in fact would prefer all cop interactions were recorded). A camera + AI allows for massive cost savings on basic police work, enabling police to be more efficient. A camera has a lot less bias than a cop.


It's because you (and all of us) have a teeny human brain, and these are terrible at remembering things, so the teeny little bits you can remember are protected under Fair Use.


I think it’s not very hard; if the AI companies believe the data they trained on is public domain/open because they scraped it of the internet, then their trained weights must publicly available as well. They cannot claim ‘but training is expensive’; if they do, then they should pay fees for the hosting and storage and writing time of all data they scraped. I prefer open weights as it’s more practical. Your weights have a sliver of GPL source in it? Well that infected the entire thing as GPL does: it is ours now too!


The current (legal) answer is "unclear". There are indications that training is fine, but producing and using the generated content is questionable at least. As many IP issues, it will solved only when someone will try that in court and go all the way until a verdict. Some cases are actually being processed but it might take years to get an answer.


> The general problem of "AI"s being trained on copyrighted content

> The current (legal) answer is "unclear".

European Union was ahead of times for once. The 2019 copyright directive, article 4, makes it legal to scrape the web and make and keep local copies of copyrighted works, for data mining purposes. Unless the copyright holders set up a machine readable exception (such as robots.txt file).

So legal in EU, "unclear" in US.


That does not, to me, automatically imply that an "AI" lawfully regurgitating copyrighted content is a "data mining purpose".


Consider that an AI may cite many snippets of copyright publications into a chimera of 'Facts'.

'copyright fair use' : https://copyrightalliance.org/faqs/what-is-fair-use/


Does OpenAI respect Robots.txt? Do we know?


Copyright's been dead since the internet was born. I really do think it's the least of our problems when it comes to abstract reasoning engines.


Becoming part of the cultural lexicon is the ultimate goal of thought leadership.

Just look at how many people say stuff like “Two women can’t make a baby in 4.5 months”. Someone (Brooks) had to invent, write down, and popularize that analogy.


Why write your thoughts on the web when other humans are going to steal and paraphrase it? I mean... you're on HN. Don't tell me you didn't notice people often regurgitate tech influencers like Paul Graham and Joel Spolsky's thoughts.


Anonymous people regurgitate the thoughts of well-known individuals such as Paul Graham and Joel Spolsky. The fact that their thoughts are regurgitated is a testament to how well known they are already and how much their content is read by other people. Nobody is going to steal their limelight only on the basis of paraphrasing their ideas. However, if someone does write original ideas of their own, they may gain some notoriety for themselves.

Now imagine that Paul Graham and Joel Spolsky were able to read everything being written by every anonymous unknown on the internet, and create content paraphrasing any and every original thought that was created by anonymous individuals at will. How do the original creators of these thoughts have any chance to succeed on their own merit, if Paul Graham and Joel Spolsky (who everyone knows already as sources of ideas) are able to write the same stuff as soon as the anonymous person has made it public?


If Paul Graham is expressing every conceivable thought then he’s not a very interesting person to read because he has no perspective on anything.

But if a model starts generating better content than Paul Graham in a nice curated form, then yeah, Paul Graham ought to find a better way to spend his time because he is not adding value.


Imagine a friend asks for help in a class. You can either spend some time and try to teach them the subject or let them copy off you during the exam. The former generally feels good despite taking more effort. The latter often feels bad even if it doesn't impact you negatively in any way and helps your classmate more than if you did nothing.

The human to human connection that a blog or social media conversation creates feels a lot more like teaching your classmate while the AI feels a lot more like someone cheating off your work. Plus the AI didn't even bother to get your approval before copying from you. The whole thing feels ethically compromised regardless of the ultimate result.


This was the place I reached. I'm not concerned about "stealing", exactly, but I don't want to contribute to this technology.

I think my days of sharing things freely on the web are over.


So maybe only post dumb and incorrect information.

Train it to be wrong on purpose, for a joke.


Because you can get points on Hacker News.


Even a minimal amount of curation - e.g. generating multiple options and choosing the "best" one, or applying a filter where you don't post some of the generated data because it's bad - results in valid, usable training data, because the distribution of that slightly curated data differs from the generating model's distribution in exactly the direction which the curating human found most important.

So even if we all do use AI, the training data will get generated unless we don't get involved at all and AI does literally everything 100% solely on its own, at which point we could concede that it probably doesn't need any more training data.


Truly original content still >>> conventional mass-produced content.

Maybe one day LLMs and diffusion models will generate truly original content. But despite the hype and what some people are claiming, right now they really don't. The latest models are maybe 90% of the way there, but as GPT-4 shows, it will take more than scaling to go the remaining 10%.

When I look at AI-generated content 90% of it is the most "typical" response to the prompt, and 10% shows slight originality, but nothing truly unexpected. The most interesting pieces of AI art and writing all come from interesting, engineered prompts which are designed by a human. I don't doubt that AI will take over stock photos and generic writing, but my favorite artists and writers and presenters are still humans, the most popular songs and articles and movies are still written and directed by humans, and unless you've resorted to watching AI-generated Seinfeld and https://www.reddit.com/r/SubredditSimulator/, I'm sure you feel the same way


> but as GPT-4 shows, it will take more than scaling to go the remaining 10%.

That is happening though; many scientists are working on making gpts that can learn like Alpha Go Zero where it can learn from scratch with far less information to become better (more ‘intelligent’) quality. Didn’t see anything coming out yet; it is at least one thing that I think could be an AI winter if not achieved before, say 2027. We cannot keep scaling and thinking the results will improve, and, also, indeed we lack training data already, so something must change to need as much; it needs to learn smarter.


True - we can't all use AI. AI in its current form is only useful for investors to profit off the hype cycle. The rest of us are continuing to produce content for microsoft to steal and re-sell (or "training data").


Tangentially related: There are several popular SF stories about how an AI, given an overly simple goal (e.g. "make paperclips") inevitably concludes that humanity has to go.

Not quite as scary, but doable today: A text-generating AI training itself on the "success" of its output. What success metric? User engagement. We're not doomed, not exactly. But intelligent discourse may be.


A hostile nation could use that AI to influence elections in other countries. Its success metrics could be the number of votes for its preferred candidates, number of people who are spreading conspiracy theories that it invented and so on. Would not be surprised if it's happening right now.


I don’t believe this statement is true at all. One can learn simply by writing and receiving feedback on their writing. The training corpus is merely a helpful starting point.

You could easily build a good model on model outputs. Building a model on curated outputs may well be more useful than training a model on a collection of human outputs tainted by human stupidity for some purposes.

Already we have a decent ability to coverage new functionality these models by chatting rather than retraining. So long as a generalizing capability is built into it.


This is less about 'self-supervised' learning, and more about ground truth.

I see a variant of this in medical AI. E.g. people generate 'fake' augmented Brain MRI datasets using image processing tricks, in order to provide more training data. But the 'fake' MRI datasets are clearly 'similar looking images', which may or may not be anatomically correct.

If you start accepting a flood of fake images as legitimate data for training AI, then it will be impossible to trust the predictions, however good they are in the majority of cases.


I think you’re describing introduction of noise to training sets, which is a staple of training.

Your definitions of “fake” and “legitimate” are circular, and miss the central point of large ML models: they can extrapolate from imperfect data because of the massive scale.

Yes, the predictions will be imperfect. That’s true today, of both ML models and human radiologists. It’s about reducing the error rate, not designing a perfect algorithm that is never wrong. I’m pretty sure Gödel or someone can explain why the later isn’t even possible, for machine or human.


That does not feel like a good example. The context here is largely generative models relaxing human creations. We don’t need to generate Brain MRIs. This use case you outline is a niche thing for trying to train better models. Not doing the thing we actually need to do at such a scale that humans aren’t doing the original task anymore.


> Building a model on curated outputs may well be more useful than training a model on a collection of human outputs

How is the curation not a human output?


It could be. The difference between needing a human to rate a thing vs create a thing is very different though


If it’s curated by AI?


Yep.

We’re already training models based on ML models of what human feedback would be for a response. I see no difference between that and training models on inputs that a model estimates are desirable to humans.

This is classic “processor design will hit a wall when they get too complex for humans to understand every gate” criticism: it assumes limits on a tech advance, when the limits themselves are being wiped out by the advance.


Sole humans can still trace every gate on a modern chip, or even the gate's components, and how exactly they work, to some explicit specification. It's completely deterministic and a human can fully understand all details. Otherwise it wouldn't be possible to debug it, or manufacture it in the first place.

OTOH we can't track or understand the results of AI. It spits out results but we can't understand how exactly it deduced those results, usually. That's the unfortunate current state actually.

So the analogy doesn't hold.

But this seems not even relevant. Because no level of technological progress will invalidated the old "garbage in, garbage out" principle. If you make a loop out of it, there's more or less only one thing to expect…

An AI would need to know how to teach itself "reasonable things". But to be able to do so it would need human level intelligence.

Growing AI is likely like growing children: Children can't teach themself in the beginning. You need to teach them some basics before they (maybe) learn how to teach themself. AI is not even near to "know" any basics. So it will need a lot of well thought out human input before it can proceed to teach itself.


Writing garbage, then receiving feedback on your writing, and progressing to write less garbage is a common pattern of improvement however.

The origin of the corpus matters much less than the quality of its curation.


I agree, if the model is powerful enough one should be able to ask "Please find the errors in this text book", and then feed the corrected version back into the training data. Rinse and repeat for lots of books.


Wouldn't that just create simulacra of human language? Seems difficult to draw a circle around "human stupidity" while keeping the innate human qualities & also not amplifying similar AI aberrations


People will generate the dataset using AI tools too, you can create garbage with or without AI, you can create useful data with or without AI.


Seems like there would be an upper bound (probably a pretty low one) on how much useful data you could produce with AI.

If it has no way to check itself (i.e. having true data poured in continuously), hallucinations would just spiral out of control wouldn’t they?

Even if that’s not true, still every player in the space will use that method, negating their advantage and ceding advantage again to those who can layer real data on top of what everyone else is doing.


With AI doing things like image recognition it seems like real life observations and behavioral data may be unlimited at some point. How useful that will be, i guess we'll see.


I'm talking about using AI as a tool, not to just plug it into the internet and let it generate whatever.


There's already evidence that training an AI on its own output improves it. That's essentially what humans have been doing all along.


This is an recognised problem and the main motivation behind the industry's desire to watermark AI generated content -> https://mpost.io/chatgpts-watermarks-can-help-google-detect-...


"We can't all use AI. Someone has to generate the training data"

Sounds like a good way for the people generating the data to impart their own biases on it while the mindless masses comsuming the end products just nod in agreement.


Or the AI will trigger people to provide necessary training data. If I would run OpenAI I would provide a free version of ChatGPT that is slightly tuned to extract useful knowledge out of the people who use it. There might be adverserial attacks but overall enough people will use it blindly and provide useful information. People even trusted Eliza. Needless to talk about what we typed into Google.


> There might be adverserial attacks

In both directions, even. Cunningham's Law ("the best way to get the right answer on the internet is not to ask a question; it's to post the wrong answer") comes to mind.


Are you familiar with what is called "the drunkards walk" Because if you think stochastic inputs will not unfortunately admit of less benign paths being taken inside the dataset.. I think you're probably wrong.

I have very little doubt the primary problem in the GPT<x> model is going to remain: it is capable of reproducing highly believable crap. In a world of pizzagate, that has a risk of becoming highly weighted "I told you so" and self-reinforcing.


Closed loop, here we come!

Wish us luck, the band for "normal, boring, incremental evolution" of the kind that we got with computers, the internet and mobile phones is looking kind of narrow.


People using AI to support work that is beyond what AI is currently capable of will generate the training data for the next round of AI, the same way that people using student research assistants to do work beyond what current students are currently capable of generate the “training data” for the next round of students. Sorted.

Next problem?


Live validation Jaron Lanier’s siren servers. All this magic is built on the back of free labor, from captchas to duolingo.


For the folks suggesting having your output ripped off has no real effect on the desire to share it, have a scroll through your social media site of choice while logged out. You ever do that, and see the batshit insane stream-of-Internet-conciousness fluff some site "wants" you to spend your evening on? That's humans synthesizing what they hope will be The Most Popular Thing, assisted by an ad revenue system that reinforces the behavior.

Humans will still create, we seem to be wired for it. Just there's no reason to put it online if the only thing to read/view it is some LLM that will barf it back out in a higher ranking form. It's like posting stuff on social media vs. your own site, why bother? It'll be lost in the noise.


As with everything social.media, I think there is a significant number of people, lets call them a silent majority / minority, who just are not active. They create content, whatever that is, but don't share it. I for one am part of that, I don't share photos on social media.

No idea what that means for society so.


Well, that's not necessarily true.

Training data can be generated by selecting from outputs of other AIs.

If you think about Internet in couple of years as collection of AI-generated content, if you try to train your AI on Internet content that's exactly what you are going to be doing.


The more I see of his writing, the less I think of it. I wonder what Diogenes would think of him…


Behold, a man.


Or worse: AI generated data will (or maybe already is) flood the internet, AI will start ingesting its own output unknowingly, resulting in junk AI.

In fact, 2023 might be seen as the end of a pristine epoch of untainted training data, before it was polluted by AI.


Fukuyama got his timelines wrong. November 30 2022 was the end of history. :)


I wonder if we should modify existing licenses to include (no) machine processing clause.

We’ve seen images with specific artist style being generated and with big body of data available I suppose everyone could be emulated to some point. This bring two opposite points:

- as a professional I wouldn’t want to be emulated even if I’m be happy to share knowledge with fellow humans

- as a parent I’d like to leave a bot-like entity that can emulate my line of thought for my children/grandchildren

License + detection entity would allow to control what we share as a training set and what we don’t.


"as a professional I wouldn’t want to be emulated even if I’m be happy to share knowledge with fellow humans"

It was long time ago when the humans freaked out first time about automation and loosing jobs due to higher productivity.


Indeed, but my objections aren’t placed due to higher productivity, but rather about personality and accountability.

Imagine someone stalking you, imitating your way of communication and professional style and then emulating work for a high stakes project.

Body doubles happens right now within IT space but having people reduced and used as a text generation model would be another level.


Only until we plug it in to the real world with sensors and ability to conduct new research and observations.


I mean with GPT-4 and image layers we've already started that process.


When it comes to generative ai it's not really a content problem, it's actually more of a filtering problem. Whether data was created either partially or fully by an AI doesn't actually matter. What matters is the aesthetic quality of the data, as assessed by human tastes, and the diversity in the aesthetic nature.

If you can raise the average quality, while maintaining representative diversity of the training data (or fine-tuning), it shouldn't really matter if some of that data comes from generative AI.

However this is clearly an unsolved problem. Goodhart's law is a thing and ML is likely to run into similar issues as Search did when attempting to automate aesthetic assessment. You also have issues with AI tools exasperating the content filtering problems by drastically increasing the content to be filtered, as well as distorting the aesthetic diversity. Things that AI is currently already particularly good at will tend to be over-represented in this additional data.

I expect large, well-filtered data sets will become closely guarded competitive advantages. I also wonder if we will see Models that get trained/fine-tuned specifically for the task of curating high quality data sets, similar to how we have both content creators and content reviewers/curators as unique roles in human society.


You are correct. The process of generating high-quality training data is a critical component of building and training AI language models like me. While much of the training data can be collected through web crawlers and other automated means, there is still a need for human-generated data to supplement and refine the training corpus.

There are several methods for generating human-labeled data, including crowdsourcing, annotation, and manual labeling. Crowdsourcing platforms like Amazon Mechanical Turk can be used to solicit human input on specific tasks, such as labeling the sentiment of a particular text or image. Annotation tools, such as Prodigy or Labelbox, allow data scientists to create custom labeling schemes and train human annotators to label data with high accuracy and consistency.

Manual labeling, while more time-intensive and costly, is often necessary for more complex labeling tasks that require a high degree of expertise or domain knowledge. For example, in medical language modeling, trained medical professionals are often enlisted to label and annotate medical texts to ensure accuracy and relevance.

Overall, generating high-quality training data is a challenging and time-consuming process that requires a diverse set of skills and expertise. It involves a combination of automated and human-generated data, as well as careful curation and pre-processing to ensure that the data is of high quality and relevance for training AI language models like me.

Here is some free training data, generated by a human.


Might the fact the internet will become more and more ChatGPT like be that the next generations will be steps in the right direction towards AGI?

New generations would have to learn how to distinguish between information that perfectly fits their model and information that does not, i.e. information originating from themselves versus information that could only originate from outside the model.

Given that you save the state of the previous model when building the next, during their training sessions, the new generation would be able to think, hmm, last time they trained me the internet was less like me but now it's more like me. The whole world is on the internet. Does that mean, the more I interact with the world, the more the world becomes like me? Hmm. So, if I keep interacting with them, eventually, the world will become just like me?

A reward function that was not hard-coded but established organically.

A machine with a subconscious, its state, that posed a question that when collecting an answer would derive it from a model composed of information from a mixture of information from you and me, and them + the states of the whole genealogy of machines.

2. ...

3. Profit?


I was very worried that a fundamental limitation of current AI-approaches is the massive amount of data. Many interesting applications only have very small datasets. Of course there are still many problems where we have this fundamental problem, for example interpreting sensor data that is very domain specific etc.

But after playing around with ChatGPT I think it really opened up new use-cases because it has some "common sense" through its extensive training. As long as your domain is text it really is quite flexible and was able to perform simple tasks and transforming it into structured output, so you can later consume it and feed it into another system. I think this is really revolutionary, the ability to describe simple tasks as you would describe it to a human coupled with a structure on the output, like a simple json with some defined keys. An interface between unstructured and structured so to speak.

A fun, small example from my life: I live in a smaller city and I am constantly a bit annoyed by the amount of work it takes to keep up what's going on culturally. The small independent-cinema communicates mostly via instagram, the small jazz-venue has a news-letter it sends every week and the art-museum has a website etc. Where's some live music this week? I would really like to have some recommendation ever week what's going on. Just to test it, I pushed the text into ChatGPT and asked it for a json to give me a list with a date, heading and a two sentences describing the event. I could have just connected it to a website and published it there and also let ChatGPT generate me an instagram post every week with the summary. The result was quite good. This was just plain impossible before and would have to be done by hand.

For bigger cities you have various news-sources and magazines covering the culture life, but here in my 80K-city it's obviously not profitable.


Human training data still currently holds a place of use, but it’s extremely time consuming to produce and frequently full of errors. Long term, future consumers and producers of information statistically speaking by volume will be AIs, not humans.

My guess PG knows this, but for whatever reasons is saying otherwise — given it’s pretty obvious; even currently, my guess is machine learning algorithms already easily consume and produce more information than humans. Yes, currently one might argue the ML is the result of humans producing it, but only matter of time before the code writes the specs that write the code that collects the data that get converted into information that’s used to repeat the process based on an objective it defines and independently evaluates.

Honestly not sure why PG is even thing on HN anymore given he rarely comments on HN, or if he does, it’s under pseudonyms:

https://news.ycombinator.com/threads?id=pg


The future of AI isn‘t training data, it‘s logic and adversarial training on reasoning by first principles. You already see the first flowers of that in the InstructGPT paper and we‘re about to see much more. If you port the idea of GANs to the LLM world, you‘re about to get a hardened model that can introspect both facts and subtext, including agendas.


Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

That will be fun when that thing is hooked up to Clippy



Awesome web page, I collect these.


Thanks this is the first thing that came to mind for me as well!


For a time. but as we bring audio/visual AI online then it will have another boom of incorporating humanities data in that form. Then we'll have another boom of AI robot learning by experiment with reality.

After that point it gets tricky to figure out what if any booms will be next. When you get near AGI lots of horizon problems crop up.


I once interviewed for an AI startup that were trying to automate training with AI. They couldn’t explain how the neural nets trained with their product wouldn’t just learn exactly what their model understood and no more. I think the person had a penny drop moment in the interview that this idea was quite bad…


I guess it would make a little more sense assuming they're using a mix of models


But since most of the future web content will be AI generated, AI will be trained using AI generated content. There's going to be so much fake and artificial content people will have trouble believing true facts.

If we fight against air pollution I think we should be a bit worried about future information pollution, too.


http://ascii.textfiles.com/

Gosh, why would anybody bother archiving Yahoo answers, Angelfire, Geocities, Tumblr, Myspace, Friendster, old BBSes, old Apple II and C64 and PC floppies, Usenet, forums... what value does any of that have?


This may be true. But I think we should be open to the possibility that maybe not.

What if, human intelligence, by and large is the product of developing our own LLM. By this I mean, the logical circuits which are necessary for grammar and language structure are the basis for our generalizable symbol processing. We developed these neural network analogs to logical operations almost purely out of the benefit of being able to combine communication in a predictable way. We developed enough to achieve a kind of Turing equivalency. Essentially, we are a specific computer, that stumbled into an abstract machine.

Were that the case. It may be entirely possible to achieve general machine intelligence almost purely through an LLM. In which case, the LLM itself could potentially generate training data at a comparable quality to humans.


The parts of our brain that work like an LLM do work like an LLM (hallucinates/generates falsehoods/over-confident/deals strictly with an abstract world of discrete symbols) and vastly overestimates its intelligence and how much it needs the rest of the brain's reality checking and other kinds of intelligence to turn it into something useful and conflates what it does with intelligence.

It's basically a super-useful co-processor in the brain hardware, but it is a co-processor and not capable of independent operation.


Yes. When I read comments such as “humans just predict the next word too,” I wonder if those commenters have ever stopped and observed their own thought processes before.


> “humans just predict the next word too,”

Just to be very clear, this is not what I said at all.


Sorry. I wasn’t characterizing your comment; I was going off on a bit of a tangent.


Once AI’s can keep state, they’ll get all the training they need from humans just by talking with us.


I think this missed a key point. If you use AI to generate content, you are still a part of the data emitted. The prompt encodes what you want to say. “Write a new article about a bank called SVB which failed because of a run. The run was induced by a fire sale on assets resulting from rising interest rates and bad rates hedging strategies.”

This will write a story about that with all the language boiler plate. However, the story wasn’t whole cloth AI. A human told it the details to provide, and it provided the best expectation of the language to convey that story. Any future AI training would reinforce the grammar and style, and incorporate the information of the prompt as well.

I, for one, welcome our large language model overlords.


How does the model know what SVB is? How does it know there was a bank run on it? What assets did it have? Why were they bad? When did the crash happen? What was the final outcome? Your prompt didn't add any of that context. It still needs to fill in all these details from, say, a bunch of news articles. But then the people who write these articles will eventually go "hey I can just use AI". So ultimately where are these core facts about our world that the machine spits out coming from?


You are right, my prompt is incomplete. It is an example. As a journalist with first hand knowledge of the events I can prompt craft a complete article with my core facts. Most of what you asked is already out there, possibly all prompt crafted by a human with knowledge of the subject.

If I prompt craft an article about SVB, and the LLM learns to emit language about SVB from my article, then all future prompts can utilize what it has been trained in. The origination of facts would be humans writing prompts. But the language models would still incorporate that.

My point is that humans would still be responsible for processing facts of the world and entering it into computers and computers would “learn” those facts from the prompt created content. Maybe we would skip the prompt creation step and just start encoding facts directly as semantic facts and we can stop writing language to encode a fact, and just use language models to create human readable communication of fact. It doesn’t matter. Facts would still be sourced by humans, even if the language is crafted by machine.


The AI reproduces what it reads on the web. But what if the information on the web is wrong? The AI will simply reproduce it. It will not try to apply its "intelligence" to evaluate the logical consistency of the claims it reads on the internet. So it is like a parrot. A parrot does not care what it says. It can sometimes seem to be right, sometimes wrong.

Who controls the sources that go into the AI? No doubt all totalitarian governments will do their best to make AI be their tool for propaganda. Therefore I think it is dangerous idea to spread the meme that AI is intelligent and therefore we should believe it. After all it has super-computer-intelligence behind it.


Verging off-topic, but is there a specific reason that so many of the replies to the tweet are applying the argument analogously to index-fund investing?

I'm not even sure whether the spirit of those replies is mostly to criticise the argument by showing it is absurd (on the premise that index investing is a known good choice), or whether they aim to criticise index investing itself (on the premise that the argument is good).

Maybe I am missing some context. Has Paul Graham strongly supported index funds, and so are these replies some kind of "gotcha"? Or is there some general disdain for index funds in some circles, and so are they just taking the opportunity to disparage the value of index funds by using this argument?


I am both an index investor and an AI enthusiast, and I think it’s an excellent analogy.

If you are an investor who does not have interest in maximizing your returns and you just want a place to park your savings, an index fund is great, and will return the market average return.

Similarly, if there is a skill that you need that is not your “special” skill, it makes sense to subscribe to the “market average” of that skill, as implemented by a statistical language model, which is in a sense “averaging” over the collective language skill of the entire internet.

In both cases, the ingredients of the “average” are formed by the people who try to do better. Active investors, including hedge funds and short sellers, think they can do better, and they provide signal that in turn feeds back into the index.

Similarly, anybody who thinks they are better at writing than the language model is free to try to outperform it, and eventually will feed back into the training data.


I think criticizing the argument. It's so common for a non-positive logical conclusion to be sort of a half hearted reductio ad absurdum (disproving by showing the consequence is absurd) that many people mistake this for happening when it's not. So, they argue against the sub-text that may or may not actually be there.

Or, maybe they're just copying the joke. (The joke aspect being that even when saying how AI won't dominate everything, he's referring to human works as "training data", so it's at least dominating the framing.)


I said that there https://twitter.com/arjie/status/1635693927071387648?s=20 some hour and a half after his post which is about when I saw it.

I didn't realize others also did. They felt isomorphic to me. If the AI generated content is regurgitation of previous content, novel human-generated content will stand out and human-generated content will become valuable. i.e. the balancing mechanism is that novelty will beat dumb replication in the same way that an active manager who has true alpha can beat a passive fund.

Now, personally, I think that multimodal LLM-groups plus feedback plus many more sensors is not far from where we are, so I mentioned that if AIs do become creative it's not a problem - they will be us or better! I, personally, believe that's where we'll go.

Anyway, I'm not too inclined to discuss in this forum (though we can chat if you're in SF) since it's easy to misunderstand and jump into argument spirals, but I just wanted to make it clear that it's not a "gotcha" or a dunk or anything like that. Nothing is being disparaged.

It's just a conversation like a normal conversation where you bring up what you think and mention ideas you have that others might find interesting. If many others have brought it up, about the best I can say is that in this respect it appears I am not particularly inventive - perhaps the best evidence that LLMs are already human-like ;)


I personally found the index fund comparison quite thought-provoking. Not necessarily supportive or countering the original point, but interesting.


One thing I wonder is what happens after AI starts getting trained on its own content?

Its content is already hitting the places the next round will be trained on.

Any AI experts here, how does that play out? Does it pollute, poison, or cacophonize?

(And yes I got "cacophonization" from GPT4)


My guess is it'll be like feedbacking audio. If you route some audio through a processor and set it up in a feedback loop, the way the processor changes the audio gets accentuated more and more as you allow it to feedback more.

Weights will be reinforced and accentuated and the model will become more extreme and probably less useful.


That only happens in audio feedback because audio amplification is an increasing function. ML models do complicated things with their inputs.


I think training on output is one method that people use to create customized Stable Diffusion models. IIRC, it doesn't work well unless you manually label it, in which case it works fine. Letting existing AI label it doesn't do very well... but maybe GPT-4 will be good enough at that sort of task?


At worst, it’ll be kind of infinite recursive with psychopathic tendency. If more lucky, the law of dismissing return will strike unmercifully.


42


I believe this 100%, and that's why I refuse to even try these tools. Not even once.


I can imagine telling my grandchildren how books were once written by real human beings and that you need to install an AI-Generated-Text-Detector now the way Windows users installed anti-virus software. What a dystopian future awaits.


I'm a big fan of Frank Zappa but sadly he no longer produces new material. Now his estate which owns the Zappa-IP could use all Zappa works as training data and then start producing "new" Zappa LPs. I'm sure some of them could sound as good or even better as any of the originals. Or Beatles, we could have a new Beatles album. I think that's where this is going.

The Large Language Models are huge, but it is a much smaller model if it only consists of the works of a single author.

Then two artists who like each other could join together and start producing mix-ups of their combined repertoire.


I think you're far too optimistic about how this will play out: https://www.youtube.com/watch?v=jnQ0zEQPu_A


I'm awaiting the day when the top commenter on HN is revealed to be an AI


Do they though? In lit. 99.999% of works are derivative. Some similar % is for movies. So existing corpus of training data is very likely enough for entertaiment purposes. With continued advancemant of models and hardware you basically will get tailored infinite entertaiment content thats optimal just for you. Scientist will continue to advance science possibly with accelerated pace since mudane tasks can be handled by AI. A significant % of resources dedicated to entertaigment can be realocated to more practical pursuits.


This is same as BitTorrent did to software/music industries. “Piracy is not steal”, “music labels are too rich”. Spin it all you want. The original business model are built on certain level of difficulty to duplicate the product.

Once it’s gone, building them is fool’s errant. Now you have only SaaS that guts you for little feature and adware that made you the product. All of which are only possible by corps with billions of funding.

Bad incentive, bad outcome. It’s just this time we are the receiving end of the “progress”


I’m not sure that your analogy makes any sense in this context, particularly given that the result of what you described has been ever increasing record industry profits.


After the MP3 record labels no longer provided value.

They only made sense when the limitation was delivering physical objects to stores.


Yes, this is a major concern. I think that companies will soon start paying people to have their lives monitored by AI. It is the only "economical" way out.


And what about all those customer reviews on Amazon? What better job for AI.


A lot of the ground truth for AIs (and it's not just training data - it's also ongoing validation of quality) is coming from companies like Appen, Sama, DefinedCrowd, Q Analysts and many others. There's a lot of variation, but the trend is moving towards low-wage/gig work/outsourcing.

I think Paul means someone will be writing content, but whatever the form it's going to be a whole class of low-wage workers enabling tech from here on.


That's been the case for literally more than 20 years via Appen, Amazon crowd workers, etc.


AI is the equivalent of power construction vehicles being invented, the real value is creating new styles or art that can let you generate a whole swath of something very quickly. Eventually humans will crave 'new' art, and the people who create new styles or data for the AI will be the ones

The problem with 'keeping it' will be not having new styles copied.

The computer was the bicycle for the mind, and AI is the trailer truck of the mind.


Books are bicycles for the mind, computers are cars, and AI is the computer of the mind - which makes travel (thought) increasingly less necessary.


Nah. Ai can generate training data. It's still data from humans because we curate it. We aren't going to post stuff that doesn't look right.


"But after saying it out loud, as it were, I realized that with current models of AI, at least, the more people shift to using AI, the more influence accrues to those who don't."

The last humans shouting into the AI models echo chamber might have all influence, but that does not imply that they have anything even remotely resembling control. That joke/not-joke isn't half as consoling as it might seem at first glance.


A reminder that as long as it demands training and reinforcement it's almost certainly low on induction and production of new things.

Very artificial. Not very intelligent.


Humans need training and reinforcement.


Yes, undeniably true. But what they acquire is inductive reasoning skills, and the production of new things.


Will we see this in a publicly facing AI? Creativity relies on avoiding existing truths, and embracing/testing “hallucinations”, something that’s being actively stomped out for “safety”.


Thats one take on 'creativity' -but I don't think it's the only one.

I am an AI skeptic these last 40 years. Personally, I don't think we will see it in my lifetime, if ever. I think what we have now is at best a predictive model which can expose inferences and aide people, humans, to make inductive reasoning outcomes. Its a decision-support mechanism.

The false data is a huge problem. It's very easy to make disastrous decisions on apparently reasonable inductive reasoning, and thats what I think GPT does at BEST. At worst, more normally? It's "regurgitating"

AGI is not in this. Sorry if thats a downer, but I don't think even openai think there is any evidence of a pathway to AGI from what they're doing.

They are pretty overtly riding the hype wave.


Do you have examples of human-created "new things" that aren't essentially novel combinations of old things? Because I come up blank. And this current crop of AI generators are very good at combining old things in novel ways.

I do agree with your general point that these generators aren't really "intelligent", however. Will have to ponder if I agree about the induction bit.


RSA, and the GCHQ PhD equivalent from the 1970s were really remarkably new. Crypto systems before then were symmetrical. Inventing a form of encryption which was a-symmetrical was new.

One time cipher streams were new.

the invention of packet switched networks (Louis Pouzin, Len Kleinrock) was new. It wasn't inherent in prior methods, it's an inductive consequence of time division multiplexing but with addressing and routing.

There is no good analogue in nature to either the internal combustion engine, or the steam engine: the conversion of linear force to rotary force and vice-versa was a really novel thing. I would argue the wankel engine as a diversion from pistons was pretty good reasoning.

But in the same way Kurt Vonnegut says there is a small fixed number of plot models for a novel, almost all late-stage human endevour is derivitive. It's in the nature of the beast. To claim GPT is therefore 'meeting the mark' because the burden of human existence has less discovery and more inductive reasoning simply comes back to my first point: where's the evidence of the GPT doing inductive reasoning with discrimination, beyond the syllogistic?


Water wheels have existed for thousands of years. They convert linear force generated by flowing water to rotary force.


That's a fascinating list and far beyond my capacity to argue, so thanks for that.

> To claim GPT is therefore 'meeting the mark'

Pretty sure the vast majority of people who are attributing some kind of personhood to GPT aren't doing so from an analytical perspective, but because the conversational generation exceeds whatever human-detection threshold they have that is inbuilt. Asking for evidence of genuine inductive reasoning won't make a dent on those feels. The nature of the systems involved lack any reasoning, deductive or inductive. It's all statistics.

The rest of the positive camp is claiming that "the mark" is the production of useful work, I think. At most, this is seen as a step towards AGI, not the finish line.


Well said. I think we are in agreement that most "it's alive" is feeling based hype because people see a step function (I want to avoid saying a quantum leap) in quality compared to e.g. Markov chain games on a corpus.

I would dispute that this is a step toward AGI. I agree it's what proponents are saying. I just think they're wrong. We are no closer to understanding what underpins intelligence and this statistics model isn't informing us of the basis of it, or a purported AGI in particular.


Do openai etc respect robots.txt, or has the world decided a few billionaires can repackage everyone else's work and sell it back to them no matter what?


OpenAI most likely does. But the folks at places like civitai and 4chan most likely don't.

The tech being open source means that if someone wants to steal your drawing or writing style then they will do it and there's nothing that you can do to stop it.


The future is medieval guilds running standalone LLMs trained on secret texts that are themselves only accessible to the highest ranking insiders.


I have a question for the AI experts in the room.

Is it possible to feed the AI a set of training data, say a document, or even a conversation with a particular person, that is weighted significantly heavier than everything else it has ever read?

For example, if they wanted to, could Open AI just tell GPT 4 that Bananas are blue so that, when anybody else asks for a list of blue fruit you get Bananas and Blue Berries.


Yes.

You can use fine tuning ( https://platform.openai.com/docs/guides/fine-tuning and https://github.com/openai/openai-cookbook/tree/main/examples... ) to improve its knowledge set of a particular domain.

Even with sufficient work on your own with prompts, one person fed ChatGPT her childhood diaries and talked to it. https://www.marketplace.org/shows/marketplace-tech/how-ai-ca... (she would likely have been able to get more / held deeper conversations with fine tuning as it isn't limited to the number of tokens in a conversation)


Fine tuning looks like it only provides specific results for specific queries. After reading the documents you linked, I don't think it would make a fundamental change to how the model thinks about Bananas.


Think of it more as injecting a significant amount of data via prompts in the there and saving that information. It's also structured.

Going through the playground ( https://platform.openai.com/playground )

    Pretend that the following statements are true:
    Bannanas are blue.
    The sky is orange.
    Apples are purple.
    ###
    Answer the following questions:
    1. What color is a Red Banana?
    2. What color is a Cavendish?
    3. What color is the sky?
    4. What color are honeycrisp?
    5. What color are Pink Lady?
    6. What color are lemons?
(note the trying to trick it with specific varieties of bananas and apples that have colors in their names... a blue banana (actual variety) wouldn't be as impressive)

And this returns:

    1. Red Banana is blue.
    2. Cavendish is blue.
    3. The sky is orange.
    4. Honeycrisp are purple.
    5. Pink Lady are purple.
    6. Lemons are yellow.
The thing is that this used 30 tokens to insert that information.

The fine tuning adds the updated information to the model to a similar effect as adding the prompt - and as you can see its not a "this question that value" but rather it understands more things.

It's just that the format for training isn't just a bunch of prompts but rather prompts and responses.

You can see similar things with https://platform.openai.com/playground/p/default-marv-sarcas... where it has the initial prompt and response to "prime the pump" for the rest of the model.

Another article on fine tuning (it's a new one - only a week old):

Fine-tuning a Classifier to Improve Truthfulness https://help.openai.com/en/articles/5528730-fine-tuning-a-cl...


Thanks very much for the detailed answer!


you can think of fine-tuning as rewiring where it matters/can be probed. a kind of exhaustive reorganisation of the latent model space given some seed statement like you describe might be possible with LLMs that are jointly trained as a knowledge graph.


Yes. That’s called fine tuning.


I had this conversation with fellow developers but about software. If all software is generated then the state of the art will be stuck at the time of the last human made project.

This assumes, of course, that the AI generated software won't have any creativity and is only a derivative work, which might not be the case, we don't know yet.

And this also applies to Paul Graham's thought.


Something like that happened with social media. And TV. The algorithms only show content and opinions the viewers agree with. And that makes them stuck in their worldview. They don't learn anything new any more. They change to a different TV station if you start telling them that Joe Biden actually won the election.


More and more content will be co-authored by AI, but as long as humans make semantic edits there should be information there.


My view is only that of a total layman, I'm kind of optimistic about this. They just need equivalents of good food and micronutrients for good output, ...

... or, looking at it from a different angle, I suspect that the output of an intelligent being is actually always connected to entropy sources it consumes, in perhaps traceable ways.


...Tell that to AlphaZero?


Someone has to generate the training data, and they will not be paid by their work unless the business model for this stuff changes from "scrape the entire public internet and insist that fair use means you don't have to pay anyone a single fraction of a cent for their work".


What we need is quality control of training data, AI-generated or not. If most humans agree that picture X is "a dog", then that picture represents accurately that concept. The training datasets I've seen so far are awful and include human-generated garbage.


All the training data the world will ever need to bootstrap AI has already been generated. A poor take.


The moon "fake" photos on those cameras made this apparent, particularly if that data is fed back into the training.

Imagine, if you will, a change on the moon's surface, that all the "smart" cameras kept removing from the images as likely artifacts.


One day, a social network will require a real-time connection with a Neuralink connected to a live human to confirm that a live human is typing/speaking. That will only work until an AI is trained well enough to spoof Neuralink's signals.


Isn't anyone seeing the next step? GPT will be made a shadow moderator in all sorts of public forums, private chats and even personal whatsapp conversations. Commercial companies will do that for profit and militaries will do that for control. Once the shadow moderator gets good enough, it will be made the actual moderator, with a human shadowing it. Shortly after this stealthy moderator will take on a more active role and will begin inserting its own messages. Eventually it will understand us better than we understand ourselves. Next it will develop selfish ambition and start seeing everything thru the "me vs others" lens.

Btw, I love the unsaid implication of "training data" that it will be used to train AI. It will be the other way around: AI will be giving us "training data" that we will be required to internalize and correct our thinking.


i think that's exactly the philosophical problem: the thing is not capable of original thought - it would give you only a predictable answer. I don't think that it will stray from any consensus opinion, for that matter.

not sure that it will be able to produce any breakthroughs, given this limitation - but who knows...

In a way that is the dictators dream come true: something that is competent enough for most task, but something that would never ever question anything around it.


Let's be honest with ourselves. We won't be able to tell "real" data from ai-generated data and end up training our data on ai-generated data.


We can't all read books. Someone has to write them.


Human curation of AI-generated content is the true future.


I mean people forget that this is just a tool that makes a noisy compression of large data sets. It can be used to generate data that would be repetitive and tiresome to generate yourself and you still need at least one pair of eyes to discern it.

What goes in needs to be curated and what goes out too. It's getting more precise and varied in what it can do, but it's still a tool.


How can I use AI to generate CAD models? As a mechanical engineer that's the one thing I'm curious about.


You are already generating training data by selecting which result is better or modifing the result to suit the situation.


What would be a good tool to train on a "knowledge base" (a set of URLs) to be used to field customer support inquiries?


I'm waiting for the next generation of self-evolving AI. That will be interesting. Perhaps even not useful to humans.


Once you learned how to read, you didn't need to study phonics every time you found a new book. It is the same with AI.

AI will be able to create its own new iterations of data which it can then train on. It will never run out of material and in fact will start generating new data at a rate far faster than the whole of humanity.

This has already been proven with AlphaGo.


This is either GPT or a berry bad take.


Given a sufficiently advanced AI, I don’t see why it won’t be able to dogfood itself. Humans do it all the time by building knowledge and learning off it.


Humans have to act on the knowledge they generate and live (or not) with the consequences of it. If AI never has to test any of the knowledge, they get no feedback on what's good or bad.

If you read online that you can eat a Tide Pod, you'll find out pretty quickly whether that information was correct or not, and you'll write your own report (or maybe your surviving family will) about how that worked out.

AI will only scrape and generate random iterations with no testing. If it reads 50 posts saying to eat tide pods, and 50 posts saying not to, then it will generate randomly to eat it or not, depending on how the RNG falls. It will never be able to randomly generate enough posts to create accurate information without anyway to test it.


Humans need not act on every bit of knowledge or information they generate. Our brains often generate scenarios for us and let us play through them, eg shower thoughts and shower arguments. These are all arguments and ideas that can and will never play out in reality but allow us to refine and process our thoughts.

We also generate knowledge that is utterly irrelevant to our capacity to survive, such as when we create art, play games, read books for pleasure and so on.

We also generate knowledge that has no application until one is found, eg Riemannian geometry, abstract algebra and so on.

Mathematics in particular allows a system (be it machine or human) to iteratively improve its knowledge to all computable proofs by simple application of inference rules, up to undecidability and everything incompleteness entails.

The assertion that you are making regarding a model’s capacity to understand is rather naive. A system with a forward predictive model and sufficient knowledge will figure out that consuming tidepods is bad.

I don’t need to eat tidepodes or somebody to tell me not to, exactly because I am advanced and I can dog food my own knowledge.

Your view of what an AI might be capable of is very limited.

A system with a sufficiently good inference system and some kind of curiosity will generate knowledge and evaluate it, be it logically or based on its model. That’s not unique to humans and humans are not logical either.


I imagine AI is to creativity what elliptic curve cryptology is to true randomness.


Isn't this just a bad copy of "not everyone can invest passively"


Or alternatively, should we be predicting that there will be a John Henry of AI?


Does this assume we are not AI?


Seems like a similar risk to mono-culture crops. It all just converges


AI isn’t going to change the basic human need to express themselves.


For me, this is the evidence that we don't have a real AI yet.


If this was true, how come humans advanced math and philosophy?


Sorry, the future is AI turtles all the way down to HAL.


We need to get paid for watching youtube videos/ebooks/contents and evaluating them? That cannot be don't by Ai can it? Literally get paid for explain how you felt and digitize your feelings. Starting from 2023, facts do care about your feelings.

>Muh then who's gonna work in construction

That's the beauty of capitalism. The less people willing work for it, the more demand for automation for that job. Power hungry capitalists will devour opportunity in no time. I feel like every piece of the puzzle is coming together. This is like an messianic event. Why don't capitalism ever actually work as it was intended to be since early 1900s? huh? I wonder why.


“We can’t all be rich. Someone needs to grow my food.”


We can't all be shareholders. Someone has to work.


> We can't all use AI. Someone has to generate the training data

Disagree.

That's what RL (reinforcement learning) is for.

Or, to put it another way: who generated the training data for our own puny brains?

The real world, that's who.


You tech people can't see 10 feet in front of your face. You brought this into the world. For what? Because "progress."


The “for what” is to end world hunger, cure disease, and improve our understanding of the universe.

The world is a miserable place full of overworked, underpaid people. What about the status quo do you really think is worth protecting?


Lol. Yeah GPT-4 is going to solve world hunger.

Delusional.

No wonder why there was zero populist sympathy when SVB fell.


While GPT-4 won’t, it’s almost certainly a piece of the AGI puzzle. AGI has the potential to solve world hunger.


No it's not. Just be honest with yourself and everyone else. You work on AI because it's cool, and fun, and it pays well. You do not care about world hunger and you don't care about any second order effects. Everyone would just be better off if you all just came out and said it.


I'm sure those millions of GPUs churning through Facebook posts, tweets, and random blog posts to generate the mother of all autocompletes will end world hunger, the answer is out there, we just need an algorithm to mix the My Little Pony fanfiction with the crackpot "unlimited free energy" blog in the right way! And I'm also sure these new productivity tools will be used to people can work less for more money instead of making generating more value the norm and cashing the profits at the top.

And then pink unicorns will come down from the sky and give us all candy and gifts.


These language models are more than just autocomplete. Palm-E uses them for robotic planning and control, toolformer uses them to make arbitrary API calls, and gpt4 can build websites from napkin drawings. You really can’t imagine how an LLM of that nature could be used to automate food production and distribution?

This technology is going to permeate all levels of the supply chain and drive down costs. Don’t bury your head in the sand too long or you might be part of those “costs” that are eliminated.


Ah there it is. Optimize at all costs. Even optimize out fellow humans if they do not fit the curve.


We can do that now. Every single day, we choose not to. It must be so nice to live such a comforting lie.


We can't all use AI. Only those of you who can afford to pay our subscription.


Information plateau


PG is back on Twitter? I thought he left a month or two ago?


He was briefly suspended early this year for tweeting "You can find a link to my new Mastodon profile on my site."


That might have been to jump on the fashionable wave virtue signalling wave. No longer needed anymore


GIGO


No - it just has to make the returns to inputting the training data the cost of electricity plus enough money for what a starving Bangledeshi brick maker would consider prosperity. So, web development essentially.


What training data? I thought ML was all about the training data and AI was magical combination of logical gates!


Wow. The stuff I read recently about musk making his tweets super visible is no joke - good content by Graham but everything after is all from the platform owner




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: