Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
OpenAI’s hunger for data is coming back to bite it (technologyreview.com)
77 points by tim_sw on April 20, 2023 | hide | past | favorite | 40 comments


I'm glad to see we're getting better and better at regulation. Here we have the most important development in computing in years, and we're immediately nipping it in the bud. There used to be some delay.


Same with the atom bombs, we only got to drop them twice, what a beautiful show we missed!


The analogy you’re looking for is the broader term Nuclear Power, and yes same thing except we allowed it to happen for a few decades before switching to burning coal instead, thanks for the reminder.


The fact that we botched something should not become a reason to continue botching things - politics or science developments. I see it rather as an argument to do better this time, next time, you know, lessons learned...


This, but unironically. We should’ve built more and done Operation Unthinkable.


That thing with record labels going after people using generative AI on their IP caught my attention a little while ago. This seems like a big Achilles' heel for these products, but I guess the hope is reaching critical mass before anyone complains and then getting the rules changed in their favor, as a number of other tech products managed.


If you're big enough and rob enough people, your thefts will go unpunished.


I really hope we don't get a new version of the "You cannot view this webpage because you are in the EU", this time for LLMs.


That would not be enough. You need to implement the right to be forgotten of European citizens even if you are not offering services in the EU.

Rather, I wonder what would be the effect of this on open source weights, as they have the same issue of being able to produce personal data of European citizens.


>even if you are not offering services in the EU.

If a company isn't operating in the EU then there is little the EU can do to them as the EU doesn't have jurisdiction over them.


Unfortunately for them companies like Microsoft and Google are operating in the EU, so it may not be that easy


That's called "extra territorial" laws... and that's an US-concept lately brought in the EU laws (to counteract US extra territorial laws).

It's the same as the US able to fine two foreign entities doing business together... because they use dollar for the transaction: the US is not part of the transaction EXCEPT for the use of the money.

Usually, the "guilty" companies will be fined and then... either ignore it (but risk trouble directly in the US or when they will try to have business with an US entity... or with an entity having part of its activity in US...), or just pay. Same will apply here: either pay the fine or you won't be able to do business in the EU nor with any EU company... and possibly even not with a company in business with a EU company (indirectly)


Thanks for the explanation, but it basically converges on what the grandparent comment stated, right? I.e., unless your business deals with EU companies or operates in EU, then you are free to ignore it. And if you decide in the future that you want to deal with EU companies/operate in EU, you have the option of paying the fine and moving on.

> either pay the fine or you won't be able to do business [...] possibly even not with a company in business with a EU company

That one seems a bit sus, because it would imply that you won't be able to use GCP/Azure/AWS for cloud services, and that just doesn't sound right. Afaik they wouldn't blanket refuse cloud services to an american business that doesn't follow EU laws (in case the business simply doesn't care to operate in EU or make money from there).

Not a legal professional at all, so if someone could provide a better explanation of the situation, it would certainly be welcome.


OpenAI will have no trouble claiming legitimate interest. It’s very broad - broad to the point where HR companies are openly selling scraped LinkedIn data to recruiters.


Based on the injunction from the Italian privacy guard (the second one, not the ban) they did indeed back off on the scraping part, because it is not mentioned among the tasks that OpenAI has to do before April 30, for the ban to be lifted.

That said, this article is only devoting a sentence to way more important requests, which are the implementation of right to be forgotten and especially the problem with inventing false personal data. Which is a huge issue and what distinguishes artificial intelligence from autocomplete.

For more information: https://www.reuters.com/technology/italy-lift-curbs-chatgpt-...

By the way, blocking ChatGPT from Italy is not enough to avoid the GDPR violation. OpenAI is handling(*) personal data of Italian citizens, and therefore they must allow them to exercise their rights, even if OpenAI is not providing service in Italy. Blocking ChatGPT was just a fig leaf to show they were doing something.

(*) Because people say the personal data is not part of the model, "handling" is defined as performing any in a list of actions which includes disseminating, and "personal data" is defined as "data that allows identification of a person". So if "ChatGPT is disseminating data that allows identification of a person", then "OpenAI is handling personal data"; the former is simply a subset of the latter.


Lets legalize theft because some get away with it.


Let’s redefine theft instead?


Imagine we're a primitive people in a forest village. If I steal your ax, that's theft. However, if I see your ax, and make my own ax based on your idea, did I steal anything from you?

In the 2nd scenarios above, you may have to work harder in your woodchopping business, as I'm now competition, but doesn't that mean that the villagers benefit from better access (cheaper prices) to wood-chopping services?

We currently define ideas as property. But is that ethically defensible? Now, if I claim to have originated the ax idea, I'm maybe stealing your brand or reputation. I should have to give credit to you for the idea of an ax, but I don't see how making a copy is theft. Ideas are not scarce; they're inherently shareable.


Well lucky us we don't live in a primitive society, and that's basically the point of this whole debate.

> you may have to work harder in your woodchopping business, as I'm now competition

So China copying our products is good, because we now have competition?


My comment is regarding whether ideas fall in the same category as physical goods. Ideas are fundamentally different, in that they're non-scarce unless we make them scarce via intellectual property laws. And I question whether those laws really promote human flourishing, or whether they simply protect whoever is close to the government.

Regarding China, copying without attribution should be prosecuted as a form of reputational or brand fraud, as it fails to give credit to others for the idea. And sure, under existing intellectual property laws many Chinese firms ought to be prosecuted and held accountable---I don't advocate breaking existing laws.

I simply think that the laws out to be brought into better alignment with reality, and ideas simply aren't scarce (at least in how economics considers scarcity).


> as it fails to give credit to others for the idea

That is precisely the issue at hand.

> Ideas are fundamentally different

But that's essentially what products are. It's not like chinese companies take a product and clone it, they take the idea of how to make a product look like and work and implement it. Same applies to code, art, books. OpenAI takes those products, the results of ideas, modify them and resell them without attribution or without having paid a license. That's simply theft.


> Ideas are fundamentally different, in that they're non-scarce unless we make them scarce via intellectual property laws.

Is code an idea? Is the corpus of Github an idea?


Also, assuming you're writing from a United States perspective, part of the reason so much of our manufacturing sector was outsourced to China (and other nations) is that since 1971 the US dollar has been our primary export, as the global reserve currency among all the fiat currencies. This is starting to change, so I think we'll see more and more manufacturing returning to the USA as the dollar looses international reserve status. And I wonder to what extended Chinese copying our technologies is connected to macro monetary realities.

It will be generally good for the world, for American manufacturing, workers, businesses, and families---pretty much for all except Wall Street banks and those close to the government money spigots.


I am writing from a european perspective as we share common problems caused by countries copying our products and ideas. Now it would appear that the US wants to do the same but for intellectual work. That's not nice to put it mildly. I think licensing data, expensive as it may be, would eventually lead to higher quality ai and healthier growth. At the moment this whole campaign of promoting openai's products, and those of similar companies, appears to be focused on: 1) i'll take your products (books, art, software, music) 2) resell them 3) put you out of a job.

That's precisely what malicious actors such as China have done. Not only are we swamped with lower quality products, but as you wrote, we are also facing significant social and economic issues. This time at an unprecedented scale.


Your solution will definitely employ a lot of lawyers! A strange utopia of full employment


> So China copying our products is good, because we now have competition?

Yes? Isn't that the entire point of the market driving innovation and improving things over time?

Let's try the opposite: Why is China copying foreign products bad? And to whom?


So if you don’t believe in IP protection, how come Open AI are allowed to have their weights and training methods secret ?

I’m aware you might think that their approach sucks too but we can’t have it both ways.

Same rules have to be applied across the board.


Why not democratize it?


One of the best comments on hacker news ever.


That’s simply not true.

Although legitimate interest does not specifically require consent it DOES require that the subject is informed of the collection of the data.

Also, legitimate interest doesn’t apply to many categories of data otherwise protected by law. e.g. health info

Also, it doesn’t override their other lawful rights, such as the right to be forgotten.


> Although legitimate interest does not specifically require consent it DOES require that the subject is informed of the collection of the data.

Link? This was not my recollection of it at all.

Would also put Google and other search engines in violation so doesn’t seem right.


I think it true in the technical manner. You can try to find it in the actual GDPR but that hard if you aren't a lawyer the UK ICO has a neater article on this:

https://ico.org.uk/for-organisations/guide-to-data-protectio...

Basically you just need to declare them in you privacy policy and keep a record for compliance.

In terms of LI though it's really complicated and I don't think anyone here on this site is in a position to say for sure if LI applies to what OpenAI is doing. There are arguments from both sides that make sense.


As an Illinois resident, I'm going to guess I'll be getting a check from openai for some biometric privacy violation at some point in the future...


They've already got red light cameras all over Illinois, so I wouldn't be surprised


Good. I wish they would be fined into bankruptcy


Just curious, why?


Even my own content has probably been used by them, against my wishes. I see it as unethical data usage. Additionally, the ethics of the company's history, changing to become entirely 'closed' while even keeping the name "OpenAI", makes me... angry, I suppose is the right word. Intuitively it seems to me that they will have a very bad influence on the world if they maintain dominance and continue to gain power


Are large datasets too much exposure or worth the risk? For OpenAI it appears the latter.


I cannot verify, but delving deep into electrical engineering topics I’m guessing most of that training set is coming from expensive college text books and papers that are behind paywalls. My guess is it scraped all of Sci-Hub. Which is a great resource, albeit illegal.


GDPR is the best thing since sliced bread




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: