Google's updated privacy policy states it can use public data to train its AI

elric · 2023-07-04T13:55:25

We probably need a better definition of which data is "public". Simply being accessible doesn't cut it. I can look out my window and straight into my neighbour's bathroom. Is that public information? Same goes for information on the internet. Sure, my neighbour could put up curtains (and I really wish he would), much like people could restrict access to web pages, but I don't think a lack of protection should automagically imply public access. Much less public access for the profit of some multi-billion dollar corporation.

prepend · 2023-07-04T15:37:50

> I can look out my window and straight into my neighbour's bathroom. Is that public information?

Yes. And there have been numerous court cases confirming this (why we have paparazzi taking topless photos on people on private beaches from public vantage points).

I think this is a feature, not a bug, as without a straightforward rule I don’t know how society solves this without causing more harm. If something is public, it’s public. Restricting you from looking in your neighbors open window isn’t something that can realistically be “fixed.” Other than if I don’t want my neighbor to look in, I draw the blinds. This works for paupers and billionaires.

If I don’t want people to see things, I don’t make them public. I can’t set a limit of “only people making less than $20k are allowed to view this.”

abwizz · 2023-07-04T17:09:57

> paparazzi taking topless photos on people on private beaches from public vantage points

this is illegal almost everywhere.

e.g. in germany it is even illegal to take a picture of a person (one specific) in a public space without their consent (unless they are a person of public interest).

prepend · 2023-07-04T17:28:15

As long as the photo is taken from a public location, it is legal almost everywhere.

US [0] “ In the United States, photographs that are taken for editorial use in a public place generally enjoy Constitutional protection under the right of free speech.”

Denmark [1] “ you can almost without restrictions shoot anything as long as what you're seeing is visible from public property. You are allowed to shoot people, including police officers or other government officials.”

In Germany [2], you can photograph people from public locations but not if they are nude or vulnerable or in their home. “ You can’t take photos of people if it shows their helplessness.1 For example, you can’t take photos of accident victims, drunk people or nude people without their permission.”

[0] https://www.hg.org/legal-articles/what-are-the-laws-regardin...

[1] https://law.photography/law/street-photography-laws-in-denma...

mcbutterbunz · 2023-07-04T17:43:20

The link you posted [0] states there are limitations on the use. Places where you would have a reasonable expectation of privacy are not allowed to be photographed. Like at an ATM or in a public restroom.

prepend · 2023-07-04T20:37:56

Indeed. There are limitations. Restrooms aren’t public, they are places where privacy is expected. The link I posted describes how in the US it’s perfectly legal to take a picture through someone’s open window (as long as you don’t trespass or do anything illegal to take it).

My point isn’t that there are no limitations on photography or use. My point is that if you make something public, people can view it and use it. And that’s legal. People seem confused about this that somehow consent is required for use. Not for things publicly released. (In the US and many countries at least)

onli · 2023-07-04T17:47:16

> In Germany [2], you can photograph people from public locations but not if they are nude or vulnerable or in their home. “ You can’t take photos of people if it shows their helplessness.1 For example, you can’t take photos of accident victims, drunk people or nude people without their permission.”

Not how I learned it. You can take pictures of public spaces as long as a specific person is not the focus of the picture. The other aspects, like the invasion of privacy when a person is nude, only come on top of that.

The law around questions like this is not definitive, so a lot depends on recent court decisions.

prepend · 2023-07-04T20:40:08

The actual law is cited, so perhaps law changed since you learned it. Or you are just incorrect. According to German law and court decisions, it seems quite definitive. Although I know very little about German law, perhaps there’s some local law that supersedes regional, national, and EU law.

What I try to do is track to the actual source rather than relying on my own memory of things.

onli · 2023-07-04T21:13:00

I see no link for the [2] in your comment. I sincerely doubt I'm wrong about this, you can read about it for example here: https://hoesmann.eu/fotos-von-gebauden-personen-und-marken-i.... It also explains under which exceptions pictures of a single person can be valid anyway, without explicit permission.

> According to German law and court decisions, it seems quite definitive.

That's impossible. Basically nothing is definitive in the german Medienrecht ;) Ok, not really true, but it's true that things can change and that it is a less defined area and you'd have to be really certain to know the relevant case law to be almost certain here.

prepend · 2023-07-05T03:15:27

I’m so sorry, I left out the citation and had in my clipboard.

Here is is: [2] https://allaboutberlin.com/guides/photography-laws-germany

And it actually references the same article you posted.

onli · 2023-07-05T08:09:41

No problem. There is https://www.gesetze-im-internet.de/kunsturhg/__22.html plus https://www.gesetze-im-internet.de/kunsturhg/__23.html - note that it is about distributing and publishing, but that just taking a photo can trigger this somewhat, from the article I linked above:

> Nach dem Bundesverfassungsgericht (BverfGE NJW 2000, 1021) ist bereits ab diesem Punkt [that a photo is taken] ein Kontrollverlust der ohne Einwilligung abgebildeten Person über das Bild gegeben, und dieser mögliche Kontrollverlust rechtfertigt sogar unter Umständen ein Fotografierverbot.

But that is debatable and actually the point that could be outdated.

In practice, the guide is not completely wrong: You can take pictures in public spaces and they can show people, but those people should not be the focus of the image and you might want to make them unidentifiable if you publish the image.

vincnetas · 2023-07-09T07:14:24

one thing is to take pictures of things that you as a person see. i doubt that you can use these pictures any way you like. for example use them as magazine covers without consent of depicted person. provided its a portrait. or for example for tracking people that regulary walk through same place.

tyfon · 2023-07-04T17:47:39

It's the same in Norway.

Taking a photo of a public square with a lot of people in it is still legal without asking everyone for consent. But the topless photo example would not be legal at all without consent or even just a photo where one specific person or small group is the focus.

Edit: taking the photo itself might be ok in the latter case, but publishing it (like posting to facebook or a blog or forum) is not.

SoftTalker · 2023-07-04T18:39:33

What if you're on the beach and taking a photo of your kid and there's a topless woman in the background. Very common in Scandinavia. And if there are German tourists there, they might be completely naked.

wukerplank · 2023-07-04T19:06:42

Taking a picture vs publishing/selling a picture are separate issues.

DropInIn · 2023-07-05T07:20:47

Yea no...

That ends up being a prohibition on taking photos in public at all except in areas devoid of others as it is impossible/unreasonable to get the written consent of all parties who would be photographed in the background of a photo of a person who requested thier photo be taken....

VirusNewbie · 2023-07-04T17:22:27

so if I take a picture of my wife on vacation in germany with someone in the background I'm breaking the law?

mxscho · 2023-07-04T17:31:22

Not necessarily, as there are other exceptions to this rule. A person can also be "Beiwerk" [1] (= accessory/props) which in the context of personality rights means that you are not the main focus point of the picture.

[1] https://de.wikipedia.org/wiki/Beiwerk

nicbou · 2023-07-04T17:27:59

There are clear exceptions for people who happen to be in the picture.

darkclouds · 2023-07-04T19:30:35

I dont think so as Kate Middleton future Queen of England found out.

******Warning****** NSFW pictures of the future Queen of England at this URL --> https://theoutsidersadi.wordpress.com/2012/09/14/click-here-...

The (UK) press get around the law, by reporting a story and then relying on the reader using search engines to get the information from other jurisdictions like I have just done here, located in UK, EU google servers, and where ever the wordpress server is located.

It kind of makes me think, what is the point of law?

You need a lot of money to fight these entities and fight them in multiple jurisdictions where relevant laws exist.

What also makes a mockery of the legal system at least the Royal lawyers is, whilst Kate Middleton has some injunction to block press publication in the UK, they cant stop the search engines from publishing the data to users in the UK, as I have just demonstrated with the link above and the search term "kate topless holiday photo" and then clicking the images option!

Now what if I meant "kate beckinsale topless photo" instead of "kate topless photo"? I just got someone elses topless photos without even expecting them, ie the future Queen of England.

Are the Royal lawyers from the stone age, do they not understand search engines or do they buy the targeted filter bubble narrative highlighted by Eli Parsier in his Ted talk?

I like German privacy, they even stood up to Google and the Streetview project, however Google have edited me from their Streetview images where I'm giving their car the bird as it drove past me so there is some human oversight, but they still have that data and refuse to hand it over via GDPR DSAR requests.

To many big entities including the Police, fulfil GDPR DSAR requests by relying on being able to identify the individual using todays existing systems.

If someone cant be identified, like my giving the Google Streetview car the bird in their streetview data, Google will say no data exists.

Yet GDPR law doesnt address developing and future technology which will be able to identify me giving the Google Streetview car the bird if they run facial recognition over their streetview data.

So the GDPR DSAR is useless law as is, although I havent read it, but I suspect Google's privacy policy is as well.

humanistbot · 2023-07-04T15:52:38

> Yes. And there have been numerous court cases confirming this

In the US. It is different in the EU. I know HN is US-centric, but tech is global and there are more people in the EU than the US.

adventured · 2023-07-04T17:03:47

> but tech is global and there are more people in the EU than the US

Why would population number be the determining factor on anything? The world obviously doesn't operate based on direct democracy. India having 1.4 billion people doesn't give it a greater ability to dictate anything vs the EU, US or China. It comes down to power (always will, always has). For example, the US - for now - has the ability to dictate certain things to China on trade restrictions, given the US technology advantage. China is a legitimate superpower economically, has four times the population, and yet the US can still do that.

If the premise is consumer numbers: the US still has the EU beat even with fewer consumers, with a far larger, far more valuable economy.

Besides all of that, naturally each country (or as a union), to the extent it can, will attempt to set its own rules for tech. The US will do so, the EU will do so, China will do so, etc.

flangola7 · 2023-07-04T18:49:13

I can't think of a better metric than individual count.

halJordan · 2023-07-04T20:42:02

What are the substantial differences? (I dont think there are)

dawnerd · 2023-07-04T15:47:04

What if someone leaks info is that public now? What if someone shares my private info without permission? Should Google be allowed to train on it? How would their systems know? What about pirated other illegal content? The lines are not quite as clear to computers.

prepend · 2023-07-04T17:30:26

IANAL, but leaking is a crime, accessing once public is not. There was a lot of this back during the height of Wikileaks where there were questions about reading classified material being a crime. It is, but once it’s leaked, it is no longer classified, so public.

So the crime would be on the leaker, not on Google for training on it.

snerbles · 2023-07-04T21:43:02

IANAL either, but I have held a security clearance and this is very dangerous advice.

Unless explicitly declassified the leaked information remains classified, and those who hold security clearances are legally required to avoid all classified information outside of their need-to-know [0].

Now if you have never held a U.S. security clearance, you're less likely to be prosecuted but the history and precedent is murky [1]. The average Joe Sixpack checking out Wikileaks is probably safe, but if I were a journalist publishing the next round of Pentagon Papers I would much rather have a small army of lawyers and a friend or two in Congress.

EDIT - Google has federal contracts, so they are probably bound by similar agreements to at least make an effort to avoid any such leaks in their training data for public-facing models.

[0] https://www.csmonitor.com/USA/Foreign-Policy/2010/1207/US-to...

[1] https://www.npr.org/sections/thetwo-way/2017/03/22/521009791...

prepend · 2023-07-05T03:18:36

I have a security clearance and the guidance I received (and was explicitly asked about during my recertification), is that I cannot view Wikileaks at all, and any classified material leaked there.

So I didn’t view Wikileaks or any material as I don’t want to lose my clearance.

But the question was about whether it’s legal or not to view leaked material. Security clearances are a different matter and are going a step beyond what’s legal or admissible in court or what someone would be prosecuted for.

dawnerd · 2023-07-04T21:09:14

But the question is should google train against it not if it’s legal. Just because it’s public doesn’t mean you have the right to use it, I think that’s the real question we’ll need to figure out.

littlestymaar · 2023-07-04T17:09:43

> > I can look out my window and straight into my neighbour's bathroom. Is that public information?

> Yes

Not in my country at least, but if you will I van suggest another exemple: does broadcasting a movie or a song through to the public make it public good? No. We have laws (highly variable between jurisdictions) that set the rules for these “published data”, and they are actually dependent on manu factors including whether or not you're making money out of it (you can lend a DVD to your friend for free, but you cannot start a DVD-renting business without a licence from the copyright owners).

prepend · 2023-07-04T17:32:12

> broadcasting a movie or a song through to the public make it public good

Movies and songs are copyrighted and can’t be rebroadcast or copied without license. But viewing is perfectly legal and does not require a license.

And you can certainly start a DVD-renting business without a license from copyright holders (assuming you bought the DVD). You can’t start a dvd streaming business without a license.

littlestymaar · 2023-07-04T19:33:36

> Movies and songs are copyrighted and can’t be rebroadcast or copied without license. But viewing is perfectly legal and does not require a license.

That's exactly the point: “publishing” something only gives you some limited rights (an in certain jurisdictions like mine, most of these rights are limited to individuals only, and organizations are excluded).

> And you can certainly start a DVD-renting business without a license from copyright holders (assuming you bought the DVD)

Not in my country, again.

prepend · 2023-07-04T20:43:15

What country are you in? It sounds unusual that you can’t sell property that you’ve bought (a DVD).

Publishing are rights for the publisher, not the viewer. As a viewer, I can view the material and don’t have restrictions on whether I can remember it or not (ie, run it through an algorithm to train a model as part of the viewing).

littlestymaar · 2023-07-05T07:25:14

> What country are you in? It sounds unusual that you can’t sell property that you’ve bought (a DVD).

France. You can re-sell the DVD all you want, but you cannot lend them for money (because it means sharing the copyrighted material) without permission.

> As a viewer, I can view the material and don’t have restrictions on whether I can remember it or not

In m'y country you even have legal right to record it (=run it through an algorithm that “trains a model”, be it an overfitted one) for your own use, if you're an individual but not an organisation. And even for an individual, you can remember every lines of the movie, but if you make a new one with the same dialogues, then it's plagiarism.

Making something public (aka: publishing it) doesn't mean giving up all rights on it.

scrum-treats · 2023-07-05T18:23:01

This is why situations like VisualMic and visual-based cryptoanalysis tools are troublesome (https://news.ycombinator.com/item?id=36331446). All you need is a decent smartphone (e.g., Samsung Galaxy S22) and open-source software.

Quite a vulnerable place humans are in at present.

detourdog · 2023-07-04T15:10:07

What is so crazy is this seem like a disincentive to publish and share ideas. If anything you say can be used in an AI against you... or the alternatively that is not your idea the AI came up with it... This has nothing to do with AI and everything to do with the crackpots in charge.

sigmoid10 · 2023-07-04T15:26:50

Unfortunately it's not so easy. If I as a person (which is essentially just a biological neural network) can simply go to websites, read the content and use the information I gathered to create slightly modified new content without repercussions, who's to say that an artificial neural network should not be allowed to do that? Just because I'm not as fast? What if I hire 1000 workers in a low wage country to do it for me? As AI capabilities grow, this separation will grow even narrower in the future. There's no realistic way to differentiate web access for human purposes vs. AI purposes in the long run.

to11mtm · 2023-07-05T01:19:50

I as a person understand that the GPL, AGPL and it's ilk are (at least in my moral view) a certain level of sacrosanct.

I might read GPL code once in a while. But I would never copy-paste it when someone asks for, say, how to do a fast inverse square root. (I don't really read AGPL code b/c most AGPL companies I've encountered strike me as salivating for a reason to force a license on a business entity.) The closest I did was looking at GPL code once upon a time for some geo-transform code, and frankly didn't use any of it and instead used a USGS book to re-implement everything in a fully legal-safe way.

sigmoid10 · 2023-07-07T07:55:46

How much copy pasted code on stackoverflow has a license attribution? And on top of that, even more restrictive licenses wouldn't change the problem - neither with humans nor with AI.

morjom · 2023-07-04T17:45:23

>Just because I'm not(...)

It's because a human you can be accountable and can feel consequences, and yes, also because you aren't as fast.

I feel like many people just equivalent "human brain and consciousness" with "neural network" way too quickly. You can't remove the human factor however much you try to equivelate(?) it with a program.

(? = non english speaker)

sigmoid10 · 2023-07-04T21:42:19

But you aren't held accountable. At least not as much as LLMs nowadays. And speed is also irrelevant as stated above.

salawat · 2023-07-04T15:38:33

Yes there is.

You can spin up more compute with a credit card. You can't make 1000 people in the same manner, nor can you own them.

Lets be real here though. The only reason anyone is drooling over AI is because it potentially allows one to elide paying someone else, ehich means more money for them.

evandale · 2023-07-04T15:56:22

You can use your credit card to find and convince 1000 people to do your bidding, why is owning them a requirement? You don't "own" the compute you spin up either, you're temporarily borrowing it.

detourdog · 2023-07-04T22:55:08

As the AI business model as a two prong approach. Create an idea generating system so complex one can legal elude responsibility, followed by rent seeking opportunities from the generated ideas.

CamperBob2 · 2023-07-04T17:24:50

Yes, God forbid we build powerful new tools that extend human knowledge, insight, and productivity in directions previously undreamed-of. Mah coppy rite is more important! Thereoughttabealaw!

As usual in these scenarios, the only real injustice is that the people who tried to stand in the way will enjoy the benefits of progress in AI alongside those who worked to make it happen. So it goes, I guess.

salawat · 2023-07-07T21:55:56

I'm no ally of copyright; however, as long as it's here and a thing to be dealt with, I'm not going to cheer on a company operating on the back of flagrant disregard thereof. This isn't "code I'm using in a personal project that maybe only my friends will interact with". This is a full fledged business, owned in part by of all people, Microsoft, the people who rammed copyright down our throats for the last 3 to 6 decades while doing everything they could to cripple FLOSS.

I expect the absolute most aggressive enforcement of copyright in this case.

As to my more general assertion of AI only getting the traction it is because of an industry looking to devalue it's currently incredibly highly priced laborers; spend a bit of time around shareholders/management types and you'll soon understand why I think the way I do. Magical thinking, "as long as it makes my outlays lower" thinking is par for the course. There are, in fact, social classes who see the "hired help" as something meant to be out of sight, out of mind, and lucky they get what they are willing to give.

Besides the above hot take, I also see AI as being fundamentally disruptive to the human social fabric. I'm not convinced that as a society we're even prepared to have a real conversation eith regards to a technology that at any time could cross a threshold to sspience. The choruses of such individuals as Carmack and plenty of other HN posters on "it's just a statistical model", and "lets wait til it's at least a developmentally challenged toddler before worrying about those types of concerns" (where those types of questions are those with regard to sapience, and the matter of where the line between "just a statistical model" lay) only proves my point The reductionist viewpoint will be stretched right up to the point that there's a court case where the public finds out that training or instantiating models that communicate with one another basically involves torturing a collective mind that no one bothered to see that way because it was just so stupidly productive.

Hell, the outcome of said case would probably be shifting research in a direction whereby it's possible to make a construct that just barely toes the line. Which misses the entire moral point.

You could say I'm fairly black-pilled on the matter. Humanity can't even deal with one another, or competently raise their own children. We don't need to be committing terrible parenting on an industrial scale.

...If you've read through all of this, you're ptobably a better person than I currently, but know there was a time I shared your attitude toward the subject matter. Then I really started to pay attention to how people treat one another, and how money actually gets earmarked for different things. The learning experience is something I'd not wish on anyone, but as you, and our shared friends the Trafalmadorians say,

So it goes.

detourdog · 2023-07-05T17:36:33

My issue isn’t the mechanics of what is happening. I believe the AI is being used as a shim to work around otherwise prohibited behavior.

JohnFen · 2023-07-05T17:23:33

> who's to say that an artificial neural network should not be allowed to do that?

If I'm the creator of the work, I get to say that. That I have no means to enforce that is precisely why I've taken all of my work off of the public web.

> There's no realistic way to differentiate web access for human purposes vs. AI purposes in the long run.

Right, which is a very serious problem.

detourdog · 2023-07-04T18:39:48

As an individual I would like to use bots to do my websurfing. Where I see the the problem is large corporations using webscrapped ideas to patent/copyright ideas. IF the data produced by AI is as open as the webscrapped data that seems fine.

sigmoid10 · 2023-07-04T21:44:48

If you produce code from reading stackoverflow or github for some company, it will also own it - not you. AI will only be faster at producing stuff for these companies.

detourdog · 2023-07-04T22:51:36

I'm not interested in code as much as other forms of human expression. Imagine having to convince the courts that you said something first no matter what the expert AI states.

I still don't see it as an AI failure as a human failure in the use of sophisticated tools.

ajsnigrutin · 2023-07-04T14:01:58

Yep. Most of the books are also public, since anyone can borrow them for free in a library, but they're still protected by copyright. Same for movies and tv shows broadcasted over the air.

zirgs · 2023-07-04T14:04:02

They are protected by copyright, but anyone is free to learn from them.

elric · 2023-07-04T14:06:59

Sure, but there have been many copyright cases about plagiarism (in writing and in music). It's rarely cut and dry. There's a fine line between inspiration and plagiarism, which can only seemingly be settled in court. That approach is not feasible when dealing with the amount of data (and copyright holders) that Google gobbles up.

zirgs · 2023-07-04T19:46:20

Using AI is probably the most inefficient way to obtain copyrighted content. It's much faster to simply find the original image or text and copy that instead.

madhadron · 2023-07-04T15:33:38

The fact that we call the curve fitting/optimization/compression that we do to fit machine learning models to input data "learning" is really unfortunate and leads to this kind of conflation.

If we trace the path of how we ended up here, it's similar to how people incorrectly refer to loci of DNA as genes. We have behavior analysis where we speak of learning as the conditioning via the antecedant-behavior-consequence loop. There was the Hebbian theory of how the ABC loop manifested physically in neurons. Early neural net papers took inspiration from that that mechanism and called it learning.

Meanwhile, actual learning is far, far richer than the Hebbian theory of synaptic strengthening, and has a lot more going on than just operant conditioning.

So, please, it's time for everyone to stop pretending that the fact that ML inherited the word "learn" as a term of art for curve fitting has any philosophical weight.

6gvONxR4sf7o · 2023-07-04T14:35:26

“Free to learn from them” isn’t specific enough. The question is “in what ways are people free to profit off of them?”

ajsnigrutin · 2023-07-04T14:05:50

Even AI? :)

Y-bar · 2023-07-04T14:27:39

Putting aside the fact that what we call AI today is not learning in the same way as humans. They operate on a VASTLY different scale compared to humans. On a good week I can read a book. A single book. A massively parallelised data centre can do that billions or trillions of times faster. Scale of effect (lacking a better phrase) must be considered.

zmjjmz · 2023-07-04T14:22:36

Do believe that a human learning from a book is fundamentally different than training a model off of it, and thus should be regulated differently?

mjr00 · 2023-07-04T14:44:06

There have always been legal differences between a human doing something and technology doing the "same" thing.

It's legal for me to go to a nude beach and stare at a topless woman. It's probably legal for me to draw a picture of that topless woman and distribute it. It's definitely not legal for me to take pictures of that topless woman with my phone and post them on the internet.

It's legal for me to overhear a conversation you and your friend are having on a bus. It's legal for me to transcribe what I heard and post it online. In most jurisdictions, it's not legal for me to record that conversation.

Ingesting data for use in machine learning models is still too new to have any specific legislation around it. But the argument that the technology is just doing a thing that humans do has zero relevance.

og_kalu · 2023-07-04T15:02:44

>It's definitely not legal for me to take pictures of that topless woman with my phone and post them on the internet.

This is legal. You can take pictures of anyone, nude or not in a public setting and post them anywhere.

>It's legal for me to transcribe what I heard and post it online.

This is murky. It's legal to take notes of what you've heard but that comes with all the pitfalls of hearsay. Legally, it's not treated as the human equivalent of recording because humans have no such equivalent.

ajsnigrutin · 2023-07-04T15:14:38

The first one is iffy.... probably depends on a country too. If you take a panoramic shot of the beach, and someone is randomly topless.. sure. If you take a telephoto lens and single them out, it's questionable, and in many countries, illegal. Same as with walking vs following someone, even in public... the intent is different, so is the legality.

staticman2 · 2023-07-04T14:26:19

If corporations are allowed to own AI there's a strong argument that it shouldn't be treated anything like a human.

Humans aren't property so of course they should be regulated differently from AI.

JohnFen · 2023-07-05T17:27:03

I believe this, yes.

0xParlay · 2023-07-04T14:46:31

Seems to be well defined and accepted in regards to Third Party Doctrine. Would be silly if AI were the motivation to reconsider what is "public" as opposed to the blatantly obvious run-around of the American bill of rights currently in use. But hey I'll take it.

tensor · 2023-07-04T14:43:38

Maybe something like an extension of robots.txt. E.g. if your don't allow crawling then the data is not public. I think that's fair, after all if you want complete strangers to be able to search for it arguably you are asking for it to be public.

JohnFen · 2023-07-05T17:29:18

The problem with robots.txt is that it depends on the crawler to respect it (and many do not). It's not an adequate protection against crawlers in general and a specific form just for AI wouldn't be adequate either, for the same reason.

pessimizer · 2023-07-04T19:10:22

> "But honestly Monica, the web is considered "public domain" and you should be happy we just didn't "lift" your whole article and put someone else's name on it! It happens a lot, clearly more than you are aware of, especially on college campuses, and the workplace. If you took offence and are unhappy, I am sorry, but you as a professional should know that the article we used written by you was in very bad need of editing, and is much better now than was originally. Now it will work well for your portfolio. For that reason, I have a bit of a difficult time with your requests for monetary gain, albeit for such a fine (and very wealthy!) institution. We put some time into rewrites, you should compensate me! I never charge young writers for advice or rewriting poorly written pieces, and have many who write for me... ALWAYS for free!"

13 years ago, we treated cookssource.com like they were rubes, but they were just too early and too small.

"The web is considered 'public domain'"

https://illadore.livejournal.com/30674.html

https://news.ycombinator.com/item?id=1868736

-----

A Follow-Up to "The Web is Public Domain"

https://web.archive.org/web/20101112141752/http://www.cookss...

https://news.ycombinator.com/item?id=1911977

DeusExMachina · 2023-07-04T16:49:28

I don't think the definition of public is even enough here. Open-source code is public but covered by specific licenses. That's still legally undefined, but there are already ongoing lawsuits.

tmaly · 2023-07-04T18:21:40

what about Google books? Do they get a pass with this?

bhickey · 2023-07-04T18:26:05

https://en.m.wikipedia.org/wiki/Authors_Guild,_Inc._v._Googl....

tmaly · 2023-07-06T03:14:50

But using it to train an LLM and profiting off said LLM is different than providing a public service.

_Algernon_ · 2023-07-04T13:52:17

It's pretty crazy that changes in privacy policies apply ex post facto on data that was generated before the privacy policy change.

hedora · 2023-07-04T15:22:05

... and also to people that do not use google services, and have never read their privacy policy.

kdavis · 2023-07-04T13:37:01

Generally this is already "the law of the land" in the US via the HiQ Labs v. LinkedIn precedent[1]. (IANAL)

[1] https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

tensor · 2023-07-04T14:41:46

This is a non-story. Look at the actual changes. Nothing fundamentally new was added, but the language was broadened slightly in a completely unsurprising way.

Language model -> AI model features -> product and features

It just looks like their lawyers tidying up the words but the intention is unchanged.

jonathankoren · 2023-07-04T16:42:52

I can’t help but think all this brouhaha over web scraping is just a rehash over the web scraping panic of the 1990s.

Put it behind a robots.txt or better yet a login.

This battle was fought and won. I’m not going back to the bad old days.

nicbou · 2023-07-04T17:31:19

I want my website to be accessible to humans, and even indexed by search engines. I just don't want to be used as unpaid labour to train artificial intelligence.

A bit like I will gladly build a bike for someone, but object to the bike being sold around the corner.

jonathankoren · 2023-07-04T20:28:40

This was the same argument against search engines and in favor human curated hierarchies like original Yahoo and DMoz.

How do you Wikipedia being used for products period, let alone used as data products, since that’s also repurposed unpaid labor?

nicbou · 2023-07-04T21:59:41

Wikipedia hosts people's hard work and makes it available to the world.

Search engines help connect people to the things they want. At least they mostly do.

AI inserts itself in the middle and strips the creators of credit, recognition or income.

jonathankoren · 2023-07-04T23:31:33

Given that you used the term “AI”, you may not know that there’s a lot more to general web search engine beyond a simple inverted index. There are multiple models (ie “AI”) used to determine things such as query intent, term disambiguation, query classification, and that’s even before we get to results ranking, or relevant snippet detection. So now, with that knowledge, do you believe that Wikipedia, or even the web crawl itself (because that’s literally what we’re talking about, a web crawl) should be used for anything beyond term naïve term indexing?

nicbou · 2023-07-05T08:58:23

Given the context, it's assumed that "AI" will not be used merely for providing better search results, but for keeping people on Google.com with "original" answers.

jonathankoren · 2023-07-05T17:17:25

So just like the already existing info boxes, which I presume you have a moral problem with?

nicbou · 2023-07-06T12:47:14

You can opt out of them. I'm fine with them, although they tend to be comically wrong.

djhn · 2023-07-05T07:23:41

If I want to read up more on this, could you throw some search terms at me that would help me find some more specific pieces of this conversation/debate around search engines' access to data? Any books, particular legal cases, prominent people or organisations engaged in this debate?

troyvit · 2023-07-05T16:34:21

I feel the same way too, but I have to ask myself, why do I deserve a search engine to automatically index my content and then show it to others without my having to give anything in return? Let's say I actually wrote something of value, put an ad next to it, and it generated a hundred thousand organic search views. Those 100k views were basically given to me for free by a search engine. Do I owe them anything for that?

undrcvr95 · 2023-07-06T16:38:15

No of course you don't owe them anything. They opted in to doing this and didn't even consult you before doing so. They did this because they have found a way to monetize this, which is selling their own ads. On the other hand you just posted a blog post. You didn't opt in to anything related to search engines. You just posted something, that's it.

troyvit · 2023-07-11T17:23:10

Yeah that's a good point. I didn't make a contract with them.

puzzledobserver · 2023-07-04T18:02:17

I mostly agree with you, but as a counter-point: How would one precisely draw the lines between search engines and LLMs? One provides mostly airtight attributions, and the other is famous for hallucinating citations. But this doesn't sound like a distinction that can be used as the basis for law?

Or are you less concerned with their design and more concerned with their purpose? Providing citations is fine, but creating content is not?

nicbou · 2023-07-04T21:57:11

One brings traffic to the original work, supporting its creators. The other takes uses the creators' labour without giving anything in return.

JohnFen · 2023-07-05T22:32:08

> or better yet a login.

That's what I've done along with a growing number of others -- but I'd much prefer to be able to make it available to the general public. I mourn that's not possible without also aiding the training of AI.

ChrisArchitect · 2023-07-04T15:00:03

[dupe]

Discussion from yesterday: https://news.ycombinator.com/item?id=36577626

Takennickname · 2023-07-04T19:39:49

If anyone thinks Google respects your privacy and this is an isolated incident then boy do I have news for you LOL

kmbfjr · 2023-07-04T21:05:10

Okay, if that is we are going to play it.

My personal web site that is essentially a doc page on numerous home lab projects and other technical writings, is going dark.

I’m not training your new search algorithms so you can directly profit from my writings. I didn’t seek payments prior to this, but now this is just plagiarizing my work.

See ya

djmips · 2023-07-05T10:30:49

Can you not exclude it with robots.txt? It would be a shame for humans to lose the resource you intended to share.

JohnFen · 2023-07-05T22:27:18

robots.txt is not nearly sufficient for this purpose.

pessimizer · 2023-07-04T18:55:40

That's nothing. I changed my personal privacy policy to allow myself to use google's copyrighted source code. Because that's how it works.

Animats · 2023-07-04T17:33:13

That just says what Google wants to do. Google's privacy policy has no legal effect on anyone not in a contractual relationship with Google.

karaterobot · 2023-07-04T17:10:14

Is this something they can confidently define in a privacy policy? This feels like its proper scope would be legislation. Indeed, legislation which is very much unsettled and ongoing. My assumption is they'll do as much as the law allows, but if the law doesn't allow it, their privacy policy isn't going to make a difference.

worksonmine · 2023-07-04T18:45:22

Under what jurisdiction am I allowed to create a policy to use whatever I want however I want? Playing the devils advocate I have to assume their lawyers didn't just pull this out of their asses?

Now I'm making a policy that I'm allowed to (ab)use Googles services however I want. They can find the policy themselves. God I hate Google.

tinus_hn · 2023-07-05T10:57:22

A strange proposition, why would their privacy policy apply to websites scraped by their bot? ‘By being scraped you agree to be bound by these rules’?

phoe-krk · 2023-07-04T13:26:00

I guess they made the math and the probabilities and figured that it'll gain them more money to do this and accept the inevitable fines that come from this.

firstSpeaker · 2023-07-04T14:47:01

Likely true. Previously it was any attribution data that was the most valuable since advertising benefit from that. With the LLM and what comes next I imagine every bit of data, structured or unstructured, accurate or not, is going to have some value to someone somewhere.

smrtinsert · 2023-07-04T17:02:57

Of course they did. It was probably the primary factor. Google is a business looking to survive the technology impact of LLMS, not a tool to improve humans lives.

villgax · 2023-07-04T16:57:27

There's no limit to such idiocy. I shall ingest everything public on Google properties coz scraping my website constitutes agreement to my conjured up licence