Google's Gemini AI caught scanning Google Drive PDF files without permission

Cthulhu_ · 2024-07-15T08:23:33.000000Z

Just reiterates that you don't own your data hosted on cloud providers; this time there's a clear sign, but I can guarantee that google's systems read and aggregated data inside your private docs ages ago.

This concern was first raised when Gmail started, 20 years ago now; at the time people reeled at the idea of "google reads your emails to give you ads", but at the same time the 1 GB inbox and fresh UI was a compelling argument.

I think they learned from it, and google drive and co were less "scary" or less overt with scanning the stuff you have in it, also because they wanted to get that sweet corporate money.

shadowgovt · 2024-07-15T11:54:49.000000Z

Of course Google reads and aggregates data inside your private docs. How would it provide search over your documents otherwise?

greycol · 2024-07-16T21:57:30.000000Z

This feels a lot like "Of course they use their hands, they couldn't give you a massage otherwise" but it's in reply to a news article about the person who agreed to being touched being punched.

dylan604 · 2024-07-15T14:36:57.000000Z

When I hit search, do the search right then. Don't grep out of a stored cache of prior searches.

shadowgovt · 2024-07-15T16:49:41.000000Z

The thing that makes it possible for search to be fast is pre-crawling and pre-indexing.

Some other engines don't do this, and the difference is remarkabe. Try a full-content search in Windows 7, you'll be staring at the dialog for two minutes while it tries to find a file that's in the same directory as you started the search in.

dylan604 · 2024-07-15T16:58:35.000000Z

You said nothing about fast in your original though, so now you've moved the goal posts

shadowgovt · 2024-07-15T23:06:04.000000Z

I'm not really engaging in forensics-style debate. If you don't already know why "fast" is so integral to search as a feature that it goes without mention, I don't think we are enough on the same page to discourse on the topic.

intern4tional · 2024-07-15T23:12:35.000000Z

I think returning results in a timely manner is more than an acceptable assumption.

The poster clearly thought about search in terms of the existing Google search functionality which is near instantaneous.

Usability matters to the average end user and a delayed search is not usable for most people.

mark_l_watson · 2024-07-15T13:02:47.000000Z

re: data on cloud providers: I trust ProtonDrive to not use my data because it is encrypted in transit and in place.

Apple now encrypts most data in transit and in place also, and they document which data is protected. I am up in the air on whether a future Apple will want to use my data for training public models. Apple’s design of pre trained core LLMs, with local training of pluggable fine tuning layers would seem to be fine, privacy wise, but I don’t really know.

I tend to trust the privacy of Google Drive less because I have authorized access to drive from Colab Pro, and a few third parties. That said, if this article is true, then less trust.

Your analogy with early Gmail is good. I got access to Gmail three years before it became public (Peter Norvig gave me an early private invite) and I liked, at the time, very relevant ads next to my Gmail. I also, gave Google AI plus (or whatever they called their $20/month service) full access to all my Google properties because I wanted to experiment with the usefulness of LLMs integrated into a Workplace type environment.

So, I have on my own volition surrendered privacy if Google properties.

dylan604 · 2024-07-15T14:34:54.000000Z

All it takes is a "simple" typo in the code that checks if the user has granted access to their content. Something as amateur (which I still find myself occasionally doing) as "if (allowInvasiveScanning = true)" that goes "undetected" for any period of time gives them the a way out yet still gains them access to all the things. Just scanning these docs one time is all they need.

Aurornis · 2024-07-15T12:03:14.000000Z

> but I can guarantee that google's systems read and aggregated data inside your private docs ages ago

That is how search works, yes.

But if you’re trying to imply that everyone’s private data was scraped and loaded into their LLM, then no, that’s obviously a conspiracy theory.

It’s incredible to me that people think Google has convinced tens of thousands of engineers to quietly keep secret an epic conspiracy theory about abusing everyone’s private data.

dmvdoug · 2024-07-15T12:33:49.000000Z

I dunno, bro, software engineers have repeatedly shown total lack of wider judgment in these contexts over the years. Not to say there is, in fact, some kind of “epic conspiracy,” just that SWEs appear not to take much time to consider just what it is their code ends up being used for. Incidentally, that would be one way to start to get out of the mess we’ve found ourselves in: start holding SWEs accountable for their work. You work on privacy-destroying projects that society pushes back against, it’s fair game to put you under the microscope. Perhaps not legally, but we as a society shouldn’t hold back from directed criticism and social accountability for the individual engineers who enable this kind of shit. That will not be a popular take here. Perhaps it will be some solace to know I advocated the same kind of accountability for lawyers who enabled torture and other governmental malfeasance in the GWoT years. I was also looked at askance by other lawyers for daring to suggest such a thing. In that way, SWEs remind me of lawyers in how they view their own work. “What, I’m not personally responsible for what my client chooses to use my services for.”

Yeah, you are, actually.

f6v · 2024-07-15T13:08:48.000000Z

I was hanging out around startup incubators, and, by extension, many wantrepreneurs. When asked about business model, the knee jerk reaction was usually “we’re going to sell data!” regardless of product. I was appalled by how hard it is to keep founders from abusing the data when I worked at startups. GDPR and the likes are seen as an annoyance and they make every effort to find a loophole.

zarathustreal · 2024-07-15T12:16:19.000000Z

To riff on the famous Upton Sinclair quote:

“It is difficult to get an engineer to see something, when his salary depends on his not seeing it.”

psychoslave · 2024-07-15T12:21:39.000000Z

> that’s obviously a conspiracy theory.

Well, while some of our fellow humans are far too quick to jump on concluding that everything and the rest comes from some conspiracy, it shouldn't void the existence of any conspiracy as an extreme opposite.

In that case, whether these actors do it or not is almost irrelevant: they have the means and incentives to do so. What safeguard civil society is putting in place to avoid it to happen is a far more interesting matter.

fifteen1506 · 2024-07-15T12:33:38.000000Z

> It’s incredible to me that people think Google has convinced tens of thousands of engineers to quietly keep secret an epic conspiracy theory about abusing everyone’s private data.

With NDA being all over the place, it does strike me as doable.

NDAs should have a time limit.

Additionally, no-one in their right mind will be a whistleblower nowadays.

digitalsushi · 2024-07-15T12:43:43.000000Z

isn't that it's supposed to work? we just need >0 people to blow a whistle if a whistle needs blowing. we don't need to rely on the people fearing for their jobs so long as >0 people are willing to sacrifice their careers/lives when there's some injustice so great it's worth dying for

fifteen1506 · 2024-07-16T12:06:22.000000Z

Ya, no.

Sludge in a hole in your shop's backyard? Not worth it.

High level of chemicals in the air which may cause stillborns? Not worth it.

Scanning private files for an AI training? Not worth it.

Genocide? Not worth it.

There is absolutely nothing nowadays worth being banished from your livelihood.

Snowden and Assange are heroes. And insane. Threw away $$$ for their morals. Stanislav Petrov threw his career away instead of passing it up and let it be somebody's else problem.

hulitu · 2024-07-17T09:43:30.000000Z

> that’s obviously a conspiracy theory.

Of course. Google (and Apple, Microsoft not so much - but it is for your own good) will deny that they store your encryption keys.

foobarian · 2024-07-15T12:43:25.000000Z

One time I booked something on Expedia, which resulted in an itinerary email to my Gmail account. Lo and behold, minutes later I got a native CTA on the Android home screen to set up some thing or another on Google’s trip product. I dropped Android since, but Gmail is proving harder to shake.

vouaobrasil · 2024-07-15T12:22:07.000000Z

All AI should be opt-in, which includes both training and scanning. You should have to check a box that says "I would like to use AI features", and the accompanying text should be crystal clear what that means.

This should be mandatory, enforced, and come with strict fines for companies that do not comply.

crazygringo · 2024-07-15T16:19:38.000000Z

Training I can understand, but why scanning?

It's literally just running an algorithm over your data and spitting out the results for you. Fundamentally it's no different from spellcheck, or automatically creating a table of contents from header styles.

As long as the results stay private to you (which in this case, they are), I don't see what the concern is. The fact that the algorithm is LLM-based has zero relevance regarding privacy or security.

vouaobrasil · 2024-07-15T18:28:43.000000Z

> It's literally just running an algorithm over your data and spitting out the results for you.

I don't want any results from AI. I don't even want to see them. And there is too much of a grey area. What if they use how I use the results to improve their AI. I hate AI also and want nothing to do with its automations.

If I want a document summarized, I will read it myself. I still want to be human and do things AT A REASONABLE LEVEL with my own two hands.

crazygringo · 2024-07-15T18:40:23.000000Z

Again, that's like saying you don't want any results from spell-check.

OK, sure. But then just don't use it.

The problem is that you're calling for a legal policy against it to be "mandatory, enforced, and come with strict fines".

Have your own personal preferences, that's great. But I don't want you imposing your preferences on the products I use. I want companies and the market to decide.

An auto-summary feature that is enabled by default is not something we should be asking for government regulation over, any more than we should be asking the government to prohibit wavy red lines unless they're explicitly opted into.

vouaobrasil · 2024-07-15T19:13:54.000000Z

I don't want YOUR increasing use of AI to make a world where everyone is forced to use AI in their jobs and lives because it brings short-term business benefits.

JohnFen · 2024-07-15T16:35:10.000000Z

I think that vouaobrasil was talking about scanning on the behalf of others, not scanning that you're doing on your own data. Scanning your own stuff is automatically and naturally an opt-in situation. You've consciously chosen for it to happen.

crazygringo · 2024-07-15T18:50:30.000000Z

I don't think so -- that's not what the article is about. The subject here is entirely about a product scanning your own document to summarize it for you.

theolivenbaum · 2024-07-15T16:21:11.000000Z

Except that there's still a grey area on who owns the copyright of the generated text, and they might be able to use the output without you knowing.

Suppafly · 2024-07-15T16:29:37.000000Z

Except that's not what's happening, so why pretend otherwise?

ipaddr · 2024-07-15T16:48:51.000000Z

Because tomorrow it will with little or no discussion

DebtDeflation · 2024-07-15T14:17:49.000000Z

AI is becoming the new Social Media in that users are NOT the customer they are the product. Instead of generating data for a Social Media company to use to sell ads to companies you are generating data to train their AI, in exchange you get to use their service for free.

rurp · 2024-07-15T15:06:27.000000Z

The deal keeps getting worse too. In addition to hoovering up your data for whatever products they want, Google has gotten more aggressive about pushing paid services on top of it. The amount of up-sell nags and ads have increased significantly in the past couple years. For a company like Google that kind of monetization creep only gets worse over time.

DebtDeflation · 2024-07-15T15:48:15.000000Z

Not surprising at all. Inferencing against foundation models is very expensive, training them is insanely expensive. Orders of magnitude more so than whatever was needed to run the AdWords business. I guess I should modify my original post to "in exchange you get to use our service at a somewhat subsidized price".

Rinzler89 · 2024-07-15T14:52:29.000000Z

>you are generating data to train their AI

That's why I seriously recommend everyone everywhere regularly replace their blinker fluid and such.

exe34 · 2024-07-15T17:55:45.000000Z

it's very important to replace your blinker fluids yearly, but also, polka dot paint comes in 5L tubs.

vouaobrasil · 2024-07-15T14:56:28.000000Z

This should be illegal.

phendrenad2 · 2024-07-15T15:09:22.000000Z

By "scanning" what do you mean exactly? I assume you mean for non-training purposes, in other words simply ephemerally reading docs and providing summaries. Why should that be regulated exactly?

drzaiusx11 · 2024-07-15T12:27:57.000000Z

We also need a robots.txt extension for publicly accessable file exclusion from AI training datasets. iirc there's a nascent ai.txt but not sure if anyone follows it (yet)

chias · 2024-07-15T13:56:45.000000Z

I don't think `robots.txt` works on the basis of the crawlers wanting to do this to be nice, or "socially responsible" or anything. So I don't hold up much hope that anything similar can happen again.

Early search engines had a problem, which was that when they crawled willy nilly, people would block their IP addresses. Inventing this concept of `robots.txt` worked because search engines wanted something: to avoid IP blocks, which they couldn't easily get around. And site hosts generally wanted to be indexed.

Today it's WAY harder to block relevant IP addresses, so site hosts generally can't easily block a crawler that wants its data: there is no compromise to be found here, and the imbalance of power is much stronger. And many site hosts generally don't want to be crawled for free for AI purposes at all. Pretty much anyone who sets up an `ai.txt` uses it to just reject all crawling, so there is no reason for any crawler to respect it.

mtnGoat · 2024-07-15T14:50:13.000000Z

Google ignores robots.txt as do many others. Try it yourself, setup a honeypot URL, don’t even link to it, just throw it in robots.txt, google bot will visit it at some point.

JohnFen · 2024-07-15T16:39:24.000000Z

I discovered this years ago, and it's what made me start stop bothering with robots.txt and start blocking all the crawlers I can using .htaccess, including Google's.

That's a game of whack-a-mole that always lets a few miscreants through. I used to find that an acceptable amount of error until I learned that crawlers were gathering data to be used to train LLMs. That's a situation where even a single bot getting through is very problematic.

I still haven't found a solution to that aside from no longer allowing access to my sites without an account.

pennomi · 2024-07-15T12:38:05.000000Z

I think the closest thing is the NoAI and NoImageAI meta tags, which have some relatively prominent adoption.

JohnFen · 2024-07-15T16:37:22.000000Z

robots.txt is useless as a defense mechanism (that isn't what it's trying to be). Taking the same approach for AI would likewise not be useful as a defense mechanism.

_joel · 2024-07-15T15:06:11.000000Z

Haven't some companies explicitly ignored robots.txt to scrape the sites more quickly (and pissing off a number of people)

signatoremo · 2024-07-15T12:44:26.000000Z

What is the privacy implication of AI training?

vouaobrasil · 2024-07-15T12:47:30.000000Z

I care much more about allowing my content to be used at all, despite any privacy concerns. I simply don't want one single AI model to train on my content.

meiraleal · 2024-07-15T12:58:35.000000Z

[flagged]

vouaobrasil · 2024-07-15T13:17:38.000000Z

Perhaps in response to environmental regulations to prevent toxic waste from being dumped into your own backyard, you should respond, "create your own country, then".

meiraleal · 2024-07-15T14:53:08.000000Z

You really think creating a country and creating software that respects your privacy are equally difficult?

vouaobrasil · 2024-07-15T18:29:20.000000Z

Equally difficult, no. Equally important in principle, yes.

meiraleal · 2024-07-15T19:13:46.000000Z

for someone that prefers to complain over solving the issues at hand, yes.

lobsterthief · 2024-07-15T13:23:48.000000Z

I wrote my own software. Turns out LLMs are still training on my data.

What’s the next step?

InDubioProRubio · 2024-07-15T14:32:25.000000Z

Poison the well with AI SEO? There must the equivalent to parrots for NN that can be embedded in documents.

candiddevmike · 2024-07-15T13:57:13.000000Z

Host it on a private git instance.

hobs · 2024-07-15T13:05:03.000000Z

That's like saying if you don't like ransomware just develop your own.

freetanga · 2024-07-15T14:39:56.000000Z

Why would I write ransomware for myself?

brookst · 2024-07-15T13:26:51.000000Z

Brilliant!

ClumsyPilot · 2024-07-15T13:50:50.000000Z

Maybe the correct response is to burn down their office and if they don’t like it they can create their own data

Zambyte · 2024-07-15T13:30:27.000000Z

Just wait until you hear about Copilot :D

jerpint · 2024-07-15T12:51:59.000000Z

Models can easily regurgitate back training data verbatim, so anything private can be in theory accessed by anyone without proper access to that file

brookst · 2024-07-15T13:28:35.000000Z

This is partly true but less and less every day.

IMO the bigger concern is that this data is not just used to train models. It is stored, completely verbatim, in the training set data. They aren’t pulling from PDFs in realtime during training runs, they’re aggregating all of that text and storing it somewhere. And that somewhere is prone to employees viewing, leakage to the internet, etc.

oblio · 2024-07-15T13:37:25.000000Z

> This is partly true but less and less every day.

Isn't this like encryption, though?

I'm fairly sure that the cryptography community basically says: if someone has a copy of your encrypted data for a long time, the likelihood over time for them to be able to read it approaches 100%, regardless of the current security standard you're using.

Who could possibly guarantee that whatever LLM is safe now will be safe at all times over the next 5-10-20 years? And if they're guaranteeing, they're lying.

brookst · 2024-07-15T19:11:44.000000Z

I think it’s different, unless you believe LLMs have broken theoretical limits on compression. I don’t see how an LLM with 1T 16 bit parameters could encode 100PB of data.

oblio · 2024-07-15T19:56:13.000000Z

My point was about attack angles. The original comment said, that for example, you could exfiltrate data with the right prompt attack.

To which the reply was "they'll just make the LLM able to better defend itself".

And my point was "the attackers will learn to build better prompts, too".

beefnugs · 2024-07-15T19:58:32.000000Z

This exists and is called encryption. Give the key out to people you want to read it. If anyone can read it, AI is going to read it wether your feelings like it or not

GaggiX · 2024-07-15T12:30:17.000000Z

The feature was enable by the author.

nitin_flanker · 2024-07-15T11:25:53.000000Z

Apart from the obvious misleading way this article is written. I am listing all the links shared in the tweet thread that the article mentioned -

- Manage your activity on Gemini : https://myactivity.google.com/product/gemini

- This page has most answers related to Google Workspace and opting out of different Google apps : https://support.google.com/docs/answer/13447104#:~:text=Turn...

shadowgovt · 2024-07-15T11:54:02.000000Z

The headline is a little unclear on the issue here.

It is not surprising that Gemini will summarize a document if you ask it to. "Scanning" is doing heavy lifting here; The headline implies Google is training Gemini on private documents, when the real issue is Gemini was run with a private document as input to do a summary when the user thought they had explicitly switched that off.

That having been said, it's a meaningful bug in Google's infrastructure that the setting is not being respected and the kind of thing that should make a person check their exit strategy if they are completely against using The new generation of AI in general.

dmvdoug · 2024-07-15T12:37:04.000000Z

> It is not surprising that Gemini will summarize a document if you ask it to.

No, but it is surprising that Gemini will summarize every single PDF you have on your Drive if you ask it to summarize a single PDF one time.

thenoblesunfish · 2024-07-15T14:37:23.000000Z

The title is misleading, isn't it? I was expecting this was scanning for training or testing or something, but this is summarization of articles the user is looking at, so "caught" is disingenous. You don't "catch" people doing things they tell you they are doing, while they're doing it.

mtnGoat · 2024-07-15T14:47:42.000000Z

He had the permissions turned off, so regardless of what it did with the document, it did it without permission! The title is correct!

Suppafly · 2024-07-15T16:37:31.000000Z

>He had the permissions turned off, so regardless of what it did with the document, it did it without permission! The title is correct!

Honestly it sounds like he was toggling permissions off and on and actually has no idea why it summarized that particular document despite him requesting it summarize other documents. Google should make the settings more clear, but "I had the options off, except when I didn't, and I set some other options in a different place that I didn't think would override the others, and also I toggled a bunch of the options back and forth" is hardly the condemnation that everyone is making it out to be.

mtnGoat · 2024-07-15T19:05:10.000000Z

Agreed on that, but this is Google, do you really think they couldn’t have made this easier if it wasn’t in their best interest not to?

One would think, maybe even expect, a single setting in a single place would control this. And that their docs would be correct.

Havoc · 2024-07-15T13:30:45.000000Z

Only a matter of time before someone extracts something valuable out of googe's models. Bank passwords or crypto keys or something

Glue pizza incident illustrated they're just yolo'ing this

motohagiography · 2024-07-15T13:46:56.000000Z

this is similar to the scramble for health data during covid where a number of groups tried (and some succeeded) at using the crisis to squeeze the toothpaste out of the tube in a similar way, as there are low costs to being reprimanded and high value in grabbing the data. bureaucratic smash-and-grabs, essentially. disappointing, but predictable to anyone who has worked in privacy, and most people just make a show of acting surprised then moving on because their careers depend on their ability to sustain a gallopingly absurd best-intentions narrative.

your hacked SMS messages from AT&T are probably next, and everyone will be just as surprised when keystrokes from your phones get hit, or there is a collection agent for model training (privacy enhanced for your pleasure, surely) added as an OS update to commercial platforms.

Make an example of the product managers and engineers behind this, or see it done worse and at a larger scale next time.

Aurornis · 2024-07-15T11:58:27.000000Z

The original Tweet and this article are mixing terms in a deliberately misleading way.

They’re trying to suggest that exposing an LLM to a document in any way is equivalent to including that document in the LLM’s training set. That’s the hook in the article and the original Tweet, but the Tweet thread eventually acknowledges the differences and pivots to being angry about the existence of the AI feature at all.

There isn’t anything of substance to this story other than a Twitter user writing a rage-bait thread about being angry about an AI popup, while trying to spin it as something much more sinister.

okdood64 · 2024-07-15T15:14:28.000000Z

I'm shocked, especially this being HN, that how many people are being successfully misled on what is actually going on here. Do people still read articles before posting?

nerdjon · 2024-07-15T13:53:13.000000Z

Shocker, Google not going quite far enough with privacy and data access? They talk about it but its never quite far enough to avoid their own services accessing data.

We really need to get to the point that all data remotely stored needs to be encrypted and unable to be decrypted by the servers, only our devices. Otherwise we just allow the companies to mine the data as much as they want and we have zero insight into what they are doing.

Yes this requires the trust that they in fact cannot decrypt it. I don't have a good solution to that.

Any AI access to personal data needs to be done on device, or if it requires server processing (which is hopefully only a short term issue) a clear prompt about data being sent out of your device.

It doesn't matter if this isnt specifically being used to train the model at this point in time, it is not unreasonable to think that any data sent through Gemini (or any remote server) could be logged and later used for additional training, sitting plaintext in a log, or just viewable by testers.

JohnFen · 2024-07-15T16:48:08.000000Z

> Yes this requires the trust that they in fact cannot decrypt it.

Yes, this is where it all breaks down. In the end, it all boils down to the company saying "trust us", and it's very clear that companies simply cannot be trusted with these sorts of things.

nerdjon · 2024-07-15T18:23:48.000000Z

Yeah I wish there was a solution to that, even open source isn't a solution since it would be trivial for there to be a difference between what is running on the server and what is open source.

Ultimately you have to make a decision based on the companies actions and your own personal risk threshold with your own data.

In this particular case, we know that at the very least Google's track record on this is... basically non existent.

r2vcap · 2024-07-15T11:50:20.000000Z

There is no cloud. It's just someone else's computer.

bitnasty · 2024-07-15T12:03:36.000000Z

Just because “the cloud” is someone else’s computer doesn’t mean it doesn’t exist.

grugagag · 2024-07-15T13:24:44.000000Z

I think that wasn’t supposed to be taken literally but more tongue in cheek. The main point being that it belongs to some other party. But the cloud buzzword is fuzzy in description.

Ever since ‘cloud’ privacy took a nosedive

denton-scratch · 2024-07-15T15:56:08.000000Z

Back in the 80s, we used to draw network diagrams on the whiteboard; those parts of the network that belonged neither to us nor to our users was represented by an outline of a cloud. This cloud didn't provide storage or (useable) computing resource. If you pushed stuff in here, it came out there.

I think it was a reasonable analogy. You can't see inside it; you don't know how it works, and you don't need to. Note that at this time, 'the internet' wasn't the only way of joining heterogenous networks; there was also the OSI stack.

So I was annoyed when some bunch of kids who had never seen such whiteboard diagrams decided to re-purpose the term to refer to whatever piece of the internet they had decided to appropriate, fence-in and then rent out.

Zambyte · 2024-07-15T13:36:06.000000Z

It's worth noting that cloud computing has existed since the 1960s. It just used to be called "time-sharing".

meiraleal · 2024-07-15T13:09:18.000000Z

That's exactly how cloud is marketed. As a invisible mass of computers floating in the atmosphere.

SteveSmith16384 · 2024-07-15T13:32:55.000000Z

That doesn't change anything.

api · 2024-07-15T13:54:06.000000Z

Not stored on your device or encrypted with keys only you control, not yours.

I assume anything stored in such a system will be data mined for many purposes. That includes all of Gmail and Google Docs.

worksonmine · 2024-07-15T13:14:57.000000Z

This shouldn't come as a surprise to anyone, their entire business is our data. I always encrypt anything important I want to backup on the cloud.

SteveSmith16384 · 2024-07-15T13:34:39.000000Z

But we shouldn't have to, and just because Google is famous for it, it doesn't make it right or acceptable.

shinycode · 2024-07-17T06:55:40.000000Z

I tried an equivalent of copilot once, in the code base I typed `images = [` and the AI autofilled me the array with http links to real images. I never tried to do the same thing with private keys or other sensitive informations but it sucks that it happens

estebarb · 2024-07-15T13:22:14.000000Z

It is urgent to educate people about how these systems work. Search requires indexing. Summarizing with a LM requires inference. Data used for inference usually is forgotten forever after used, as it is not used for training.

Yeah, that should be obvious for many here, but even software engineers believe that AI are sentient things that will remember everything that they see. And that is a problem, because public is afraid of the tech due to a wrong understanding of how it works. Eventually they will demand laws protecting them from stuff that have never existed.

Yes, there are social issues with AI. But the article just shows a big tech illiteracy gap.

Suppafly · 2024-07-15T16:50:03.000000Z

>And that is a problem, because public is afraid of the tech due to a wrong understanding of how it works

Honestly the general public doesn't seem to care, the people freaking out are the tech adjacent people who make money driving clicks to their own content. Regular Joes aren't upset that google shows them a summary of their documents and many of them actively appreciate it.

silvaring · 2024-07-15T13:55:17.000000Z

I just want to add that gmail has a very sneaky 'add to drive' button that is way too easy to click when working with email attachments.

How long til gmail attachments get uploaded into drive by default through some obscure update that toggles everything to 'yes'?

hiatus · 2024-07-15T15:52:48.000000Z

> How long til gmail attachments get uploaded into drive by default through some obscure update that toggles everything to 'yes'?

This already is the case for attachments that exceed 25 megabytes.

klabb3 · 2024-07-15T14:15:28.000000Z

What difference does it make? They’re both on Google servers and even ACLed the same user account. Gmail isn’t exactly a privacy preserving email provider.

meindnoch · 2024-07-15T17:20:00.000000Z

Your first mistake was storing your data on someone else's computer.

eagerpace · 2024-07-15T15:08:02.000000Z

In the push for AGI do companies feel a recursive learning future is soon achievable and therefore getting to the first cycle of that is worth the cost of any legal issues that may arise?

Khaine · 2024-07-15T16:28:56.000000Z

If this is true, google needs to be charged with the violation of various privacy laws.

I’m not sure how they can claim they have informed consent for this from their customers

padolsey · 2024-07-15T13:36:04.000000Z

There is a fundamentally interesting nuance to highlight. I don't know precisely what google is doing, but if they're just shuttling the content through a closed-loop deterministic LLM, then, much like a spellchecker, I see no issue. Sure, it _feels_ creepy, but it's just an algo.

Perhaps someone can articulate the precise threshold of 'access' they wish to deny apps that we overtly use? And how would that threshold be defined?

"Do not run my content through anything more complicated than some arbitrary [complexity metric]" ??

PessimalDecimal · 2024-07-15T13:42:51.000000Z

It was already possible to search for photos in Google Drive by their content. They seemed to be doing some sort of image tagging and feeding that into search results. Did that ever cause a fuss?

I think the more interesting point is how little people seem to care for the auto-summarization feature. Like, why would anyone want to see their archived tax docs summarized by a chatbot? I think whether an "AI" did that or not is almost a red herring.

padolsey · 2024-07-15T13:49:09.000000Z

Right, but it's triggered by the user themselves, per the article:

> "[it] only happens after pressing the Gemini button on at least one document"

I agree the AI aspect is largely a red herring. But I don't think running an algo like a spellchecker within an open document is so awful. If people hate it or it's not useful or accurate, then it should be binned, ofc. And if we're ignoring the AI aspect, then it's just a meh/crappy feature. Not especially newsworthy IMHO.

PessimalDecimal · 2024-07-15T14:48:51.000000Z

I agree entirely here.

The autosummarization of unopened documents is closer to the image search functionality I mentioned above than it is to a spell checker running on an open doc. Both autosummarization and image search are content retrieval mechanisms. The difference is only in how its presented. Does it just point you to your file, or does it process it further for you? The privacy aspects are equivalent IMO. The only difference is in whether the feature is useful and well received.

Retric · 2024-07-15T13:53:00.000000Z

The issue isn’t doing something to your data, it’s what happens after that point.

People would be pissed if Android make everyone’s photos public, AI does this with extra steps. Train AI on X means everyone using that AI potentially has access to X with the right prompt.

padolsey · 2024-07-15T13:56:34.000000Z

I don't think it's _training_ on your content. That would be a whole other (very horrifying) problem, yes.

Retric · 2024-07-15T18:10:23.000000Z

Why make that assumption? Many AI companies are using conversations as part of the training, so even if the documents aren’t used directly that’s doesn’t mean the summaries are safe.

Suppafly · 2024-07-15T16:47:13.000000Z

Except that's not what is happening here or what the rest of us are discussing, so why even bring it up?

Retric · 2024-07-15T18:07:46.000000Z

We don’t know what’s happening beyond:

^the privacy settings used to inform Gemini should be openly available, but they aren't, which means the AI is either "hallucinating (lying)" or some internal systems on Google's servers are outright malfunctioning*

Many AI systems do use user interactions as part of training data. So at most you might guess those documents aren’t directly being used for training AND they will never include conversations in training data but you don’t know.

Suppafly · 2024-07-15T19:07:59.000000Z

>which means the AI is either "hallucinating (lying)" or some internal systems on Google's servers are outright malfunctioning*

I'm not sure how that is implied.

>Many AI systems do use user interactions as part of training data.

There is no evidence of that being the case here, and none of the mainstream AIs do that yet. They'd be much more useful if they did.

>So at most you might guess those documents aren’t directly being used for training

Or we can actually know that, because that's the case.

>AND they will never include conversations in training data but you don’t know.

Conversations aren't part of this discussion at all, so I'm not sure what you're trying to imply, but it's wrong.

Retric · 2024-07-15T19:12:53.000000Z

> Conversations aren't part of this discussion at all

People only know about it because information from these documents is showing up in conversations.

It’s unclear which systems have access and why, but at a minimum Google is showing the data. If things are “misconfigured” or even intentionally set up like this then any assumptions about what’s private goes out the window.

space_oddity · 2024-07-15T12:36:08.000000Z

The inability to disable this feature adds to the frustration

Zambyte · 2024-07-15T13:40:07.000000Z

If your computer does something that you don't want it to do, it is either a bug or malware depending on intent. "Feature" is too generous.

huggingmouth · 2024-07-19T06:31:29.000000Z

My wake up moment with google was when they accused a parent of being a pedophile, permanently banned their accounts, reported them the police, and then doubled down when they were proven wrong.

Not only due those degenerates have the gal to creep on people, they refuse to admit wrongdoing or make their victems whole.

Sickos. That's what they are. Sickos.

runeks · 2024-07-19T06:51:55.000000Z

A reference would be nice

muscomposter · 2024-07-15T14:38:10.000000Z

we should just embrace digital copying galore instead of trying to digitalize the physical constraints of regular assets

we should just ignore physical constraints of assets which do not have them, like any and all digital data

which do you prefer? everybody can access all digital data of everybody (read only mode), or what we have now which is trending towards having so many microtransactions that every keystroke gets reflected in my bank account

acar_rag · 2024-07-15T08:24:55.000000Z

The title misleads the point, and the article is, imoo, badly written. The post implies there is indeed a setting to turn it off. So the author deliberately asked Gemini AI to summarize (so, scan) its documents...

Related to this news: https://news.ycombinator.com/item?id=40934670

jbstack · 2024-07-15T08:38:53.000000Z

"There is a setting to turn it off" is nowhere near equivalent to "the author deliberately asked for documents to be scanned".

Also, see:

"What's more, Bankston did eventually find the settings toggle in question... only to find that Gemini summaries in Gmail, Drive, and Docs were already disabled"

jasonlotito · 2024-07-15T14:37:08.000000Z

FTA: For Bankston, the issue seems localized to Google Drive, and only happens after pressing the Gemini button on at least one document. The matching document type (in this case, PDF) will subsequently automatically trigger Google Gemini for all future files of the same type opened within Google Drive.

The author deliberately asked for at least one document to be scanned. He goes on to talk about all the other things that might be overriding the setting, other, potentially more specific settings that would override this.

I agree, there appear to be interactions that aren't immediately obvious, and what takes priority isn't clear. However, the setting was off, and the author did deliberately ask for at least one document to be scanned. Further, there author talks about Labs being on, and that could easily have priority over default settings. After all, that's sort of what Labs is about. Experimenting with stuff and giving approval to do these sorts of things.

atum47 · 2024-07-15T12:45:52.000000Z

Every single week I have to refuse enabling back up for my pictures on my Google pixel. I refuse it today, next week I open the app and the UI shows the back up option enabled with a button "continue using the app with back up".

Somebody took the time to talk down my comment about this being a strategy to give their AI more training data. I continue believing that if they have your data they will use it.

nickstinemates · 2024-07-15T13:03:54.000000Z

This goes for every SaaS / cloud native company

I think there will be a real shift back on prem with software delivered traditionally due to increased in shit like this (and also due to cost)

JumpCrisscross · 2024-07-15T13:13:07.000000Z

> there will be a real shift back on prem with software

Not while we’re production constrained on the bleeding edge of GPUs.

OtherShrezzing · 2024-07-15T14:31:31.000000Z

How many SaaS / Cloud Native companies are really GPU constrained? The overwhelming majority of SaaS is a relatively trivial CRUD web-app interfacing with a database, performing some niche business automation, all of which would fit comfortably on a couple of physical servers.

ttul · 2024-07-15T13:53:17.000000Z

… and that situation will persist until other vendors release consumer GPUs with significant VRAM. Nvidia craftily hamstrings the top consumer GPUs by restricting VRAM to 24GB. To get a bit more costs 3-5x. Only competition will fix this.

rtkwe · 2024-07-15T14:14:23.000000Z

Even then NVIDIA has a pretty significant technology moat because most of the tools are built around and deeply integrate CUDA so moving off NVIDIA is a costly rewrite.

netsec_burn · 2024-07-15T13:17:42.000000Z

I fixed this by disabling the Photos app and using Google Gallery (on the Play store). It's the same thing as Photos for what I was using it for, without the online features.

mimimi31 · 2024-07-15T13:06:51.000000Z

I don't get those prompts with Google Photos. Have you tried selecting "Use without an account" in the account menu at the top right?

tyfon · 2024-07-15T13:17:59.000000Z

Thank you, I didn't even consider this to be a possibility. I back up to my own storage and was annoyed by this message.

Untying photos from my google account is even better!

switch007 · 2024-07-15T14:00:44.000000Z

GrapheneOS helps here by having no real backup solution at all. Google account is entirely optional

Pixels have first class support

You can also disable Network access to any app

(It's a buggy ride though and requires reading a lot of docs and forum posts)

neilv · 2024-07-15T14:20:08.000000Z

How is it buggy? Are you using the Google Play Store?

I've been using GrapheneOS for years (Pixel 3 through 7), with only open source add-on apps and no Google Play Store, and it's been pretty solid. (Other than my carrier seeming to hate the 6a hardware or model specifically.)

switch007 · 2024-07-15T16:24:03.000000Z

Are you suggesting GrapheneOS is bug-free? ... https://github.com/GrapheneOS/os-issue-tracker/issues

It was referring to the overall experience to which I was referring, not the OS specifically ("it's a buggy ride", it = the ride, not GrapheneOS)

I imagine a lot of the issues are because of the apps not testing on GrapheneOS.

But I've had lots of little issues:

- Nova Launcher on a daily basis stopped working when pressing the right button (the 'overview' button). I had to kill the stock Launcher app to fix it, interestingly. Had to revert to the stock launcher

- 1Password frequently doesn't trigger auto-fill in Vivaldi

- Occasionally on boot the SIM unlock doesn't trigger

- Camera crashing often (yes, "could be hardware"...I read the forums/GitHub issues)

More that I can't remember. It's a bit frustrating.

But don't get me wrong, I appreciate the project. I'm not going to go back to stock

Suppafly · 2024-07-15T16:50:52.000000Z

>Somebody took the time to talk down my comment about this being a strategy to give their AI more training data.

Because that's an insane interpretation of what's happening.

jgalt212 · 2024-07-15T13:05:40.000000Z

I in no way want to absolve Google, but that's the case for so many app permissions on Android. Turn off notifications, and two weeks later the same app your turned off notifications for is once again sending you notifications. It's beyond a joke.

wafflemaker · 2024-07-15T13:35:05.000000Z

You might have disabled one type of notifications, instead of all types of them. Making sure I disable all types of notification from an app usually works for me. What brand of phone are you using?

masalah · 2024-07-15T13:39:52.000000Z

Can you share some apps where this happens for you. I have rather the complete opposite experience where unused apps with permissions eventually lose said permissions.

no-reply · 2024-07-15T13:50:42.000000Z

This is normal, with newer versions of android (probably 10+) there is a feature that checks and removes unused permissions from apps in the last X days.

According to the OP here, it does seem like a pain in the butt to disable - https://support.google.com/android/thread/268170076/android-...

masalah · 2024-07-18T13:29:47.000000Z

Thanks for that link! I was aware of that feature hence why I was curious which cases where apps get additional permissions rather than lose as expected.

jgalt212 · 2024-07-15T15:01:21.000000Z

Lyft and Uber

masalah · 2024-07-18T13:31:27.000000Z

That would make a bit of sense then, I don't have lyft but uber lists over 6 distinct notification types, where disabling one would lead me to believe the other notifications would keep pinging still.

hanniabu · 2024-07-15T13:38:43.000000Z

Name sure you also disable the ability for the app to change settings

PessimalDecimal · 2024-07-15T13:21:04.000000Z

Meta commentary but still relevant I think:

The author first refers to his source as Kevin Bankston in the article's subtitle. This is also the name shown in the embedded tweet. But the following two references call him Kevin _Bankster_ (which seems like an amusing portmanteau of banker and gangster I guess).

Is the author not proofreading his own copy? Are there no editors? If the author can't even keep the name of his source straight and represent that consistently in the article, is there reason to think other details are being relayed correctly?

numbsafari · 2024-07-15T13:35:54.000000Z

There are no editors.

person23 · 2024-07-15T13:46:52.000000Z

Maybe an AI editor?

That would be somewhat disconcerting.

Write about problem with AI and article get changed to 10 best fried chicken recipes.

PessimalDecimal · 2024-07-15T14:50:24.000000Z

> Write about problem with AI and article get changed to 10 best fried chicken recipes.

Hopefully along with ten hallucinated life stories for the AI author, to pad the blog spam recipe page for SEO.

mistrial9 · 2024-07-15T14:33:41.000000Z

meta comment - an important moment in a trend where the human and human act of authorship, the attribution in a human social way, is melted and disassociated by typos or noise; meanwhile the centralized compute store, its brand, its reach and recognition, grow.

throwxxx5 · 2024-07-15T12:56:36.000000Z

[flagged]

PessimalDecimal · 2024-07-15T13:24:25.000000Z

Then consider me "delusational" (as you put it).

I am unaware of "these corporations" -- which ones exactly? my answer doesn't hinge on you making that clear but you still should -- throwing their opponents into reeducation camps, or outright killing them.

ClumsyPilot · 2024-07-15T13:55:09.000000Z

Nobody living the west is under any threat from CCP.

and you absolute can end up up bankrupt, homeless or in prison from a data breach. Many people have

Also how is are remaining Boeing whistleblowers, how many of them believe their life is safe?

_spduchamp · 2024-07-15T13:34:25.000000Z

I now feel obligated to cram as much AI-f'n-up crap into my Drive as possible. Come'n get it!

VeejayRampay · 2024-07-15T13:21:21.000000Z

this is not openai doing shady things so everyone should be up in arms