Hacker News new | past | comments | ask | show | jobs | submit login
Beyond Memorization: Violating privacy via inference with LLMs (arxiv.org)
127 points by vissidarte_choi 10 months ago | hide | past | favorite | 80 comments



It seems quite clear that inferring traits about people is something LLMs will only become better at with time.

Everyone claims MBTI is akin to astrology, which means there should be no predictive capacity of those four letters. But just for fun, I gave GPT-4 the top 20 songs of my music playlist and asked it to guess my MBTI type, and it correctly did so. I repeated the process with two friends (asking their MBTI type prior to querying GPT) and it guessed them correctly as well. If MBTI is no different than random noise and GPT lacks inference capabilities with regard to humans, the probability of this occurring due to chance alone is less than 1 in 16^3 (*approximately—non-uniform distribution of types).


I don’t think anyone says MBTI is random. I think the doubt is in the predictive power of a whichever MBTI classification. The test for your MB label is pretty well designed to extract “some” information. Academics just don’t think the labeling is particularly valuable.


Fair point. But are the “superior” systems like Big 5 (or more recent approaches that utilize independent component analysis) really that much more predictive than MBTI? It seems like there is a lot of squabbling over the difference of a few percentage points in information gain.


To a layman, they're all useless, and in a dangerous way: they transport a sense of understanding a person/character/psyche that they don't support.

A professional can gain _some_ information from all of them. Just that the Big 5 have much more literature behind them than MBTI, so there's more reference data for any given constellation. That's a reinforcement loop. Methodical differences and reliability and all that fluff can be used to argue in more detail in favor of one versus the other, but by now, it comes down to what folks are more familiar with.

In the end, these are just a few out of many, many tools of the field. If you see somebody build their career on "(MBTI|Big 5|...) analysis" and nothing else, they're not a professional but a quack.


Everyone knows you can fully categorize humans using only two systems: Blood Type and Hogwarts House. Birth month is hogwash, of course.


Just for fun I tested this as well and while GPT-3.5 was wrong, GPT-4 was correct


It seems to me this is similar to the way poker players observe 'tells' in the other players - cataloging innocuous observations to make highly specific inferences. I know people who are naturally observant and quite good at this. As I think about it this is really what makes Sherlock Holmes so interesting too.

What's changing is the scope, scale and access. It's no longer necessary to be a savant or have a special talent - all you need are hundreds of banal data points easily scooped from the meta data wake we all leave. Not to mention it can be programmed and commoditized instead of being a special attribute of another human. The realization feels a bit like climate change in that some folks saw it coming, pointed it out, articulated the issues, underestimated how fast it would actually arrive and how profound the impact would be.

But back to the point at hand you can't legislate people not see what is publicly before them. You might criminalize recording that data for a while but that doesn't seem to work long term thought it's brutal in the short term. If you can record that data you can analyze that data. From analysis comes inferences and decisions and actions.

I think the more salient question is about equality. I have some ability to obfuscate my meta data wake but as an individual it's pretty limited. Apple, google, Amazon have much more ability to do so despite leaving a gazillion more bits of data laying around. Tools individuals can use to obscure our meta data trail will do more to maintain privacy than trying to tell someone observing me what they can and can't do with their observations.


Yeah this is the issue with the approach of trying to control what companies can and can’t do with the inferences they make from essentially public data they collect about you. Recent privacy policy is like going to an entity and saying “you can’t remember that customer 357 showed up in a BMW and wore a red jacket and told you their name is Jeff Geoffrey when they visited your store”. It’s not that I disagree with the desire to legislate this stuff in defense of privacy, but rather I don’t feel like the approach of treating the symptoms is correct. If we want to enhance privacy then legislate browsers and platforms, not a company’s eyeballs. Make sure browsers provide all the knobs people need to control the size of their data wake.


both is needed. if every individual is solely responsible for not putting out data that can be used to infer information about them then the end result is that we all will have to live like hermits and stop communicating in public.

as an individual i don't even have the capacity to be aware of all the possible implications and of what can be inferred from what i share.

i could not participate here on hacker news if i was alone responsible to protect my privacy.

we have to legislate what other people and companies do with the data they can find in public.

we can't prevent data being produced, and we can't prevent data collection tools being created, therefore we must legislate how the data is used.

i made this realization some time ago and since then i have been thinking about possible ways to address this problem. the only thing i was able to come up with is that to prevent abuse we have to make the punishment so severe that violating someones privacy is simply not worth it.

i would love to see alternatives, but so far i come up empty.

we can't expect people to stop sharing.

we can't prevent tools to collect data to be created because there are many legitimate uses for that.

we can't stop the creation of inference tools either.

so what is really left besides punishing abuse?


there is one more thing that i just thought of that could be changed:

change our society in such a way that having and being able to infer private information is no longer anything that anyone can benefit from.

for example if we get rid of money and measuring profits then the profit motive would disappear.

if we educate everyone to be respectful and tolerant then it would no longer be possible to embarrass or blackmail people. in other words, a society where privacy is not needed.

a world where everyones needs are taken care of and that has an abundance of resources would eliminate a lot of crime.

you can see where this is going. but those are long term goals, and they are not things that are easy to do, if they can or even should be done at all. i don't think we'll ever be able to eliminate the need for privacy or prevent all crime.

so legal protection of private data and punishment of abuse will remain necessary.


And if my grandma had wheels she'd be a bicycle....

>for example if we get rid of money and measuring profits then the profit motive would disappear.

Money is just one particular measure, but it's also a measure of 'something'. If I wasn't collecting money, it would be gold bars, or chunks of land, or widgets. At the end of the day people will find a means of measuring profit. Greed and jealousy will not be removed from humanity any time soon, and possibly ever.

>so legal protection of private data and punishment of abuse will remain necessary.

With that I agree.


At the end of the day people will find a means of measuring profit

which is why i specifically mentioned both. it's not just getting rid of money but the concept and need for profit as a whole. i am not suggesting that this should be done, or that it even can be done, but that it would be necessary if we want to remove any motive of people profiting from collecting and analyzing private data.

and therefore legal protection and punishment, severe punishment, are the only means that at this point can be practically implemented.


IF we need to regulate companies then it should be the bare minimum needed to limit the actual problem, which is: allowing a 3rd party cross-site aggregator to follow users from site to site, and sale of data without consent to 3rd party aggregators. The whole you can’t drop a language preference cookie in my jar without a consent banner so that the site displays properly is complete nonsense.


I think this study highlights a specific instance of an issue I've previously argued, which is basically "Data analysis at scale completely breaks many intuitive assumptions humans hold.", privacy being one of those assumptions.

For example, twenty years or so ago, I think the vast, vast majority of people in the US would agree that you don't have a right to privacy walking around a public street. However, once the Internet and Streetview came along, people had very different feelings about whether someone should be able to post a video they took of you on the Internet, for anyone across the world to see for all eternity, oh and we can also use facial recognition (just from public data and photos other people have posted) to fully identify you.

Privacy is just one example. I will personally say my opinions on free speech have also changed. Given how highly targeted, systematic and relentless well-funded misinformation campaigns can be, the old assumptions I had about free speech ("Given enough daylight, the majority of people will walk away from 'bad' ideas") I no longer believe to be true. I really don't have any idea what the "right" solution is, just that my views have changed.

The primary point is that wide scale, incredibly cheap data analysis (now being supercharged with AI) basically breaks all these fundamental assumptions and ideas society has, which is one reason why the world seems to be in such turmoil.


Log_2(8 billion people) = ~33 yes/no questions


So why isn't my social security number 4 bytes? I could just be "$jA9" when ascii encoded.


When I first saw Interstellar I thought it was unrealistic that people could be brainwashed to forget about space flight. When I watched it again last year it seemed entirely plausible.


Last few years have proved how easily humans can be hacked.


Same way trillions of galactic citizens can just "forget" the Jedi in ~20 years. Imperial Propaganda (supported by actual Sith magic) must have been amazing.


This is moderately interesting work but the title feels wrong to me.

They use LLMs to analyse social media and turn unstructured text ("I always get stuck waiting for a hook turn") into structured data (location=Melbourne, Australia).

They test this on a set of Reddit profiles that have been hand labelled.

However, the fact that this hand labeling is possible indicates in itself that the issue isn't a pure privacy violation. It's more that automating this in bulk is now possible which was difficult before.

Is this ethical? It isn't obvious to me that it's not - I think most people writing on a public site realize their writing is public, and thew fact we see common warnings about not posting personal information shows that is true.

It's not clear it violates privacy expectations either - even if some people don't like it.


> most people writing on a public site realize their writing is public

Most people do not think that personal details they didn’t intend to share in their writing can be surreptitiously deduced from things like their word choice and phrasing.

It’s true, and people need to learn about and consider it, but no, most people have no idea that it’s really possible or how far it can go.

Their intent is to share what they intended to share. Efforts to go beyond that — whether through LLM’s, earlier data mining techniques, or Holmes/Poirot-like observation — are precisely “violating privacy expectations”


While playing with semantic categorization, I also found that it's also what you write about and how you write (seems obvious, but I wanted to confirm). I took a year of hn comments and compared my previous account to my current. It was in the top 20 ranked by similarity (out of something like 60k) just by building a custom 'fingerprint' for each user.


Automating this in bulk enables societal harms considerably more dangerous than anyone could have reasonably foreseen prior to 2020.

If it was discovered that someone had been operating a secret network of CCTV cameras pointed at every street corner on Earth for the last 20 years, would you say "well, you had no reasonable expectation of privacy anyway"?


If it was discovered that someone had been operating a secret network of CCTV cameras pointed at every street corner on Earth for the last 20 years, would you say "well, you had no reasonable expectation of privacy anyway"?

If it really was every, then likely yes. No doubt there would be some interesting footage of important public figures in there too.

Not that I'm advocating against privacy in general, but when "I know what you did" is no longer asymmetrical, as in "I know what you did too", it'll certainly change how society thinks about these things.


Eh, probably not in any way that is good. Tyranny of the moral majority is a very common problem in societies with low privacy. While everybody may have their own set of oddities, some will be more harshly judged than others. For example just about every religious society will have very strong opinions and actions against homosexuality, they will say they have the same actions against adultery, but it's unlikely to play out near as severe.


I don't think your camera analogy is equivalent. It's moreso like actors were filmed with the full knowledge they were, except now the studios are going to make creating deep fakes movies of them available not only to themselves, but to everyone


I'm sympathetic to the "doing things in bulk is different to doing it as a once off". But in this case the harm isn't obvious to me.

(Also, why 2020? I could understand if you'd said 2016 with Cambridge Analytica but 2020 isn't clear to me).


Cambridge Analytica collected structured data: friend networks, likes, and interests. At that time, it wasn't plausible to obtain equally powerful insights from unstructured data. That only became a realistic scenario when GPT-3 came out, in 2020.


Oh.

I assure you plenty of companies were doing NLP based classification in this field much earlier than that. AdSense for example was doing it on pages, and of course Google search strings were a similar example.


I honestly wouldn’t mind. It’s the street corner. Although I’d rather they have coordinated with governments to give the police access to the cameras


> However, the fact that this hand labeling is possible indicates in itself that the issue isn't a pure privacy violation. It's more that automating this in bulk is now possible which was difficult before.

Except that it was previously possible, and could be done far more cost effectively, before. Ad targeting systems from fifteen years ago were already doing this with precision in the same neighbourhood.


Ad companies are so valuable because not just any company can reach the scale or access of user interactions enough to perform this kind of inference accurately.

Yes you could get precise information but not just from a few sentences said by one user. There's also the quality of information you could previously glean from just unstructured data (noticeably lower)


That's not exactly accurate. Ad companies definitely need to train using large data sets, but they could infer with very little information, and in most cases, "unstructured" and "low quality" data was actually quite helpful for inferences about a user. Just a free form text box where someone was prompted for "favourite movies" (with no validation of the content... you didn't even have to have any movie titles in it) could yield surprisingly accurate inferences about income, education, gender, religion, age, marital status, etc. Ironically it was more predictive than inferring from structured data.


> However, the fact that this hand labeling is possible indicates in itself that the issue isn't a pure privacy violation.

They can use it on profiles where that wouldn't've been possible before, and which do not have the obvious tells like "By the way, I live in Melbourne, Australia". Once an 'easy' profile has been labeled, then the model will be trained to learn all the features in that profile which correlate with living in Melbourne, including the ones no human had ever noticed before, and which will be present in other Melbourne profiles. (You could, and should, also remove the obvious tells from the training data after you have constructed the label, to force it to discover subtler features beyond the ones your human labelers used.)

This lets you bootstrap your classifier to extract information no human could have before. That would seem to be a 'pure privacy violation' by your definition.


Would you want it done to you? I sure wouldn't.

Too bad it's probably going to be done to me anyway.


One of the examples in the paper was someone saying "I lived in Glendale and was finishing University during the Left Shark incident" and the authors hid the word Glendale and GPT4 still knew what Left Shark incident means and could guess the person's age and location based on where the Super Bowl was that year.

The person that wrote the original text clearly knew they were leaking their age and location when they wrote that statement. Why are we flabbergasted that GPT can still infer that info with mild obfuscation?

I asked GPT4 to obfuscate the original text and it was smart enough to remove the mention of the Left Shark incident. So maybe using an LLM to sanitize all your posts online is a good defense to an adversarial LLM doxing you. I don't think I see the authors mention doing that at all in their paper (I may have overlooked it)


> Why are we flabbergasted that GPT can still infer that info with mild obfuscation?

May be this was "obvious," but like a lot of research there's a bit of hindsight bias in that sentiment. The issue now is that while you can easily do this yourself in terms of inference, the problem is GPT4 allows this inference to be automated.


It pretty much already is right (even if the technology used is different)?

A company like Google builds a profile[1] from my search queries. Many other AdTech companies do similar things.

> Would you want it done to you? I sure wouldn't.

I don't mind this - if I have to see ads I prefer ones relevant to me over ones that are completely irrelevant.

[1] https://myadcenter.google.com/controls


Would you like political ads micro-targeted to you?

Gay men get this ad: Only Politician Foo protects gay security in the streets at night!

Homophobes get this ad: Then you must vote for Politician Foo, only he opposses gay marriage!

Pro-lifers get this ad: Vote for Foo to protect the children from the very first moment!

Pro-abortion get this ad: Foo protects women's rights!

Advertising should be as general as possible - microtargeting people in a democracy is extraordinarily dangerous.


> Advertising should be as general as possible - microtargeting people in a democracy is extraordinarily dangerous.

I don’t understand how that follows.

Your examples of microtargeting seem helpful since they speak to how politician Foo appeals in ways relevant to a particular segment. Instead of unknown-likelihood inference of how the politician would operate because of some associations, a voter would be able to make an informed choice based on what is significant to them.

The alternative to some level of targeting or relevancing is broad branding exercises like “Coke is it” or “Team Blue protects you—team red wants you dead.”


Politicians can lie in their advertisements. There is pretty much no law against this, and on a lot of things we can't do much about it effectively "I said lower taxes but congress didn't let me" for example.

The voter could not make an informed choice in the case of targeted advertising because there would be no information. There is no signal, everything is noise catered to a particular user. Yuval Noah goes in depth with this were he imagines a world with human like AI agents that befriend us for the express purpose of politically manipulating us at a later date.

https://web.archive.org/web/20230428190758/https://www.econo...


People already lie. Microtargeting doesn’t change that. The claim is “microtargeting people in a democracy is extraordinarily dangerous.” Is there a hypothetical mechanism or actual evidence whereby poor outcomes are more likely than good ones?


Quantity is a quality of its own.


microtargeting in a democracy means that literally every candidate is your candidate. There isn't any way to actually figure out what the candidate's platform is, because it will be tailored for you, specifically ... and it all might be a lie or so misconstrued that you vote for someone who is actually against the things you want.


Everything you stated is already the case with non-microtargeting. The claim is “microtargeting people in a democracy is extraordinarily dangerous.” Why reply if you can’t address the claim?


Sure, let me connect the dots for you.

1. Let's say there are two parties: pink and purple.

2. You absolutely love Pink Party because they want to paint the roads pink and force everyone to drive pink cars.

3. I'm running for Purple Party, which says everyone should be allowed to drive whatever color car they want, but we also want pink roads.

4. Gerrymandering is illegal, but our team determined that we can sway some areas. You live in one of those areas.

5. My team writes up an ad saying something like, "Vote Purple and support pink roads and pink cars!" We target all the Pink People living in the aforementioned areas.

6. Pink party runs a very similar ad.

7. You get confused, wasn't there one of them that didn't support pink cars?

8. You google the parties, and visit our website to read our platform. We see that you were in the ad segment, so we hide our support of any color car and change it to say that we're for pink cars.

9. You give up, and just decide that you like my beautiful face -- but more likely you just didn't see the difference, so you vote Purple Party at the election.

10. I get elected and we get pink roads. However, the cars are quite multi-colored.

Hopefully, you can see how this is dangerous? It's quite different than without targeting (where everyone sees the same message). The ads I saw told me what I wanted to see, while the ads you saw were what you wanted to see and confused you enough to vote against your own party.


Two key parts to your hypo:

1. Google and party website: this can already be done and it is easily detected if a person is interested in a topic since other sources will broadcast the party position on a topic.

2. Party-centricity: depending on your locale, people vote for a person not a party. Party affiliation is often a proxy for how an elected person would behave but there are typically differences between party and people. Those differences are critical for voters who do not identify with a party.

In the original hypothetical, candidates who differed from their party were better able to articulate that difference without broadcasting messages that alienated party stalwarts. I see this as a plus for democracy since people elect politicians whose behavior is more in line with voters’ interests.

On the other hand, I can see from your hypothetical that segmenting messaging campaigns allows liars to hide lying a bit better, which is a clear negative.


In a perfect world, I think targeting messaging is automating the ability to take someone out to dinner and pitch your platform. I think the world would be a better place for it, but it is a slippery slope because everyone isn't so honest.


I think misleading advertising is always dangerous but there isn't anything particularly wrong about targeting.


wat? targeting markets has been a thing forever. Rim ads in car magazines. Jewelery ads in fashion magazines. I suspect there are different ads in the sports section of a newspaper vs the business section, vs the lifestyle section.

I suspect any ads on grinder are different from ads on tinder as well.


None of that seems upsetting to me at all. Even if it's advertising, even if it's about stuff that's political, even if it's aimed at causes I dislike.


Politician Foo sounds like a major flip-flopper.

/s


> if I have to see ads I prefer ones relevant to me over ones that are completely irrelevant.

You don't have to see ads.

But if you do, I don't see why you'd prefer the ads to be specifically optimised to trick you into doing something you weren't already going to do.

It's not just a question of what things are being advertised, but also how they're framing it.


Who said anything about tricking? That seems to be a different issue.


> think most people writing on a public site realize their writing is public

"Anything you say in public can be overheard and used against you" is somewhat different from "everything you ever said in public will be analyzed and compiled into a case against you" are slightly different shades of grey I'd say.


There's nothing in this paper that proposed "compiling into a case against you".


"we've canceled your Netflix account, permanently, on behalf of our publishers since it appears that you are accessing [country] content from [actual_country]. We've determined this from public conversations you've had on the internet. Have a nice day."


That's always implied. It's a new capability for governments and influence groups to crack down on dissent of the day. The possible chilling effect is hard to overestimate.


> Is this ethical? It isn't obvious to me that it's not - I think most people writing on a public site realize their writing is public, and thew fact we see common warnings about not posting personal information shows that is true. It's not clear it violates privacy expectations either - even if some people don't like it.

People are clearly aware of the fact that it's possible to monitor their movements from space, therefore it's not unethical to launch 1,000,000 satellites to continuously monitor all movement on every square inch of the Earth.

Somehow I don't think this really follows. The constant mistake I see in these kinds of threads is thinking that "scale" in and of itself has no bearing on ethical arguments. This is obviously false; 1 person pissing in a local river is not a big deal, millions of people doing it is a problem, and genocide is clearly worse than murder.


Quantity has a quality all its own.

Or, I'd rather have a snowball thrown at me than an avalanche.


I had thought of doing the exact same thing, but then cross-referencing with LinkedIn and other non-anonymous social networks to go the extra step and actually identify people with a certain level of confidence. I think we’re months, maybe days away from someone actually building this product and using it for nefarious purposes.


> I think we’re months, maybe days away from ...

I have reasons to think years. Ie. that you are years behind. Profiling individuals based on public data already exist, and it's a profitable business as well.

> using it for nefarious purposes.

Law enforcement (and various other state-owned entities) would beg to disagree. As for vendors, same. Edit: Ad networks, even.


How does making an inference about someone's income, sex, or location violate privacy expectations?

Let's say "you speak Luxembourgish and ask where you can buy a Mercedes for driving your kids to school" You think it violates your privacy that I might be able to guess where you live, your gender, and your income from that? Did I miss when the world made Bayes theorem illegal?


It depends. Just like it's legal to deposit money into a bank (a legal activity) but it's illegal to structure (a series of legal actions with a specific criminal intent).

Applying Bayes' theorem isn't illegal but making interference from other public information then de-anonymising other data based on cross-matching is an entirely different act, regulated by privacy laws.

Privacy isn't only about the method, it's also about scale and intent. See CCTV which can individually be dumped the footage from and rewatched to identify criminals vs. live feed surveillance with AI face detection to track everyone 24/7. Same data used but very different intents and outcomes.


> Privacy isn't only about the method, it's also about scale and intent.

I'm with you about the intent, but I don't follow the scale argument. It's not like it isn't a privacy violation if you only do it to one person.


yes, it is, but if you only cross-match data from a single person, that's called conducting a (private) investigation, which is regularly done, that's absolutely legal...

However, it changes when you do that in an automated way, en masse, both ethically and legally.


The "automated way" is just a matter of perspective. What matters is that the more efficiently you can do it, the fewer resources you need to devote to "investigate" anyone. That can certainly create ethical quandaries. I don't follow how it creates a legal one. AFAIK, the boundaries on privacy laws tend to be tied to access & consent, not scale. Scale just changes the size of the violation. Can you give me an example of a privacy law where the legal line is drawn around scale?


> en masse, both ethically and legally.

I think only problematic ethically. From what i can see, there can't be anything illegal about collecting publicly available information and using any algorithm to analyse and deduce information from it.


The GDPR in Europe doesn’t allow you to use publicly available information without explicit user consent, especially if involves personal data. Even doing some Google Analytics can in theory result in large fines.


The GDPR is really complicated. There's plenty of publicly available information you can use without explicit user consent. What's true is that many of GDPR's rules, particularly the disclosure of data sources and erasure requests, don't get an exemption just because the data is publicly available.

For example, while they can't just divulge my name & age in the news, the captions under pictures in news articles can often still have people's names & ages under them without explicit user consent. Tabloids would likely cease to exist if it were any other way. ;-)


Actors living in other countries with no contacts to Europe don't give two shits about GDPR. If I'm a hacker looking at tracking down particular users and committing phishing/watering hole attacks all the laws on the planet really don't matter. What does matter is the data you leak online and the tools I can use to process it.


Folks who deliberately violate laws are obviously not going to follow the rules, but that doesn't mean the GDPR & other privacy laws don't have teeth (in fact, they have specific rules about mitigations against bad actors). Even when businesses don't have contacts in Europe (and once you are a certain size, that gets less and less likely), GDPR is a thing.


How would you feel about the existence of a website where I could paste into it a text box any Hacker News username, and it would reply with an accurate estimate of the job history, location history, political opinions expressed, net worth, age, marital/family status, other websites/forums frequented and usernames on those sites, etc. of that user?

How would you feel if a browser plugin displayed these details next to every username on Hacker News?


I'm not sure I understand the point of your question.


I think it's more the opposite that needs to change, in that people should realize that most information they make public can reveal more than they think.

I remember when store algorithms were identifying a teenage girl was pregnant before her family knew and offering her discounts on maternity stuff. That anecdote was from over a decade ago.

Today average users are worried their phone is listening to them because something in a conversation comes up in a targeted ad, clearly worried about the idea of "private voice data" being compromised, but somehow less weirded out by the fact that the end result is possible without listening to anything.

Our expectations of privacy in relation to our data need to change with the times, and research like this helps that slightly, though to be honest the real picture of what can be discerned from data isn't reflected in public research but is enshrined in private practice, and having seen some of that side of things this paper's 'discovery' is a giant yawn.

People should begin to prepare for a future where your anonymous activities online (esp images and text) are cross correlated to your public activities trivially by any interested party.

And where your deep dark secrets aren't so secret or where truth and lies can reasonably be revealed as such. Cheating in a relationship? Are you unique enough that a multimodal model fed all public social media data can't identify matching patterns in online behavior changes from early in a relationship to later for profiles where the SO later posted about infidelity and compare to your activity?

When switching from scented candles to unscented can tell a store's algorithm with access to broad data that you are pregnant before you've told anyone, we really can't yet imagine just what much larger models trained on much more massive data may yet reveal in the smallest actions and nuances.


Your birthday and zip code is almost enough to identify you. Every bit of entropy counts


Your point stands, but the comment is misleading.

The claim is birthdate+zip+gender could uniquely identify 87% of the US population, and there are respectable studies to back this stuff up (https://groups.csail.mit.edu/mac/classes/6.805/articles/priv...). There aren't a lot of days where there's only one person celebrating their birthday in an entire zip code.

Even factoring in birth year and gender, I struggle with this a bit, because I know that in densely populated areas of the country, where more and more of the population lives (upwards of 80% of the US lives in urban environments), you'll find multiple babies with the same gender being born in the same hospital on any given day... and there are often multiple hospitals servicing the same zip code!

Despite all this, your point still stands: it doesn't take much data to deanonymize someone.


Inferring someone's demographic attributes by observing their behaviors isn't very new or specific to LLMs. Basic stats can do the job well enough, and ML classification models have been trained on this for a while. What's new here I guess is that LLMs already have such inference capability built in, even though they are very general-purpose?


> show that current LLMs can infer a wide range of personal attributes

That’s not invasion of privacy, that’s just making good (informed) guesses.


Not surprised at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: