Hacker News new | past | comments | ask | show | jobs | submit login

OP's outrage rests on the fact that OnStar claims they are anonymizing the data and he says they are not. Why should I believe him over OnStar? He gave no evidence that they were not anonymizing the data properly, he just assumed they were not.

EDIT: There are other ways to anonymize data than simply removing the name associated with data.

His concern is not that OnStar will fail to remove your name from the GPS location stream. It is that even without a name attached, the subject's identity can be readily inferred from the data itself.

If one looks at a stream of location data over time, and sees the recurrence of a particular location in a residential area, particularly at night, then it can be pretty well surmised that this is your home. And from that, it's a trivial step to get your identity. And bingo, the anonymized data is now re-identified.

There's a simple solution to that: don't give a stream of location data. Chop it up into 5-second fragments, and fuzz the data by a meter or so to prevent re-assembly.

That would still be a very valuable dataset (for me at least), and almost completely free of PII.

Than again, I'm not an expert in these things; am I missing some way that this could be deanonymized?

Adding a meter to the GPS location of where my car starts and stops at the end of each day still tells you where my house is.

Even if you removed any IDs from the data and sufficiently fuzzed the location, speed, and timestamps, you are still left with a heatmap of where cars with OnStar drive most frequently.

In a city, that is probably anonymous. If you are in a rural area or drive along a route where your car makes up the majority of the data points, it still isn't.

I don't think I can explain why GPS data is inherently immune to anonymizing better than the OP. Please re-read that section.

It's impossible to anonymize location data, because location data is actually better at identifying you than your name (unless you have a very uncommon name).

The US census releases anonymized location data.

The U.S. census releases aggregate data. It's not so much anonymized as impersonal. You're right, though, I should have been more specific: it's impossible to anonymize location data, save by aggregating very large, amorphous groups.

the OP explains it all quite clearly - you should reread the post.

But in a nutshell his point is that by its very nature GPS data collected over a constant time period cannot be anonymized. If your car is located >50% of the time in one of two places, chances are one is your home and one is your office. I now know where you live (and thus your identity) and I know where you work.

How do you know they give continuous position on a per car basis? They could break everything up into chunks, or simply give out statistics on average speeds and usage for every road.

Everyone here is assuming anonymize means to remove name but keep everything else intact. I see no indication that this is the case. If there is reason to believe otherwise, point me in that direction.

  for any purpose, at any time, provided that following collection 
  of such location and speed information identifiable to your Vehicle
They store the data tied to your identity. A data breach (quite common these days...) would be a Big Deal. GPS tracks of everywhere you've gone in your car, ever? That's worth quite a bit of money in the right hands.

  He gave no evidence that they were not anonymizing the data properly, 
  he just assumed they were not.
Zipcode, birthday, gender: identifies 87% of Americans[1]. Your (Home,Work) gps tuple? Unique[2][3]. His assumption is quite safe; every "anonymized" dataset that's been released into the public (that I know of) has been de-anonymized. Why would this one be special?

1) http://arstechnica.com/tech-policy/news/2009/09/your-secrets...

2) http://crypto.stanford.edu/~pgolle/papers/commute.pdf

3) http://33bits.org/2009/05/13/your-morning-commute-is-unique-...

EDIT: In response to parent edit and below comments

I have no proof of these, but factoids I believe to be true (so feel free to base a research paper on them :D)

1) To identify commuters: (Highway-Entrance-Location, Average-Highway-Entrance-Time, Highway-Exit-Location, Average-Highway-Exit-Time) -> some derived values: approximate (home,work), average speed, average driving aggression

2) Really, now that I think about it, any dataset where multiple gps tracks (for a single person) are tied together is out. If you can get any single Average-Location-at-Specific-Time data point, (plus point #3 below) you've reduced the unique set to quite small. Then you just stand on that street corner at that time (or, for the police, use the red light cameras...) and you're done.

3) This is an OnStar dataset we're talking about, so you're looking for GMC-manufactured cars, made in the last ~10 years (or whenever onstar started going into cars). I'm willing to bet that just that data point is enough to reduce any other lukewarm/weak de-anonymization to a solid match.

4) Anyone who buys onstar as an option is quite concerned with their safety at all costs (... my bias, I guess, since I consider it a waste of time), so look for e.g. families with small kids or other dependents.

I'm running out of steam for this single comment, but name is certainly not necessary for unique ID. Ongoing research is cracking this stuff wide open. When the netflix dataset came out, who would have thought that movie ratings could uniquely identify a person?

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact