Hacker News new | comments | ask | show | jobs | submit login
De-Anonymizing Web Browsing Data with Social Networks [pdf] (randomwalker.info)
215 points by MaurizioP on Feb 7, 2017 | hide | past | web | favorite | 51 comments



At the 33c3 I showed how you can often uniquely identify a single anonymized user from a dataset containing three million people by using publicly posted links from his/her Twitter timeline. I also showed that this is possible with other types of public information as well, such as YouTube video ratings or reviews on Google Maps.

The math behind it is quite simple and very reliable for many datasets, which makes it very easy to build robust fingerprints based on browsing / location / behavior data. In my opinion, this is what most big companies rely on today for identifying users, as this is more robust than cookie-based mechanisms, which become more ineffective as the use of multiple devices and blockers increases.

Here's the link to the video (you can choose the language in the menu, by default it's German but the talk is also available in English and French):

https://media.ccc.de/v/33c3-8034-build_your_own_nsa


> In my opinion, this is what most big companies rely on today for identifying users, as this is more robust than cookie-based mechanisms, which become more ineffective as the use of multiple devices and blockers increases.

Correct, another thing this is used for is to identify mobile/pc pairs belonging to the same person.


Why not both though? I think the approach these days is to try to collect everything and put it into a model.

It might be similar to what happened to passwords. Passwords are still important, but they are just one signal among multiple others.


I encountered something similar the other day. I was on a moving website to schedule a move across town. I put in my info, and they gave me a quote, but the process was weird. It wouldn't let me back out of the quote completely, and if I returned to the main page of the site, it still showed my quote.

So I opened an incognito window and went through the process, no problems. Closed all incognito windows. Opened another, went to the site, and my quote was up, in a new incognito tab.

It was jarring because it meant that they were tracking me somehow, obviously not through the standard mechanisms as I was incognito.


This sounds to me like they were tracking your IP.


Change your IP with a vpn like Tunnelbear.

Most likely, it is only your ip.


This was the talk with the URLs that say showed if you are logged in and at your profile (worked with different services) right?

What I didn't understand: how was this data captured? 3rd party tracking through ad-networks and alike? (Would I be safe with no-script or even some privacy aware adblock?)


seems you skipped most of the talk. they talk extensively about how it was (probably) captured.

the tl;dw would probably amount to: 1. extensions like Web of Trust, CrossSite Requests (JS) and lesser sources like cookies if available.


Ah sorry, might have been that I joined the stream late. Thank you!


i watched that video for the second time today and am still confused why they say its not really blockable by the enduser...

isnt this fully blocked by a pi-hole (dns server with tracker/advertising rules)? even the extension tracking should be completely killed by that...


Sorry if this wasn't clear, what we mean is that client-side blocking won't help you to stay anonymous because even a server-side tracker can collect enough information about you to deanonymize you if enough of your "fundamental data" does not change (such as your IP). So in addition to using client-side blocking you should also use a VPN service to masquerade and (frequently) change your IP address.


Thank you for the video! We will take a look.


This is great work btw, congrats on that paper!


Any idea on how the google translate URL ended up in a HTTP referer header ? Or how did they get the data !?


It has been a couple of days since I watched the video, but if i recall correctly their point was not about the tracking by browser extensions (which is horrible and as mentioned above is in most cases blockable) but more about the ease with which these datasets can be used to de-anonymize users completely. Most large companies will be able to construct at least a partial data set from licensing various tracking providers.


This is why I use isolated sessions when browsing. I compartment my surfing these days because of this exact type of attack (identities and other browsing artifacts spilling over into serendipitous/casual/random browsing). Mozilla are even going to ship this strategy in Firefox soon[1]. Another strategy to lessen the amount of data collected on you is to outright disable Facebook like buttons and Twitter share buttons, because these widgets track you as you navigate around the web. This can be done in uBlock origin[2] under the '3rd party filters' tab and selecting Fanboy's annoyance filter list alongside Anti-ThirdpartySocial filter list.

[1]: https://wiki.mozilla.org/Security/Contextual_Identity_Projec...

> Individuals behave differently in the world when they are in different contexts. The way they act at work may differ from how they act with their family. Similarly, users have different contexts when they browse the web. They may not want to mix their social network context with their work context. The goal of this project is to allow users to separate these different contexts while browsing the web on Firefox. Each context will have its own local state which is separated from the state of other contexts

[2] https://github.com/gorhill/uBlock


It's my understanding that the Privacy Badger Chrome extension from the EFF also nullifies this ability: https://www.eff.org/privacybadger


yes, 'firefox -no-remote -P' is your friend, been doing this for quite some time.

Google is a firefox profile, facebook / whatsapp is another, main browser is another profile


What does the -no-remote switch do?

I looked here: https://developer.mozilla.org/en-US/docs/Mozilla/Command_Lin...

It says

    Do not accept or send remote commands; implies -new-instance.
But what are remote commands?


As I keep reminding here, there's no such thing as "anonymized data" - there's only "anonymized until combined with other data sets". Thanks for a demonstration, and thanks 'MaruizioP for posting this.


Isn't that the main point of differential privacy?


From cursory reading of the Wikipedia article on differential privacy, I understand this is about limiting access to data to a set of queries that are designed to make it difficult to use the query results to infer data the provider wants to protect. That doesn't seem to help in any way though, if the malicious actor owns the dataset and can run whatever queries they like on it. I'm also not sure how this addresses the problem of including additional datasets to help correlate out identities.


(From memory, so possibly miss some details)

Differential Privacy uses the concept of adding 'noise' to a dataset to make it statistically provable, that, if queries are supposed to only be A, B, C... then the adversary can only tell the target from noise with some probability - The point of restricting access to a subset of queries isn't really the main point of how DP improves privacy, but rather, restricting access to a subset of queries makes the formal proofs of things doable.


> I'm also not sure how this addresses the problem of including additional datasets to help correlate out identities.

I agree with you on the first part of you post, but this part is a little off the mark. In the original paper[1], Cynthia Dwork confronts the issue you point out head on; they actually start with an impossibility proof that show no treatment of the data will get you the property "access to a statistical database should not enable one to learn anything about an individual that could not be learned without access". The impossibility result relies on the existence of outside datasets.

DP instead tries to quantify the probability of identification, and adds differing amounts of Laplace noise to get this. The idea is that the dataset shouldn't look "too different" with or without your information in it. If your participation doesn't change the dataset much, how could someone tell if you are in it or not, or moreover link you to a data point in it?

[1] http://www.ccs.neu.edu/home/cbw/static/class/5750/papers/dwo...


If the data owner is malicious, you are right that there is no guarantee differential privacy is being used. That's a far cry from "there is no such thing as anonymized data"; you could just as easily say "there is no such thing as encrypted data", which I think we agree is wrong?

One can provide differentially private access to web histories that provably mask the presence/absence of individuals in that dataset, even when combined with arbitrary exciting side information, like social networks.

Even with something as simple as differentially private counting, you can pull out correlations between visits, finding statistically interesting "people who visit page X then visit page Y surprisingly often" nuggets, which are exciting to people who don't have their own search logs.


Related reading, Paul Ohm, Broken promises of privacy:

http://paulohm.com/classes/infopriv13/files/week8/ExcerptOhm...

It is extremely hard not leak personally identifying information and combining datasets it is fairly easy to identify unique individual.


At the launch of G+ I tested a way to assign search terms that people used to get to my page to G+ users - if they clicked on an +1 G+ button (which had some nifty callbacks then). Kinda worked. I could assign the search term to a set of users which I narrowed down with manual review.

Think Google must have been aware of the issue of intermingling Search with Social as a pretty short time later they scrapped the search query referers from (organic) Google Search.

Note: It was just a proof of concept to see if I could. Also: Nobody used G+ anyway.


Search query referers were dropped due to adoption of HTTPS/TLS. When a url is visited from an HTTPS site the browser does not sends reffer url only the host.

Its still possible to get "aggregate" information about keywords via Google Webmaster tools.

https://webmasters.googleblog.com/2012/03/upcoming-changes-i...


Yeah default-for-logged-in-users-https-with-stripping-part-of-the-referrer started in 2011, besides being a good idea -after the launch of G+ profiles as there was now a public "real name" piece of information that connected search queries to G+ users.

Google Search Console data doesnt let you connect google search queries to sessions.


Google passing your search queries in the referral URL was always a privacy issue. The TLS/SSL issue was an easy excuse to get rid of it.


> Our approach is based on a simple observation: each person has a distinctive social network, and thus the set of links appearing in one’s feed is unique.

Compartmentalization mitigates that threat. I have multiple online personas, and Mirimir is the only one who goes on about privacy, anonymity, etc. The only one who visits HN, Wilders, etc. Mirimir and other online personas also share no contacts. However, Mirimir does use pseudonyms ;)


Couldn't you also be identified to a certain probability based on your writing style (stylometry)? I remember reading something about that years ago.

Makes me think someone should make a Chrome extension that will re-write everything someone posts based on the concepts they are trying to get across. To avoid being identified via stylometry.


I don't know about the Chrome extension, but "anonymouth" is one such tool:

https://psal.cs.drexel.edu/index.php/JStylo-Anonymouth

https://github.com/psal/anonymouth


I find that stylometry is one of the easier barriers to overcome. For example, one pseudonym can persistently make a grammatical error ( e.g. failing to capitalise proper nouns ) and another can use UK English spellings.

And then there's the general 'vibe' of the forum which shapes how a pseduonym writes. If I were to write the same thing on HN versus IRC I would hope that the styles would be very different.

What's really difficult is comparmentalising information, so that even on the same website two pseudonyms don't demonstrate that they have the same knowledge.

Is it worth the bother? Possibly not, but I find it also makes me concentrate on what I'm writing.


I never use multiple personas on the same site. It's just too hard to avoid linkage. And, on principle, I don't use sockpuppets.


None of my other personas post very much in public. At times, I've been active in other languages, with very different styles.


I hope "erehwon" is just one of those pseudonyms and doesn't connect to any other of your identities :).

That said, deanonymization methods stack up; add geotemporal correlation of activities and one could presumably connect your various identities together ;).


Yes, erehwon is one of them: https://news.ycombinator.com/user?id=mirimir ;)

There's no geo, for me. And I'm not at all organized, so not much temporal, either.


>There's no geo, for me

Bot or astronaut?


Mostly, nested VPN chains and Tor. And occasionally, I2P. Even JonDo. It's amazing what one can do in VirtualBox and VPS.


Astronauts are trivial to track; orbit changes are expensive...


Sure. But virtual orbit changes are not :)


Is there really much value in keeping the same HN account? Perhaps account rotation would be a much better method of obscuring your identity. Some social networks do have value for maintaining the same friends, while others do not.


It's not anonymity that I seek with Mirimir.


Do ISPs sell our browsing data?


It shouldn't be too hard to find someone's identify if you have either their browsing history or twitter profile, or in this case both !?


If you have one browsing history and one Twitter profile, it is easy to compare them. However, the main technical challenge is that we have to find some way to compare a given browsing history to every single profile on Twitter, which in theory is hundreds of millions of comparisons, plus we would have to find some way to get hundreds of millions of news feeds. Solving this took us several months. Of course in the paper we explained how we reduced the search space so we only have to consider a limited set of candidates, and only have to crawl part of the news feed in real time.

Of course if a couple of grad students could do this in a few months with only publicly available data (plus donated browsing history), then I'm sure ad networks could easily do this too, and given how many ad networks have parts of your browsing history I would say that it's scary that it's so "easy."


Well as the paper notes, we don't quite have their browsing history, just the domains (since HTTPS encrypts the paths). But yes, the domains you visit make you unique and therefore identifiable.


It's never too hard to figure something out if you're given the whole shebang instead of a loose breadcrumb trail.


How much does it help to selectively block third party scripts/frames/plugins with something like uMatrix?


This is useful in that it limits the set of people who have access to your browsing history. But first party tracking will still be a thing (i.e. if you are visiting a website, the owners of that website will know you are visiting it). Actually there is a paper that says that if you only use the New York Times links that people click on from Twitter, it still makes you uniquely identifiable in many cases, so a company like the NYT could (theoretically) track people without using any third party tracking at all.

http://cosn.acm.org/2014/files/cosn012s-chaintreauAemb.pdf




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: