
De-Anonymizing Web Browsing Data with Social Networks [pdf] - MaurizioP
http://randomwalker.info/publications/browsing-history-deanonymization.pdf
======
ThePhysicist
At the 33c3 I showed how you can often uniquely identify a single anonymized
user from a dataset containing three million people by using publicly posted
links from his/her Twitter timeline. I also showed that this is possible with
other types of public information as well, such as YouTube video ratings or
reviews on Google Maps.

The math behind it is quite simple and very reliable for many datasets, which
makes it very easy to build robust fingerprints based on browsing / location /
behavior data. In my opinion, this is what most big companies rely on today
for identifying users, as this is more robust than cookie-based mechanisms,
which become more ineffective as the use of multiple devices and blockers
increases.

Here's the link to the video (you can choose the language in the menu, by
default it's German but the talk is also available in English and French):

[https://media.ccc.de/v/33c3-8034-build_your_own_nsa](https://media.ccc.de/v/33c3-8034-build_your_own_nsa)

~~~
grepthisab
I encountered something similar the other day. I was on a moving website to
schedule a move across town. I put in my info, and they gave me a quote, but
the process was weird. It wouldn't let me back out of the quote completely,
and if I returned to the main page of the site, it still showed my quote.

So I opened an incognito window and went through the process, no problems.
Closed all incognito windows. Opened another, went to the site, and my quote
was up, in a new incognito tab.

It was jarring because it meant that they were tracking me somehow, obviously
not through the standard mechanisms as I was incognito.

~~~
nix0n
This sounds to me like they were tracking your IP.

------
spaceboy
This is why I use isolated sessions when browsing. I compartment my surfing
these days because of this exact type of attack (identities and other browsing
artifacts spilling over into serendipitous/casual/random browsing). Mozilla
are even going to ship this strategy in Firefox soon[1]. Another strategy to
lessen the amount of data collected on you is to outright disable Facebook
like buttons and Twitter share buttons, because these widgets track you as you
navigate around the web. This can be done in uBlock origin[2] under the '3rd
party filters' tab and selecting _Fanboy 's annoyance_ filter list alongside
_Anti-ThirdpartySocial_ filter list.

[1]:
[https://wiki.mozilla.org/Security/Contextual_Identity_Projec...](https://wiki.mozilla.org/Security/Contextual_Identity_Project/Containers)

> _Individuals behave differently in the world when they are in different
> contexts. The way they act at work may differ from how they act with their
> family. Similarly, users have different contexts when they browse the web.
> They may not want to mix their social network context with their work
> context. The goal of this project is to allow users to separate these
> different contexts while browsing the web on Firefox. Each context will have
> its own local state which is separated from the state of other contexts_

[2] [https://github.com/gorhill/uBlock](https://github.com/gorhill/uBlock)

~~~
unexistance
yes, 'firefox -no-remote -P' is your friend, been doing this for quite some
time.

Google is a firefox profile, facebook / whatsapp is another, main browser is
another profile

~~~
spaceboy
What does the -no-remote switch do?

I looked here: [https://developer.mozilla.org/en-
US/docs/Mozilla/Command_Lin...](https://developer.mozilla.org/en-
US/docs/Mozilla/Command_Line_Options#-no-remote)

It says

    
    
        Do not accept or send remote commands; implies -new-instance.
    

But what are _remote commands_?

------
TeMPOraL
As I keep reminding here, there's no such thing as "anonymized data" \-
there's only "anonymized until combined with other data sets". Thanks for a
demonstration, and thanks 'MaruizioP for posting this.

~~~
lbeziaud
Isn't that the main point of differential privacy?

~~~
TeMPOraL
From cursory reading of the Wikipedia article on differential privacy, I
understand this is about limiting _access_ to data to a set of queries that
are designed to make it difficult to use the query results to infer data the
provider wants to protect. That doesn't seem to help in any way though, if the
malicious actor _owns_ the dataset and can run whatever queries they like on
it. I'm also not sure how this addresses the problem of including additional
datasets to help correlate out identities.

~~~
divbit
(From memory, so possibly miss some details)

Differential Privacy uses the concept of adding 'noise' to a dataset to make
it statistically provable, that, if queries are supposed to only be A, B, C...
then the adversary can only tell the target from noise with some probability -
The point of restricting access to a subset of queries isn't really the main
point of how DP improves privacy, but rather, restricting access to a subset
of queries makes the formal proofs of things doable.

------
jacquesm
Related reading, Paul Ohm, Broken promises of privacy:

[http://paulohm.com/classes/infopriv13/files/week8/ExcerptOhm...](http://paulohm.com/classes/infopriv13/files/week8/ExcerptOhmBrokenPromises.pdf)

It is extremely hard not leak personally identifying information and combining
datasets it is fairly easy to identify unique individual.

------
franze
At the launch of G+ I tested a way to assign search terms that people used to
get to my page to G+ users - if they clicked on an +1 G+ button (which had
some nifty callbacks then). Kinda worked. I could assign the search term to a
set of users which I narrowed down with manual review.

Think Google must have been aware of the issue of intermingling Search with
Social as a pretty short time later they scrapped the search query referers
from (organic) Google Search.

Note: It was just a proof of concept to see if I could. Also: Nobody used G+
anyway.

~~~
aub3bhat
Search query referers were dropped due to adoption of HTTPS/TLS. When a url is
visited from an HTTPS site the browser does not sends reffer url only the
host.

Its still possible to get "aggregate" information about keywords via Google
Webmaster tools.

[https://webmasters.googleblog.com/2012/03/upcoming-
changes-i...](https://webmasters.googleblog.com/2012/03/upcoming-changes-in-
googles-http.html?m=1)

~~~
franze
Yeah default-for-logged-in-users-https-with-stripping-part-of-the-referrer
started in 2011, besides being a good idea -after the launch of G+ profiles as
there was now a public "real name" piece of information that connected search
queries to G+ users.

Google Search Console data doesnt let you connect google search queries to
sessions.

------
mirimir
> Our approach is based on a simple observation: each person has a distinctive
> social network, and thus the set of links appearing in one’s feed is unique.

Compartmentalization mitigates that threat. I have multiple online personas,
and Mirimir is the only one who goes on about privacy, anonymity, etc. The
only one who visits HN, Wilders, etc. Mirimir and other online personas also
share no contacts. However, Mirimir does use pseudonyms ;)

~~~
wakkaflokka
Couldn't you also be identified to a certain probability based on your writing
style (stylometry)? I remember reading something about that years ago.

Makes me think someone should make a Chrome extension that will re-write
everything someone posts based on the concepts they are trying to get across.
To avoid being identified via stylometry.

~~~
dingaling
I find that stylometry is one of the easier barriers to overcome. For example,
one pseudonym can persistently make a grammatical error ( e.g. failing to
capitalise proper nouns ) and another can use UK English spellings.

And then there's the general 'vibe' of the forum which shapes how a pseduonym
writes. If I were to write the same thing on HN versus IRC I would hope that
the styles would be very different.

What's really difficult is comparmentalising information, so that even on the
same website two pseudonyms don't demonstrate that they have the same
knowledge.

Is it worth the bother? Possibly not, but I find it also makes me concentrate
on what I'm writing.

~~~
mirimir
I never use multiple personas on the same site. It's just too hard to avoid
linkage. And, on principle, I don't use sockpuppets.

------
macawfish
Do ISPs sell our browsing data?

------
z3t4
It shouldn't be too hard to find someone's identify if you have either their
browsing history or twitter profile, or in this case both !?

~~~
JessicaSu
If you have one browsing history and one Twitter profile, it is easy to
compare them. However, the main technical challenge is that we have to find
some way to compare a given browsing history to every single profile on
Twitter, which in theory is hundreds of millions of comparisons, plus we would
have to find some way to get hundreds of millions of news feeds. Solving this
took us several months. Of course in the paper we explained how we reduced the
search space so we only have to consider a limited set of candidates, and only
have to crawl part of the news feed in real time.

Of course if a couple of grad students could do this in a few months with only
publicly available data (plus donated browsing history), then I'm sure ad
networks could easily do this too, and given how many ad networks have parts
of your browsing history I would say that it's scary that it's so "easy."

------
rthomas6
How much does it help to selectively block third party scripts/frames/plugins
with something like uMatrix?

~~~
JessicaSu
This is useful in that it limits the set of people who have access to your
browsing history. But first party tracking will still be a thing (i.e. if you
are visiting a website, the owners of that website will know you are visiting
it). Actually there is a paper that says that if you only use the New York
Times links that people click on from Twitter, it still makes you uniquely
identifiable in many cases, so a company like the NYT could (theoretically)
track people without using any third party tracking at all.

[http://cosn.acm.org/2014/files/cosn012s-chaintreauAemb.pdf](http://cosn.acm.org/2014/files/cosn012s-chaintreauAemb.pdf)

