
Mozilla research: Browsing histories are unique enough to identify users - chris_f
https://www.zdnet.com/article/mozilla-research-browsing-histories-are-unique-enough-to-reliably-identify-users/
======
Groxx
I feel inclined to say "... well yeah, obviously".

Not in the "obvious in retrospect" way, but because browsers have been
progressively blocking history-sniffing tactics for _years_ precisely because
advertisers were using it to identify visitors.

Did this research... establish better numbers around it or something?

~~~
godelski
> Did this research... establish better numbers around it or something?

>> However, this time around, since the data was collected from Firefox itself
and not through a web page performing a time-lengthy CSS test, the data was
much more accurate and reliable. Furthermore, the data Mozilla researchers
collected is also about the same type of data that today's online analytics
companies also collect about users — either through data partnerships, mobile
apps, online ads, or other mechanisms.

~~~
dralley
Needs to be specified that this was an opt-in study that you had to agree to.

~~~
godelski
Is no one in this thread going to read this article? Seriously, it isn't that
long. RTFM

>> The new experiment got underway between July 16 and August 13, 2019, when
Mozilla prompted Firefox users to take part of this experiment.

>> Mozilla researchers said that more than 52,000 users agreed to take part
and agreed to provide anonymous browsing data.

~~~
ocdtrekkie
Clearly not actually anonymous browsing data in actuality though... which is
why we should always take claims that telemetry data is anonymized with a
grain of salt.

~~~
tonyhb
Anonymous as in the only thing they're getting is a random identifier and
browsing data.

------
the_jeremy
Who is able to get access to my browser history? I thought it was just my
ISP/VPN, which can obviously track me better in other ways.

~~~
slipheen
Consider for example, that many pages use remotely loaded resources.

I would think things like Facebook/Twitter like buttons or Google Fonts might
make it to assemble this history. Sites like FB are said to maintain "Shadow
Profiles" of people, even when those people aren't using their service
directly.

I suppose in theory any sufficiently shared infrastructures such as
AWS/Cloudflare could do so as well, but they are disincentivized to do so.

~~~
WilTimSon
Would using Firefox's 'Containers' help prevent this? As far as I understand
they quarantine the Facebook pages so they can't get data from other websites
you visit.

~~~
cma
DNS resolve measurement to see if it is cached by the OS can potentially
breach that.

~~~
laumars
Can JavaScript measure DNS resolve time?

~~~
cma
I think only indirectly, but if they control the endpoint they can ping you
back, subtract rtt from initial request response time and then the difference
from that can tell them whether initial request was cached in dns or not.

~~~
laumars
Just so I understand correctly, does that mean you then need to control the
end point of every site you want to use as part of fingerprinting?

If so, wouldn’t that drastically reduce the effectiveness of using DNS resolve
times as a work around for Firefox containers?

Not trying to be argumentative here, just trying to understand how effective
the sandboxing is, or whether I need to design more layers of indirection. :)

------
axegon_
That's hardly surprising. I mean browsers hand out willingly plenty of
information that could be used for pretty accurate identifications. Just
scrolling through my scores on amiunique[1], many of the parameters put me in
the 0.01% category.

[1] [https://amiunique.org/fp](https://amiunique.org/fp)

~~~
wnbc
If you want to be less unique on amiunique.org/fp

1\. Visit the site 2\. Delete your browser cookies 3\. Refresh 4\. Repeat the
steps until you're less unique

~~~
curiousgal
Or you know, just block JS.

~~~
squiggleblaz
Congratulations, you just broke 90% of the modern web. Might as well go
directly to Gopher.

~~~
throwaway_pdp09
Congratualations on never actually bothering to block JS and find out - you
know, facts. From actually doing so over many years, and so from actual
experience I'd say completely non-functinal sites are about 25%.

~~~
chrismorgan
I’d put the number quite a bit lower than that, probably comfortably under 10%
of sites I interact with, though the trend is definitely upwards, drastically
so among interactive things (which are probably worse than 50% broken these
days).

------
3np
To this me and a friend started sketching on a VPN/HTTP proxy that will have a
set of say 100 outgoing IPs, look at the domains being connected to and
distribute request destinations over IPs.

So e.g. Google would always see the same IP, which would be different from the
one Facebook sees.

While access times cross-references and identification is still theoretically
possible, it should be an entirely different game.

Would anyone else reading this be interested in working on this or joining in?
I'm not thinking to make it a startup or business per se but 1) reliable IPs
are a bit too expensive to make sense for just 1 person 2) anonymity in
numbers.

I'm thinking ideal would be something FOSS and easy to self-host and replicate
so you can pool together a group of friends for a shared VPN among semi-
trusted parties (at least the user should trust the operator to not index
requests and sell the data, and the operator should trust users to not run
botnets)

~~~
nemothekid
I think an easier approach is that once you have good IPv6 connectivity you
could do something like a unique address per day per host. Every device could
have 100M ip addresses and it wouldn't touch the IPv6 address space (10
billion humans * 100 devices = 0.000005% of the IPv6 address space).

Edit: My math is wrong. I thought IPv6 was 2^64, but it's actually 2^128, so
that percentage is 10^20 times more miniscule.

~~~
3np
In that scenario those 100 IPv6 addresses in the subnet would be practically
equivalent to an IPv4 address today and would provide no extra benefit.

~~~
croon
It would if hundreds of different users came through those IPs, wouldn't it?

------
LatteLazy
Here in the UK, date of birth and post code is enough to identify something
like 95% of people. Anonymised data sets are not really possible once you have
more than a few varriables. Most people don't know this.

~~~
MattGaiser
Isn’t a postal code about 50-100 houses? It really narrows things down.

That particular variable really reduces things.

~~~
godelski
That's a bit insane. I looked up US numbers to check and got around 8k/zip
code[0]

[0] [https://www.zip-codes.com/zip-code-statistics.asp](https://www.zip-
codes.com/zip-code-statistics.asp)

~~~
patrickmcnamara
In Ireland, we use Eircode with one house per postcode. This is very handy
because you don't need to type in your full address on a lot of websites, just
the Eircode.

The first three digits of an Eircode are more like a traditional postcode in
that they indicate your area/town but the next four are randomised for each
address.

~~~
hanniabu
Man, that sounds so convenient. I kind of wish everywhere had that minus the
privacy factors

------
yalogin
Intuitively there are tons of things we do on our computers that uniquely
identify. I am sure the adware companies know a ton more and are not public
too. The need for strict privacy preserving tech is needed across the whole
stack.

------
dmos62
By looking at all the data available to untrusted sites (as seen in
[https://amiunique.org/fp](https://amiunique.org/fp)) you can tell that Web is
many many years away from being privacy conscious. List of fonts, canvas
fingerprinting, timezone, OS, user agent... the list goes on and on. Those of
us who are tech-literate know better than to create tech like this today, but
there's just too much momentum (and shady interests) to hot-swap Web for
something else.

------
aaron695
I think this is as stupid as it sounds from the paper -
[https://www.usenix.org/conference/soups2020/presentation/bir...](https://www.usenix.org/conference/soups2020/presentation/bird)

Why not "Mozilla research: We asked users for their name and address and the
ones telling the truth we could identify"

TOR is fighting identifying users from the screen size of their window when
maximised.

Here's the original paper which is more about how you can access the browsers
histories -
[https://www.petsymposium.org/2012/papers/hotpets12-4-johnny....](https://www.petsymposium.org/2012/papers/hotpets12-4-johnny.pdf)

Can you still access browsers histories? I'd have to guess no way without a
zeroday. The original site is down.
[http://www.wtikay.com/](http://www.wtikay.com/) Firefox fixed it -
[https://bugzilla.mozilla.org/show_bug.cgi?id=147777](https://bugzilla.mozilla.org/show_bug.cgi?id=147777)

------
moonchild
Wasn't it shown by aol researchers 20 years ago that search histories are
uniquely identifying? If so, this seems hardly surprising, as browser history
should be a superset of search history.

------
amai
As counterstrategy you can use tools like
[http://trackmenot.io/](http://trackmenot.io/)

"TrackMeNot runs as a low-priority background process that periodically issues
randomized search-queries to popular search engines, e.g., AOL, Yahoo!,
Google, and Bing. It hides users' actual search trails in a cloud of 'ghost'
queries, significantly increasing the difficulty of aggregating such data into
accurate or identifying user profiles. "

~~~
throwaway_pdp09
I use it as far as I can but it's stopped working in palemoon. The queries it
produces aren't very intelligent when you see them and it wouldn't take much
NSA/MI5 work to trim much of them out.

I think TMN could be a fair bit smarter.

------
MaxBarraclough
The _Evercookie_ (hard-to-delete cookie-like system in JavaScript) and
_Panopticlick_ (browser fingerprinting) projects may also be of interest:

[https://en.wikipedia.org/wiki/Evercookie](https://en.wikipedia.org/wiki/Evercookie)

[https://panopticlick.eff.org/](https://panopticlick.eff.org/)

------
g42gregory
Interesting. I also think that the browser signature, together with IP
address, will probably come very close to uniquely identifying users.

~~~
lkbm
I noticed the other day that various chatbots (as in, a single service shared
across multiple websites) call me "The University of Texas at Austin",
presumably because I have a housemate who works there.

I tried various VPN servers and got called by other company names[0]. It was a
good reminder about how we're tracked, and our information may be shared, even
with other users.

[0]
[https://twitter.com/lkbm/status/1299408670325964802](https://twitter.com/lkbm/status/1299408670325964802)

------
jedisct1
So can DNS queries.

------
vlovich123
I suspect privacy would be better served by taking the approach of the
security domain with responsible disclosure to vendors and a concerted effort
to attack the problem holistically. Until then we’re just giving privacy
attackers a heads up and by the time this issue is mitigated their onto the
next avenue for bypassing privacy.

------
hkt
Time for a browser plugin that will generate random noise - adding junk into
history.

~~~
mulmen
If it’s truly random wouldn’t that make you even easier to identify?

~~~
airstrike
Not if everyone else uses it too

~~~
mulmen
Even if everyone else uses it.

If your random pages are a, b and c but my pages are d, e and f or even a, b
and d then it’s still easy to fingerprint us.

Extensions like this might work if they visited the same sites all other users
visit. Otherwise you’re just adding even more unique information for the
trackers.

~~~
airstrike
But if both our random pages are a, b and c, and the only difference is when
or how often I accessed each of those, then making it random for both of us
will effectively turn us into the same person.

~~~
mulmen
What about all the other pages you visit? How does _adding_ random traffic to
your history make you any harder to identify? It just creates more datapoints.

------
wombatmobile
If the study establishes that for all practical purposes, online anonymity is
impossible to maintain for average users, what are the implications (a) for
the average user; (b) for the economy; and (c) for society?

------
Lordarminius
Mine certainly i, since I tend to visit the same ten sites over and over again

------
option
so are amazon/itunes/appstore/googleplay/netflix-views/etc

