Hacker News new | past | comments | ask | show | jobs | submit login
Mozilla research: Browsing histories are unique enough to identify users (zdnet.com)
238 points by chris_f on Sept 1, 2020 | hide | past | favorite | 125 comments



I feel inclined to say "... well yeah, obviously".

Not in the "obvious in retrospect" way, but because browsers have been progressively blocking history-sniffing tactics for years precisely because advertisers were using it to identify visitors.

Did this research... establish better numbers around it or something?


> Did this research... establish better numbers around it or something?

>> However, this time around, since the data was collected from Firefox itself and not through a web page performing a time-lengthy CSS test, the data was much more accurate and reliable. Furthermore, the data Mozilla researchers collected is also about the same type of data that today's online analytics companies also collect about users — either through data partnerships, mobile apps, online ads, or other mechanisms.


Needs to be specified that this was an opt-in study that you had to agree to.


Is no one in this thread going to read this article? Seriously, it isn't that long. RTFM

>> The new experiment got underway between July 16 and August 13, 2019, when Mozilla prompted Firefox users to take part of this experiment.

>> Mozilla researchers said that more than 52,000 users agreed to take part and agreed to provide anonymous browsing data.


Clearly not actually anonymous browsing data in actuality though... which is why we should always take claims that telemetry data is anonymized with a grain of salt.


Anonymous as in the only thing they're getting is a random identifier and browsing data.


Welcome to HN, where the majority reads nothing but the headline and discus how they feel about the headline.


That's nothing to do with HN, that's just how the internet works.


HN is on the internet, no? Also, HN definitely has a particularly bad headline-only problem, or maybe it just shows worse than some other places because people here have a tendency to ask really basic questions that the article clearly answers.


I think the problem here is that HN has higher standards (and we should keep it that way). Reddit is far worse, but I don't want to deal with all the stupid there.


Does HN have higher standards? Maybe different standards, but I don't know that when it comes to reading the article that HN is much better than Reddit.


The subtitle is "Just 50-150 of our favorite sites are enough." More numbers further on in the article, although not that many.


It also says 50-150 websites.

I regularly visit maybe 5 or 6? The rest tend to be random links from reddit or HN, I wonder if visiting a site like that once and never again is enough to help with that identification.

Another thought is I think it's obvious that if it's a site you log into and the URL has an identifier of some type then it's easy to identify you and that's why schemes to hide the URL could also be a privacy issue.


Who is able to get access to my browser history? I thought it was just my ISP/VPN, which can obviously track me better in other ways.


Consider for example, that many pages use remotely loaded resources.

I would think things like Facebook/Twitter like buttons or Google Fonts might make it to assemble this history. Sites like FB are said to maintain "Shadow Profiles" of people, even when those people aren't using their service directly.

I suppose in theory any sufficiently shared infrastructures such as AWS/Cloudflare could do so as well, but they are disincentivized to do so.


Would using Firefox's 'Containers' help prevent this? As far as I understand they quarantine the Facebook pages so they can't get data from other websites you visit.


DNS resolve measurement to see if it is cached by the OS can potentially breach that.


Can JavaScript measure DNS resolve time?


I think only indirectly, but if they control the endpoint they can ping you back, subtract rtt from initial request response time and then the difference from that can tell them whether initial request was cached in dns or not.


Just so I understand correctly, does that mean you then need to control the end point of every site you want to use as part of fingerprinting?

If so, wouldn’t that drastically reduce the effectiveness of using DNS resolve times as a work around for Firefox containers?

Not trying to be argumentative here, just trying to understand how effective the sandboxing is, or whether I need to design more layers of indirection. :)


It can measure response time. If you host your own dns and web server you can vary their response times and record from js.


I've started using decentraleyes, hoping to mitigate this issue


Use LocalCDN. A fork of Decentraleyes with many more resources for cache. https://addons.mozilla.org/en-US/firefox/addon/local-cdn-web...


You linked to local CDN, I think you wanted localCDN (no space).



Thanks! Don't know how I missed it, when the extension's icon is on the toolbar itself



Have there been any indications that AWS broadly captures connection data between AWS tenants and their respective users for illegitimate purposes?

Some AWS services (such as TLS-terminating load balancers) do have access to sensitive cross-site information that could be fed into the adtech panopticon but I wonder if it would be cost-effective for AWS to gather.

I doubt it would be cost effective for AWS to do broad captures for all of its services, however. There's probably not much value in slurping up the IP and SNI data for all HTTPS requests to every EC2 instance, for instance.


Malicious extensions are a likely culprit. This is the ultimate irony of the whole WebExtensions debacle; browser vendors wanted to stop the extensions from interacting with the browser because maintaining that interface is work, so now the most trivial extensions will request full access to all websites so they can inject scripts. To bring back "backspace navigates back" I have an extension that needs just that.


Needing javascript that embeds in every page for basic mouse and keyboard behaviour is insane. No clue why they decided it should be the only viable option.


It isn't intended to be. But supporting the old APIs meant that they had to have Microsoft levels of backwards-compatibility.

• You want to refactor XUL so it doesn't duplicate features of HTML5? Whoops; you broke all the extensions.

• You want multi-threading? What a shame; that API over there assumes it'll always be called from the main thread.

• Update that database table's schema to store more data? Bah. Make another table, or you'll break extensions

When every implementation detail is part of your interface, bad things happen.


Fine, XUL had to go. But where is the replacement? How many more years should we expect Mozilla to need to implement configurable bindings? It doesn't even need to be extension-accessible, just give users a tab in the preferences menu like damn near any other application has been doing since the dawn of GUIs.

Do they even have one engineer working on this?


Considering they fired half the people working on this, probably not.


I am very much used to alt+left arrow to navigate back, in the unlikely case you were not aware of this shortcut and if you would like to drop this extension for whatever reason.


The point of the "backspace navigates back" extension is that alt+left is not how people were used to going back on websites.


On Firefox you can go to about:config and set 'browser.backspace_action' to 0.


The number of times I have been bitten by backspacing when I thought I was in a web form and in fact navigated to a previous page is high.

I sympathize with the other user having to change a default setting or install an extension, but I'm glad that the felt behavior changed.


Facebook. Their embedded like buttons are all over the damn internet.

Google Analytics.

Google Fonts.

Javascript CDNs.

Image CDNs. Imgix.

CDNs in general. Akamai. Cloudflare.

Any of the several ads platforms.

Disqus, or any of the commenting platforms.

Youtube. Lots of people embed their video content.

AWS, Google Cloud, Azure.

Every time you visit a page that has an embedded content from a remote host, that host knows you visited that page.


Hmm this is a bit of an interesting question. The original study (2012) exploited a security bug, which let anyone see which sites you had visited. (Basically, by checking the color of links with JS to see if :visited styling had been applied.) That bug doesn't exist anymore, and the new survey just uses opt-in data to "confirm" it.

So, I don't actually think this research is particularly relevant anymore? It can't really be exploited (and when it can, there's much better ways to track the person).


Anyone with a widely distributed analytics package or tracking beacon can track your hits on pages with that beacon. How many pages DON'T use Google Analytics or a Facebook 'like' button?


Mozilla was sending your Firefox history to Cliqz (in Germany). And currently is sending it to Cloudflare (in US).


Got a source for that one?


Does Mozilla Pocket get browsing history if it isn't disabled? Last I heard it's not using E2E encryption and Mozilla still hasn't open sourced the server side of it.


My understanding of pocket recommendations is that it gets a list of articles from a server every day and uses a local algorithm to match them to your browsing history so your history never leaves the device. Idk if any metadata about which decisions it made is leaked though.

https://help.getpocket.com/article/1142-firefox-new-tab-reco...


I'm under the impression that syncing browsing history between instances of Firefox is a feature Mozilla provides through Pocket, but admittedly I don't have first-hand knowledge of this.


I think syncing is done with a Firefox account (Firefox Sync) and i can't find implementation details, but I did find:

"Firefox Accounts uses your password to encrypt your data (such as bookmarks and passwords) for extra security. When you forget your password and have to reset it, this data could be erased. To prevent this from happening, generate your recovery key before having to reset your password."[1]

So it appears they may be encrypting data locally and syncing encrypted data without having keys.

I think you are right though, there are more website saving features available through pocket other than recommendations and I'm not sure how any of that works.

[1] https://support.mozilla.org/en-US/kb/reset-your-firefox-acco...


It's in the article. You can do some clever Javascript/CSS tricks to sniff the browser history. Browsers are not trying to block this.


*now trying to block this (oops)


extensions with the necessary permissions I think would be one obvious source.


How do you imagine the ISP/VPN gets your browsing history for https traffic beyond domain names?


> beyond domain names

"..other than that, how was the play Mrs. Lincoln?"


Install one shady Android app and it will immediately dump your browsing history, rest assured.


How exactly would it do it? Android permissions prohibit access to other app data, except shared storage.


1. LPE/rooting exploit 2. Win


That's hardly surprising. I mean browsers hand out willingly plenty of information that could be used for pretty accurate identifications. Just scrolling through my scores on amiunique[1], many of the parameters put me in the 0.01% category.

[1] https://amiunique.org/fp


A lot of these make no sense at all.

Using Mac OSX stock audio input and output devices is already supposedly "unique".

Having an Azerty keyboard supposedly puts you in the 0,04% category even though all French speakers have the same setting which means that 0,04% already represents +70M users. So far from "unique".


Presumably they ignore correlated values. Like a French IP, a French localevnd AZERTY are all very correlated.


If you want to be less unique on amiunique.org/fp

1. Visit the site 2. Delete your browser cookies 3. Refresh 4. Repeat the steps until you're less unique


Or you know, just block JS.


Congratulations, you just broke 90% of the modern web. Might as well go directly to Gopher.


A bit of selective whitelisting with umatrix keeps everything functional while massively helping with privacy.


Use RSS.

The modern Web is a tracking system.


Congratualations on never actually bothering to block JS and find out - you know, facts. From actually doing so over many years, and so from actual experience I'd say completely non-functinal sites are about 25%.


I’d put the number quite a bit lower than that, probably comfortably under 10% of sites I interact with, though the trend is definitely upwards, drastically so among interactive things (which are probably worse than 50% broken these days).


Content language=en-US,en;q=0.9,bg;q=0.8,es;q=0.7

That's gonna take a lot of refreshing.....


To this me and a friend started sketching on a VPN/HTTP proxy that will have a set of say 100 outgoing IPs, look at the domains being connected to and distribute request destinations over IPs.

So e.g. Google would always see the same IP, which would be different from the one Facebook sees.

While access times cross-references and identification is still theoretically possible, it should be an entirely different game.

Would anyone else reading this be interested in working on this or joining in? I'm not thinking to make it a startup or business per se but 1) reliable IPs are a bit too expensive to make sense for just 1 person 2) anonymity in numbers.

I'm thinking ideal would be something FOSS and easy to self-host and replicate so you can pool together a group of friends for a shared VPN among semi-trusted parties (at least the user should trust the operator to not index requests and sell the data, and the operator should trust users to not run botnets)


I think an easier approach is that once you have good IPv6 connectivity you could do something like a unique address per day per host. Every device could have 100M ip addresses and it wouldn't touch the IPv6 address space (10 billion humans * 100 devices = 0.000005% of the IPv6 address space).

Edit: My math is wrong. I thought IPv6 was 2^64, but it's actually 2^128, so that percentage is 10^20 times more miniscule.


You get a range from your ISP (e.g. a /64). Everything within that range would be tied to "you" (rather, your connection, but something like user agent would tell its your wife's iPhone or your MacBook Pro).


In that scenario those 100 IPv6 addresses in the subnet would be practically equivalent to an IPv4 address today and would provide no extra benefit.


It would if hundreds of different users came through those IPs, wouldn't it?


Look up browser fingerprinting. It’s a lot more complicated than just obscuring IPs.


Yes, but the primary identifier is IP address. The detailed profiles built with fingerprinting and other data are attached to the small set of IP addresses a person uses over the course of her lifetime. Most internet users have limited choice when it comes to internet access. A user cannot change her IP address with the same ease as she can change her software fingerprints. If a company is trying to sell online ad services, then having a database of browser fingerprints purportedly representing real people is not very valuable unless the company can link those fingerprints to real physical locations.


Given the prevalence of CGN, especially in mobile / cellular internet, and the reality that mobile is first for a large number, the use of IPs as a primary key feels less likely these days than a decade ago.

> A user cannot change her IP address with the same ease as she can change her software fingerprints.

I dunno. It’s a lot easier for my less techie friends to reboot their router and get a new IP than it is to talk them through installing some privacy enforcing software they required regular maintenance or results in weird and wonderful breakage of their favourite websites.


Sounds like you are describing two different scenarios: 1. connecting to cellular networks when away from home/office and 2. connecting to internet routers at home/office.

Don't take my word for it, read the work cited in the article. Note how much they still rely on (static) IP addresses. If we removed the IP address as a reliable item of available data, based on observed practices (not theory), that would likely be significant.

"Mishra et al. demonstrated that IP addresses can be static for a month at a time [42] which, as we will show, is more than enough time to build reidentifiable browsing profiles."

"Secondly, ground truth was established based on reidentifying visitors with a combination of IP Address and UserAgent, perhaps biasing the baseline data to under-represent users accessing the web from multiple locations."

"Even if traditional stateful tracking is addressed, IP address tracking and fingerprinting are a real concern as ongoing privacy threats that can work in concert with browser history tracking. We point readers to Mishra et al.'s [42] discussion on IP address tracking and possible mitigations. They observed IP addresses to be static for as long as a month at a time, and while not a perfect tracker, IP addresses are trivial to collect."


How static/deterministic are the CGNAT translations though? It is conceivable that when client A connects to a Facebook service with IP X and port P that the source IP and port observed by Facebook is always the same.

In any case your ISP is probably logging all your DNS queries and all their dynamic NAT translations to a database, so couple REMOTE_ADDR with REMOTE_PORT and a timestamp and you can almost certainly be identified.


Good points, and there's a sibling comment which notes some of this as well.

One thing to note is that your ISP is absolutely going to be logging every flow, with (at the very least) the following details:

SRCIP, SRCPORT, TRANSLATED SRCIP, TRANSLATED SRCPORT, DSTIP, DSTPORT

As to how the translations are occurring, I've never actually managed a CGN platform myself, but based on my knowledge of other hardware, I suspect you're closer to the reality than I was, and it's likely that a SRCIP always results in the same TRANSLATED SRCIP, as that can then be installed in hardware trivially and no longer needs to traverse the punt/cpu path to lookup what the translation needs to be.

That does leave the system open to abuse though, depending on how quickly entries age out, as a single customer could easily open up 65k sockets in a very short span of time, effectively DoS'ing any other customers who are using the same TRANSLATED SRCIP if there are no free TRANSLATED SRCPORTs left that their translation can bind to. Then again, the risk of this could be perceived to be low, with a AUP that can handle this if it turns out to be a social rather than technical problem, so it could still be happening in the wild anyway.

This remains a good reminder for me to avoid speculating about topics I haven't thought too deeply about!


Oh absolutely, it's not a silver bullet - just an attempt at alleviating that single dimension, which I still think is significant enough to take seriously.

IMO there will never be a complete solution but that means we have to tackle each issue or dimension individually within the larger context, not just throw our hands in the air and give up.

Maybe should have been more clear on the scope ambition in the OC but can't edit the comment anymore.


This sounds like a less secure Tor to me.


> While access times cross-references and identification is still theoretically possible, it should be an entirely different game.

How are you planning to handle communicating identity across sites with link decoration?


Could you elaborate on what you mean by this specifically? query params like fbclid? That could be stripped client-side, e.g. via a browser extension.

There are already extensions that do this. Previous discussion:

https://news.ycombinator.com/item?id=22386388


It sounds to me like your threat model here is that two sites (A and B) which would like to share identity, and you want to stop them? Perhaps A has identity (for example, you log in), A gives you links to B, and B runs some third-party JavaScript served from A. For example, A could be FB/Google/etc and B could be a news site. Site A can add any query parameter it wants to the outgoing link, and then parse it on B in their third-party JavaScript. If they always used the same params (ex: fbclid/gclid) it would be easy to detect and block, but if they were trying to get around the blocking it would be easy to rotate these parameters as often as they wanted because the same entity (A) controls both the producer and the consumer. Now your two identity bubbles have been joined.

(Disclosure: I work on ads at Google, speaking only for myself)


Yeah, you summed it up pretty well I think.

It's definitely an arms race, and IMO it makes sense to push back on areas where one can. Are you implying that it's not a worthwhile effort and that the battle is lost?

I'm thinking 1) DNS control with block lists 2) browser extensions (restricting canvas, removing tracking parts of urls) 3) be restrictive of disclosing PIIs 4) IP obfuscation along the lines I laid about above should make it a lot less deterministic and decrease confidence in merging of datasets.

Rule lists obviously have to be continuously updated.

Only a sith deals in absolutes but from your perspective am I missing something?


Here in the UK, date of birth and post code is enough to identify something like 95% of people. Anonymised data sets are not really possible once you have more than a few varriables. Most people don't know this.


My local area published "Anonymised" datasets of public transport usage but they gave everyone a unique ID. It was found that if you knew 2 trips the person took you could uniquely identify the person in the dataset and see all of their trips.


How does that work? If a friend of mine and I both took the bus to a movie and returned to our start how do you differentiate between us?

Seems like this would come up a lot with commuters.


Of course that case will fail but for almost all cases if you know something like that you took x bus to work and then a week later you took one to the mall its now possible to find all of their trips. For someone you know somewhat well its not hard to find 2 trips they took and then be able to find all of their trips.


Ok, so two trips but not any two trips. It requires a lot more knowledge of the person. For someone you know well why would you even need to look at the data?


The thing is, if you know enough trips to unmask them, then you can find out about all their trips in that dataset.

As an employer, maybe I can find out an employee wasn't home sick when they said, but took the bus to a station that only serves a competitor's business. Etc.


This is a really good point. I hadn’t fully considered the different ways you could know just part of someone’s routine.


Its almost any two trips. Its the exception that two people take the same trip together. I can think of a handful of people I could eventually find 2 trips for who wouldn't want me to have their entire travelling history.


One could trivially establish multiple trips that anyone with even slightly public social media has taken.


Ok this is an interesting point. I can’t tell if this is a challenging problem or not. It’s probably worth trying an experiment.


In this case, proper anonymization would require unlinking those two variables then.

It isn't impossible to anonymize data. It's just hard to do it right.


Isn’t a postal code about 50-100 houses? It really narrows things down.

That particular variable really reduces things.


There are 2 reasons postcode matters.

First, postcode is something you give out pretty willingly. If you put your postcode and dob into an insurance quote website, they would no longer be insuring based on a pool of people like you. They'd literally just see how many claims you had. And also what ethnicity and sexuality and 50 other personal, irrelevant criteria they want.

The second is that postcode is only a narrow or broad measure depending on what you're using it for. If you want to do a study on asthma rates vs road traffic, postcode is just right, anything more general and you're comparing side streets and motorways. So it makes sense for that data to be available. But wait, as the data user, I only need one more data set (say voter registration, already available) and I can literally look up you're medical history before deciding whether to hire you.

This is the issue here: data HAS to be specific to be useful. But ifs its specific its dangerous. AND data is much more specific than you realise because a few innocent sounding data points are unique to you when combined.


On average it's about 15 houses, so postcode + date of birth is indeed around 95% accurate.


That's a bit insane. I looked up US numbers to check and got around 8k/zip code[0]

[0] https://www.zip-codes.com/zip-code-statistics.asp


In Ireland, we use Eircode with one house per postcode. This is very handy because you don't need to type in your full address on a lot of websites, just the Eircode.

The first three digits of an Eircode are more like a traditional postcode in that they indicate your area/town but the next four are randomised for each address.


Man, that sounds so convenient. I kind of wish everywhere had that minus the privacy factors


That must be what ZIP+4 is for.


When you consider the birthday paradox, and consider demographics, it probably doesn't.

If you assume that DOB's are evenly and randomly distributed over the last 100 years (1 of ~36500 values) then the probability of none of 100 people sharing a DOB is only ~87%. If you tuned for demographics the true stats would be much worse.

That said, you probably have anti-clustering aspects - parents obviously can't share birth dates with their children, and siblings can't either (unless twins). But! couples tend to be of a similar age...so, tricky.


50-100 houses is not that much. You can find someone in those 100 houses with 1 DOB and post code. Maybe in very crowded areas you'll need one more var. Worst case scenario you'll end up with 2-3 people.


Intuitively there are tons of things we do on our computers that uniquely identify. I am sure the adware companies know a ton more and are not public too. The need for strict privacy preserving tech is needed across the whole stack.


By looking at all the data available to untrusted sites (as seen in https://amiunique.org/fp) you can tell that Web is many many years away from being privacy conscious. List of fonts, canvas fingerprinting, timezone, OS, user agent... the list goes on and on. Those of us who are tech-literate know better than to create tech like this today, but there's just too much momentum (and shady interests) to hot-swap Web for something else.


I think this is as stupid as it sounds from the paper - https://www.usenix.org/conference/soups2020/presentation/bir...

Why not "Mozilla research: We asked users for their name and address and the ones telling the truth we could identify"

TOR is fighting identifying users from the screen size of their window when maximised.

Here's the original paper which is more about how you can access the browsers histories - https://www.petsymposium.org/2012/papers/hotpets12-4-johnny....

Can you still access browsers histories? I'd have to guess no way without a zeroday. The original site is down. http://www.wtikay.com/ Firefox fixed it - https://bugzilla.mozilla.org/show_bug.cgi?id=147777


Wasn't it shown by aol researchers 20 years ago that search histories are uniquely identifying? If so, this seems hardly surprising, as browser history should be a superset of search history.


As counterstrategy you can use tools like http://trackmenot.io/

"TrackMeNot runs as a low-priority background process that periodically issues randomized search-queries to popular search engines, e.g., AOL, Yahoo!, Google, and Bing. It hides users' actual search trails in a cloud of 'ghost' queries, significantly increasing the difficulty of aggregating such data into accurate or identifying user profiles. "


I use it as far as I can but it's stopped working in palemoon. The queries it produces aren't very intelligent when you see them and it wouldn't take much NSA/MI5 work to trim much of them out.

I think TMN could be a fair bit smarter.


The Evercookie (hard-to-delete cookie-like system in JavaScript) and Panopticlick (browser fingerprinting) projects may also be of interest:

https://en.wikipedia.org/wiki/Evercookie

https://panopticlick.eff.org/


Interesting. I also think that the browser signature, together with IP address, will probably come very close to uniquely identifying users.


I noticed the other day that various chatbots (as in, a single service shared across multiple websites) call me "The University of Texas at Austin", presumably because I have a housemate who works there.

I tried various VPN servers and got called by other company names[0]. It was a good reminder about how we're tracked, and our information may be shared, even with other users.

[0] https://twitter.com/lkbm/status/1299408670325964802


I’m sure that last 3-5 pageviews with exact timestamps is enough to uniquely identify any person


So can DNS queries.


I suspect privacy would be better served by taking the approach of the security domain with responsible disclosure to vendors and a concerted effort to attack the problem holistically. Until then we’re just giving privacy attackers a heads up and by the time this issue is mitigated their onto the next avenue for bypassing privacy.


Time for a browser plugin that will generate random noise - adding junk into history.


If it’s truly random wouldn’t that make you even easier to identify?


Not if everyone else uses it too


Even if everyone else uses it.

If your random pages are a, b and c but my pages are d, e and f or even a, b and d then it’s still easy to fingerprint us.

Extensions like this might work if they visited the same sites all other users visit. Otherwise you’re just adding even more unique information for the trackers.


But if both our random pages are a, b and c, and the only difference is when or how often I accessed each of those, then making it random for both of us will effectively turn us into the same person.


What about all the other pages you visit? How does adding random traffic to your history make you any harder to identify? It just creates more datapoints.


Internet noise generator[0].

[0] https://makeinternetnoise.com/index.html



If the study establishes that for all practical purposes, online anonymity is impossible to maintain for average users, what are the implications (a) for the average user; (b) for the economy; and (c) for society?


Mine certainly i, since I tend to visit the same ten sites over and over again


so are amazon/itunes/appstore/googleplay/netflix-views/etc




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: