Not in the "obvious in retrospect" way, but because browsers have been progressively blocking history-sniffing tactics for years precisely because advertisers were using it to identify visitors.
Did this research... establish better numbers around it or something?
>> However, this time around, since the data was collected from Firefox itself and not through a web page performing a time-lengthy CSS test, the data was much more accurate and reliable. Furthermore, the data Mozilla researchers collected is also about the same type of data that today's online analytics companies also collect about users — either through data partnerships, mobile apps, online ads, or other mechanisms.
>> The new experiment got underway between July 16 and August 13, 2019, when Mozilla prompted Firefox users to take part of this experiment.
>> Mozilla researchers said that more than 52,000 users agreed to take part and agreed to provide anonymous browsing data.
I regularly visit maybe 5 or 6? The rest tend to be random links from reddit or HN, I wonder if visiting a site like that once and never again is enough to help with that identification.
Another thought is I think it's obvious that if it's a site you log into and the URL has an identifier of some type then it's easy to identify you and that's why schemes to hide the URL could also be a privacy issue.
I would think things like Facebook/Twitter like buttons or Google Fonts might make it to assemble this history.
Sites like FB are said to maintain "Shadow Profiles" of people, even when those people aren't using their service directly.
I suppose in theory any sufficiently shared infrastructures such as AWS/Cloudflare could do so as well, but they are disincentivized to do so.
If so, wouldn’t that drastically reduce the effectiveness of using DNS resolve times as a work around for Firefox containers?
Not trying to be argumentative here, just trying to understand how effective the sandboxing is, or whether I need to design more layers of indirection. :)
Some AWS services (such as TLS-terminating load balancers) do have access to sensitive cross-site information that could be fed into the adtech panopticon but I wonder if it would be cost-effective for AWS to gather.
I doubt it would be cost effective for AWS to do broad captures for all of its services, however. There's probably not much value in slurping up the IP and SNI data for all HTTPS requests to every EC2 instance, for instance.
• You want to refactor XUL so it doesn't duplicate features of HTML5? Whoops; you broke all the extensions.
• You want multi-threading? What a shame; that API over there assumes it'll always be called from the main thread.
• Update that database table's schema to store more data? Bah. Make another table, or you'll break extensions
When every implementation detail is part of your interface, bad things happen.
Do they even have one engineer working on this?
I sympathize with the other user having to change a default setting or install an extension, but I'm glad that the felt behavior changed.
Image CDNs. Imgix.
CDNs in general. Akamai. Cloudflare.
Any of the several ads platforms.
Disqus, or any of the commenting platforms.
Youtube. Lots of people embed their video content.
AWS, Google Cloud, Azure.
Every time you visit a page that has an embedded content from a remote host, that host knows you visited that page.
So, I don't actually think this research is particularly relevant anymore? It can't really be exploited (and when it can, there's much better ways to track the person).
"Firefox Accounts uses your password to encrypt your data (such as bookmarks and passwords) for extra security. When you forget your password and have to reset it, this data could be erased. To prevent this from happening, generate your recovery key before having to reset your password."
So it appears they may be encrypting data locally and syncing encrypted data without having keys.
I think you are right though, there are more website saving features available through pocket other than recommendations and I'm not sure how any of that works.
"..other than that, how was the play Mrs. Lincoln?"
Using Mac OSX stock audio input and output devices is already supposedly "unique".
Having an Azerty keyboard supposedly puts you in the 0,04%
category even though all French speakers have the same setting which means that 0,04% already represents +70M users. So far from "unique".
1. Visit the site
2. Delete your browser cookies
4. Repeat the steps until you're less unique
The modern Web is a tracking system.
That's gonna take a lot of refreshing.....
So e.g. Google would always see the same IP, which would be different from the one Facebook sees.
While access times cross-references and identification is still theoretically possible, it should be an entirely different game.
Would anyone else reading this be interested in working on this or joining in? I'm not thinking to make it a startup or business per se but 1) reliable IPs are a bit too expensive to make sense for just 1 person 2) anonymity in numbers.
I'm thinking ideal would be something FOSS and easy to self-host and replicate so you can pool together a group of friends for a shared VPN among semi-trusted parties (at least the user should trust the operator to not index requests and sell the data, and the operator should trust users to not run botnets)
Edit: My math is wrong. I thought IPv6 was 2^64, but it's actually 2^128, so that percentage is 10^20 times more miniscule.
> A user cannot change her IP address with the same ease as she can change her software fingerprints.
I dunno. It’s a lot easier for my less techie friends to reboot their router and get a new IP than it is to talk them through installing some privacy enforcing software they required regular maintenance or results in weird and wonderful breakage of their favourite websites.
Don't take my word for it, read the work cited in the article. Note how much they still rely on (static) IP addresses. If we removed the IP address as a reliable item of available data, based on observed practices (not theory), that would likely be significant.
"Mishra et al. demonstrated that IP addresses can be static for a month at a time  which, as we will show, is more than enough time to build reidentifiable browsing profiles."
"Secondly, ground truth was established based on reidentifying visitors with a combination of IP Address and UserAgent, perhaps biasing the baseline data to under-represent users accessing the web from multiple locations."
"Even if traditional stateful tracking is addressed, IP address tracking and fingerprinting are a real concern as ongoing privacy threats that can work in concert with browser history tracking. We point readers to Mishra et al.'s  discussion on IP address tracking and possible mitigations. They observed IP addresses to be static for as long as a month at a time, and while not a perfect tracker, IP addresses are trivial to collect."
In any case your ISP is probably logging all your DNS queries and all their dynamic NAT translations to a database, so couple REMOTE_ADDR with REMOTE_PORT and a timestamp and you can almost certainly be identified.
One thing to note is that your ISP is absolutely going to be logging every flow, with (at the very least) the following details:
SRCIP, SRCPORT, TRANSLATED SRCIP, TRANSLATED SRCPORT, DSTIP, DSTPORT
As to how the translations are occurring, I've never actually managed a CGN platform myself, but based on my knowledge of other hardware, I suspect you're closer to the reality than I was, and it's likely that a SRCIP always results in the same TRANSLATED SRCIP, as that can then be installed in hardware trivially and no longer needs to traverse the punt/cpu path to lookup what the translation needs to be.
That does leave the system open to abuse though, depending on how quickly entries age out, as a single customer could easily open up 65k sockets in a very short span of time, effectively DoS'ing any other customers who are using the same TRANSLATED SRCIP if there are no free TRANSLATED SRCPORTs left that their translation can bind to. Then again, the risk of this could be perceived to be low, with a AUP that can handle this if it turns out to be a social rather than technical problem, so it could still be happening in the wild anyway.
This remains a good reminder for me to avoid speculating about topics I haven't thought too deeply about!
IMO there will never be a complete solution but that means we have to tackle each issue or dimension individually within the larger context, not just throw our hands in the air and give up.
Maybe should have been more clear on the scope ambition in the OC but can't edit the comment anymore.
How are you planning to handle communicating identity across sites with link decoration?
There are already extensions that do this. Previous discussion:
(Disclosure: I work on ads at Google, speaking only for myself)
It's definitely an arms race, and IMO it makes sense to push back on areas where one can. Are you implying that it's not a worthwhile effort and that the battle is lost?
I'm thinking 1) DNS control with block lists 2) browser extensions (restricting canvas, removing tracking parts of urls) 3) be restrictive of disclosing PIIs 4) IP obfuscation along the lines I laid about above should make it a lot less deterministic and decrease confidence in merging of datasets.
Rule lists obviously have to be continuously updated.
Only a sith deals in absolutes but from your perspective am I missing something?
Seems like this would come up a lot with commuters.
As an employer, maybe I can find out an employee wasn't home sick when they said, but took the bus to a station that only serves a competitor's business. Etc.
It isn't impossible to anonymize data. It's just hard to do it right.
That particular variable really reduces things.
First, postcode is something you give out pretty willingly. If you put your postcode and dob into an insurance quote website, they would no longer be insuring based on a pool of people like you. They'd literally just see how many claims you had. And also what ethnicity and sexuality and 50 other personal, irrelevant criteria they want.
The second is that postcode is only a narrow or broad measure depending on what you're using it for. If you want to do a study on asthma rates vs road traffic, postcode is just right, anything more general and you're comparing side streets and motorways. So it makes sense for that data to be available. But wait, as the data user, I only need one more data set (say voter registration, already available) and I can literally look up you're medical history before deciding whether to hire you.
This is the issue here: data HAS to be specific to be useful. But ifs its specific its dangerous. AND data is much more specific than you realise because a few innocent sounding data points are unique to you when combined.
The first three digits of an Eircode are more like a traditional postcode in that they indicate your area/town but the next four are randomised for each address.
If you assume that DOB's are evenly and randomly distributed over the last 100 years (1 of ~36500 values) then the probability of none of 100 people sharing a DOB is only ~87%. If you tuned for demographics the true stats would be much worse.
That said, you probably have anti-clustering aspects - parents obviously can't share birth dates with their children, and siblings can't either (unless twins). But! couples tend to be of a similar age...so, tricky.
Why not "Mozilla research: We asked users for their name and address and the ones telling the truth we could identify"
TOR is fighting identifying users from the screen size of their window when maximised.
Here's the original paper which is more about how you can access the browsers histories - https://www.petsymposium.org/2012/papers/hotpets12-4-johnny....
Can you still access browsers histories? I'd have to guess no way without a zeroday. The original site is down. http://www.wtikay.com/ Firefox fixed it - https://bugzilla.mozilla.org/show_bug.cgi?id=147777
"TrackMeNot runs as a low-priority background process that periodically issues randomized search-queries to popular search engines, e.g., AOL, Yahoo!, Google, and Bing. It hides users' actual search trails in a cloud of 'ghost' queries, significantly increasing the difficulty of aggregating such data into accurate or identifying user profiles.
I think TMN could be a fair bit smarter.
I tried various VPN servers and got called by other company names. It was a good reminder about how we're tracked, and our information may be shared, even with other users.
If your random pages are a, b and c but my pages are d, e and f or even a, b and d then it’s still easy to fingerprint us.
Extensions like this might work if they visited the same sites all other users visit. Otherwise you’re just adding even more unique information for the trackers.