Coauthor here. I lead the research team at Princeton working to uncover online tracking. Happy to answer questions.
The tool we built to do this research is open-source https://github.com/citp/OpenWPM/ We'd love to work with outside developers to improve it and do new things with it. We've also released the raw data from our study.
What can be done by the browser vendors such as Mozilla, Google, and Microsoft?
To prevent fingerprinting, your browser has to disable all sorts of useful modern JavaScript API's (e.g., WebRTC) by default, prevent spurious HTTP requests (e.g., to prevent abusing @font-face to find out which fonts are installed), and pretend you are an American using the most popular web browser of the moment (i.e., hide the user's preferred language and claim en-US as your preference, and change the user agent string to blend in to the crowd).
This is all assuming people don't run any third party plugins like Flash.
Are browser vendors on track to figure out a solution to this problem that combines user friendliness with privacy? Or will anonymous browsing remain a privilege for those with the right amount of technical know-how?
The problem it seems is that simply disabling JavaScript is not an option for normal web browsing, and even a requirement for interacting with the web services used by organisations you have a relation with (e.g., the government, insurance companies, banks, etcetera).
Personally I think there are so many of these APIs that for the browser to try to prevent the ability to fingerprint is putting the genie back in the bottle.
But there is one powerful step browsers can take: put stronger privacy protections into private browsing mode, even at the expense of some functionality. Firefox has taken steps in this direction https://blog.mozilla.org/blog/2015/11/03/firefox-now-offers-...
Traditionally all browsers viewed private browsing mode as protecting against local adversaries and not trackers / network adversaries, and in my opinion this was a mistake.
Don't you think this sort of thing warrants a separate sort of browsing mode? A lot of people who use the likes of incognito mode just use it for e.g. browsing porn where they don't want the local history to be preserved.
Turning that mode into one that's highly hardened against fingerprinting would in practice ruin the browsing experience for those users. Just look at what the Tor browser needs to do with fixed preset resolutions, no JavaScript etc.
> Google has explicitly WontFix'd bugs on the subject of expanding incognito to be hardened against fingerprinting
Obviously. Google is in the business of destroying your privacy: Advertising revenue is maximized when the consumer is/remains completely tracked and profiled at all times.
Other browser vendors which are not in the ad business could use this as an opportunity to differentiate themselves from Google:
Introduce a 3rd browsing mode which kills fingerprinting (with the "cost" of reduced user friendliness).
Or just let the user decide at the start of a private session. Firefox already does this with tracking protection. If Mozilla decides to improve tracking protection at the cost of usability (such as hiding you preferred language), than offering that as a toggle-able option on that page might be sufficient to empower the user to decide for on his own.
> offering that as a toggle-able option on that page might be sufficient
Technically speaking yes.
From a marketing/communication standpoint, I would separate this "feature" clearly from the 2 known browsing experiences. It not only clearly communicates to the user that a different browsing experience is about to start. By selling it as "the third browsing mode", it definitely also adds more perceived value to the product.
> Don't you think this sort of thing warrants a separate sort of browsing mode? A lot of people who use the likes of incognito mode just use it for e.g. browsing porn where they don't want the local history to be preserved.
Think about it this way, would those using incognito mode for porn be OK with their normal browsing being peppered with ads claiming to "Improve your <fetish> with our range of <sexual implements>"?
Whilst I think incognito mode's warnings about not hiding data from network operators should remain (i.e. your boss can find out what sites you were visiting at work), that doesn't mean efforts to prevent it shouldn't be made.
> there are so many of these APIs that for the browser to try to prevent the ability to fingerprint is putting the genie back in the bottle.
I disagree. If there were billions to be made from this new tech, secure browsing, then the browser vendors would be moving rapidly toward it. Certainly, more difficult technical challenges are overcome regularly.
Surveillance was implemented without users' knowledge and without public debate, presented as a fait accompli, and now the latest tactic is to say there's nothing that can be done about it. People accept that because they feel helpless, but I don't think we should be perpetuating this rhetoric of inevitiability. There's no technical reason it can't be done.
The browser vendors could start taking the idea of asking for permission seriously.
For WebRTC, browsers could block local addresses. uBlock Origin can do this on Firefox already.
For battery: browsers could treat it like location and ask for permission. Why does the average site need to know my battery status?
For fonts: browsers could standardize a list of system fonts available on each platform. It's 2016 already: web fonts are here, are widely supported, and no legitimate website should be relying on some oddball manually installed font.
This problem is hard to solve, but the Tor browser has it mostly solved. Other browsers could learn from it.
> browsers could standardize a list of system fonts available on each platform.
It would probably make sense to completely disable support for local fonts unless permitted by the user (for legacy websites that depend on it). All modern browsers support @font-face, and without @font-face you can always depend on the special keywords serif, sans-serif, and monospace; these will load the system's default font for that category.
I hope there is another way to solve it, as I have installed web-fonts to my PC to improve page loading speed for some common fonts I keep seeing (the most recent being the Roboto font stack from google).
It would be a shame to have to keep re-downloading that every time.
You don't have to. Fonts should be (and usually are) offered to the web browser with the instruction to cache them indefinitely. You will only have to re-download them when your cache is cleaned up (due to its size, private browsing, or manually cleaning it). Upcoming technology WOFF2 helps further compress them by a significant margin as well (I've seen up to 50% improvement in size over plain WOFF).
The problem is that the behaviour you are describing is also one of the ways a fingerprinter gets its data on your fonts; by specifying an @font-face declaration that first tries for a local font, and only loads a remote font if that is not found. Do this for a short-list of popular but distinct fonts (such as Roboto), and you have a nice amount of bits of identifying data to add to the stack.
Also, tricks like these exist (using rendering metrics to detect fonts):
Browser caches are extremely unreliable and pretty small in the grand scheme of things.
On some mobile platforms the browser cache can be replaced entirely by some heavy pages!
Plus, even if we assumed "cached forever" actually worked for a significant amount of time, it still doesn't solve the problem that I am hoping to solve. I know many websites use the Roboto font. By installing it I no longer need to ever download that font again. It doesn't matter if it's the first time i'm seeing the site, if they use a CDN, if they link to the bold/light/regular version or their own packed font, etc...
I understand that it's a privacy issue, but I'm hoping there is a way to solve that privacy issue without removing that feature.
I would go further and disable remote fonts as well since it's not crucial like images and is an attack vector that should have been avoided. The better solution would have been a shared set of web fonts distributed with browsers, just like certificates.
Fonts are different in that they're not as crucial to the content as images. You cannot replace an image with an alternative text form while retaining the content, but you can display the content completely with WOFF missing.
I think the logical conclusion of that argument is to also disable all CSS. Fonts are styling for text, CSS is styling for markup. I think most the arguments against disabling CSS can be used against disabling fonts (barring that people do crazy shit with CSS most often now, so it complicates the issue).
Really, in the end, all input accepted from the remote side (including text/html) needs to be vetted and processed by security conscious routines. I don't personally have a reason to assume a font library is more likely to be exploitable than an HTML+CSS parser and layout engine. Based on complexity, I would actually assume the opposite, which is probably right, except we've already found and fixed a lot of the exploits for the HTML parser and layout engine.
I've considered what might be necessary to dispose of server-side CSS.
A set of standard page templates could do it. Clients could then choose their preferred client-side CSS to apply. Article, index page, image gallery, catalog entry, search result, etc.
Seems a finite set should cover most needs. Which ought be available from a few fairly standard sources (CMS, blogging engines, frameworks).
As I see it, the web is the most vibrant medium of expression and innovation we have today. While I don't doubt there would be gains in security to limiting it in many aspects, I question whether the specific level of security gain would be worth the loss of innovation and expression. I think there are many areas we could focus on instead that would increase security without the same level of negative consequences, so I espouse doing those first.
Not to mention I don't think it's a solution that's viable given economic principle and how much people value expression. We'd just be back to the equivalent of Flash sites again, with whatever takes over for flash (canvas?).
It's not just security, the other half is usability. The way popular web sites work is that they change stuff around regularly only to make it look different but change the behavior as well. This breaks stuff like existing functionality, key+mouse sequences to get stuff done, places you've learned to look at and navigate to quickly. Computers and modern appliances (including cars) are strangely affected by this constant desire to shift things around unnecessarily because someone told them they could sell it as it looks different now.
GitHub and gmail are prime examples for sites that broke a lot of things in the process.
Maybe what we need instead are real APIs and custom clients.
Consider the implications of what this means though. If sites are not free to innovate, things like Github and Gmail wouldn't exist. They only reason we aren't stuck with a Hotmail interface circa 2002 is because people were able to innovate on the web. To lock down CSS (or Javascript, there's no reason I can think of you would lock CSS and not Javascript) to a specific set of capabilities is both a statement that it is sufficient for all needs, and that we can decide by committee what is a good set of standards to lock into. I think both assertions are laughable false.
If we had locked down CSS five years ago, what CSS would be not be capable of using today? If we lock it down today, what would we be missing out on that would come five years from now?
Design by committee is horribly inefficient, and rarely takes into consideration the full needs of the users. What's more, it can't take into consideration future needs. Design by committee gets us XML. Adoption by iteration and evolution gets us JSON. XML has its place, but JSON is overwhelmingly more popular in certain contexts for a reason, it fits the domain better.
Lastly, iterating on Github and Gmail would not stop even if there was a complete lack of CSS and Javascript, it would just be more tedious as everything was done through a full page serve, just like the old days. That wouldn't prevent site redesigns along with missing or broken features, it would just make everything look shittier, perform slower, and use more server side resources.
That said, a sane standard for embedded interfaces, where choice is restricted, it needs to live a long time, and needs to have sane accessibility features would do well with better standards. I view that as a separate problem.
The fact of standard templates needn't prevent the possibility of novel templates. But it ought make the prospect slightly more user-controllable. Design-by-committee isn't the alternative to design-by-fuckwits, the present mode.
Github and Gmail are both tools which now face the dilemma of gratuitous changes -- many of the recent innovations haven't done much for usability, for numerous reasons (familiarity itself is a key factor, GUI offers limited capacity for improved functionality, jwz has commented on this from his Mozilla experiences).
But most changes to default styles are pants.
Hell, much the problem is that default styles are pants. If browsers had a set of presentation styles that did work well (see the "readability" modes offered by Safari, Firefox, Readability, Pocket, Instapaper, etc.), then we'd have slightly less a problem.
Github, Gmail, Google Maps, etc., are largely the exception to long-form informational content pages. I'm OK with an explicit "app mode" for such sites. But 99.999999% of what I read would do vastly better with uniform presentation.
More attention to content and semantic construction. Less to layout frippery.
> The fact of standard templates needn't prevent the possibility of novel templates.
If your stance is "provide well established default templates, but don't enforce their use", then I have no disagreement. That's not how I interpreted "I've considered what might be necessary to dispose of server-side CSS."
> Github, Gmail, Google Maps, etc., are largely the exception to long-form informational content pages. I'm OK with an explicit "app mode" for such sites. But 99.999999% of what I read would do vastly better with uniform presentation.
I think that depends heavily on what you use the web for. You and I likely read a lot on the web. Some people might stick largely to Facebook and Gmail. There are people that spend a lot of time in Github, and others that spend very little. Some people use a lot of online organizational and collaboration tools, others none.
> More attention to content and semantic construction. Less to layout frippery.
What you call layout frippery, someone else desires. This sounds suspiciously like remaking the web for your use cases, not for general use cases (which are always changing). But I'm not sure there's even a problem to address, you already addressed through referencing "readability modes" as an example of presentation styles that do work well. Why isn't that your solution to this perceived problem?
It feels like you're trying to achieve the equivalent of forcing all the printers to agree to not print magazines that don't conform to someone's opinion of what a good magazine is. I'm just not sure why that's even desirable.
> Something tells me you'll not be convinced.
No, not yet, if I understand your position correctly.
Defaulting to standard formats, and, on the basis of improved semantic parsing and ranking, promoting them through higher Search rankings (ceterus paribus) would be a Good Thing.
Among the problems of present Web design is that the Web is an error condition (there's a wonderful essay exploring this), and browsers default to allowing broken behavior, even adapting themselves to it, explicitly.
The lack of a publishing gateway, even a minimal one which enforces markup correctness to the Web is a problem.
Layout frippery as pertains textual content has a rather well-supported basis. Complexity is the enemy of reliability, and more complex layouts offer far more ways for sites to break. That's a well-established fact that successive generations eventual learn (or fail to learn) at their peril.
(The phrase "Complexity is the enemy" itself dates to the 1950s. I'd have to check the year, but have remarked on it before. Source is The Economist newspaper.)
I've seen what happens when documents and other media are aimed at very specific readers. Eventually, they rot.
Bog-standard HTML (or some alternative markup -- I'm increasingly partial to LaTeX) tends, strongly, to avoid this.
You're also going back to ignoring points raised earlier in this conversation about security, privacy, and usability.
And yes, if there's a call for an app-based runtime environment, which Google seem quite bent on producing, well, that's a thing. But no need to fuck up the game for the rest of us.
And models which prove useful could and should be incorporated.
I'm pretty gobstopped, for example, that 25 years after its introduction there's no affordance in HTML for notes (e.g., footnotes, endnotes, sidenotes, as presentation is a client issue), or for hierarchical presentation, e.g., of comments threads.
On can create nested hierarchies, but one with an integrated expand/collapse/sort/filter functionality doesn't exist. This was extant in Usenet newsreaders and mail clients 20 years ago. Why not the Web?
I don't really have any issue with most of what you are saying, except "The lack of a publishing gateway, even a minimal one which enforces markup correctness to the Web is a problem.", and my issue with that really depends on how what you mean by "problem". Sure, a publishing gateway would enforce some conformity, and some level of conformity is beneficial (I'll even allow that more conformity than we currently have would be beneficial), but too much conformity is not. Too much conformity breeds stagnation. So i'll re-frame my stance: How do you enforce or encourage conformity without going to far? How do you keep the entity or entities you've entrusted this task to from going to far?
> You're also going back to ignoring points raised earlier in this conversation about security, privacy, and usability.
I was just working off your points, which all seemed to be about usability. I've been treating this discussion as somewhat distinct from that one. I can definitely make arguments about conformity having it's own negative aspects with regard to security.
We can begin by actually reviving browser user style sheets and having a well known and respected sets of names will allow for appropriate styling on the client.
I'm all for user style sheets, I see no problem in people overriding site defaults. Re: sites breaking existing functionality while changing, maybe I just don't see big regressions as having happened in Gmail (which I always have open) or Github (which I rarely have open, as my source is in a local repo, but I visit on a regular basis from links here and elsewhere). It is interesting that you mention keyboard mouse combos, when to my knowledge both sites have put specific effort into making keyboard shortcuts that work and allow some level of navigation without any mouse.
I have to use Gmail in static HTML mode so that it doesn't try to reinvent and fail at a text edit control for composing a mail.
GitHub has been, like Twitter, grabbing more key bindings that existed before in a web browser like Ctrl-K and their comment edit box got limited in its resizability, forcing me to edit outside the website and paste into it often enough that it's an annoyance.
> I have to use Gmail in static HTML mode so that it doesn't try to reinvent and fail at a text edit control for composing a mail.
I'm not really sure what you are referring to here. Gmail does attempt to give you an editor for emails, but it's extremely simple in my experience to get it to do what you want most the time. If your complaint is that you want it to just send a text email, and not a multi-part with a plain text version and an HTML version, in which case I have to question why, as all it does it add choice and allow people to view it in the format they prefer, and it should look the same either way.
> grabbing more key bindings that existed before in a web browser like Ctrl-K and their comment edit box got limited in its resizability
Re: key binding, yeah, I can see that as somewhat annoying. I suspect they are trying to match some standard usability map and thinking of their site as an application, but it's annoying that it interferes with the browser (but only when within a text input, from what I can tell).
To some degree, I have to agree with what's probably Github's stance, which is that it's their site, and while it may seem annoying in some respects, they may have specific reasons they do things. They obviously aren't going to be able to make every change something everyone likes, but I don't necessarily think they are making change for change's sake. It's likely in response to pressure from gitlab and competitors. Presumably they are audience testing. The best way you can speak to this is to not use them when possible, or urge others to not use them.
Re Gmail: the bug is that they replaced the browser text box edit control with their homegrown JavaScript solution which does not work at all. Copy/Past, scrolling, and many other features are broken with it. On top of that, you cannot resize it.
Re Github:
The big issue is that they start hijacking keys that were free before. It's hard to impossible to sway developers to use anything but Github. I've tried and been treated as if I'm in the luddite camp.
> homegrown JavaScript solution which does not work at all. Copy/Past, scrolling, and many other features are broken with it.
These all just work for me. I'm not sure what the specific complaints are, maybe it's a Firefox thing, but it's not like there's a lot in chrome that FF doesn't support.
> On top of that, you cannot resize it.
A little convoluted, but there is a way. In the subject of the thread, to the right, along with the collaps all control, there's the option to open the thread in a new window. This window can be resized, and since the input is the size of the window. Although, I suspect Gmail is meant to be viewed as more of an app than a site, so if the window size is not just about composing, but use, it might be worth using it as a freestanding browser window, distinct from and sized differently than other tabs, if you aren't already. I might actually play around with doing that not that I've said it.
> Re Github: The big issue is that they start hijacking keys that were free before. It's hard to impossible to sway developers to use anything but Github. I've tried and been treated as if I'm in the luddite camp.
Yeah, that's unfortunate, and I would have hoped Github would do better. I don't really think it's the norm though.
As cm3 notes, usability is as much if not more a concern than privacy and security. Though I'd not dislodge any of these three from a position of high primacy.
There's a risk / frequency trade off with all of these. Privacy can be quite possibly costly or fatal, though slightly more rare. Not so rare though that 20% of all Web users in a US Department of Commerce survey (see my recent comments history) report known credit card fraud. That's many tens of millions of affected users.
The security risks are similar but also extend to organisations which might stand to lose control over their own (validly) private information, or control over systems (see for example concerns over SCADA infrastructure, or industrial process control).
Usability and adaptability issues pose lower risks, but have a much larger affected field.
It goes well beyond the visually disabled, illiterate, and cognitively challenged. Anyone who's landed on a desktop site that's unusable on mobile has encountered a usability challenge. Google, Apple, Facebook, and Amazon are all rapidly pushing us, some kicking and screaming (I include myself) to an audible Web -- one in which the primary control and response interfaces are spoken.
What landing on a small set of templates does is provide for clearly parseable and understandable content. In a world where the goal isn't to read a full page but to extract and convey a useful item of information from it, wading verbally through megabytes of unparsable and nonexcludable content isn't particularly useful (and yes, figuring out how much a data reference is worth to the data-rerference intermediary is another question worth considering).
More generally, in my case, with only modest perceptual impairments (reading small, low-contrast type is among the earlier signs of your impending death), I've come to conclude for some years now that Web design isn't the solution, Web design is the problem. There are only so many ways you can present content that doesn't fuck with readability. I try, very hard, to ensure I'm not doing this on my own modest designs (look up "Edward Morbius's Motherfucking Web Page", a riff on a popular refrain, for my own principals in action).
My most common response when landing on a website is to sigh, roll my eyes, and dump it to something more readable. Firefox's Reader Mode. Pocket. Straight ASCII text. w3m.
And no, "novel graphic design" isn't conveying vast new amounts of information. I grew tired of hearing that argument 30 years ago, it's not got fresher since. Bloomberg, The New York Times, and the BBD are all experimenting with high-concept article formatting. In my experienct, without exception, it simply Gets In The Fucking Way.
My half-serious response to this is to create a new web browser embodying these and a few other principles. The working title is "the fuck your web design browser". FYWD for short.
Ninnies may opt to call it the Fine Young Western Dinosaurs browser as an alternative.
> My most common response when landing on a website is to sigh, roll my eyes, and dump it to something more readable. Firefox's Reader Mode. Pocket. Straight ASCII text. w3m.
> My half-serious response to this is to create a new web browser embodying these and a few other principles.
In all seriousness, I wonder if spoofing a mobile client (easily done through most browser developer console's or an extension) might immediately result in a more useful experience for you on the majority of sites. Given the viewing constraints of most mobile platforms, and the focus on mobile accessibility (it's supposed to account for over 50% of traffic now), I imagine many sites try to but some minimum level of effort in to at least make it usable.
The majority of my browsing is mobile these days. 10" tablet.
Even sites which are otherwise well-designed (Aeon and Medium come to mind) insist on dark-pattern behavior such as fixed headers/footers. Again: straight to reader-mode for that.
(Screenshots contrasting site and a Reader Mode session included.)
I've written directly with the site designer who seems utterly insensate to why 14pt font isn't in fact a majickal solution to all readability problems.
Really? That seems unlikely. I mostly see people use phones, and small tables, so <= 7".
> Except for the sites which break that. Violet Blue's Peerlyst comes to mind
There will always be someone thwarting best practices, just as there will always be those that skirt or break the rules in systems that are less lenient. There's not a lot of recourse, you want what they've got, so you are at their whim unless you can work around their imposed difficulties or find another source.
> I've written directly with the site designer who seems utterly insensate to why 14pt font isn't in fact a majickal solution to all readability problems.
See above :/
> HN itself is only barely usable.
Yeah, but I think the reasoning behind HN is slightly different. I suspect HN assumes you will takes some appropriate steps to optimize your use of the platform. Instead of "we will tailor the view to our artistic vision and you shall not besmirch it!" it's more of a "we believe in user agency, so get off your ass and make it better for yourself." Depending on your point of view, skill level, and site usage, you might find one more appealing than the other.
Personally, I use one of the browser extensions that allows collapsible comments, inline replying, and user info on hover over username.
> Really? That seems unlikely. I mostly see people use phones, and small tables, so <= 7".
Ignore that, I misread the sentence. I thought you were saying most mobile browsing is with a 10" tablet. I'm not trying to tell you that you're wrong about your own reported habits...
Mozilla's replacement to the Gecko engine that run's Firefox, written in rust. Often covered here[1], the benchmarks look really promising. Small portions of the codebase are already trickling back to FF where applicable.
The complexity and size of a modern web browser and the need to better engineering tools to combat this are often touted as some of the reasoning the Rust project started.
> Why does the average site need to know my battery status?
I would go further and suggest that really no site needs to know it (I am sure there could be a few reasonable uses, but still). Which makes me wonder if we could strike back by abusing the WebRTC spec and fuzzing values like these, instead of simply blocking them.
Exactly - most sites don't need my exact location, or access to WebAudio or whatever. It should be a red flag for most sites, however most users won't know how to react in such a situation.
I'm not familiar with WebRTC. What's the use case there? I can't remember ever wanting to create an in-browser p2p connection on my local network. What would it be used for?
But it still ignores the proxy settings and will use STUN to discover your "external IP". Thus users that think they are using a proxy end up not actually doing so.
in-company hangouts, video conferencing, etc. Without p2p on local network that would have go outside the company and back in.
Others have pointed out this behavior has changed in Chrome 48. You don't get the local IP unless the page asks for access to the mic/camera which the user has to give permission for.
I think the answer isn't technical, but legal and cultural. Make it unacceptable in the court of public opinion for companies to misuse this data, and strengthen privacy laws.
These two things, of course, go hand-in-hand, but us techies tend to look, I think, for the technical solution because that's the place where it's easiest to see how we could have any sort of impact. The other stuff is a lot of talking to and listening to people, consensus-building, being persuasive, etc.
Legislation and regulation are necessary, and they tend to help to keep the really big boys in check, but how can you actually tell if a company is actively engaged in compiling profiles on you or not? I can ask my browser to pass the Do-Not-Track header indicating my objection that practice, but why would a company specialised in tracking users respect that header?
I have tried to convince the Dutch banks I use (ING and ABN AMRO, i.e., big banks) to stop employing tracking beacons and third party tracking services on their secured internet banking environments, but the responses I get range from 'yeah we need those to improve your customer experience' to 'you are welcome to block these trackers yourself' (I already do, thank you very much).
How about using the permission system? You don't have to disable WebGL by default, but you can ask users for permission when it's needed (usually in a game).
Other stuff like GPS, camera, and microphone already require permission before being used.
Netflix uses an up-to-date list of known VPN-endpoints in addition to a database of IP-ranges by country. They don't need to detect anything client-side. This list is constantly in flux though, so sometimes accessing Netflix via VPN works, sometimes it doesn't.
When everybody was running Windows on a smorgasbord of hardware / patchlevel / plugins / fonts, it was easy to fingerprint. Are we moving towards a more monolithic landscape where fingerprinting is less able to track individual users?
* If I have a fleet of Chromebooks running the same version of Chrome OS, will they all have the same fingerprint?
* Will, say, all iPhones 6 with the same hardware parts, running the same Mobile Safari and iOS version, have the same fingerprint?
This is much-needed research. Thank you for your work. Regarding the WebRTC tracking- would it be possible for WebRTC to work without exposing the local IP? I.e. is there any real reason that fingerprint needs to be there?
Other co-author here. Unfortunately there are good performance reasons for allowing WebRTC to access the local IP, see the lengthy discussion here: https://bugzilla.mozilla.org/show_bug.cgi?id=959893. One use case is allowing two peers behind the same NAT to communicate directly without leaving the local network.
The working group recommendation that we linked in the paper (https://datatracker.ietf.org/doc/draft-ietf-rtcweb-ip-handli...) addresses some of the concerns that arise from that (namely the concern that a user behind a VPN or proxy will have their real, public address exposed), but still recommends that a single private IP address be returned by default and without user permission.
However that's still quite identifying for some network configurations, e.g. a network which assigns non-RFC1918 IPs to users behind a NAT. Seems to me that putting access to the local IP address behind a permission would both remove the tracking risk and still allow the performance gains after the user grants permission.
Thanks for the response! If you're interested and it would be useful for your research, I have some really, really interesting privacy findings regarding Service Workers I'd be happy to share. I'm strongly in favor of an enhanced Open Web, but I'm not comfortable with the opaque nature in which tracking/privacy can be likewise enhanced with little user interaction or notification. Keep up the good work.
I am going to ask about a really basic question: what is fingerprinting?
I had to dig around, from the paper is sounds like a stateless form of tracking.
The audio example made sense:
1. the mic comes on, and it identifies a particular background noise.
2. I browse to another site, or a different page without a cookie.
3. The mic comes on again, matches the ambient noise and realizes I am the same person.
Is that what you mean? If this is the case, how can the "canvas fingerprinting" work since I had to browse to a new page and all the old pixels from the previous page are no longer there.
Anyway, if it is what I understand it to be, then it sounds very interesting. I bet some science fiction author wishes they had though to use it.
I can see how you would be led to believe that interpretation. Looking at the "fingerprinting" webapp however, details that sound is NOT actually recorded-- only the uniqueness of your machine's audio processing stack. At least I hope that's the case. The idea of a microphone recording without permission upon visiting a website would cause quite a broo-ha-ha.
> "This page tests browser-fingerprinting using the AudioContext and Canvas API.
> Using the AudioContext API to fingerprint does not collect sound played or
> recorded by your machine - an AudioContext fingerprint is a property of your
> machine's audio stack itself. If you choose to see your fingerprint, we will
> collect the fingerprint along with a randomly assigned identifier, your IP
> Address, and your User-Agent and store it in a private database so that we can
> analyze the effectiveness of the technique. We will not release the raw data
> publicly. A cookie will be set in your browser to help in our analysis. We
> also test a form of fingerprinting using Flash if you have Flash enabled."
Yes, no sound is recorded. Access to the user's mic isn't possible without a permission. If there are sections of the website or paper that seem to imply that, let me know and we'll clarify.
It does, and also as someone who has never heard of AudioContext before, I can't fathom why it would be necessary for a web application to generate an audio data stream that isn't output to speakers _AND_ _THEN_ analyze the result.
What is the typical use case for AudioContext?
The capabilities of AudioContext used in audio fingerprinting seem like they're beyond what is really necessary?
AudioContext is actually pretty cool. As far as I know, only Firefox supports it at the moment. But it allows you to work with audio streams in raw byteform, which means you can do advanced audio processing in client side javascript.
Wow, so are mic settings that different on different on, say, different iOS devices? If you and I have the same model iPhone with the same model iOS, is the audio stack that different?
Probably not - but basically, you add together your microphone stack with all the other data it can possibly find about your device, and that's your fingerprint.
I know iOS was just an example but just to clarify, the WebRTC spec isn't supported in-browser on iOS. To further clarify: it's not supported on iOS.
As a developer, you can take advantage of the spec only if you're building a native app. There's frameworks that you can use if you do. But within Safari or Chrome you have zero WebRTC support.
It's supported in modern versions of chrome on Android but won't be supported on iOS until apple does something about it.
> But within Safari or Chrome you have zero WebRTC support.
Chrome is just a skin on Safari for iOS, because Apple doesn't allow third party browsers, right? I would think FF (or any other browser) wouldn't be able to on iOS either, given that constraint.
> "how can the "canvas fingerprinting" work since I had to browse to a new page and all the old pixels from the previous page are no longer there"
The linked page answers this: "Differences in font rendering, smoothing, anti-aliasing, as well as other device features cause devices to draw the image differently."
Put differently, the function measureText(canvas full of text with various fonts and bizarre features with varying implementation) is a pretty good hashing function for a population of web users, because each of these web users have a pretty-unique [canvas rendering engine, underlying OS, installed fonts] combination.
Combine several of these techniques (webrtc, audio, list of plugins installed and their version, etc), and you go from a "pretty unique" to a "guaranteed unique" hash, which you can follow across the web.
Just imagine, if the audio stack exposes the volume level, that's roughly 7.5 bits of uniqueness to contribute to the 33 required to uniquely identity any person on Earth (not that you can expect it to be uniformly distributed, and thus fully usable).
Yes, it's a poor example in that respect. I meant it more as an explanation of how different attributes of a source all contribute a little but to providing a unique identifier, but you are correct that it's much less useful if the attribute is not static.
I'm going to answer the basic question: fingerprinting is about trying to identify your device as uniquely as possible using available APIs, in order to track you cross-site, without cookies.
To do that, you first try to identify API that have different results depending on the browser or the device, and then track their result. For example, the User agent have some identifying information. It's not unique for each person, but you can start having a bit of identifying information. Do that with multiple APIs (available fonts, installed plugins ...), and you start having enough identifying informations to uniquely identify some browsers, without having an actual ID provided by the browser.
I never understood panopticlick, even when I repeatedly visit it, it always tells me that
"Your browser fingerprint appears to be unique among the 135,054 tested so far."
Shouldn't it tell me that my browser is not unique during my 10th attempt considering it has recorded my previous attempts. This warning actually never changes, regardless of duration between consecutive attempts. That can only mean that the panopticlick is flawed or my browser signature is in constant flux (which would essentially make it useless from tracking perspective.)
I thought the same thing, so I did a bit of digging.
Turns out they put a bunch of tracking cookies on your machine without asking you (it is mentioned in the about page though), which seem rather naughty for an organisation promoting online privacy.
When I removed all 4 of them, I get down to being "almost unique". I'm currently down to having the same fingerprint as 1 in 45132.3333333 browsers.
This is confusing, my incorrect assumption was 1 out n browsers meant n = total numbers of browsers evaluated by panopticlick. Rather what they mean is 1 out of k, where k is determined by unique bits. There might be other factor such as entropy of each fingerprint.
How are two sites sharing this fingerprint information in a way that says "yup, this is the same guy?" Like is there some sot of cabal of evil advertising companies running a bunch of sites, or what?
I don't think you can enumerate them these days, but you can test for them by trying to use them in CSS (which font is used would affect the width of a span of text, use a wildly different fallback font and you can guess which is installed) or <canvas> (where you can inspect the actual pixels rendered).
You can't get installed fonts via javascript, but you can change the font of a known text and fingerprint the size to determine whether the font is installed or it defaulted to another font.
Your browser can leak a ton of information about your computer silently: window size, screen resolution, pixel density, time zone, language, installed fonts, installed plugins, operating system and version, browser version, plugins and versions, etc. There are good reasons for all of this data to be available to JavaScript for legitimate purposes. The AND of all of these datapoints, however, may be unique (or close), particularly if you have ever done something like install a novelty font. The EFF runs a website that fingerprints you and tells you how unique your fingerprint is:
Canvas fingerprinting uses differences in rendering e.g. of fonts. Output a text, hash resulting pixel values. Depending on exact version of the font(s) installed, anti-aliasing settings, default font sizes, operating system... you get slightly different results. So you don't rely on information stored on the device, but on repeatable behavior that differs between devices.
But now I see that is just seeing which fonts are available.
Thanks for the explanation. Its just hard to believe devices are so different. I would think most versions of iOS would have roughly the same set of fonts etc.
Canvas fingerprinting by itself won't uniquely identify users. But the idea is that you can combine various different techniques, each one giving you more bits of uniqueness, until you have enough to do so. For example, say that canvas fingerprinting gives you one of 100 possibilities, and you combine it with other techniques that give you one of 10,000 possibilities, then combined (assuming they're not correlated) you get it to a million, letting you uniquely identify people with decent reliability from a decently large visitor pool.
Good question. I'm not particularly informed on this stuff, so take this with a grain of salt, but my understanding is that mobile devices in general and iPhones in particular are much harder to fingerprint reliably. Things like time zone, clock skew, and ping times might help differentiate users, but you probably can't get it down to a single person. I imagine there's still a use for fingerprinting which helps you differentiate groups of users even if you can't narrow it down to just one.
Actually, checking if a font is available does not require canvas (you can simply inject a piece of text into the page with a specific font stack set and check its width). Rather, what canvas is used for is to obtain the sub-pixel anti-aliasing of a given piece of text, which is different between browsers and OS even when the same font is present.
I would assume that iOS devices are quite hard to tell apart using most of these techniques, yes. But I also wouldn't be too surprised if there were something that works for them, some kind of cookie that isn't cleared by default or ...
I don't know the research definition, but fingerprinting is a technique to uniquely track a user across multiple sites without a tracking beacon.
The most basic form of fingerprinting is to use the browser-supplied headers (user agent, version, OS). Canvas fingerprinting works because identical browser versions across different machines may render slightly different, but consistent. IIUC, canvas fingerprinting doesn't rely on any pixels shown to the user or anything unique to the site, but if the same canvas is rendered exactly the same on two different sites, that's another indication that both visits were from the same user.
I don't think the AudioContext fingerprinting uses the actual microphone: it uses the browser's (and possibly OS's) audio engine to generate an audio stream, then fingerprints the resulting data stream.
Thanks for this research, really interesting to see.
I do want to state for the record that instinctiveads.com was testing augur.io and that's why we're listed there. We don't use them anymore but unfortunate timing, especially considering we're trying to be a better ad network than the rest.
Also I'd like to point out that one of the most pervasive tracking methods is done through form submissions. Anywhere you submit an email (login, purchase, etc) can be used as identification and first-party cookie matching.
Having the insight of of someone who works in online advertising would be interesting and informative. Is there anything you can share that we might find interesting?
I could talk about this for hours. Fundamentally, identity is important for the ad industry but it's not about your personal info, it's just a reliable ID that we're all after.
A reliable ID allows for storing your ad history and interests to show you better ads and less of the same. This is proven since it's all math and data science and we can see the increase in metrics with better targeting. By the way, clicks are not the most important metric either, there's much more that goes into an ad campaign. Ironically, reliable IDs also allow for storing any opt-out settings since it's just a value attached to that ID.
The email login I mentioned above is the most common way to track online, most of the big sites actually sell login data and fire tracking tags when you're logged in with the email address passed through (usually hashed but not always) so that providers can set their own cookies and recognize you again. Since emails are strongly unique, this is really effective.
This tech is also used to combat ad fraud (which is what we were using it for). Fraud is a massive problem since it's so easy to start up botnets and churn through millions of ad impressions quickly.
Unfortunately a lot of this new age of tracking is the result of politics, bad incentives, and a lack of regulation that's led to a wild west situation where these companies can do anything. Clearly the technical talent is capable (as seen in this research) but it's being put to the wrong use. The DNT (do not track) header was a compromise but lacked any real regulation to make it effective. 3rd party cookies were fine but unfairly demonized and the default blocking of them pushed the industry to these deeper tactics.
Ultimately this is a business process issue: if there was a standardized ID like IDFA but for browsers (or even better at the OS level) and privacy regulation that's actually enforced, that would be a good compromise. Sites and ad networks get a reliable ID and you get control over when and how that ID is refreshed.
EDIT - All this stuff used by independent ad companies is just a tiny fraction of the industry. This barely covers ISPs who have very refined tracking abilities that you really cant avoid since they control the traffic. Comcast/Verizon has the AOL ad network using this. And the 2 biggest ad companies are Google and Facebook, both of which don't need fingerprinting because they already know who you are from just being logged in.
> if there was a standardized ID like IDFA but for browsers (or even better at the OS level) and privacy regulation that's actually enforced, that would be a good compromise. Sites and ad networks get a reliable ID and you get control over when and how that ID is refreshed.
To detect ad fraud, would the ID need to be the same on all sites? Instead of sites dropping cookies on clients, what if browsers generated their own random per-site IDs? Users and browsers would have more control over managing and clearing cookies and user IDs.
Yes the ID would have to be the same, just like it is on mobile (Android Advertising ID, iOS IDFA) and would be best at the device level but browser would be a start.
If it's unique to every site then it's nothing new, networks can already set IDs today with 1st party cookies. It's being able to have a internet-wide ID that's valuable and is what 3rd party cookies allow(ed).
The ID itself doesn't matter, it's just random characters and mapped in various ways by networks. It's the reliability and consistency on a device level that's needed. Having something like this would make a massive difference - all the cookies/tracking junk would be obsolete, along with the hundreds of pixel sync tags, and would make everything faster, more accurate, more private and more secure.
But the point of fingerprinting is that practically no two "browsers" are the same:
- browser software and exact version
- installed plugins
- size of browser window
- OS software and exact version (think of patches!)
- language
- time zone
- screen resolution
- ...
- (and all the stuff mentioned in the submitted article!)
See the EFF's Panopticlick to see _how_ unique your browser is. Be sure to click the "Show full results for fingerprinting" after the test to see all things it considers.
But is any of this stuff stable enough to ensure a fingerprint -> user correlation which doesn't break every time? It's not very much use if all it does is create a unique fingerprint for each refresh?
Yes; the things I've mentioned above don't change on page refresh.
If you'd find some things do change too often to be relied upon you could either take that into account, or simply don't use that specific fingerprinting technique.
That's a pretty short answer, and it sounds wrong to me. Are you implying that browser version, OS type and version, and system architecture are all factors that matter for audio fingerprinting? If so, what would be the point of audio fingerprinting when you can just look at the user agent string?
Sorry, it seems I misunderstood your intention/question.
The `AudioContext` API exposes several details about the host which may depend on the hardware (sound card, sound chip), software stack (OS, on Linux e.g. PulseAudio vs. ALSA), sound driver and its versions, and connected periphery (speakers? headphones?).
Additionally, the audio API is used to generate a sound (which is muted before being played, but still generated before). Sound is hard, and so the browser vendors don't necessarily generate the "sound bits" themselves but ask the OS to so. Which might in fact ask its sound system to do so. Which might ask its sound driver...
Some of these properties are fairly common or likely to change often. But chances are that combined they give you more bits of information then say the simple user agent string (which is shared by thousands - if not more! - other browsers).
Will you post insight into the data you've collected? Obviously I don't care about IP addresses, etc, but it would be nice to know how many people have submitted data vs how many unique hashes have been collected for say the "Fingerprint using DynamicsCompressor", etc. I also haven't checked every page on the site, so the data might already be there (and I'm missing it)...
"When using the headless configuration, we are able to run up to 10 stateful browser instances on an Amazon EC2 “c4.2xlarge” virtual machine."
Also it seems like you ran the crawl only in the month of January this year, and crawled about 90 million pages. Were you able to do that on the single AWS instance, using Firefox via Selenium? What do you think the performance would have been just issuing raw requests?
Just interested because I'm currently building a crawler and am trying to decide if Selenium would be worth it performance wise.
On iOS I use safari and disable access to location etc, also disable cookies, advertisementID, etc, etc. Then I feel quite save when using a VPN. Does that still hold?
This week’s http://www.heise.de/artikel-archiv/ct/2016/11/144_kiosk states “Viele .. fingerprinting-verfarhen laufen auf Mobilgeräten ins Leere. … Zudem gibt es kein Mittel, mit dem man über gezielt gegen Fingerprinting über die Sensoreigenschaften vorgehen kan - weder unter iOS noch unter Android.” And then it concludes recommending Adblockers for Safari on iOS noting that it depends on the quality of the block-list. It also mentions that adblockers on iOS don’t work in Apps, other than in Android.
IOW most fingerprinting fail on mobile devices and that sensors, eg batteries, are one of the few remains for fingerprining on iOS. Do you disagree with Heise? Could you please substantiate your statements regarding iOS fingerprinting?
You're right that mobile devices are harder to fingerprint – too many people with the same screen size, operating system, browser version, timezone, and language/region settings.
However, mobile devices have a bunch of sensors, some of which can be accessed by JavaScript without permission, namely the ones you wouldn't expect to yield identifying information (e.g. accelerometer). The problem is that no two sensors are alike – they all introduce noise into the data which can be enough to fingerprint a device.
As soon as I saw these APIs being added I immediately dropped into about:config and disabled them. How the hell do these people think this is a good idea to do without asking any permissions?
Privacy on the web keeps getting harder and harder. Of course this should only be used in conjunction with maxed out ad blockers, anti-anti-adblockers, privacy badger and disconnect.
We need browsers to start asking permission. When you install an app on Android or iOS it says "here's what it's going to use, do you want this?". The mere presence of the popup would annoy people and prevent them from using these APIs.
Thank you, user, for making your fingerprint hash more unique by disabling certain default features, given your user-agent string, thus opting into cat-facts.
It's great that mozilla decided to remove about:permissions. I do enjoy the fact that I now have to visit every website whose permissions I want to change instead of managing all permissions from a single location.
Google has a vested interest in information leakage. I have a suspicion that the Chromium project expresses a strategic desire to shape the direction of browser development away from stopping those leaks. The idea of signing into the browser with an identity is a core feature and in Google's branded version, Chrome, the big idea is that the user is signed into Google's services.
Indeed. i even use FF despite its terrible terrible "pinch to zoom" functionality - which works perfectly in other browsers (Safari, Chromium, Chrome).
Zooming is such a basic thing... i don't understand why they implement it in such a crappy way. Certainly doesn't attract users.
This is what the same movement does in Firefox: https://up1.ca/#SEKWNOm1BSQnkntxj_v53w
In Firefox, if i want to zoom all the way in, i have to pinch in like 10 times (very annoying) and then to zoom out pinch out another 10 times...
Google is already really good at tracking people, why would it introduce vectors that would help other vendors catch up? You would have to demonstrate that Google itself was using these vectors for tracking.
What Google vends isn't browsers. It vends advertising.
It's competitors for data? To see how is Microsoft's "sign in to the web" is playing one might be tempted to Bing with IE, but statistically the odds favor another browser and another search service combination.
This is the kind of nonconsensual sureptitious user tracking that the EU privacy directive 2002/58/EC concerns itself with, not those redundant, stupid cookie consent overlays.
No, from my understanding cookies are allowed by default only if they are essential to the function of the site. If you only use the cookie to handle logins and sessions then you don't need the warning. I you use the cookie for tracking or analytics then you need the warning.
Note that you can use your webserver logs for analytics and that doesn't require the cookie banner.
Something that is best left to the browser to handle... by allowing the user to enable/disable 3rd party cookies. Which we already have. But no, the EU has stupid notifications on basically every single website as a result since everyone uses third party analytics. Why? If you want your analytics to be believed by anyone who wants to advertise with you, invest in you, partner with you, or buy you, they'd damn well better be third party analytics.
That's true. The implementation differs on the country, for example in the UK it is enough to just show the annoying banner. Here in Spain you cannot set any tracking cookie (i.e. Analytics) without explicit consent. Of course, governmental websites totally break this law: http://cfenollosa.com/blog/the-ignorant-eu-cookie-law.html
However, OP is right, governments spy on our webcams and analyze our traffic, and that's ok, but we need a stupid banner that overrides browser preferences to avoid all but session cookies. Duh.
If you can set cookies, the user has already expressed their consent by enabling the cookies in the browser. As long as cookies' existence is common knowledge (it is by now), there is no need to duplicate browser UI within every website.
This is the official stance of the ICO[1], the UK national authority: there was a need to educate users what cookies were when the directive was passed. No such need exists now. ICO itself briefly used consent overlays, but does not anymore (EDIT: Aaaaand they've apparently use them again; I'll try to find the policy release where they say this is not necessary.). Cookies not used for tracking of persons never needed any consent, as they have no privacy implications.
People who make their living creating cargo-cult UI designs, have predictably added cargo-cult law-compliance to their toolset. It is beyond stupid.
> If you can set cookies, the user has already expressed their consent by enabling the cookies in the browser. As long as cookies' existence is common knowledge (it is by now), there is no need to duplicate browser UI within every website.
Wrong. If I disable cookies in my browser, I can't log in to websites anymore, so they need to be allowed. A whitelist would be very inconvenient. On top of that, it's not explicit allowance, it'd be implicit (i.e. opt-out instead of opt-in).
I don't know if British legislation is different, but this is illegal at least in the Netherlands.
You can enable session cookies only, even in the current UIs. Ditto for third-party cookies. Duplicating UI in a website is a solution looking for a problem. The web devs can nag the 0.01% who don't have cookies enabled, and leave the 99.99% who have them enabled alone.
It has never been enforced that way to my knowledge, anywhere in the EU. Which law or court decision says that it is actually illegal?
How does my browser know that one PHPSESSID is used for tracking, and another is a session? You probably mean until I close the browser, which would be never -- at least, I would never want to, but I do every few months for browser updates. (My laptop always goes in suspend/sleep mode.)
> Ditto for third-party cookies
I don't know what third-party cookies are anyway, and I bet my peers could not give me an accurate description either. We're all in the software business, be it game development or general software development or something.
Two gave a rough description but couldn't answer a question about whether embedded Like buttons would work if the user is logged into Facebook. Another just said "I don't know".
I'm not sure "the public is informed about all their options by now". The ones who really care generally use uBlock, ABP, Self-Destructing Cookies, Ghostery, etc., the rest just click "ok" because the sites do not inform them about these aforementioned possibilities: that wouldn't be in their interest.
> Duplicating UI in a website is a solution looking for a problem
Oh I agree it's an issue, I hate this cookie wall as much as anyone. I would love for there to be no need to ever see this wall.
> It has never been enforced that way to my knowledge, anywhere in the EU. Which law or court decision says that it is actually illegal?
I am not sure fines have been dealt, but the Dutch ACM ("authority for consumer and markets", literally translated) did give out warnings to non-compliant sites and they subsequently places cookie walls.
The law simply says no such cookies may be placed, it doesn't say "for a few months while users are unaware, and after that, oh well, have some fun picking your own privacy laws as you wish."
And yes, I know functional cookies and simple tracking is allowed if you don't invade a person's privacy. This means practically every major website knowingly tries to invade your privacy, because they have these walls in place. What do people say? "Fucking government does not understand the internet, look at all these walls." What should we be saying? "Wait why are they trying to create detailed profiles of me in the first place?"
Although the emphasis on the actual abuse of newly-introduced APIs is much needed, it is probably important to note that they are not uniquely suited for fingerprinting, and that the existence of these properties is not necessarily a product of the ignorance of browser developers or standards bodies. For most part, these design decisions were made simply because the underlying features were badly needed to provide an attractive development platform - and introducing them did not make the existing browser fingerprinting potential substantially worse.
Conversely, going after that small set of APIs and ripping them out or slapping permission prompts in front of them is unlikely to meaningfully improve your privacy when visiting adversarial websites.
Few years back, we put together a less publicized paper that explored the fingerprintable "attack surface" of modern browsers:
Overall, the picture is incredibly nuanced, and purely technical solutions to fingerprinting probably require breaking quite a few core properties of the web.
So... what we need is a browser, which says it supports these things but blocks or provides false data on request and looks as ordinary as possible for "regular" browser fingerprinting.
The problem here is Canvas fingerprinting - that's what I found the most surprising and interesting.
How do you prevent that, apart from working on 'fixing' browsers to create pixel-perfect renders across different browsers/platforms/configurations. Would that even be possible?
Edit:
> Tor Browser notifies the user for canvas read attempts and provides the option to return blank image data to prevent fingerprinting.
Huh. I guess that's one attempt, but being able to read pixel data out of a canvas is completely reasonable.
> […] but being able to read pixel data out of a canvas is completely reasonable.
Not for every website. Most websites don't need canvas at all. One option would be to ask users to activate canvas support for a website that does need it, so users can judge for themselves if the request is legitimate. This is how the geo-location API works after all.
I am not convinced that this will work very well though.
It seems possible to add heuristics like 'the canvas element has requested more than X fonts within about Y seconds' and then treat that as a tracking script and do something like prompt the user, or return the default font from then on.
If the "fingerprint" really is a checksum/crypto hash, an ever so slight random element in rendering output could help. Of course, together with other techniques, it might just identify your somewhat obfuscating browser.
For Firefox and Chrome there are canvas fingerprint blockers [1] and [2]. These are heuristic based, so you'll likely see a bit of false positives. uBlock Origin includes an option to prevent the leakage of local IP addresses [3].
The Tor Browser does not send misinformation; it just blocks. A solution would probably be a browser where every version, on every platform reports the exact same things, always the same way.
> The Tor Browser does not send misinformation; it just blocks.
No, it doesn't. TB sends all kinds of misinformation, from the user agent string (always reports itself as being its base version of Firefox running on 32-bit Windows 7) to rounding javascript timing functions to reduce the precision.
> A solution would probably be a browser where every version, on every platform reports the exact same things, always the same way.
I'm glad I disabled WebRTC when I first discovered it could be used to expose local IP on a VPN.
These "extension" technologies should all be optional plugins. Preferably install on demand, but a simple, obvious way to disable would be acceptable. (ie more obvious than about:config)
Not a great deal can be done about font metrics other than my belief that websites shouldn't be able to ferret around my fonts to see what I have. Not like it's a critical need for any site.
What would anyone do with your internal network IP?
Having these features as optional plugins means they are basically impossible to count on having in the basic web platform, meaning you're going to fight a losing battle to gain adoption for any applications that need them.
And the open web platform is the only platform right now that is enabling developers to create cross-platform applications outside of the restrictions of walled-garden app stores.
Not just internal network IP, but also public IP. There were quite a few test sites popped up when the issue came to light.
> Having these features as optional plugins means they are basically impossible to count on having
Funny. Didn't seem to prevent flash, acrobat or others becoming extensively adopted. If I want browser video chat I can install WebRTC etc.
If the cost of having that universal platform is compromising everyone's privacy, on any site that wants to check, it's not a fair or acceptable trade.
With your internal IP I can guess the brand of your router, determine if you are a home user or on a corp network, guess how many other machines might be on your network.
I can also assume that your router lives at .1 or .254 or similar, and use your browser to pivot and brute force the password while you browse cat pictures.
I don't experience webrtc leaks with Openvpn and this config: http://pastebin.com/raw/hiH1TZtS (I use IVPN, their client prevents webrtc leaks on windows by default, but had to manually configure openvpn on linux).
Of course, I keep webrtc disabled in Firefox anyway except when i need it, defense in depth like you said.
If you use Firefox or Iceweasel, you can disable most of those apis in about:config or user.js. For example, media.peerconnection.enabled = false, to disable WebRTC. dom.battery.enabled = false for battery, etc.
Yeah, but one may want to enable those on per-site basis. So you get both the fancy stuff (with a sites you trust) and no tracking material for the rest.
This would make absolute sense. Certain requests (like location) already trigger popups that ask you for permission. If it turns out other APIs can be equally revealing as far as privacy goes, it would make sense to present the same popup.
I mean, using a web app for the first time would be no different then installing a mobile app - I wouldn't be surprised if I had to give it a few permissions.
I was thinking the same thing. We need a permission system for websites. Preferably useable on a per-domain basis so I can disable those APIs on adnetworks' domains.
True, but the number of APIs is relatively small while having them enabled can allow for a much richer set of values, far more useful for fingerprinting.
By disabling specific APIs, you would make your browser even more identifiable.
It would only work if many users have disabled exactly the same APIs as you and all other non-disabled APIs don't provide any information useful for fingerprinting.
It's kind of surprising that there isn't an extension to provide this functionality (at least in desktop browsers). All you'd have to do is monkey patch the methods that get called and throw up a confirm("are you sure you want to allow [X]")
I don't know of any. I would think it would be fairly easy to create a userscript or extension to stub built-in APIs (maybe using something like testdouble.js or sinon.js to override the default global objects that you are trying to "disable"). I'm not sure what issues you'd run into on various pages if you did that though (so it'd probably need a lot of iteration- and fixing bug reports).
It might be a fun project to start though. I've been really enjoying testdouble's API (and have started using that for my unit tests).
All of this makes me wonder how some of these interfaces should be more closely guarded by the user agent.
Perhaps instead of a site probing for capabilities, they should instead publish a list of what the site/page can leverage and what it absolutely needs to work. Maybe meta tags in the head or something like the robots.txt. Browsers can then pull the list and present it to the end user for white-listing.
You could have a series of tags similar to noscript to decorate broken portions of sites if you wanted to advertise missing features to users and, based on what features they chose to enable/disable for the site, the browser would selectively render them.
I mean, how many people are dealing with the hassle of noscript? That's probably most of the users that are going to do anything other than tell the browser to stop asking questions.
Users are familiar with managing permissions, they do it all of the time. Users have to manage location services and the camera in browser. iOS and Android also prompt for access to resources.
Why is it unrealistic to expect the same for other interfaces like audio, video, WebRTC, and other potentially exploitable functionality?
Most permission management most users do is click the "accept" button when installing an app without reading anything on the list. I don't see how that helps.
Just altering your own browser's fingerprint for each domain won't poison their data (it just makes you anonymous to them). Any data is good data as far as these trackers are concerned. You can devalue their data by collectively sending the same fingerprints, but there is no way to actively poison their databases.
Some methods of fingerprinting are probably used to distinct between real users and bots. Bots can use patched headless browsers that are masquaraded as desktop browsers (for example as latest Firefox or Chrome running on Windows). Subtle differences in font rendering or missing audio support can be useful to detect underlying libraries and platform. Hashing is used to hide exact matching algorithm from scammers.
There is a lot of people trying to earn on clicking ads with bots.
Edit: and by the way disabling JS is an effective method against most of the fingerprinting techniques.
As someone who has written code to detect bots, exactly this. We don't care about fingerprinting the user, we care about fingerprinting to verify the user agent you claim to be.
WebRTC guys get around this by stating fingerprinting is game over, so don't even bother. They ignore that they are going against the explicitly defined networking (proxy) settings. Browsers are complicit in this. If the application asks "should I use a proxy", then ignores it, silently, wherever it wants, that's deceptive and broken.
There's still zero (0) use cases to have WebRTC data channels enabled in the background with no indicator.
If all these APIs are added, the web will turn into a bigger mess than it is. They can't prompt for permissions too much. So they'll skip that, like WebRTC does.
Seems like browsers should ask the user's permission to use these html5 features. Then whitelist. For example, a site that does nothing with audio should be denied access to the audio stack.
There is an acceptable tradeoff between pseudo anonymous access through browsers vs non-anonymous access through native apps.
To interpret this research as reason for crippling web or browsers would be a giant mistake. Crippling browsers will only work against users, who will be then forced into installing apps by companies.
Two popular shopping companies in India exactly did this, they completely abandoned their websites and went native app only. This combined with large set of permission requested by apps lead to worse experience in terms of privacy for consumers. As the announcement for Instant Apps at Google I/O demonstrate, web as an open platform is in peril and its demise will be only hastened by blindly adopting these types of recommendations.
Essentially web as open platform will be destroyed in the name of perfect privacy. Only to be replaced by inescapable walled gardens. Rather consider that web allows a motivated user to employ evasion tactics, while still offering usability to those who are not interested in privacy. While with native apps where Apple needs a credit card on file to install, offer no such opportunity.
I am happy that Arvind (author of the paper) in another comment recommends a similar approach:
"""
Personally I think there are so many of these APIs that for the browser to try to prevent the ability to fingerprint is putting the genie back in the bottle.
But there is one powerful step browsers can take: put stronger privacy protections into private browsing mode, even at the expense of some functionality. Firefox has taken steps in this direction https://blog.mozilla.org/blog/2015/11/03/firefox-now-offers-....
Traditionally all browsers viewed private browsing mode as protecting against local adversaries and not trackers / network adversaries, and in my opinion this was a mistake.
"""
> Two popular shopping companies in India exactly did this, they completely abandoned their websites and went native app only. This combined with large set of permission requested by apps lead to worse experience in terms of privacy for consumers.
I'm surprised nobody has commented on your comment yet. I was in a meeting just this morning where my interlocutor assured me that over 70% of advertising in 10 years will be native apps since everything else is getting blocked or abandoned (and presenting it as an opportunity to do all the stuff you "can't do anymore" on browser).
Over 3,000 top sites using the font technique, and from the description this sounds really wasteful (choosing and drawing in a variety of fonts for no reason other than to sniff out the user).
Each font is probably associated with a non-trivial caching scheme and other OS resources, not to mention the use of anti-aliasing in rendering, etc. So a web page, doing something you don’t even want, is able to cause the OS to devote maybe 100x more resources to fonts than it otherwise would?
A simple solution would be to set a hard limit, such as “4 fonts maximum”, for any web site; and, to completely disallow linked domains from using more.
After reading this it makes me want to disable JavaScript entirely, along with cookies, and go back to text browsing. I've been using Ghostery on my phone, it's been pretty good.
Some application want access to the battery info as they might want to disable some functionality in case your battery runs low. It would be smarter if instead of giving exact battery level it will get a callback once the battery runs low.
Sharing the battery information with a browser extension seems reasonable, but are there any websites that actually use the battery information for legitimate user benefit?
Of course this is something you do. Throw it together with all of the other information you can clean from a browser (referrer, ip) and you can get a match with a very high confidence level.
Shops can do the same with baskets, you find that people are either identified by one very rare feature which reoccurs often or their little graph of 4-5 items which correlate 99% to them.
That's the line Apple took with iOS shortly before it introduced the App store. Mozilla, Palm/HP, and even Microsoft with it's Win 8 Metro Apps tried to make websites the new apps. It has some short comings.
Web apps are definitely getting better, I haven't used an actual email client in 10 years, but they have a long way to go before they can replace dedicated clients entirely.
> Web apps are definitely getting better, ... but they have a long way to go before they can replace dedicated clients entirely.
And yet, just yesterday there was a great discussion on Virtual Desktop Infrastructures, where entire operating systems are accessed and operated virtually through just the browser [0].
The current top comment indicates that while there are some setup hoops to jump through to use a specific OS, the performance itself "works very well" [1]. Does this not qualify as a web app replacing a client entirely?
> Does this not qualify as a web app replacing a client entirely?
If you can pull up a video stream from a surveillance camera in your house then you no longer need a home?
When you watch Daredevil on the Netflix App on your phone do you think that the actors are inside your phone performing live action for you?
What they're discussing is a web app that allows you to interact with a remote client. That client OS still exists and the UI/UX is still being rendered by a nonweb technology, the pixels rendered are just being streamed to your web browser instead of to a monitor and your inputs are being captured and transmitted to that client OS.
Ideally I'd like to have a minimal OS and file set on my local machine (for offline and poor connectivity scenarios), that automatically syncs with my own, encrypted cloud system, such that I can (at my own discretion) update the OS from controlled sources (e.g. git). But I don't think there is enough interest from others for such a system, and I'm occupied with enough other projects that I won't be able to set up such a system.
I think that it depends on your use-case. I use Google Photos to store my media files, Github to store application configuration & source code of my applications, Chrome to store bookmarks, passwords, Spotify to save & listen music etc. Even if I lost my computer now, I would easily setup my desktop environment again.
For me, I don't use Google Photos or Spotify, I have my own local copies and maintain my own backups.
I do use Github for some projects, but I also maintain local copies and maintain my own backups for all my projects.
If pinboard.in ever disappeared, it'd be like loosing an appendage! It might not be as bad as loosing an entire arm or leg, but its loss would be equivalent to at least a finger or two!
Full replacements will probably have to wait for mass adoption of WebAssembly, web workers and probably some other things as well.
Heck, what would be really interesting would be hardware acceleration for the final version of WebAssembly. That should (?) make it competitive with regular assembly.
I'm not sure why this seems to be a controversial opinion. Webasm with good concurrency could mean well written webasm software actually running faster than poorly written desktop software unless there is hardware acceleration on the desktop side that can't be used on the web side.
I think its similar to how Absolute Computrace rootkit identifies Android and Lenovo devices. Each hardware compoment has a unique ID, like your ethernet, bluetooth, even microphones and batteries.
I agree with that (technical person's) perspective, but that was not the mainline argument. Steve Jobs got on stage and said it was a hunk of battery wasting crap that invaded your privacy. I'm saying you can make that case regardless of your platform de jour.
Simple: Seperate documents, interactivity, and programs.
If I browse the web, I usually want documents.
Sometimes I also want interactivity, like in comment forms, which could be a seperated widget which could only interact in limited ways, and only with the page and the server it connects to.
And then there would be programs, which could access even local files – but would have an installation process like browser extensions.
Giving documents access that normally just programs do is stupid, as we have seen in Word Macro-based malware, PDF-based malware, Browser-based malware (the pdf.js exploit, for example), and so on.
How about not developing applications in the browser?
Its about linked documents. Not angular-17 MVVM async session persistence in indexdb with websql and asm.js rendering webgl for a spinning teapot.
OK. So you get your 'document only' internet. Where do we put all the other stuff? I have a bunch of 'non-document' websites that are essential to me now.
What happens under your new regime? Someone reimplements them all as native apps?
The point I'm making is just because your internet is document-only please don't assume mine or other people's are.
Every web app is written in perl/php, therefore we will keep using perl/php. - What will you do, rewrite the apps?
</..
internet != web.
This cludge of in-browser tech everyone is pursuing already comes with so much suffering for the developer. But the enthusiasm to embrace garbage like religion is just unbelievable. (yeah js everything, spotify lol)
I think the current webapp tech stack is not doing our generation justice. Yes google has very strong interests in maintaining status quo since it dominates it and so do many other giants. But good, user friendly and maintainable software looks different.
Why not break the completely misused model of documents for apps ? there is no document semantics in 1000x <div> elements riddled with js callbacks.
But dont let my cynicism annoy you, its the resignation talking. Imagine what cool tech we would have if something was started in the 90s (no, not java applets) and was all grown up now.
But every big company is now a walled garden provider. Just think about UI toolkits, would spotify be in Chromeframe otherwise?.
No one denies that the web as we know it is kind of a clusterfuck of things duct taped together, but to say it's "a shit technology" is maybe giving too little credit.
The tool we built to do this research is open-source https://github.com/citp/OpenWPM/ We'd love to work with outside developers to improve it and do new things with it. We've also released the raw data from our study.