This looks like it's trying to exercise every dark corner of the user's browser in order to ensure that the browser is a real, eyeball-facing browser and not just a URL fetcher, PhantomJS/SlimerJS, or a clickjacking plugin being used to fraudulently click ads.
I think it's easy to see both sides here: tools like this are a powerful way to detect and combat botnets and click fraud, but if/when weaponized they're also a form of browser fingerprinting which is a nasty way to ruin anonymity across the web.
IMO there are a lot of bigger targets on the Taxonomy of Bad Internet Things: malware-serving bottom-tier ad networks, "wrapped download" sites, clickjacking, and especially cross-site correlative "analytics" companies come to mind as being more sinister to privacy than Dan Kaminsky going botnet hunting.
What can make it bad is:
1. What you do with that information.
2. Who or what you share it with.
3. How far your reach extends.
An ad network which can place that code on multiple websites can put itself in a position of power and track devices, and thus browsing habits, of individuals.
However, if you have fingerprinting code only on your own website, and don't share that information with any other people/companies/websites, and use it solely to detect malicious bots, users, and behaviors, then I don't really think it's bad. It's like the difference between a gas station owner pointing a closed circuit camera at the door and someone flying a surveillance drone over a whole state. Both are surveillance, but one kind is much less ethical.
So I did, Selenium allows you to automate a real browser and capture the responses, hell if need be you can create a fake profile with Chrome and make it completely indistinguishable.
EDIT: explanation, it was a video tutorial site for one of the technologies I use, each video had a download link but it had no way to batch downloads for offline use, scraping it with normal tools didn't work since it was doing UA and other sniffing so I whipped up a python script to control chrome to authenticate, sign in get the cookie and then pulled the unique to that session download link for each video, since I'm not a dirtbag I set the time between downloads to 30 minutes (average video is 15 minutes) and left it running for 24 hours to get the ones I wanted.
Selenium allows you to automate a real browser and capture the
responses, hell if need be you can create a fake profile with Chrome
and make it completely indistinguishable.
I've used Selenium, but I just assumed that headless browsers were exactly: real browsers minus the UI.
So I had a hammer and the problem looked like a nail. ;)
The best you have to not be ID'd and "kinda headless" is slimer or selenium to XVFB, which aren't really headless but sort of
 slimer [AND] selenium to [OR]
It sounds like it would be quite easy to circumvent just by running a real browser... especially with lightweight VMs.
Those kinds of security issues are the low-hanging fruit that's largely been fixed by now.
We are still very early in the age of the Internet. People are sending all sorts of trashy traffic. There is ample opportunity to optimize but net neutrality means we have to treat it all the same. It's nuts.
What problems are caused by browser automation? Slightly more on point, what issues might the NYT be seeing that detecting browser automation is the sensible solution?
I'm actually working on developing a system to track browser analytics and usage to detect if it's a person on the other end or a bot.
The quick solution of course would be to have a captcha when viewing ads on sites so the advertiser could confirm it's actually a legitimate user, but there are users that are doing everything they can to not be tracked/or view ads, so what incentive do they have to confirm they are a human just so they can be targeted for advertisements? That's why there are companies trying to work behind the scenes to see if the browser is a legitimate session, or a bot session.
Companies looking to buy advertisement space are really honing in now on bots, because it's become such an issue where server farms are set up that will automate views on pages to inflate profits, or like in the case of the company that runs this script on NYtimes, to see if the user is viewing the page through a legitimate viewing session, or if the user is running software in the background of their computer pushing page views automatically.
I could probably talk all day long with this, but advertising is a huge HUGE market. There is little to no day-to-day talk of the users that are running ad block on their computer, it's a low percentage of the actual users we are running into. The large talk is the people that have created botnets of hundreds of computers to push thousands of fake impressions and how to handle that.
What problems are caused by browser automation?
You do need to enable this. After reading the article I immediately checked by dashboard and saw that the option was available, but unchecked.
NoScript mitigates the problem because WebRTC won't work with scripts blocked but I'll still have to disable the plugin to make it work on legitimate sites. It's annoying and I'd prefer a browser permission popup along the lines of what has been suggested in other posts here.
Now to find out how many sites break in fun and exciting ways for having done this ;)
enabling the option in ublock origin, my local IP would not show but my "Public" IP address still showed.
decided to checkout what options are in chrome://flags for webrtc (use to be enable/disable for webrtc but now thats restricted to just android).
found this other option in the flags, chrome://flags/#enable-webrtc-stun-origin
Enable support for WebRTC Stun origin header. Mac, Windows, Linux, Chrome OS, Android
When enabled, Stun messages generated by WebRTC will contain the Origin header.
i enabled it, my public IP no longer shows on that site.
doesnt matter if i disable/enable ublock, my public IP never shows on that site with that chrome flag enabled.
any idea what that flag actually does? and repercussions leaving it enabled?
If you are not behind a VPN, it is expected that your ISP address is visible -- WebRTC or not.
The major use case for webrtc ip leak blocking is preventing leaking of rfc1918 IPs (or link/site-local IPv6 addresses) and preventing leaking of alternate LAN and alternate public IPs.
For example, if you web browse through a VPN, this webrtc functionality will by default reveal not only your public VPN exit IP, but also your VPN rfc1918 ip, and also your real rfc1918 ip, and your primary, non-vpn public IP. All in the name of better connectivity for webrtc. It's horrible.
I don't understand why webrtc does this. How does revealing non-global addresses help improve (edit: reliably improve) connectivity? Those addresses aren't guaranteed to be globally unique, so if you have two webrtc app users on the same non-global netblock, so what? Even if they have the same public IP, it's not guaranteed that they can talk to each other with their non-global addresses; they could be on different (isolated) internal networks. So the app will be blindly trying to connect to random internal ip addresses. Sounds fun for NIDS.
Plus, thanks to IPv4 depletion there might be multiple layers of NAT involved, because ISPs are having to deploy carrier-grade NAT.
I don't see how WebRTC could do this if you're actually routing 0.0.0.0 to your VPN, which I think is how a lot of people use VPNs when they're the kind you toggle on and off. Are you aware of a way it could get your non-VPN public IP even in those cases?
This is enabled by default on windows, osx and some linux systems.
But I do wonder whether webrtc can find your other v6 addresses, a host often has more than one.
on sites that you really trust and need it...
It's a royal pain in the ass.
I can't imagine that non-power users would have any tolerance at all for all that hassle.
$site_name wants to use WebRTC.
WebRTC allows voice calling, video chat, and P2P file
sharing, but can also be a privacy risk. We recommend
allowing WebRTC only on sites that you expect to use
such features on.
[Link to learn more]
Allow WebRTC for $site_name?
"sites that you expect to use such features on"
This means "sites on which you expect to use such features"
Not "sites where you expect them to use such features"
The preposition is key to having you be the one using, not the site. It doesn't have to go on the end, perhaps, and you could rewrite the sentence without that preposition, but simply removing it would leave you with an entirely different meaning.
So what you're really asking for is "can the web app know my non-NATed IP address if I'm behind a NAT and my non-VPN IP address if I'm behind certain kinds of VPNs"?
If you are interested and have some time, find and contribute HTTP "fingerprint" assets from devices on your LAN to src/db.js.
I've now given up on "naked" browsing of the web and only surf via the Tor Browser Bundle. I use a standard Firefox only for web development.
What I'm more concerned about is things like criminals obtaining the info and realizing my personal machine might be a good target for some reason. Or some company selling the data, which eventually gets to a healthcare company that is able to connect my browser history to my name and start denying me coverage or marking me as "potentially high risk" because I looked up the wrong thing, or things like that.
I haven't gone as far as using Tor, but I block cookies from 3rd parties, use ad blockers and browse entirely in private mode until I'm forced not to by some lame web site. After I'm done, I delete as many tracking things as I can and turn private mode back on.
That said, I'd rather there be permissions surrounding WebRTC, but my clients are happy.
Original discussion: https://news.ycombinator.com/item?id=8949953
You'd be sickened and surprised by how many startups overlook handling chargebacks.
I am deeply pessimistic about the potential for tracker-blind browsing without extraordinary measures. A simple plugin or cookie rules simply do not and cannot cut it.
There are just umpteen million ways to fingerprint a device. What plugins do you have installed? What is your font list? What can be deduced about your device's make/model/revision from things like HTML feature support? Then you have WebGL and other technologies that potentially allow for hardware fingerprinting via various methods, slight differences in JS performance revealing things about your JS runtime engine's revision (JIT differences, etc.). Don't even get me started on all the myriad things you can do with TCP, ICMP, network latency, geo-ip, etc.
Anything less than onion routing (Tor and friends) combined with a high-isolation virtual machine or separate hardware device and a browser with no persistent state whatsoever is probably provably inadequate to protect you from fingerprinting or tracking. Any un-obscured network path back to you, access to any form of non-generic local hardware or storage, or persistent state equals fingerprinting/tracking hacks.
It's like using simple XOR for "encryption" and then saying "well, it's better than nothing." Yeah, maybe it's a nano-something better than nothing but it's basically nothing. You might as well not even bother.
Personally I think privacy is dead dead dead dead dead and we need to start talking seriously about what kinds of new political mechanisms and safeguards we need to mitigate abuse. This is a political problem and does not have a technical solution that doesn't come with a lot of cost -- e.g. the enormous performance overhead of onion routing and the inconvenience of secure computing environments. 99.999% of users are not going to do any of that stuff and never will.
It's pretty scary
I'd agree though that preventing general purpose browser fingerprinting is pretty much dead.
IPv4 you'd have a small range of IP addresses but with IPv6 you can have a different IPv6 address each hour if you so choose.
edit: from the horse's mouths https://code.google.com/p/chromium/issues/detail?id=457492
edit2: you can install this
and test here:
though google sure seems to be dragging their feet on this so I'm sure they'll break this workaround soon
If you go read the bug you linked to, you can see that we added code to Chrome to specifically make this extension possible. It's in many CLs, but here is an example of one:
Further, we recently open sourced our own version of the extension and put it on github:
And we intend to keep this advanced control, and perhaps more in the future, well-supported long-term.
Or they could just use upnp like everyone else and enjoy a decent P2P connectivity rate without exposing your private IPs and making you fingerprintable.
More concerning, though, is that this stuff isn't triggering a permission dialog in Firefox.
Only my opinion but there is much one can do without all the .js
Dillo is freaking fast, once you try it you start to wonder where the web went all bloated. Of course its layout engine is quite dated, AFAIK no HTML5 support whatsoever and I think there are many layout bugs too. I use it to load up huge static html pages, they just kill Firefox or Chromium on my netbook. It's certainly nicer than lynx, sometimes you want to look at images too.
It all depends on what the user is trying to do. And not all users are the same.
For retrieving content I find I do not need a web browser.
Looking at photos is a different task. When I used to use X11, there were a few good options for viewing large number of photos quickly.
Watching video is a different task. For this I prefer a dedicated application. Interestingly enough, the player I use on the iPad has a built-in FTP server, HTTP server... and a web browser. Quite useful.
There was a short time long ago when I used wget. That period did not last very long.
curl is overloaded with "features" I will never use.
My idea of a good TCP client is a small, relatively simple program with few options that compiles quickly and easily.
FreeBSD's fetch utility is another one that comes to mind.
Perhaps counterintuitively, I feel I get more "control" from the simpler TCP clients with fewer options (that I complile myself) than I ever did from the more popular programs with too many options and unneeded libraries linked in by default.
But that's just my opinion, nothing more. I would not "recommend" the programs I use to anyone. It is just my personal preference to use them.
You would have fitted into the plan9 user community in the early 2000s and could still perhaps enjoy it today.
A small subset of us used the name "the Tim Berners-Lee kicking club", in reference to how one man killed the greatest tool ever created by adding a worse one on the top.
"The use of electronic communications networks to store information or to gain access to information stored in the terminal equipment of a subscriber or user is only allowed on condition that the subscriber or user concerned is provided with clear and comprehensive information in accordance with Directive 95/46/EC, inter alia about the purposes of the processing, and is offered the right to refuse such processing by the data controller."
So acquiring my internal IP address without consent for reasons other than need for establishing a webrtc connection that the user has asked for is against the law.
This seems like a very odd way of phrasing the law. The default user-agent string, for instance, is stored on the terminal equipment of the user, albeit in read-only memory. It can be used to distinguish one user from another. Therefore, could one argue that any site that includes the user agent in their logs would be violating this law?
And what if the user is not behind a NAT? In that case, the user's external IP address is the one their terminal equipment places in the TCP header... which would mean that it is necessarily stored in said terminal equipment. Did the user give up the right to the privacy of the information by connecting to the website in the first place? Must there be a right-to-refuse all the way down?
It looks like the script in question is hosted on a domain ("tagsrvcs.com") that Adobe uses when loading JS assets for Omniture.
This is very likely a standard Adobe Omniture thing. So its not the NYT acting alone (or necessarily with awareness of this).
Nothing to see here.
You can use it to unobtrusively monitor license compliance for a SaaS biz. You charge each user. A user is constantly logging on from multiple browsers during the day (e.g. IE and Chrome). With local IP knowledge you can determine whether or not this is being done from the same machine (still abiding by license terms), or from multiple machines (most likely sharing with a colleague and breaking license terms).
Before this webRTC hack the only other way to do this that I am aware of, is via the dreaded Flash cookie.
A. How often do we see this user using multiple machines on the same day?
B. Is access being made from two different machines at the same time (very high indicator of license sharing)
At the end of the day no matter how much data you add to the equation, you are still dealing in probabilities. So as a business you must be careful about when and who you accuse of license violations.