

Detecting PhantomJS Based Visitors - walterbell
http://engineering.shapesecurity.com/2015/01/detecting-phantomjs-based-visitors.html

======
tommorris
> As a website owner, you want to ensure you serve humans, and as a web
> service provider you want programmatic access to your content to go through
> your API instead of being scraped through your heavier and less stable web
> interface.

I'm not sure as a website owner I want to only serve humans. What happens if
someone bookmarks my site with a service like Delicious or Pocket or Pinboard
and that site wants to scrape metadata from my site? Poor user can't see the
metadata because I have a strange preference for human over robot users.

Oh, sure, those sites should get API keys, right? Err, no. Why should someone
need to be able to parse arbitrary JSON and have to sign up with an API key
for every site they visit?

Your web site shouldn't be (significantly) heavier or less stable. In an ideal
world, if you followed REST properly, there'd be no real difference between an
API and a website: both are just hypermedia representations of the resources
you make available.

You can put data inside HTML (microformats, RDFa). You can use content
negotiation so that you can have the same URLs serving up different content
types.

Also, by filtering programmatic access, you'll piss off a lot of geeks who use
stuff like PhantomJS to automate boring tedious shit that your website
probably makes us do. I have little scripts that do stuff like automatically
download invoices from suppliers for accounting purposes. All in order to
prevent a security threat that shouldn't exist because you shouldn't have to
rely on filtering particular browser types for your site to remain secure.

------
zimbatm
I wish navigator.plugins was never exposed to the website. Now every site I am
visiting knows which set of exploitable software I have installed with the
exact version number as well. How convenient.

EDIT: seems more complex to remove than I thought:
[https://bugzilla.mozilla.org/show_bug.cgi?id=757726](https://bugzilla.mozilla.org/show_bug.cgi?id=757726)

------
walterbell
It looks like Selenium [0] can control both native browsers and headless
PhantomJS [1] for test automation [2]. Native browsers pass fingerprint tests
and allow visual observation of the running test, but are slower due to
display rendering.

[0] [http://www.chrisle.me/2013/08/5-reasons-i-chose-selenium-
ove...](http://www.chrisle.me/2013/08/5-reasons-i-chose-selenium-over-
phantomjs/)

[1] [http://www.assertselenium.com/headless-testing/getting-
start...](http://www.assertselenium.com/headless-testing/getting-started-with-
ghostdriver-phantomjs/)

[2] [http://blogs.adobe.com/security/2014/07/overview-of-
behavior...](http://blogs.adobe.com/security/2014/07/overview-of-behavior-
driven-development.html)

------
pikzen
Every single method relies on detecting something sent by PhantomJS. The
author even admits that it's spoofable in the first few tests, but then the 4
last do not have this mentioned. Forking PhantomJS to send out data that looks
like a normal browser would take ten minutes.

I'm not sure what the point of this is. An honest title would be "detect
phantomjs by trusting the client". It's completely stupid.

~~~
harryf
To me the most promising approach would be entropy detection. Crawlers have
less entropy than humans

------
aleksi
Fun fact: Ariya Hidayat, PhantomJS author, is VP of Engineering at Shape
Security.

~~~
curiously
so? I mean nobody is paying him to develop PhantomJS right?

~~~
dstein64
> "so?..."

The blog post on detecting PhantomJS is from Shape Security. The "fun fact" is
that PhantomJS was developed by someone at the same company.

------
davelnewton
Kind of interesting, mostly because it's something I've never really thought
about before. Not convinced this is a generally-solvable problem, particularly
long-term, without resorting to behavior analysis.

~~~
tommorris
I'm not convinced it is a valuable problem that needs solving.

If you really want a web which only humans can browse and which prevents non-
human-operated clients from visiting websites, require a submission of a
blood, hair and urine sample or a photocopy of the person's driving license or
something.

Until that point, computers will want to do useful things on behalf of humans,
so it might be best to not get in their way because someone told you to on a
website.

~~~
tuckerman
For some companies though, it's about protecting IP, e.g.

* Google wants to stop scraping so that you can't build a competing search engine that just scrapes Google for every search term that you see.

* LinkedIn has an amazing database of user information, you wouldn't want someone scraping all of it and creating LinkedIn2.

* One of the reasons Quora exists is that there are a lot of opportunities in mining the answers; they don't want another company to get to piggy-back on their hard work creating the site, acquiring users, making a good UI, paying for hosting etc.

~~~
tommorris
So it's basically for DRM.

~~~
davelnewton
I think DRM is _part_ of what this attempts to protect against, but that's
certainly not the _only_ use.

It's still potentially onerous, since I regularly write agents that pull info
from various places, for personal use, research, and archiving.

