
Detecting Chrome headless - avastel
http://antoinevastel.github.io/bot%20detection/2017/08/05/detect-chrome-headless.html
======
westoque
Your solutions in detecting Chrome headless is good.

But someone who really wants to do web scraping or anything similar will use a
real browser like Firefox or Chrome run it through xvfb and control it using
webdriver and maybe expose it through an API. I find these to be almost
undetectable.. The only way you can mitigate this is to do more interesting
mitigation techniques. Liie IP detection, Captchas, etc.

edit: when I say real browser, I mean running the full browser process
including extensions etc.

~~~
dsacco
Interesting. I've done a lot of scraping and I never did this. Do you have a
treatise that explains how it's done? I've actually been experimenting with
headless Chrome recently because I like it more than PhantomJS.

Practically speaking I don't think IP detection is useful at all these days,
and the only Captcha that can't be bypassed is Google's most recent version.
The much more successful anti-scraping tactic is to use a sophisticated
reverse proxy that analyzes all requests to identify patterns and unusual
behavior, indiscriminate from the IP source (because any large scale scraping
will come from many IPs anyway).

~~~
IgorPartola
IP detection, especially IPv4, should still work fine because the cost to get
a new IP will often outweigh the benefit of the scraping.

Having said that, just let them scrape?

~~~
jalfresi
There exists an entire industry that provide rotating IPs across fleets of
proxies for very cheap prices. Proxy services like this make it pretty much
impossible to prevent scraping.

------
shakna
> ... to automate malicious tasks. The most common cases are web scraping...

I really don't think scraping should fall onto that list.

There isn't even a consensus in the IT world whether or not scraping should be
able to be legally restricted.

~~~
throwaway2016a
I came here to say that too. Scraping has a great many legitimate uses. Search
engines, scientific research, trying to use publicly available data that
doesn't have an API. I've had to scrape government websites quite frequently
because they often make public information hard to read by other means.

That last one is an interesting one. I think one of the most effective way to
deter a scraper might be to just provide an API!

Now if you were using the scraped data to republish (copyright infringement)
or use it to gain a competitive advantage (re-pricing in eCommerce comes to
mind) that is a different story.

~~~
xg15
This is actually an interesting point. If you implemented an effective
scraper-detection API, you'd run risk of locking out search engines too.

(Though I guess the real-life solution would be both simple and depressing:
make an exception for googlebot and don't care about anyone else)

~~~
andai
Yeah I've seen a lot of sites that explicitly state that, as well as in their
robots.txt

All robots forbidden, except googlebot

~~~
chrisweekly
... which rule would be obeyed only by legitimate, robots.txt-honoring
crawlers. This reminds me of the anti-piracy messages shown (solely) to
viewers of legally-purchased media. Similar "logic", similar (counter-
productive) "effectiveness".

------
stevefeinstein
So again someone wants to punish all the legitimate people using a web site to
get some marginal benefit from detecting the remaining <1%. The inevitable
false positives don't affect the "malicious" users. Only the legitimate ones.
And how much will this bloat the page load by? Adding more code to an already
overly large page isn't helping anyone.

Just let the web be the web, and stop trying to control it.

~~~
tyingq
Mentioned this in another comment, but for some websites, the scraping problem
has real costs. Airline, hotel, stock prices,etc. For some spaces, scaling and
paying bandwidth for unconstrained scraping is costly. And not restricting it
hurts the legitimate users because the performance sucks.

There are also the scrapers blindly looking for vulnerabilities or other
unsavory tactics.

~~~
always_good
Also, what I've learned is how little regard for your site your scrapers often
have, scraping as aggressively as possible.

You're just not always in a place to scale to the abuse or build something
more complex than some simple heuristic filters.

~~~
josteink
> what I've learned is how little regard for your site your scrapers often
> have, scraping as aggressively as possible.

Often? Based on what data?

I find it much more likely you only often _notice_ aggressive scrapers. That
however tells you nothing about the behavior of the average web scraper or web
scrapers in general.

~~~
tyingq
The system encourages it. Ingress data is cheap, and so many scrapers just
default to high frequency.

------
JoshTriplett
This looks like a list of bugs that need fixing; ideally, headless Chrome
should be completely indistinguishable from ordinary Chrome, so that it gets
an identical view of the web.

~~~
heipei
It depends on the target audience. For Google (and for most people) the goal
of Headless Chrome is to offer an easy and feature-complete way of
automatically testing websites, e.g. for performance (PWA are all the craze)
and bugs. For those folks, it doesn't matter that you _can_ detect the
Headless Browser, it only matters that it's working like the regular one 99%
of the time. This is a huge step-up from previous technologies like PhantomJS
or laborious solutions involving webdriver and many moving components.

In some cases they don't even want it to behave exactly like the regular
browser. As soon as your website uses any client-side state (cookies,
IndexedDB, HTTP caching, service workers, local storage) you want to have to
an easy "give me a clean and isolated browsing session" switch like Headless
offers.

People scraping the web are not the target audience of this.

~~~
nitwit005
But to automate testing, you probably need working locale support, and APIs
for images to function. Those items will probably stop working for detection
purposes when they fix them.

I wouldn't be that surprised if they added webgl support later as well.

------
sorenbs
Leaving aside for a moment that many "malicious" use cases are actually fairly
common and totally legitimate.

Headless Chrome is awesome and such a step up from previous automation tools.

The Chromeless project provides a nice abstraction and received 8k start in
its first two weeks on Github:
[https://github.com/graphcool/chromeless](https://github.com/graphcool/chromeless)

------
josteink
> Beyond the two harmless use cases given previously, a headless browser can
> also be used to automate malicious tasks. The most common cases are web
> scraping

I guess I disagree with the premise of this article.

How is web scraping fundamental malicious?

What rights/expectations can you have that a publicly accessible website you
create must be used by humans only?

~~~
sumedh
It puts a load on you server when bots go wild on your site which in turn
affects the experience of legitimate human users of the site.

------
fforflo
Since when is web scraping a "malicious task"?

~~~
leetbulb
Read the ToS of most websites. :)

~~~
simlevesque
Most ToS don't forbid scraping for "malicious tasks", they just don't allow it
for any task.

~~~
wolco
At what point does a tos stop being enforcable? Would something like by
visiting you grant all copyright on any materials the visor has created to
this site? Could you demand 10% of yearly revenue for any business that visits
and be legally able to retrieve the funds?

------
XCSme
If someone wants to scrape your site he will do it, just find workarounds
against your "protection". It is impossible to tell the difference between a
real user and an automated scrape request, you can only make their job a bit
harder.

~~~
heipei
True. Then again, the cost for the scraper can be raised significantly by
changing your obfuscation / anti-scraping methods frequently. All of the
sudden a scraper will need close monitoring to ensure his scripts / regexes
are still working and he will likely need a person dedicated to implementing
new workarounds as soon as the sites-to-be-scraped push out a new obfuscation
method.

~~~
afpx
In which case, why not just provide a paid API? The content provider will then
make extra revenue that would otherwise go to the endless arms race.

As others have mentioned, there is nothing (that I know of) that can thwart a
motivated and resourceful scraper.

~~~
heipei
Some services have actually gone down that exact road. pastebin.com is one of
them.

But in other cases you simply don't want anyone to be able to extract your
info automatically. A good example would be e-commerce sites which don't want
anyone to be able to scrape their pricing information in large scale and real
time.

------
tyingq
I wonder how many of these were deliberate, and how many were missed. Google
has a vested interest in bot detection.

And by releasing headless chrome, they killed off some of the competition.
([https://groups.google.com/forum/#!topic/phantomjs/9aI5d-LDuN...](https://groups.google.com/forum/#!topic/phantomjs/9aI5d-LDuNE))

~~~
captainmuon
Google also has an interest in making an undetectable bot - for their search
engine. Undetectable in the sense that GoogleBot should see the exact same
page that humans see. They have been using chromium based code, at least on
certain sites, for a while now. I wonder if this person is damaging his
rankings with these techniques...

Of course, Google announces itself as GoogleBot. It wouldn't surprise me if
they did a second stealthy crawl to detect cloaking. (But I think they are
honest when they say they don't, and just throw cheap human labor at it
instead by having people browse suspect sites.)

~~~
kuschku
> Of course, Google announces itself as GoogleBot. It wouldn't surprise me if
> they did a second stealthy crawl to detect cloaking. (But I think they are
> honest when they say they don't, and just throw cheap human labor at it
> instead by having people browse suspect sites.)

They actually run secondary tests that aren’t GoogleBot. They’re quite easy to
detect on very low traffic sites. If you only have a few hundred users, all of
which you know personally, and suddenly over the range of a few hours a few
users using Chrome visit the page, while it’s not linked or findable in any
search engine, just shortly after users using the googlebot UA visited it, and
with certain usage patterns – it’s quite obvious.

Detecting the Android Bouncer’s VM is equally easy, although that only
happened by accident because, due to an automated action, my app crashed in
that and submitted a crash report that was unusual, and I managed to extract
parameters that’d allow detecting it (similar with other android virus
scanners), but I only cared about that to be able to split those "devices"
into a separate category in the crash tracker (all my apps are GPL licensed,
and don’t do anything evil anyway)

------
PascLeRasc
I don't want to start an argument here, but can someone explain why web
scraping is considered malicious?

~~~
zeta0134
I don't believe that responsible web scraping is malicious, but my belief
relies on the assumption that the web is meant to be open. That is,
information put on the internet that is publicly available is considered free
to access, store, and later retransmit. The original web was designed for
researchers to share their work, while the modern internet built on top of
that platform has other moral systems that don't necessarily agree.

Anyway, on a purely technical level, scraping of publicly available content
isn't inherently bad unless you're asked to stop, or are scraping so quickly
as to cause a service disruption by tying up the target systems. There is
nothing malicious about generating normal traffic at the rate of a regular
user. The animosity arises from what you plan to do with the data, and whether
the entity you're scraping agrees with your usage.

~~~
PascLeRasc
Thanks, that's what seems obvious to me too that it's just public data and
it's possible to collect it without overwhelming the server with requests. I
just don't get why someone wouldn't want you to look at their website.

------
skinnymuch
How many of these can be faked with some additional code with Chrome headless?

Regardless as others are saying, using complete Chrome or Firefox with
webdriver solves all these, right? Is there a way to detect the webdriver
extension? That's the only difference I think from a normal browser.

~~~
DiThi
> How many of these can be faked with some additional code with Chrome
> headless?

All of them. As soon as you can run some JS code before the page does, every
single difference can be monkey-patched. There's no way to distinguish native
APIs from fake APIs made by someone that knows all ways of detecting them.

~~~
skinnymuch
Right yeah, of course.

------
tomatsu
> _var body = document.getElementsByTagName( "body")[0];_

You can just use document.body.

I also suggest to use a data URL instead. E.g. "data:," is an empty plain text
file, which, as you can imagine, won't be interpreted as a valid image.

    
    
      let image = new Image();
      image.onerror = () => {
        console.log(image.width); // 0 -> headless
      };
      document.body.appendChild(image);
      image.src = 'data:,';
    

> _In case of a vanilla Chrome, the image has a width and height that depends
> on the zoom of the browser_

The zoom doesn't affect this. It's always in CSS "pixels".

------
netsharc
Shouldn't the first block of code have "HeadlessChrome" instead of just
"Chrome" as the search term?

~~~
avastel
You're right, I changed the code.

------
tscs37
I do hope that these methods get patched, I tend to archive my bookmark
collection with chrome headless to prevent loosing content when such a site
goes offline. I hate it when a website requires me to play special snowflake
to scrape them for this purpose.

------
jdc0589
dumb question from someone who's written a ton of scrapers and scraping based
"products" for fun:

at what point does it make more sense for companies to just start offering
open APIs or data exports? Obviously it would never make sense for a company
who's value IS their data, but for retail platforms, auction sites, forum
platforms, etc... that have a scraper problem, it seems like just providing
their useful data through a more controlled, and optimized, avenue could be
worth it.

The answer is probably "never", it's just something that comes to mind
sometimes.

------
revelation
The irony of using JavaScript to detect scraping or bots when the majority of
them _not used to trick ads_ don't ever execute any of it because they are a
better curl.

~~~
xg15
Well, if you're determined to prevent scraping, it's rather easy to hide
content from non-JS bots: simply pull in the content via Ajax or "encrypt" is
and perform the decryption via JS.

So thinking about how to ward off bots that _do_ go the extra mile makes
sense. (From a scrape-protection POV at least)

~~~
heipei
And it's actually getting easier with every new shiny web API. Want to make
sure only the latest Chrome can retrieve the content of your website? Why not
run a Webassembly computation that will yield the correct URL to fetch. Or
what about a Web Worker? There are endless possibilities, and the only sane
way to scrape / index the web in 2017 is a full-fledged browser.

------
askvictor
All of these could quite easily be overcome by compiling your own headless
chrome. It wouldn't surprise me if there is a fork to this effect soon.

------
userbinator
Those who want a more "authentic" experience would do better to use a real
normal browser, and control it from outside.

------
DannyDaemonic
I'd be willing to bet that missing image size variance is more of a bug or
oversight, and is something that will be fixed.

------
hossbeast
"Beyond the two harmless use cases given previously, a headless browser can
also be used to automate malicious tasks. The most common cases are web
scraping, increase advertisement impressions or look for vulnerabilities on a
website."

Cheating an advertiser I'll grant you, but the other two are 100% legitimate.

------
assafmo
"... a headless browser can also be used to automate malicious tasks. The most
common cases are web scraping... "

Since when web scraping considered malicious? Companies like Google are doing
billions because they use web scraping.

------
codedokode
What about mining cryptocurrency on a page load as a solution against
scrapers?

~~~
heipei
That's like the people pushing Github PRs which will mine $coin in the CI
process. But seriously, you can do that, but scrapers will have short timeouts
anyway before they abandon the page or consider it loaded, so there's probably
not much to be made in terms of profit.

------
fiatjaf
Isn't it possible to detect a bot by tracking some events like random mouse
moving, scrolling, clicking etc.? Why weren't these kinds of detection tried
in place of captchas, for example?

~~~
kuschku
Because they are easily faked.

Google’s current captcha system does track a few of these, but it mostly takes
your browsing history, and, if that seems normal, will accept you.

I’ve run a few IRC bots that allowed people to submit Google searches, and
would return the first resulting link. They also fetch any link mentioned in
IRC channels, execute the JS, and after a timeout of 400ms respond with the
current page title.

Both combined – a normal search history, reading a few hundred pages and
videos a day per user – apparently are enough that they seem "human", and can
pass NoCaptcha.

------
megamindbrian
Can you guys shut up already?

