
It is not possible to detect and block Chrome headless - foob
https://intoli.com/blog/not-possible-to-block-chrome-headless/
======
jimrandomh
Sites detecting headless browsers vs headless browsers trying not to be
detected by sites, is an arms race that's been going on for a long time. The
problem is that, if you're trying to detect headless browsers in order to stop
scraping, you're stepping into an arms race that's being played very, very far
above your level.

The main context in which Javascript tries to detect whether it's being run
headless is when malware is trying to evade behavioral fingerprinting, by
behaving nicely inside a scanning environment and badly inside real browsers.
The main context in which a headless browser tries to make itself
indistinguishable from a real user's web browser is when it's trying to stop
malware from doing that. Scrapers can piggyback on the latter effort, but
scraper-detectors can't really piggyback on the former. So this very strongly
favors the scrapers.

~~~
troels
In my experience, the most effective counter measure to scraping is not to
block, but rather to poison the well. When you detect a scraper - through what
ever means - you don't block it, as that would tip it off that you are on to
it. Instead you begin feeding plausible, but wrong data (like, add a random
number to price). This will usually cause much more damage to the scraper than
blocking would.

Depending on your industry etc., it may be viable to take the legal route. If
you suspect who the scraper is, you can deliberately plant a 'trap street',
then look for it at your suspect. If it shows up, let lose the lawyers.

Of course, the very best solution is to not care about being scraped. If your
problem is the load that is caused to your site, provide an api instead and
make it easily discoverable.

~~~
adrr
Poisoning the well is very effective. We employed it at a large ecommerce
company that was getting hit by carders testing credit cards on low price
point items(sub $5). We were playing cat and mouse with them for six months.
Found certain attributes about the browser that the botnet was using and fed
them randomized success/fail responses. After two weeks of feeding them bad
data, they left and never came back. They did DDOS us though in retaliation.

~~~
stryk
The whole cat and mouse game thing... for some strange reason that sounds fun
to me. Probably because I don't know the details and the workload involved in
actually doing it (and it's not my money or inventory at stake). It seems like
it would be exciting, in that somewhat naive juvenile-ish fantasy sort of way,
to try and figure out how to mitigate the threat, implement it quickly, and
deploy it to watch it play out in real time on live production servers. I
don't know, maybe I have the wrong idea about the whole thing?

~~~
_jal
There are aspects that are fun, but I feel like if you're doing it right, it
is stressful. You are playing an antagonistic game with bad actors, so there's
risk, and you'd better be well past just gaming out the probabilities and
costs there. Just because you noticed them doesn't mean they can't do damage.
You'd also better get informed buy-in from other relevant departments, etc.

Quite a while ago, I was involved in baiting an attacker in a somewhat
different way, but with the same goal (destroying the value we were providing
to them). After the attacker figured out what was going on, they issued a
somewhat credible threat to damage the company's machines (they included some
details demonstrating they had access to a couple internal machines at some
point), attacked and DOSed them, and persistently tried spearphishing them for
months afterwards.

I guess I'd just say, (a) doing this sort of thing responsibly sucks a lot of
the fun out of it, and (b) don't underestimate the risks of things going pear-
shaped. You could be buying yourself a lot of ongoing grief. Something as
everyday as that spearphishing attack can be nerve wracking - even after
making annoying the crap out of everyone by repeating how to be careful with
email, there's no way to be sure it won't hit, and the next thing you know
people are sitting around with upper management having conversations nobody
wants to have about network segmentation and damage mitigation.

------
odammit
Blocking crawlers is dead simple:

Find a way to build an API for your data that allows you both to make money.

Any effort besides that is wasted.

Honey pots links? Great my crawler only clicks things that are visible. See
capybara.

IP thresholds? Great I have burner IPs that hit a good page of yours until I’m
blocked (am I time banned, captchad or perma) and then I back that number out
across my network of residential IPs (bought through squid, hello or anyone
else) and a mix of tor nodes ( I sample your site with that too) to make sure
I never approach that number. But then I also geolocate the IP so it’s only
crawling during sensible browsing hours for that location.

Keystrokes detection? yeah I slow down keystrokes so it looks like Grandma is
browsing

Mouse detection? looks like Michael j Fox is on your site (that’s an old Dell
or Gateway Commercial reference don’t be mad)

Poison the well? I get a page from multiple IPs and headless browser
combinations on different screen orientations and if I detect odd changes in
data I flag that URL for a turk to provide insight/tune the crawler.

I keep the screenshot and full payload (css,js,html) that I use over time to
do more devious shit like render old versions of your page behind a private
nginx server so I can re-extract pieces of data I may have missed.

Stop trying to stop the crawling and figure out how to create a revenue
stream.

~~~
vivekseth
How would this stop crawling by users who cannot afford the API subscription
or don't want to pay for it?

I think your suggestion would reduce crawling, but not prevent or block it.

~~~
sushid
Because you'd presumably need to authenticate (and have a paid account) to
access the data.

~~~
odammit
I think they meant, since they can't afford the API, they'd just keep crawling
your HTML. Combo move the API and putting your stuff behind auth and you solve
a lot of the problem.

You've gotta get that good yummy content publicly accessible for google
though, so you'll rank. So, that's a balancing act.

------
_o_
This article is a joke, all those methods of "protections" are a joke. What we
called "script kiddys" and are now a major amount of so called developers are
just underdeveloped lamers who just don't know that the fight is lost in
advance. All the methods that you take are useless when you get into situation
of scraper run by someone who is able to modify (oh and is able to code in
c/c++) and recompile the client side. The world went into Idiocracy so much
that methods are beeing invented by people who are so narrow minded that they
see the development in a scope of a browser and have a false sense that they
"can handle it". Only if the oponent is as narrow minded as they are. Only
than. I can modify the source code of chromium, you will get back exactly what
you expect from regular user, i am able to scrape fb and linkedin and the only
thing they can do is to slow me down (to hide the fact that the code is doing
surfing, not human). Stop wasting your time on protection, you are running
your inneficient crappy code in insecure environment, the only "attacker" you
are safe against is the one who is as clueless as you are.

The same moment when you send content to the client, it is game over. You have
lost all control.

I am sorry for all non-gentle sentences here, but we had developers who were
able to decompile asm code and patch it to avoid drms, while now sandboxed
idiots are thinking, they are smart. The whole dev. environment became toxic
=/ And people are just to stupid to understand how stupid they are =/

~~~
xfer
Harsh words; but very true that the developers don't realize the chromium is
open-source.. maybe they should just jump to the new drm extension, atleast
that will challenge the dedicated scrapers.

------
tabeth
Isn't it impossible to win the game of blocking headless browsers?

What's stopping someone from creating an API that opens up a real browser,
uses a real (or virtual) keyboard, types in/clicks the real address, etc. then
proceeds to use computer vision to scrape the information from the page
without touching the DOM?

~~~
dsjoerg
You are in principle correct, but in practice you need to account for the side
channels of information as well -- does the mouse and keyboard behave like a
human or a robot? Are there thousands upon thousands of sessions coming from
the same IP address?

The cat and mouse game happens at every level, not just the DOM/browser-
detection level.

~~~
nicklaf
So record actual user input data and generate similar input patterns
stochastically.

That said if you try to scale this up beyond what a reasonable, normal user
world do in one sitting, you are bound to stand out.

Although that said, I find that I trigger such rate-limiting mechanisms
already as a human just when searching Google as a human being and clicking
through every last search result page.

~~~
averagewall
You'd have to scrape slowly to mimic a real slow user. Maybe at that point
you'd be cheaper to get Mechanical Turk to do it. That should solve IP rate
limiting, captchas, and just about everything except the endless arms race.
Why are so many people going directly to these same-formatted internal URLs
without clicking through from random other places? So the site can change the
internal URLs and break it all over again.

~~~
toomuchtodo
You'd use a browser extension, scoped to requests of sites you're interested
in, and stream your data back to your infrastructure for processing. You're
limited only by your install base and your ingest infrastructure.

Recap [1] does this to extract PACER court documents that are public domain,
but access is restricted due to draconian public policy.

[1] [https://free.law/recap/](https://free.law/recap/)

------
nukeop
Good, the less effective various spying techniques are, and the easier they
are to throw off, the better the internet is for its users. I don't want any
website owners to know what device, browser, or other program, I use to access
their site, and they have no business knowing that. I like it being a piece of
information I can supply voluntarily for my own purposes, and I get the heebie
jeebies every time I read about a new shady fingerprinting technique that
exploits some new, previously unexplored quirk of web technologies.

~~~
zzzcpan
This incentivizes more aggressive fingerprinting, not the other way around.
Too bad people don't realize it.

~~~
ladzoppelin
Browser fingerprinting, I almost forgot. Non aggressive and impossible to
stop.

------
koiz
Good. It shouldn't.

The web should be open the fact that people are still trying to stop this is a
joke.

~~~
yorby
tell that to web assembly and friends

------
fixermark
I'm not sure why one wants to bother to do this.

With tools like Sikuli script (sikuli.org) already around for ages, automating
a headed browser isn't rocket science. So the best-case scenario for detecting
headless browsers is "The bad guys just use headed browsers and another
automation solution."

~~~
Dolores12
Looks like great tool, never heard of sikuli before. Thanks for the tip!

~~~
fixermark
I name-drop it every chance I get ;) We used it to automate the integration
tests for a game engine at a previous company; worked great, because it
allowed us to fire events into the engine itself based upon the actual
rendered pixels (Sikuli supports varying levels of fuzzy image detection for
event targets).

------
j_s
This dicussion is also happening on a counterpoint posted about 9 hours
earler, also currently on the front page:

It is possible to detect and block Chrome headless |
[https://news.ycombinator.com/item?id=16175646](https://news.ycombinator.com/item?id=16175646)

------
zzzcpan
"That’s when it becomes impossible. You can come up with whatever tests you
want, but any dedicated web scraper can easily get around them."

As long as the logic is hidden from the scrapers, i.e. not running in a web
browser, scrapers are at a disadvantage. They don't have the data about the
users that websites have. And even something as simple as Accept-Language
header associated with an IP subnet is a data point that can be used to
protect against scraping. There are a lot more data points though and more
aggressive fingerprinting can effectively destroy scraping.

~~~
kbenson
All the passive techniques are much harder to reasons about, but much easier
to match. You just look at the complete request/response headers, make sure
you match them, and have some good sources to request from.

Much harder is stuff like Distil's script injection, where they transparently
inject script tags that do fingerprinting, and they obfuscate the code that
does so annoyingly (it's not really _hard_ to reverse, just time consuming and
and annoying). They pair this with being a bit more user friendly by
redirecting you to a CAPTCHA page if your fingerprinting hits some threshold
which if you answer redirects you to the page you wanted, so users experience
and inconvenience if there's a false positive, but still get access to what
they wanted.

I was able to get around most of the passive fairly easily with Perl and LWP,
and even the active stuff and CAPTCHA redirects (cookie_jar all serialized to
DB so I could store request and represent it to a user to answer), but once
they started tweaking their fingerprint script ever couple months/weeks that's
when the equation shifted. Distil, as a solutions provider, gets to amortize
their changes across all their customers, while I would have to spend the time
de-obfuscating it. They could just assign a person to change it once a week
and they would effectively halve my time to get any real work done, so without
a collective effort of some sorts to combat them, I saw the writing on the
wall. :/

The sad thing is that when we moved to API access, their APIs are hampered to
the degree that it actually takes two orders of magnitude more requests each
minute for a fraction of the accuracy (I was able to query changes over the
last couple minutes previously, and now I have to query the entire item set of
a subset of all containers, when there are tens of thousands of containers).
:/ Lose lose, since our use case isn't even the main reason the site wanted to
block scrapers.

~~~
ThrustVectoring
Did you do a cost estimate for the "Wizard of Oz" solution of having real
people with real browsers (and a script to pull data from the site?) Might
have been worthwhile.

~~~
kbenson
It was actually two changes, one which required a much more intensive request
regime because of a public caching (SOLR) system change, and then the much
more aggressive scraping detection. The first caused us to change from
requesting data 78 times an hour (items changed within 2 minutes requested
every minute, 6 minutes of changes every 5 minutes, 11 minutes of changes
every 10 minutes for overlapping coverage) to many thousands of checks an hour
for much less accurate information. In the end, we ended up doing very
targeted checks and much less accuracy for different classes of items. Having
people actually do the checking just wouldn't be feasible for our size and
resources (very small company, <10 employees), even through mechanical turk (I
suspect).

------
heipei
Interesting follow-up (again). It will be very interesting to see where
attempts to detect headless browser will first appear in the wild. Once we
know that and the prevalence, we can make a judgement call on how much effort
to put into anti-detection techniques. It's an arms race for sure, but once
you know your target you can evaluate whether you even have to put up the
effort to defeat a non-existent adversary.

------
emmelaich
It's a very dangerous thing to do for SEO reasons too.

I'm sure Google and others have automated user-like crawling which attempts to
validate their official Google indexing bot.

If the results between the two differ in certain ways you may well get your
site buried way down in search results.

------
urlgrey
Crawlers & scrapers that rely on headless browsers like Chrome often initiate
playback of video on the pages they access.

The company I work for (Mux) has a product that collects user-experience
metrics for video playback in browsers & native apps. It's been a non-trivial
effort developing a system to identify video views from headless browsers so
that we might limit their impact on metrics. Being able to make this
differentiation has a real benefit to human users of our customer's websites.

My preference would be for headless browsers to not interact with web video or
be easily identifiable via request headers, though I doubt either of these
things will happen any time soon.

~~~
goerz
Video should never play unless actively initiated by the user. That would fix
the metrics, as the headless browser probably wouldn't initiate the video
playback

~~~
bpicolo
If it's a video site, I expect the video to play when I land, e.g. youtube.
I'm initiating on purpose by browsing

~~~
chriswarbo
The first thing I do when hitting a YouTube URL is stop the video. Then I'll
either run youtube-dl on the URL, or just paste it straight into a proper
video player (VLC).

~~~
bpicolo
Pretty confident you're in the extreme minority on that one

------
Sephr
The author's navigator.webdriver fix is easily detected, though of course it
is fixable with changes to Chrome. This cat and mouse game probably isn't
worth pursuing against dedicated adversaries.

    
    
        if (navigator.webdriver || Object.getOwnPropertyDescriptor(navigator, 'webdriver')) {
            // navigator.webdriver exists or was redefined
        }

~~~
foob
That test actually wouldn't work:

    
    
        > navigator.webdriver
        true
        > Object.getOwnPropertyDescriptor(navigator, 'webdriver')
        undefined
    

As you say though, it's a cat and mouse game and you could always override the
behavior of _getOwnPropertyDescriptor()_ if it were used in a test.

------
ed9911
As someone who writes web scrapers for a living, I have only come across one
site where I have been unable to reliably extract the information we need. If
we were more flexible, we would be able to deal with this site too. Defending
yourself from scrapers is an arms race you are almost certain to loose.

~~~
odammit
What site?

I’d guess LinkedIn or Facebook.

I’ve had to make a lot of fake accounts to get just a _decent_ amount of data
from them.

------
aplorbust
Its trivial to randomise HTTP headers, both the content and the _order_. There
are free and commercial databases of user-agent strings available to any user,
the same ones the websites may use.

Users can also modify or delete HTTP headers through local proxies, using the
same proxy software that many high volume websites use. Sites that rely on
redirects to set headers make this even easier.

p0f only works with TCP. Could this be another a selling point for alternative
congestion controlled reliable transports that are not TCP, e.g. CurveCP? I
have prototype "websites" on my local LAN that do not use TCP.

The arguments in favor of controlling access to public information through
"secret hacker ninja shit"
([https://news.ycombinator.com/item?id=16176572](https://news.ycombinator.com/item?id=16176572))
are not winning on the www or in the courts. Consider the recent Oracle ruling
and the pending LinkedIn HiQ case.

If the information is intended to be non-public, then there is no excuse for
not using access controls. Anything from basic HTTP authentication to
requiring client x509 certificates would suffice for making a believable
claim.

Detecting headless Chrome and serving fake information, or any other such
"secret hacker ninja shit" is not going to suffice as a legitimate access
control, whether in practice or in an argument to a reasonable person.

The fact is in 2017 websites still cannot even tell what "browser" I am using,
let alone what "device" I am using. They still get it wrong every time. Best
they can do is make lousy guesses and block indiscriminately. Everything that
is not what they want/expect is a "bot", a competitor, an evil villan. Yet
they have no idea. Sometimes, assumptions need to be tested.1

    
    
      1 https://news.ycombinator.com/item?id=16103235 (where developer thought spike in traffic was an "attack")

------
beagle3
The EME DRM is part of the game for those who really want to block headless.
It will arrive, sooner or later.

~~~
OkGoDoIt
I always assumed DRM would eventually factor into this. I’ve only ever read
about it in the context of media, but I’m assuming there’s ways to use it
creatively for fingerprinting and blocking scraping as well. Do you have any
links with insights into that?

~~~
beagle3
I am not familiar with the specific details and how they would allow this, but
... if it is for human consumption and not bot consumption, it is enough to
render the result into a DRMd h264

------
baybal2
From my experience in the scene:

Bot mill people are very aware of headless browsers being an effortless
solution to mimic a browser, but not that efficient.

The amount of ram and so a bots spends to do a single click can truly hurt
their bottom line.

Top tier collectives I heard of use own C/C++ frameworks with hardcoded
requests and challenge solvers, and in-depth knowledge of anti-botting anti-
fraud techniques used by the opposing force. If DoubleClick finds a brand new
performance profiling test, and send it out in the JS code in one in 1000
requests, expect those guys to detect it and crack it within 24 hours.

They have no objective of getting through captchas, just having their number
of valid clicks in double digits.

------
gildas
The problem is that you can easily detect that some properties have been
overloaded. For example, you can execute
Object.getOwnPropertyDescriptor(navigator, "languages") to detect if
navigator.languages is a native property or not.

~~~
sigotirandolas
It's possible to hide that as well, funnily enough, by also overwriting
Object.getOwnPropertyDescriptor (and similar tricks). As far as I know, it's
theoretically possible to use this trick to completely 'sandbox' some code so
that there's no way it can detect certain functions being overwritten (by
overwriting all functions such as Object.getOwnPropertyDescriptor,
Function.toString, etc. and making them hide the overwritten functions,
including themselves).

For some more information: [http://randomwalker.info/publications/ad-blocking-
framework-...](http://randomwalker.info/publications/ad-blocking-framework-
techniques.pdf)

~~~
gildas
Very interesting article! Thank you.

------
wiz21c
Could someone tell me why everybody wants to fight against headless browser ?
If I want to use such a browser to browse your site, site that you voluntarily
show to the public, then it's my problem, my code, not yours. If you want to
protect your data so much, then maybe you shouldn't put them on the web first
place. (yep, I present things in black and white, but you get the picture)

I would also add this :

[https://www.bitlaw.com/copyright/database.html#Feist](https://www.bitlaw.com/copyright/database.html#Feist)

because it basically says it's hard/pointless to protect data.

------
pwaai
Some people seem to have figured out how to detect without relying on
fingerprinting the browser. ex. Crunchbase

but headless chrome shouldn't be possible to distinguish from a regular chrome
browser.

The only vector to block scraper is some sort of navigational awareness that
deviates from a distribution curve + awareness of IP.

but this comes at a great cost to hurting your own real vistors by taxing them
with captcha or other annoyances.

~~~
londons_explore
Thats what invisible recaptcha is for.

Only the users who compulsively clear cookies ever get bothered by it, and
even then all they have to do is click a few photos of cars.

------
j_coder
It is "easy" to block scraping. Make it very costly to scrape:

\- Render your page using canvas and WebAssembly compiled from C, C++, or
Rust. Create your own text rendering function.

\- Have multiple page layouts

\- Have multiple compiled versions of your code (change function names,
introduce useless code, different implementations of the same function) so it
is very difficult reverse engineer, fingerprint and patch.

\- Try to prevent debugging by monitoring time interval between function
calls, compare local time interval with server time interval to detect
sandboxes.

\- Always encrypt data from server using different encryption mechanisms every
time.

\- Hide the decryption key into random locations of your code (use generated
multiple versions of the code that gets the key)

\- Create huge objects in memory and consume a lot of CPU (you may mine some
crypto coins) for a brief period of time (10s) on the first visit of the user.
Make very expensive for the scrapers to run the servers. Save an encrypted
cookie to avoid doing it later. Monitor concurrent requests from the same
cookie.

The answer is that it is possible but it will cost you a lot.

~~~
tlrobinson
All of which is defeated by OCR.

~~~
eastendguy
Good point. OCR powered web scraping is even available out of the box
nowadays.

[https://a9t9.com/kantu/docs/scraping#ocr](https://a9t9.com/kantu/docs/scraping#ocr)

~~~
j_coder
It is not the OCR that is costly. It is the JavaScript execution to render the
page so you can do the OCR. You can even increase the JavaScript execution
cost if suspicious.

You will also have to automate all page variations and the traditional
challenges (login, captcha, user behavior fingerprinting, ...)

At the end the development time, cost and server cost will kick you out of
business if you are too dependent on the information or you start to loose
money every time you scrap.

------
landryraccoon
If you want to detect if a human is visiting your site, open an ad popup with
a big close button directly over the content.

A human being will always, 100% of the time, immediately close the popup.
Automation won't care.

~~~
xienze
OK, but that is guaranteed to annoy users. Plus, I think you’re
underestimating the intelligence of the people writing scrapers — obviously
they’re going to visit the site manually and see what appears to be a
fingerprinting measure. Then they’ll update the scraper to close that pop up.
There are no effective solutions to this problem.

------
mixedbit
It is impossible to make headless and normal browser send 100%
indistinguishable traffic. The timing of the browser requests is influenced by
rendering that for the two versions will be always different.

~~~
seanp2k2
It's not impossible; you could e.g. profile a real client's timing and
introduce delays into the headless version. It's not zero work, but it's very
much not impossible if you're sufficiently motivated.

Especially recently with e.g. [https://hackaday.com/2018/01/06/lowering-
javascript-timer-re...](https://hackaday.com/2018/01/06/lowering-javascript-
timer-resolution-thwarts-meltdown-and-spectre/) , high-precision timers in JS
might not be available for all clients for reasons other than ~"they're
headless and trying to scrape my site".

------
pbalau
Chromium can work directly with Wayland afaik. Do a "fake" Wayland
implementation and Chromium will happily think it's drawing crap

------
megamindbrian2
Cool, I think the new captchas use mouse entropy that would be an interesting
test since remote usually go straight to the pixel point.

~~~
robocat
Most touchscreens go straight to the pixel point too.

------
merb
if you want to block scrapers, just add rate limiting...

~~~
jjeaff
That's old hat and ineffective. Scrapers usually proxy through large lists of
rotating ip addresses. There are lots of services for it.

~~~
saas_co_de
depends. if you also do tracking for valid navigation paths then rate limiting
may be effective.

For instance: if you have a search result with a 1000 pages that someone is
trying to scrape if you don't allow people to jump into the middle of the
result set then just rotating IPs doesn't work.

------
_o_
All those tests are useless and effective only against script kiddys (which
are now like 99.99999% of developers by old standards) and are unable to code
anything else but crappy languages like js. For people grown up with web,
capable of coding in c/c++ those tests are a joke, I'll just modify the source
code to return what is expected and 'game over'. We were reversing drms by
dissasembling and patching the binaries - in world of text based protocols and
scripts, Idiocracy of todays world is making us invincible.

~~~
icebraining
What does C/C++ have to do with this, when the point of the article is showing
that they can be defeated using JS?

~~~
_o_
JS is run within c/c++ js engine that can be modified to return you whatever
fake results. You can't prevent that. As always, any lock is cheaper to defeat
than create.

~~~
icebraining
No, you misunderstood; you can defeat _the detection_ using JS. You don't need
C at all.

