
It is possible to detect and block Chrome headless - avastel
http://antoinevastel.github.io/bot%20detection/2018/01/17/detect-chrome-headless-v2.html
======
eastendguy
The listed techniques not only detect Chrome headless, but all custom browsers
built on CEF (Chromium Embedded Framework)
[https://bitbucket.org/chromiumembedded/cef](https://bitbucket.org/chromiumembedded/cef),
such as Kantu from [https://a9t9.com](https://a9t9.com)

If your goal is to only allow the original Google Chrome browser, that is
fine. Otherwise this might cause false alarms.

~~~
RandomInteger4
Not only that, but it would be simple enough to just create an iframe based
scraping script or browser extension for use in a normal browser, no?

------
kondro
And it’s possible to pretend not to be Chrome headless too.

[https://intoli.com/blog/making-chrome-headless-
undetectable/](https://intoli.com/blog/making-chrome-headless-undetectable/)

~~~
nailer
Hah, this didn't cover window.webdriver, and I was about to post that you can
still use that (since I assumed window properties weren't deletable) but...
window properties _are_ deletable. Cool.

~~~
masklinn
How do you ensure that your deletion code runs in the context of the hosted
page but before that page can run any of its own code?

~~~
dragonwriter
From the original article, you put a proxy in front of Chrome headless and
inject the deletion code into the HTML of each page before any JS loaded by
the page.

~~~
jevinskie
So now the page needs to checksum itself once loaded to detect tampering.

~~~
masklinn
You'd get false positives from e.g. extensions modifying the page.

~~~
TeMPOraL
Which, for 99% of extension-equipped users, will be just an ad blocker, i.e.
something the websites don't want to deal with either.

The arms race goes on.

------
DanielBMarkham
I read these things and I think "So much wasted energy and effort"

In the beginning was the web, and it was good. Content came along. Some was
good, some was cats. Then paid sites with sign-up. Then search engines. Then
ads.

Pretty soon folks thought "I not only own this content, I own how it will be
presented to the end user. If I choose to add in cats, or Flash ads, or
whatnot? They're stuck consuming it. I own everything about the content from
the server to the mind of the person consuming it, the entire pipe."

Many people did not like this idea. Ads were malicious, they installed
malware. The practice of using ads on content caused sites to track users like
lab rats. Armies of people majoring in psychology were hired to try to make
the rats do more of what we wanted them to do.

Ad blockers were born. Then anti-ad-blockers. Then headless browsers. Now
anti-headeless browsers.

It's just a huge waste of time and energy. The model is broken, and no amount
of secret hacker ninja shit is going to make it work. You want to know where
we'll end up? We'll end up with multiple VMs, each with a statistically common
setup, each consuming content on the web looking just like a human doing it.
(We'll be able to do that by tracking actual humans as they consume content).
But nobody will be looking at those VMs. Instead, those invisible screens will
be read by image recognition software which will then condense what's on there
and send the results back to whoever wants it.

Content providers will never win at this. Nor should they. Instead, we're just
going to sink billions into a busted-ass business model over the next couple
of decades throwing good money after bad.

</rant>

~~~
ivanhoe
It's more complicated than that, almost all spidering that I do lately for
clients is not about scraping the content in a classical sense, but some form
of "spying" of one's competition, mainly in e-commerce. Tracking inventories,
fine-tuning prices, tracking how they promote certain products online, etc.
Not to mention really blackhat stuff that many people do like attacking
competition with fraudulent ad clicks and similar. Many times "the content
providers" have every right to wish to protect themselves.

~~~
jon-wood
I'll add my piece of anecdata to this. I've done a few projects that involved
scraping over my career.

1\. Scraping results from a property listing website, specifically to pull the
properties an agent had listed and put them on their website. The agent didn't
want to pay the fee for API access (they probably ended up paying my employer
more to scrape it, and keep that updated, but hey).

2\. Scraping an e-commerce website of a company my employer were working with
to keep our product catalogues in sync - the partner had their own platform,
but no API for it.

3\. Automating requests to a price comparison website to find out what prices
competitors are offering for particular types of customer.

~~~
jandrese
API keys can be revoked at any time or the API service can be suddenly
terminated. You can't trust them. Scraping is more maintenance, but it is more
reliable.

~~~
emodendroket
But whoever you're scraping can suddenly break your scraping tool too.

~~~
jandrese
Temporarily until you work around what they're doing. API key revocation
leaves you dead in the water until you rewrite your app to do scraping.

~~~
emodendroket
Yeah, but imagine they totally redesign their Web site. The "workaround" is a
complete rewrite.

~~~
RhodesianHunter
> The "workaround" is a complete rewrite.

No it's not, it's just swapping out a few selectors or regular expressions...

~~~
emodendroket
I think that depends on how thorough the redesign is, but surely it's not any
less work than if the API breaks overnight.

------
pbhjpbhj
You probably want the web equivalent of malicious compliance - an
algorithmically generated web-hole or similar. That way the bot author isn't
entirely sure you're on to them; it could be a bot or server error. Like send
the right headers but garbage data that looks like it's compressed but isn't,
or doubly compressed garbage, or trim pages at a different place (before
anything interesting), or slow data transfers, or ...

~~~
psandersen
Just randomize the content they're trying the scrape imho

~~~
dx034
I saw that with a page and it works extremely well. Inserting wrong data based
on rules (there it was on rate limiting rather than user agent) is extremely
hard to detect. The scraper never knows if they see real or wrong data.

On the other hand, this will also get wrong data in search engines.

~~~
imtringued
I think you could get pretty far just by simply scrambling the CSS classes and
ids.

------
beager
All web automation and automation prevention is a cat and mouse game where you
never stop the scrapers, you just create more effort for them. It’s like
traditional and digital security in that regard, except that security often
has an element of difficulty in overcoming it (cryptography, thickness of
physical barriers), whereas stopping web scraping is about adding more trivial
things to make the process more complicated.

Eventually, human browsing and headless browsing converge. Nobody wants to
make the human browsing experience bad, so the headless browsing continues.

In my opinion, if you’re running a site that is existentially threatened by
someone else having your content, you need something else for your moat.

~~~
otakucode
Don't worry. Thanks to the W3C and their EME standards, scraping will reach
the level of other sorts of security. I'm surprised I haven't yet seen a
simple framework for serving your page not as a page but as an EME-protected
blob that bears a rendering of the content. We will see just that.

~~~
TeMPOraL
So scrappers will start OCRing screenshots.

This race won't end, and the only result beyond wasted effort is the creation
of ever more ridiculous and user-hostile practices.

~~~
imtringued
I've seen some DRM that can detect if you're running within a virtual machine
and prevent screenshots.

The future will be raspberry pi clusters connected to HDMI capture cards.

------
userbinator
This feels a bit like the "VMs aren't quite like real machines" problem --- as
in, it's a cat-and-mouse game that will probably continue indefinitely.

Personally, as someone who regularly uses several different browsers and
experiments with others, I wish the Web was far more browser-neutral.

~~~
gsich
Everything that can't be handled with curl or beautifulsoup is probably not
worth the effort.

~~~
emodendroket
Lots of pages have the content you'd want to scrape injected with JS; headless
Chrome would seem to solve that problem.

~~~
imtringued
Don't most SPAs have some kind of internal API that is easier to work with
than the HTML?

------
devit
The whole point of using an headless browser is to work around web sites that
attempt to block simple "curl" style scraping (or where you need to execute
JavaScript to scrape).

So making it detectable (intentionally, even, right there in the user agent!)
is really absurd.

Or actually, it makes one wonder about Google's motives.

~~~
williamdclt
That's definitely not the whole point of headless browsers, that's more of a
side-effect. The whole point of headless browsers is rather automation and
testing.

~~~
dx034
Same as torrents are for the distribution of legal content. That was the
original thought and it's still used for that but I'd bet the majority of
headless browser requests crawl websites not owned by the scraper.

------
saas_co_de
so, now I can run a script to fix all of these things so that headless can't
be detected by any of these methods? thanks.

------
lossolo
Is there a way to enable Chrome PDF Viewer/Widevine Content Decryption Module
etc in headless chromium? Is there some switch in chromium code base that
would enable that?

------
pathdongle
To every action there is always opposed an equal reaction...
[https://intoli.com/blog/making-chrome-headless-
undetectable/](https://intoli.com/blog/making-chrome-headless-undetectable/)

------
rundigen12
Re. blocking scrapers: Some of us are neither vast corporate espionage
practicioners nor zombie-botnet users: we're on our own, scraping for data
science & other academic research purposes.

Is there some way to declare, "I am a legitimate academic user", something
akin to 'TSA Pre' status?

"Sure, register for & use the site's API," you'll say. What if they don't have
one?

"Sure, just don't slam the server with too many requests in a short time,"
you'll say. But if they're rejecting you just because they detect you're
headless, etc...?

~~~
pc86
> _But if they 're rejecting you just because they detect you're headless,
> etc_

Isn't that their right?

If I pay for my outgoing bandwidth (even if I don't) I am under no obligation
to give my content/data/whatever to any third party source, even academic.

~~~
TeMPOraL
> _If I pay for my outgoing bandwidth (even if I don 't) I am under no
> obligation to give my content/data/whatever to any third party source, even
> academic._

Aren't you? You put a server on the _publicly routable Internet_. And made it
_talk over HTTP_. At this point I believe you've already chosen to waive your
rights not to serve content.

------
lovelearning
What's the reason for blocking a headless browser?

~~~
scardine
Headless browsers are used to create robots in order to automate the gaming of
web-based value systems - thus diluting the value for legitimate participants.
Examples:

* create fake profiles in order to boost someone's "followers" in a social network where you can monetize your "influencer" status

* click ads from a competitor in a way that would trigger fraud prevention from the ad network effectively preventing the competitor to advertise there

~~~
_eht
Yeah I think you covered all the bases here. /s

------
callumprentice
For what it's worth, Dullahan, my headless SDK on top of Chromium Embedded
Framework appears exactly the same as desktop Chrome:

Overview:
[https://bitbucket.org/lindenlab/dullahan/overview](https://bitbucket.org/lindenlab/dullahan/overview)

Examples:
[https://bitbucket.org/lindenlab/dullahan/src/default/example...](https://bitbucket.org/lindenlab/dullahan/src/default/examples/?at=default)

Not suggesting it's better or worse - just an alternative if you need
something that appears to be like a desktop browser.

------
walshemj
Id be careful using this as google crawls (well specifically it indexes) using
headless chrome you could block googlebot when you don't want to.

------
j_s
This discussion is also happening on a counterpoint posted about 9 hours
later, also currently on the front page:

It is not possible to detect and block Chrome headless |
[https://news.ycombinator.com/item?id=16179181](https://news.ycombinator.com/item?id=16179181)

------
yoz-y
The original article does not mention blocking it, just the detection.

~~~
bonesss
Good point - for a lot of what unwanted headless scraping would be used for I
imagine returning subtly changing patterns of semi-useful data is probably
more useful than blocking.

A fully blocked bot will error and get replaced with a working bot. A bot that
subtly errors again, and again, and again will look almost-right and create a
maintenance nightmare...

~~~
dx034
Yep, just feeding wrong data to a headless browser will trick most. If you
keep data realistic (add small random error terms), it could take very long
until someone finds out.

~~~
sovok_x
What if some legitimate user will be fed erroneous data by the algorithm
misfire on his system?

~~~
bonesss
If someone is setting specialized properties in their browser to impersonate
automated browsing, or using automated browsing themselves, the question is
only if you (as a content provider), accept that their use is legitimate...

This wouldn't impact day-to-day users baring gross incompetence.

~~~
sovok_x
The code posted on the site fails simply on incongruity of the JS behaviour
(window.chrome) and userAgent. I can see how this can fail with common user
setting userAgent to Chrome on Edge or Firefox for some compatibility reason
or just forgetting to turn off old referrer override. There can be other valid
reasons it will fail if user is not bot that I miss. And BLAM they'll get all
wrong data for no reason... You may call it gross incompetence or whatever but
this method will get you one angry lost user at the time.

~~~
bonesss
UserAgent detection is in the "old" groups, is specific to having your
userAgent be "HeadlessChrome", and is no longer recommended. The new triggers
are 'navigator.webdriver', or a chrome extension specific object, or specific
permissions being set, none of which are relevant to or impacted by any the
scenarios you are highlighting...

Of course the JS itself can fail due to incongruent browser behaviour... but
why would you trigger a bot obfuscation routine based on a failed JS call?

 _That_ is the gross incompetence I was referring to, and it's hard to call
basic errors a lack of basic testing anything but.

Downvotes aside, the kinds of f-ups you're speculating about here are at the
level of knowing how true/false works in JS.

And, no, there really are _not_ valid reasons for users to be adding specific
properties on their navigation objects to flag for headless, or use specific
extension objects that report the use of headless automation, if they aren't.
There is no valid reason you should set your Edge userAgent to
"HeadlessChrome", either.

That's not an angry lost user, friend, that is an upset unauthorized third-
party content scraper. I work with Open Data, so I don't care, but some sites
for-realsies do.

~~~
sovok_x
Providing users with fake data is never a good idea because it can be, and
probably will be, used against you in the long run. Plus no sane evil scrapper
uses default referrer and no masking so misfires are realistically possible
within a thin line needed to detect them.

In any case, users can do whatever they want with their client and expect the
service to work properly. If you detect abuse you should block or captcha them
but the fact of them being a possible bot doesn't really call for such drastic
measure. It's the second worst approach after serving hindering scripts to
them.

Disclaimer: I haven't downvoted you as I don't downvote things prompting a
discussion.

------
jachee
Worth noting, I believe: the word "block" doesn't appear in the article, and
seems to have been editorialized in the poster's title.

------
nurettin
So headless now knows it is headless. Then what?

~~~
fwdpropaganda
I believe the usefulness of this is that now the _server_ knows the client is
headless. Then blocks it.

~~~
mfontani
The techniques explained in the article seem like they'd be JS running on the
browser itself, so… the "browser itself knows it's headless" pretty much sums
it up.

~~~
baliex
Isn't it more like, "the browser can be coerced into revealing to the server
that it is headless"?

~~~
dx034
But for that you need to send content first? That doesn't work before you send
the result (unless you want to redirect each request)

