
Detecting Chrome headless, the game goes on - jacdp
https://antoinevastel.com/bot%20detection/2019/07/19/detecting-chrome-headless-v3.html
======
jadell
It seems he's doing something with header detection. I used Puppeteer to play
around with the site and various configurations I use when scraping.

In headless Chrome, the "Accept-Language" header is not sent. In Puppeteer,
one can force the header to be sent by doing:

    
    
      page.setExtraHTTPHeaders({ 'Accept-Language': 'en-US,en;q=0.9' })
    
    

However, Puppeteer sends that header as lowercase:

    
    
      accept-language: en-US,en;q=0.9
    
    

So it seems the detection is as simply: if there is no 'Accept-Language'
header (case-sensitive), then "Headless Chrome"; else, "Not Headless Chrome".

This is a completely server-side check, which is why he can say the fpcollect
client-side javascript library isn't involved.

Here are some curl commands that demonstrate:

Detected: not headless

    
    
      curl 'https://arh.antoinevastel.com/bots/areyouheadless' \
      -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3803.0 Safari/537.36' \
      -H 'Accept-Language: en-US,en;q=0.9'
    

Detected: headless

    
    
      curl 'https://arh.antoinevastel.com/bots/areyouheadless' \
      -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3803.0 Safari/537.36'
    

Detected: headless

    
    
      curl 'https://arh.antoinevastel.com/bots/areyouheadless' \
      -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3803.0 Safari/537.36' \
      -H 'accept-language: en-US,en;q=0.9'

~~~
jadell
As a followup, if you have the ability to modify the Chrome/Chromium command
line arguments, using the following option completely fools the detection:

    
    
      --lang=en-US,en;q=0.9
    

You can prove this with the following Puppeteer script:

    
    
      (async () => {
          const puppeteer = require('puppeteer');
          const browserOpts = {
              headless: true,
              args: [
                  '--no-sandbox',
                  '--disable-setuid-sandbox',
                  '--user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3803.0 Safari/537.36',
                  // THIS IS THE KEY BIT!
                  '--lang=en-US,en;q=0.9',
              ],
          };
      
          const browser = await puppeteer.launch(browserOpts);
          const page = await browser.newPage();
          await page.goto('https://arh.antoinevastel.com/bots/areyouheadless');
          await page.screenshot({ path: 'areyouheadless.png' });
          await browser.close();
      })();

------
Pinbenterjamin
I run the division at my company that builds crawlers for websites with public
records. We scrape this information on-demand when a case is requested, and we
handle an enormous volume of different sites (or sources as we call them). We
recently passed 700 total custom scrapers.

Recently, we have seen a spike in sites that detect, and block our crawlers
with some sort of Javascript we cannot identify. We use headless Chrome and
selenium to build out most of our integrations, I'm starting to wonder if the
science of blocking scraping is getting more popular...

I don't think what I'm doing is subversive at all, we're running background
checks on people, and we can reduce business costs by eliminating error-prone
researchers with smart scrapers that run all day.

I don't want to seem like the bad guy here, but what if I wanted to do the
opposite of this research? Where do I start? Study the chromium source? Can
anyone recommend a few papers?

~~~
floatingatoll
Reducing your business costs by scraping a public access website is often
considered an _alternative_ to paying the business costs of the website
operator.

Are you saving money at the expense of the site operator by scraping their
site for public records, or are you saving money as well as the site operator?

If you're costing them money to reduce your own bottom line without their
express written consent, that makes you "the bad guy". Offsetting costs onto
an unwitting, non-consenting third party is an unethical approach to doing
business.

I interpret your request as a similar problem to "help me with my homework
problem". I could dig up papers and studies, but at the end of the day, you
need to go do your homework. Reach out to each municipality and figure out a
business arrangement with them that satisfies your needs. It's possible they
do not wish you to perform this activity, in which case you will either need
to violate their intent for your own profit using scraping or accede to their
wishes and stop scraping their municipality. That's your homework as a for-
profit business.

~~~
satyrnein
Imagine if search engines had to "reach out to each [site owner] and figure
out a business arrangement with them." The world decided that opt out via
robots.txt was a better approach.

If the municipality wants to get the information out, this could be a win-win,
just like search engines were. Do check for robots.text, though!

~~~
floatingatoll
We found at one job that approximate one quarter of well-known search engines
blatantly use robots.txt noindex declarations as a list of URLs to index, and
one openly mocked us for asking them to stop.

Voluntary honor systems don’t work, because there’s no way to compel non-
compliers to stop other than standard “anti-attacker arms race” approaches,
such as the obstacle described at the head of this thread.

~~~
jdc
It sounds like scraping is a big problem for you guys. What kind of outfit is
it, if you don't mind me asking?

~~~
floatingatoll
Drop me an email and I’m happy to describe further.

------
lol768
There are additional tests included in
[https://arh.antoinevastel.com/javascripts/fpCollect.min.js](https://arh.antoinevastel.com/javascripts/fpCollect.min.js)
that do not exist in the GitHub repository over at
[https://github.com/antoinevastel/fp-
collect](https://github.com/antoinevastel/fp-collect).

    
    
      redPill: function() {
          for (var e = performance.now(), n = 0, t = 0, r = [], o = performance.now(); o - e < 50; o = performance.now()) r.push(Math.floor(1e6 * Math.random())), r.pop(), n++;
          e = performance.now();
          for (var a = performance.now(); a - e < 50; a = performance.now()) localStorage.setItem("0", "constant string"), localStorage.removeItem("0"), t++;
          return 1e3 * Math.round(t / n)
        },
        redPill2: function() {
          function e(n, t) {
            return n < 1e-8 ? t : n < t ? e(t - Math.floor(t / n) * n, n) : n == t ? n : e(t, n)
          }
          for (var n = performance.now() / 1e3, t = performance.now() / 1e3 - n, r = 0; r < 10; r++) t = e(t, performance.now() / 1e3 - n);
          return Math.round(1 / t)
        },
        redPill3: function() {
          var e = void 0;
          try {
            for (var n = "", t = [Math.abs, Math.acos, Math.asin, Math.atanh, Math.cbrt, Math.exp, Math.random, Math.round, Math.sqrt, isFinite, isNaN, parseFloat, parseInt, JSON.parse], r = 0; r < t.length; r++) {
              var o = [],
                a = 0,
                i = performance.now(),
                c = 0,
                u = 0;
              if (void 0 !== t[r]) {
                for (c = 0; c < 1e3 && a < .6; c++) {
                  for (var d = performance.now(), s = 0; s < 4e3; s++) t[r](3.14);
                  var m = performance.now();
                  o.push(Math.round(1e3 * (m - d))), a = m - i
                }
                var l = o.sort();
                u = l[Math.floor(l.length / 2)] / 5
              }
              n = n + u + ","
            }
            e = n
          } catch (t) {
            e = "error"
          }
          return e
        }
      };

~~~
jadell
It doesn't seem to be using the Javascript. Looking at the page source, it has
already made the determination before the Javascript runs.

If I load the page source in Chrome, it already includes the "You are not
Chrome headless" message, but when I run it in a scraper I maintain, the page
source loads with the "You are Chrome headless" message, even without running
any Javascript.

------
ggreer
[https://arh.antoinevastel.com/javascripts/fpCollect.min.js](https://arh.antoinevastel.com/javascripts/fpCollect.min.js)
contains some functions called redPill that aren't in the normal fpCollect
library. redPill3 measures the time of some JS functions and sends that data
to the backend. Here's a chart of redPill3's timing data on my computer:
[https://i.imgur.com/c8iuV6I.png](https://i.imgur.com/c8iuV6I.png)

Those are averages of multiple runs on a Core i7-8550U running Chromium
75.0.3770.90 on Ubuntu 19.04.

isNan and isFinite are much slower in headless mode, but other functions like
parseFloat and parseInt aren't. My guess is that the backend is comparing the
relative times that certain functions take. If isNan and isFinite take the
same time as parseFloat, then you're not in headless mode. If those functions
take 6x longer than parseFloat, you're in headless mode.

I don't know if this holds true for non x86 architectures or other platforms.

~~~
TeMPOraL
Huh. Your chart reminded me of an experiment I did 10 years ago to test if you
could distinguish whether an image request was triggered by <img> tag, vs.
user clicking on a link (or entering its URL in the address bar). I created a
test page and asked people on the Internet to visit it, and then analyzed PHP
& server logs.

Unexpectedly, it turned out that Accept header was perfect for this. The final
chart was this:

[https://i.imgur.com/ZA8qD8t.png](https://i.imgur.com/ZA8qD8t.png)

("link" means clicking on an URL or entering it manually; "embedded" means
<img> tag)

Makes me wonder whether Accept header is still useful for fingerprinting in
general, and distinguishing between headless and headful(?) browsers in
particular.

------
itake
This would be more interesting if the author explained this technique.

People that are knowledgeable enough will deep dive into the webpage, but for
everyone else, expect disappointment.

~~~
mcescalante
I'm not sure if diving deep into the page will yield results of how it's done.
The page's javascript does a POST to a backend with the browser's fingerprint,
and the server does all the "magic" where we can't see it. Unless there is new
fingerprint info that is being sent to the server that wasn't around before,
I'm skeptical about the javascript in the page revealing the full technique.

~~~
jadell
The "You are/are not" message seems to be included in the page source before
any Javascript runs. Is it possible there are detectable differences in the
original HTTP request itself?

~~~
pbhjpbhj
My guess is he's looking at XSS mitigations or similar that aren't in
headless?

If it were doing something like using CSS being non-blocking (? I don't know
that it is) that's a server side detection .. but that would seem to work even
against spoofing.

But he says if you spoof another Chrome-based browser (Safari) he can't tell.
So he's looking first at UA?? That's weird.

------
ryandrake
Out of all the zero-sum tech arms races (increasingly complex DRM, SPAM
senders/blockers, software crackers vs. copy protection, code obfuscation)
this one seems to me to be the stupidest. Here we have people putting data out
in public for free, for anyone to access, and then agonizing over _how_
someone accesses it. If some data is your company's secret sauce, your
competitive advantage, don't put it out on the Internet. If your data is not
your competitive advantage, then why bother wasting all this development
effort stopping browsers from browsing it? So much waste on both sides.

~~~
jadell
I agonize about this every day, since a large part of my job is aggregating
data from many sites that seem hell-bent on not letting anyone access it
without going one-form-at-a-time through their crap UI.

The thing is, we would gladly _pay_ these companies for an API or even just a
periodic data-dump of what we need. We've even offered to some of them to
write and maintain the API for them. They're not interested, for various
industry-specific reasons.

I often wonder how much developer time and money are wasted in total between
them blocking and devs working around their blocks.

~~~
lyxsus1
Sometimes when I'm thinking about it and what 95% of developers are working
on, it feels like a planet-wide charity project against unemployment.

~~~
cameronbrown
I think it's a fairly well known thing that 'junk jobs' tend to spring up in
response to supply. I think it's a bizzare cultural thing.

------
foob
I'm the other half of the cat and mouse game that Antoine is referring to, and
I just wrote another rebuttal that people here might find interesting [1]. It
goes into a little more detail about what his test site is actually doing, and
also walks through the process of writing a Puppeteer script to bypass the
tests.

\- [1] -[https://www.tenantbase.com/tech/blog/cat-and-
mouse/](https://www.tenantbase.com/tech/blog/cat-and-mouse/)

------
eastendguy
All this can be avoided (from a scraper's perspective) by using the Selenium
IDE++ project. It adds a command line to Chrome and Firefox to run scripts.
See [https://ui.vision/docs#cmd](https://ui.vision/docs#cmd) and
[https://ui.vision/docs/selenium-ide/web-
scraping](https://ui.vision/docs/selenium-ide/web-scraping)

=> Using Chrome directly is slower, but _undetectable_.

~~~
rivercam
I am using the UI Vision extension for a few months now. It is not very fast,
but it always works. It can extract text and data from images and canvas
elements, too.

------
fjp
I work in telecom and we interface with large carriers like AT&T, Verizon,
etc. We use headless browsers to automate processes using their 15-year-old
admin portals, since the carriers simply refuse to provide an API, or one that
works acceptably.

Thankfully they're also so technologically slow that they never change the
websites or do any kind of headless detection. Its works, and allows us to
offer automated [process] to our customers, but it seems so fragile. Just give
us a damn API.

------
nprateem
Well the user agent of chrome headless contains 'HeadlessChrome' according to
this site [1]. Sure enough when I spoof my user agent to the first in the list
it magically determines I'm using headless Chrome.

He basically says he's inspecting user agents:

> Under the hood, I only verify if browsers pretending to be Chromium-based
> are who they pretend to be. Thus, if your Chrome headless pretends to be
> Safari, I won’t catch it with my technique.

Maybe I should apply for a PhD too.

[1] [https://user-agents.net/browsers/headless-chrome](https://user-
agents.net/browsers/headless-chrome)

~~~
jsnell
If a browser claims to be Headless Chrome, you believe it. Nobody has a reason
to lie about that. The interesting question is the opposite case: is somebody
claiming to be a normal Chrome, but is actually Headless Chrome (or an
automated member of some other browser family, or not a browser at all but
e.g. a Python script).

So if you take a Headless Chrome instance but change the User-Agent to match
that of a normal Chrome, does the detector think it's not headless?

~~~
jadell
The detector still thinks it's headless even if you spoof the user-agent.

------
alexdrian12345
I HAVE A GOOD NEWS!!!! I just met the greatest hacker on the planet
(alexadrianeaseworld@gmail.com) I was at first scared he might scam me but he
turned out to be the best so far. HE delivered my work just few hours after I
texted him. His works are historically unbeaten plus he's trustworthy and
reliable. whatsapp, facebook,instagram, bitcoin, email and so on. You really
need to try him today on (alexadrianeaseworld@gmail.com) or text him on
whatsapp: +971-547-461170 Thank me later.

------
born2discover
The author seems to be making use of: fpscanner[1] and fp-collect[2] libraries
for achieving his task though he doesn't seem to explain how exactly the
detection is done.

[1]:
[https://github.com/antoinevastel/fpscanner](https://github.com/antoinevastel/fpscanner)

[2]: [https://github.com/antoinevastel/fp-
collect](https://github.com/antoinevastel/fp-collect)

~~~
mcescalante
On his actual "test" page, he claims he is not using fpscanner:
[https://arh.antoinevastel.com/bots/areyouheadless](https://arh.antoinevastel.com/bots/areyouheadless)

> It does not use detection any of techniques presented in these blog posts
> (post 1,post 2) or in the Fp-Scanner library

------
AznHisoka
I think there might be a market for "human crawlers". Just like people use
Mechanical Turk to get humans to beat CAPTCHAs, you could use it to get humans
to visit a web page for you, and return its HTML source. There are of course
residential proxy services (ie HolaVPN), but they're still technically can be
detected.

~~~
TomMarius
Why would you do that when you can automate it?

~~~
AznHisoka
Because of the issues the article described: detection of headless
crawlers/bots/etc

~~~
driverdan
You can automate a regular browser. It doesn't have to be headless.

~~~
AznHisoka
Unfortunately, there are some sites that can even detect regular automated
browser sessions.

------
bsmith0
This straight up crashes my scraper's browser, using puppeteer, extra-stealth
ect.

------
sieabahlpark
Wouldn't these render the Brave browser unusable on some sites?

~~~
rubbingalcohol
Why would it do that? Brave is just a Chromium fork afaik.

------
CaliforniaKarl
I’ve participated in a number of Stanford research studies, and what the
author is doing here is similar to part of it.

The studies in which I’ve participated always start with a statement of what
they are generally looking for in a participant. You then take a survey that
confirms if you are qualified. You then given a release to sign (and keep a
copy of), which states what you’ll be doing, and providing an IRB contact. You
then go through the study.

At the end of your participation, you are asked “What do you think the study
is about?”, and then you were told the real purpose of the study. Eventually
the paper(s) is/are published, with hypothesis, methodology, and results.

This seems similar: You decide if you want to participate, and are
participating; the only thing that’s missing is the final paper.

