Hacker News new | comments | ask | show | jobs | submit login
It is possible to detect and block Chrome headless (antoinevastel.github.io)
153 points by avastel 12 months ago | hide | past | web | favorite | 106 comments

The listed techniques not only detect Chrome headless, but all custom browsers built on CEF (Chromium Embedded Framework) https://bitbucket.org/chromiumembedded/cef, such as Kantu from https://a9t9.com

If your goal is to only allow the original Google Chrome browser, that is fine. Otherwise this might cause false alarms.

Not only that, but it would be simple enough to just create an iframe based scraping script or browser extension for use in a normal browser, no?

And it’s possible to pretend not to be Chrome headless too.


Hah, this didn't cover window.webdriver, and I was about to post that you can still use that (since I assumed window properties weren't deletable) but... window properties are deletable. Cool.

How do you ensure that your deletion code runs in the context of the hosted page but before that page can run any of its own code?

From the original article, you put a proxy in front of Chrome headless and inject the deletion code into the HTML of each page before any JS loaded by the page.

So now the page needs to checksum itself once loaded to detect tampering.

You'd get false positives from e.g. extensions modifying the page.

Which, for 99% of extension-equipped users, will be just an ad blocker, i.e. something the websites don't want to deal with either.

The arms race goes on.

Or you could just rewrite the JS that does the checksumming to return true.

I read these things and I think "So much wasted energy and effort"

In the beginning was the web, and it was good. Content came along. Some was good, some was cats. Then paid sites with sign-up. Then search engines. Then ads.

Pretty soon folks thought "I not only own this content, I own how it will be presented to the end user. If I choose to add in cats, or Flash ads, or whatnot? They're stuck consuming it. I own everything about the content from the server to the mind of the person consuming it, the entire pipe."

Many people did not like this idea. Ads were malicious, they installed malware. The practice of using ads on content caused sites to track users like lab rats. Armies of people majoring in psychology were hired to try to make the rats do more of what we wanted them to do.

Ad blockers were born. Then anti-ad-blockers. Then headless browsers. Now anti-headeless browsers.

It's just a huge waste of time and energy. The model is broken, and no amount of secret hacker ninja shit is going to make it work. You want to know where we'll end up? We'll end up with multiple VMs, each with a statistically common setup, each consuming content on the web looking just like a human doing it. (We'll be able to do that by tracking actual humans as they consume content). But nobody will be looking at those VMs. Instead, those invisible screens will be read by image recognition software which will then condense what's on there and send the results back to whoever wants it.

Content providers will never win at this. Nor should they. Instead, we're just going to sink billions into a busted-ass business model over the next couple of decades throwing good money after bad.


It's more complicated than that, almost all spidering that I do lately for clients is not about scraping the content in a classical sense, but some form of "spying" of one's competition, mainly in e-commerce. Tracking inventories, fine-tuning prices, tracking how they promote certain products online, etc. Not to mention really blackhat stuff that many people do like attacking competition with fraudulent ad clicks and similar. Many times "the content providers" have every right to wish to protect themselves.

That has been done for decades. Most manufacturers employ scores of people going into shops to check how products are presented. Supermarkets employ people to check traffic and sales at competing stores. Doing this online can be easier (if you know how to do it) but it's hardly new.

In the offline world companies also try to protect themselves (you can get banned from stores when they find out that you work for the competition) but it's harder to do.

I'll add my piece of anecdata to this. I've done a few projects that involved scraping over my career.

1. Scraping results from a property listing website, specifically to pull the properties an agent had listed and put them on their website. The agent didn't want to pay the fee for API access (they probably ended up paying my employer more to scrape it, and keep that updated, but hey).

2. Scraping an e-commerce website of a company my employer were working with to keep our product catalogues in sync - the partner had their own platform, but no API for it.

3. Automating requests to a price comparison website to find out what prices competitors are offering for particular types of customer.

API keys can be revoked at any time or the API service can be suddenly terminated. You can't trust them. Scraping is more maintenance, but it is more reliable.

Indeed. A web site is an API, a crappy one, yes, but also one without unnecessary auth requirements and one they can't remove without removing their business.

Then they publish a new site format update with no notice. Doesn't sound very reliable to me.

Sure, but how often do websites change their layout? Once every five years at most. And it will take you no more than an hour to rewrite the signature. Ain't that hard. I've been doing that for a living and it's manageable, even if you monitor thousands of sites.

But whoever you're scraping can suddenly break your scraping tool too.

Temporarily until you work around what they're doing. API key revocation leaves you dead in the water until you rewrite your app to do scraping.

Yeah, but imagine they totally redesign their Web site. The "workaround" is a complete rewrite.

Ultimately it has to show the same data unless the site is completely discontinuing the service (in which case you are up a creek in either scenario). Even in the face of a total redesign you're usually just adjusting a few signatures and maybe tweaking the traversal tree.

You can run into problems when they remove things like chronological listing view in favor of some "algorithm" that you can't control, but even that just adds a little work on the back end to detect duplicates and an asterisk that some of the content will be missed.

There can also be an arms race where the site developers will start trying all kinds of crazy techniques to make scraping harder. Ironically this often happens after they revoke API access, meaning users don't have an alternative except to participate in the arms race.

I'm not anti-scraping or anything but I'd pretty much never choose it over an API for stability.

Having once worked with the "machine-readable" formats the realty industry produces, I'd rather scrape.

It's probably simple enough to implement both methods if available.

> The "workaround" is a complete rewrite.

No it's not, it's just swapping out a few selectors or regular expressions...

I think that depends on how thorough the redesign is, but surely it's not any less work than if the API breaks overnight.

Simpler than regular expression to be honest; css or xpath queries are rather simple compared to regular expressions.

Reorganizing the scraping queries is a lot easier than redesigning the website's layout.

> "spying" of one's competition, mainly in e-commerce. Tracking inventories, fine-tuning prices, tracking how they promote certain products online, etc.

That’s not spying, that’s a completely legitimate and normal use case of the web. Everything on the web is public, and the web was also created to allow that.

That’s why the word “spying” is in quotes.

The irony, I think, is that the standards that the current web is descended from GML/SGML/HTML were created explicitly for the purposes of making documents easier to read and understand by computers. Here now we are trying to make it as difficult as possible to read those same documents by computers.

> Here now we are trying to make it as difficult as possible to read those same documents by computers.

They're as easy as ever to read, as that's necessary for the documents to be displayed to the end-user.

The whole article and discussion is about blocking the ability for computers to read and understand the content in these documents?

Let's not forget the part where people write shitty scripts that bombard your server with 10x the quantity of traffic you see from actual users.

There's shittiness from both producers and consumers.

Agreed. It's a terrible waste of time, money, and energy from tens of thousands of people. All for nothing.

I think you might be misunderstanding. Someone's scripted use of a Web site can burn more resources, he's saying, but it's not "for nothing" -- they're deriving some benefit from the scripting.

They wouldn't have to do that if the website wasn't - purposefully or through crappy engineering - making the data unnecessarily difficult to get and process by machines.

Producers should the make a rate-limited API - and even charge for it

Or rate-limit the website itself - if you're dealing with scrappers.

Without any irony, at my workplace, we've spun up a docker swarm with headless browsers and a simply RESTful api that takes screen shots of websites and serves them back to the employee's browser so that our employees can surf the web without even the possibility of exposing themselves to malicious software garbage. It's not required, but it's an available corporate service. Another guy built a version that returns PDFs with working links so they can even click links.

It's completely fucking stupid that we have to come up with this nonsense.

Why not use VNC?

Yes but from experience in the 1980's the punters realy did not like being nickel and dimed on pre internet based services Prestel, Tymnet/Telecom Gold and so on.

This isn't really about ad-blocking

tl;dr: fucking JavaScript.

Username checks out.

You probably want the web equivalent of malicious compliance - an algorithmically generated web-hole or similar. That way the bot author isn't entirely sure you're on to them; it could be a bot or server error. Like send the right headers but garbage data that looks like it's compressed but isn't, or doubly compressed garbage, or trim pages at a different place (before anything interesting), or slow data transfers, or ...

The main thing headless chrome saves is having to spawn and manage xvfb per instance.

If more sites adopt serving fake data to headless chrome, people will just return to the old xvfb workflow for those sites, and use headless for everything else.

Could even use a small set of xvfb scrapers to verify the results, and automate detection of false data.

Just randomize the content they're trying the scrape imho

I saw that with a page and it works extremely well. Inserting wrong data based on rules (there it was on rate limiting rather than user agent) is extremely hard to detect. The scraper never knows if they see real or wrong data.

On the other hand, this will also get wrong data in search engines.

I think you could get pretty far just by simply scrambling the CSS classes and ids.

All web automation and automation prevention is a cat and mouse game where you never stop the scrapers, you just create more effort for them. It’s like traditional and digital security in that regard, except that security often has an element of difficulty in overcoming it (cryptography, thickness of physical barriers), whereas stopping web scraping is about adding more trivial things to make the process more complicated.

Eventually, human browsing and headless browsing converge. Nobody wants to make the human browsing experience bad, so the headless browsing continues.

In my opinion, if you’re running a site that is existentially threatened by someone else having your content, you need something else for your moat.

Don't worry. Thanks to the W3C and their EME standards, scraping will reach the level of other sorts of security. I'm surprised I haven't yet seen a simple framework for serving your page not as a page but as an EME-protected blob that bears a rendering of the content. We will see just that.

So scrappers will start OCRing screenshots.

This race won't end, and the only result beyond wasted effort is the creation of ever more ridiculous and user-hostile practices.

I've seen some DRM that can detect if you're running within a virtual machine and prevent screenshots.

The future will be raspberry pi clusters connected to HDMI capture cards.

Oh, this is going to be wonderful for accessibility :(

This feels a bit like the "VMs aren't quite like real machines" problem --- as in, it's a cat-and-mouse game that will probably continue indefinitely.

Personally, as someone who regularly uses several different browsers and experiments with others, I wish the Web was far more browser-neutral.

Everything that can't be handled with curl or beautifulsoup is probably not worth the effort.

Lots of pages have the content you'd want to scrape injected with JS; headless Chrome would seem to solve that problem.

Don't most SPAs have some kind of internal API that is easier to work with than the HTML?

Anything that can't be loaded over 2400 baud is probably not worth the effort.

Somewhat correct, although an extreme example. One should always try to design websites with low bandwidth users in mind.

The whole point of using an headless browser is to work around web sites that attempt to block simple "curl" style scraping (or where you need to execute JavaScript to scrape).

So making it detectable (intentionally, even, right there in the user agent!) is really absurd.

Or actually, it makes one wonder about Google's motives.

That's one use-case for Headless browsers. Most people actually use Headless browsers to test their website, i.e. for functionality / performance / rendering.

That's definitely not the whole point of headless browsers, that's more of a side-effect. The whole point of headless browsers is rather automation and testing.

Same as torrents are for the distribution of legal content. That was the original thought and it's still used for that but I'd bet the majority of headless browser requests crawl websites not owned by the scraper.

Making the web harder to crawl would make it harder to create a Google competitor. I doubt that that's their intention, though.

so, now I can run a script to fix all of these things so that headless can't be detected by any of these methods? thanks.

Is there a way to enable Chrome PDF Viewer/Widevine Content Decryption Module etc in headless chromium? Is there some switch in chromium code base that would enable that?

To every action there is always opposed an equal reaction... https://intoli.com/blog/making-chrome-headless-undetectable/

Re. blocking scrapers: Some of us are neither vast corporate espionage practicioners nor zombie-botnet users: we're on our own, scraping for data science & other academic research purposes.

Is there some way to declare, "I am a legitimate academic user", something akin to 'TSA Pre' status?

"Sure, register for & use the site's API," you'll say. What if they don't have one?

"Sure, just don't slam the server with too many requests in a short time," you'll say. But if they're rejecting you just because they detect you're headless, etc...?

> But if they're rejecting you just because they detect you're headless, etc

Isn't that their right?

If I pay for my outgoing bandwidth (even if I don't) I am under no obligation to give my content/data/whatever to any third party source, even academic.

> If I pay for my outgoing bandwidth (even if I don't) I am under no obligation to give my content/data/whatever to any third party source, even academic.

Aren't you? You put a server on the publicly routable Internet. And made it talk over HTTP. At this point I believe you've already chosen to waive your rights not to serve content.

Isn't that the same argument regarding ripping music CDs? If I pay for the musicians, manufacturing and distribution costs to put a CD in stores, etc.

Although, I think you're framing it wrong, you're not obligated to give the content, someone is just choosing to consume it in a way you hadn't intended.

You're free to stop providing the content at all.

But as long as you're providing it publicly then it makes no sense that you'd be able to dictate how it's consumed.

What's the reason for blocking a headless browser?

Headless browsers are used to create robots in order to automate the gaming of web-based value systems - thus diluting the value for legitimate participants. Examples:

* create fake profiles in order to boost someone's "followers" in a social network where you can monetize your "influencer" status

* click ads from a competitor in a way that would trigger fraud prevention from the ad network effectively preventing the competitor to advertise there

Yeah I think you covered all the bases here. /s

If there's malicious code on the page you could use this to block headless browsers (which might be security scanners) from trying to load / run the malicious code, such as CoinHive.

Block scrapers that use a wrapped browser to work around client detection

Bots (scraping, etc) passing as legitimate users.

rather than blocking a bot, it would make much more sense to CAPTCHA an ip that is producing a lot of traffic in a short time. Scaping has always been part of the web, and one should not have the belief that the information on a website is only going to be available on said website.

This approach only stops the most basic and laziest scrapers. Some people have tens of thousands of diverse IP addressed to utilize for scraping. Many of them will not give a shit about your bandwidth or server constraints and will cause your server to hit bottlenecks, making it slow and useless for everyone.

I guess the best approach would be to captcha everything until we've captcha'd ourselves back into dial-up times for content delivery. /s

You ever used tor lately?

> it would make much more sense to CAPTCHA an ip that is producing a lot of traffic in a short time.

CAPTCHAs are useful, but they're an X/Y problem in the same way that this headless-detection is: trying to detect human vs bot, when the real solution would be to slow down (a portion of) the traffic.

Hashcash would seem like a better solution, since that doesn't lock anybody out (human or bot), it just slows them down to reduce server load. If some clients are higher priority than others (e.g. human users vs poorly-programmed bots) then use info like IP, cookies, etc. to slow down the low priority requests, or even adjust the difficulty depending on how likely the client is to be causing load.

For what it's worth, Dullahan, my headless SDK on top of Chromium Embedded Framework appears exactly the same as desktop Chrome:

Overview: https://bitbucket.org/lindenlab/dullahan/overview

Examples: https://bitbucket.org/lindenlab/dullahan/src/default/example...

Not suggesting it's better or worse - just an alternative if you need something that appears to be like a desktop browser.

Id be careful using this as google crawls (well specifically it indexes) using headless chrome you could block googlebot when you don't want to.

This discussion is also happening on a counterpoint posted about 9 hours later, also currently on the front page:

It is not possible to detect and block Chrome headless | https://news.ycombinator.com/item?id=16179181

The original article does not mention blocking it, just the detection.

Good point - for a lot of what unwanted headless scraping would be used for I imagine returning subtly changing patterns of semi-useful data is probably more useful than blocking.

A fully blocked bot will error and get replaced with a working bot. A bot that subtly errors again, and again, and again will look almost-right and create a maintenance nightmare...

Yep, just feeding wrong data to a headless browser will trick most. If you keep data realistic (add small random error terms), it could take very long until someone finds out.

What if some legitimate user will be fed erroneous data by the algorithm misfire on his system?

If someone is setting specialized properties in their browser to impersonate automated browsing, or using automated browsing themselves, the question is only if you (as a content provider), accept that their use is legitimate...

This wouldn't impact day-to-day users baring gross incompetence.

The code posted on the site fails simply on incongruity of the JS behaviour (window.chrome) and userAgent. I can see how this can fail with common user setting userAgent to Chrome on Edge or Firefox for some compatibility reason or just forgetting to turn off old referrer override. There can be other valid reasons it will fail if user is not bot that I miss. And BLAM they'll get all wrong data for no reason... You may call it gross incompetence or whatever but this method will get you one angry lost user at the time.

UserAgent detection is in the "old" groups, is specific to having your userAgent be "HeadlessChrome", and is no longer recommended. The new triggers are 'navigator.webdriver', or a chrome extension specific object, or specific permissions being set, none of which are relevant to or impacted by any the scenarios you are highlighting...

Of course the JS itself can fail due to incongruent browser behaviour... but why would you trigger a bot obfuscation routine based on a failed JS call?

That is the gross incompetence I was referring to, and it's hard to call basic errors a lack of basic testing anything but.

Downvotes aside, the kinds of f-ups you're speculating about here are at the level of knowing how true/false works in JS.

And, no, there really are not valid reasons for users to be adding specific properties on their navigation objects to flag for headless, or use specific extension objects that report the use of headless automation, if they aren't. There is no valid reason you should set your Edge userAgent to "HeadlessChrome", either.

That's not an angry lost user, friend, that is an upset unauthorized third-party content scraper. I work with Open Data, so I don't care, but some sites for-realsies do.

Providing users with fake data is never a good idea because it can be, and probably will be, used against you in the long run. Plus no sane evil scrapper uses default referrer and no masking so misfires are realistically possible within a thin line needed to detect them.

In any case, users can do whatever they want with their client and expect the service to work properly. If you detect abuse you should block or captcha them but the fact of them being a possible bot doesn't really call for such drastic measure. It's the second worst approach after serving hindering scripts to them.

Disclaimer: I haven't downvoted you as I don't downvote things prompting a discussion.

Agreed, we use headless browsers for automated regression testing and adding a check on production would help ensure that testers don't goof and test in the wrong environment and test that developers don't hard code URLs that cause the environments to hop.

Worth noting, I believe: the word "block" doesn't appear in the article, and seems to have been editorialized in the poster's title.

So headless now knows it is headless. Then what?

I believe the usefulness of this is that now the _server_ knows the client is headless. Then blocks it.

The techniques explained in the article seem like they'd be JS running on the browser itself, so… the "browser itself knows it's headless" pretty much sums it up.

Isn't it more like, "the browser can be coerced into revealing to the server that it is headless"?

But for that you need to send content first? That doesn't work before you send the result (unless you want to redirect each request)

Utilizing javascript on browser, the server can detect it's headless.

The server can then block, redirect or feed the browser erroneous data.

Server can't do anything. It wasn't informed. And it can't be informed reliably.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact