Hacker News new | comments | show | ask | jobs | submit login
It is not possible to detect and block Chrome headless (intoli.com)
360 points by foob 10 months ago | hide | past | web | favorite | 166 comments



Sites detecting headless browsers vs headless browsers trying not to be detected by sites, is an arms race that's been going on for a long time. The problem is that, if you're trying to detect headless browsers in order to stop scraping, you're stepping into an arms race that's being played very, very far above your level.

The main context in which Javascript tries to detect whether it's being run headless is when malware is trying to evade behavioral fingerprinting, by behaving nicely inside a scanning environment and badly inside real browsers. The main context in which a headless browser tries to make itself indistinguishable from a real user's web browser is when it's trying to stop malware from doing that. Scrapers can piggyback on the latter effort, but scraper-detectors can't really piggyback on the former. So this very strongly favors the scrapers.


In my experience, the most effective counter measure to scraping is not to block, but rather to poison the well. When you detect a scraper - through what ever means - you don't block it, as that would tip it off that you are on to it. Instead you begin feeding plausible, but wrong data (like, add a random number to price). This will usually cause much more damage to the scraper than blocking would.

Depending on your industry etc., it may be viable to take the legal route. If you suspect who the scraper is, you can deliberately plant a 'trap street', then look for it at your suspect. If it shows up, let lose the lawyers.

Of course, the very best solution is to not care about being scraped. If your problem is the load that is caused to your site, provide an api instead and make it easily discoverable.


Poisoning the well is very effective. We employed it at a large ecommerce company that was getting hit by carders testing credit cards on low price point items(sub $5). We were playing cat and mouse with them for six months. Found certain attributes about the browser that the botnet was using and fed them randomized success/fail responses. After two weeks of feeding them bad data, they left and never came back. They did DDOS us though in retaliation.


I think this is a good example of "poisoning the well" in practice, and I was in a similar position as you describe when I was working in incident response at a consumer bank a few years ago.

That said, this is a very particular scenario, and I don't think you can generalize the effectiveness of the technique from this example. In situations where attackers are looking for boolean responses, i.e. to verify email addresses, credit cards, usernames, etc, this can work well.

But in most situations a crawler is looking for particular data. Moreover, it will have a strong expectation (and understanding) about what this data should look like. Poisoning the well is not going to work as well in that scenario, because a scraper will be aware that the data is incorrect.

In contrast, people testing credit cards don't need your data, they just need data from somewhere that verifies the card numbers. In that case it's easier to just keep rotating targets until they've tested all the cards instead of bootstrapping an in-house team to write bespoke crawlers.


Man. I really like that idea, but is giving a random "success" to a supposed credit card charge PCI compliant?


PCI is largely misunderstood. Spoofing a success on a transaction has nothing to do with proper storage and transmission of credit card information.

Most e-commerce sites will use a payment processor such as Braintree or Stripe so they don't even have to deal with PCI.

PCI applies to entities who store and process their own payments, e.g. Target, Amazon.


PCI doesn't care about what i return to the customer. It covers protecting the card number. It could be in the merchant agreement but most of those rules are never enforced. Its against card brand rules to require a person to show an ID but all merchants do it. Merchant account agreement requires you to prevent fraudulent purchases and that part is heavily enforced which is why the card brands turn a blind eye to the merchants asking for an ID.


The whole cat and mouse game thing... for some strange reason that sounds fun to me. Probably because I don't know the details and the workload involved in actually doing it (and it's not my money or inventory at stake). It seems like it would be exciting, in that somewhat naive juvenile-ish fantasy sort of way, to try and figure out how to mitigate the threat, implement it quickly, and deploy it to watch it play out in real time on live production servers. I don't know, maybe I have the wrong idea about the whole thing?


There are aspects that are fun, but I feel like if you're doing it right, it is stressful. You are playing an antagonistic game with bad actors, so there's risk, and you'd better be well past just gaming out the probabilities and costs there. Just because you noticed them doesn't mean they can't do damage. You'd also better get informed buy-in from other relevant departments, etc.

Quite a while ago, I was involved in baiting an attacker in a somewhat different way, but with the same goal (destroying the value we were providing to them). After the attacker figured out what was going on, they issued a somewhat credible threat to damage the company's machines (they included some details demonstrating they had access to a couple internal machines at some point), attacked and DOSed them, and persistently tried spearphishing them for months afterwards.

I guess I'd just say, (a) doing this sort of thing responsibly sucks a lot of the fun out of it, and (b) don't underestimate the risks of things going pear-shaped. You could be buying yourself a lot of ongoing grief. Something as everyday as that spearphishing attack can be nerve wracking - even after making annoying the crap out of everyone by repeating how to be careful with email, there's no way to be sure it won't hit, and the next thing you know people are sitting around with upper management having conversations nobody wants to have about network segmentation and damage mitigation.


It is incredibly fun. I set up a Google Voice number specifically for site content managers to call me when comment/review spammers came around. I was fast enough at blocking spammers repeated attempts at evading registration blocking(email domains, IP address ranges, browser disguising) that I would usually make the spammers give up within an hour.


Your comment is why I feel automating engineers would be very hard. Cat and mouse game sounds like it requires a human... and a cat... and a mouse...


GANs are an automated ‘cat and mouse game’

https://en.m.wikipedia.org/wiki/Generative_adversarial_netwo...


Indeed, I was going to mention those, but I feel there is an element missing from even GANs that we don't quite have yet.


It's interesting that it came down to actual virtual warfare.

Good on you for sticking through it.


That only works if the scrapers are either in a country where you can do something about it.

Also poisoning only works for a while. As soon as they detect the poisoning, they can easily figure out what tripped the scraping detection, and now you need to poison in an even more subtle way because the scrapers know what to look for.

You can't win this game. Especially the "obvious" type of browser checks, where you can tell the js is checking your browser are so easy to circumvent, because you can tell what they're doing.

Really though, I've reversed a few fingerprinting libs in the wild and also looked at some counter measures that are being sold in the blackhat world. Both sides completely and utterly suck.

The fingerprinting stuff is easy to circument if you're willing to reverse a bit of minified JS, the browser automation 'market leader?' is comically bad and it took me 15 minutes to reliably detect it.

That browser/profile/fingerprint automation thingy in the default option sometimes uses a firefox fingerprint while using chrome. Protip: Chrome and Firefox send HTTP headers in different order. Detected, passive, without JS.

Claiming you're on windows while actually running on a Linux VM, your TCP fingerprint gives it away... Detected, passive, without JS.

Really, both sides, step up your game, this is still boring! Or, just stop fighting scrapers, you can't win.


> Claiming you're on windows while actually running on a Linux VM, your TCP fingerprint gives it away

This is interesting, I’d not heard of this approach. What’s the technique for generating such a fingerprint?


The most trivial check is using the TCP packet TTL (Time To Live). Windows and Linux doesn't have the same default value so it's "easy" to recognize which one sent a specific packet.

Just look up for "ttl windows vs linux" to find tons of article about it :)



Thanks!


I don't know the actual fingerprint here but check out nmap: https://nmap.org/


We could use some professional advice from you, lawl. Please contact me at adelmo.new@gmail.com. /thanks


> When you detect a scraper - through what ever means - you don't block it, as that would tip it off that you are on to it.

Having written bespoke scraping systems professionally, I think you overestimate the applicability of this technique.

For one thing, detecting that a scraper is a scraper is the problem, not the prelude to the problem. You might as well block them at that point, if you feel you can reliably detect them. If nothing else it's more resource efficient than sending them fake data.

Second, and more importantly, "poisoning the well" is not going to work against a sophisticated scraper. I used to use massive amounts of crawled web data to accurately forecast earnings announcements months in advance. I've also consulted with various companies building distributed crawling systems or looking for ways to develop integrations without public APIs.

My colleagues would know very quickly if something was wrong with the data because our model would suddenly be extremely out of whack in ways that could be traced back to the website's behavior. We used to specifically look for this sort of thing and basically eyeball the data on a daily basis. We even had tools in place that measured the volume, response time and type of data being received, and would alert us if the ingested data went more than one standard deviation outside of the expectation in any of these metrics.

You might succeed in screwing up whatever the data is being used to inform for a little while, but you will, with near certainty, show your hand by doing this, and scrapers will react in the usual cat and mouse game. Modern scraping systems are extremely sophisticated and success stories are prone to confirmation bias, because you're mostly unaware when it's happening successfully on your website. For a while I was experimenting with borrowing (non-malicious) methods from remote timing attacks to identify when servers were treating automated requests differently from manual requests instead of simply dropping them. The rabbit hole of complexity is sufficiently deep that you could saturate a full research time with work to do in this area.

If you want to productively block scrapers, you should consider using a captcha-based system at the application layer, preferably a captcha that hasn't been broken yet and which can't be outsourced to a mechanical turk-based API. If nothing else, doing that will introduce at least 10 - 20 seconds of latency per request, which might be intolerable for many scrapers even if they're quite sophisticated.


Choosing a believable alteration is just as important as detecting a browser. For price scrapping even 3% changes to prices can do a lot of damage and is much harder to detect than completely bogus data.


That's a fair point, and it's conceivable this could work. But if the scrapers have had a legitimate source of truth in your data for a meaningful amount of time before you pull the trigger on this, it will get significantly harder.

In practice I'm talking about data which is very multidimensional and has more nuanced metrics than something like a price point. As an example, let's say a certain company exposes sequentially crawlable online orders without authentication. A scraper is going to reason about the data instead of just accepting it, and it's going to look for relatively minute errors. In particular, if it suddenly pulls in many more orders in a single day, or if a particular product shoots up in popularity, or if the price of an item in inventory suddenly changes from what it was historically, an alert will fire off.

As a real world example, I was working on a project to scrape all orders for a publicly traded fast food company that offered online orders and delivery. In the middle of one quarter, without warning, we suddenly saw activity that looked legitimate, but noticeably different order data occurring in a way that significantly increase revenue projections. My colleagues and I basically didn't trust the data whatsoever until we could find a localized (and not well publicized online) promotion that accounted for the changes.

Basically, I'm saying that it can be done, but it's hard and not a silver bullet. If the data is being used as part of a timeseries, then data meaningful enough to disrupt operations if going to be noticed and manually reviewed to identify a narrative that explains it; if the data is believably false but not meaningful enough to change a trend, it's not really going to matter to them anyway.


> As an example, let's say a certain company exposes sequentially crawlable online orders without authentication.

Wow. That's starting to get really shady.

Note to self: All things facing the web in ecommerce must be cryptographically randomized. :(


> Wow. That's starting to get really shady.

Yes, there is a massive amount of crawling that happens in the service of financial forecasting. Satellite imagery, drones, web pages, API endpoints, receipt data (free financial aggregators sell this), location data (free geolocation services sell this), purchase history (free email clients sell this), etc. This rabbit hole goes very deep. Some of it is actively sold by free services for "market research", some of it is collected from sources that don't bother with authentication or access control and make all of their data public.

> Note to self: All things facing the web in ecommerce must be cryptographically randomized. :(

The bigger issue is that sensitive information should be behind authentication. If you require authentication instead of just a "browsewrap" terms and conditions clause at the bottom of the page, that data becomes legally actionable if it's ever found. Otherwise you're relying on (at best) a robots.txt to do the enforcement for you.

Many companies overlook this, even if they're told about it, because it doesn't compromise users or constitute a security vulnerability. So savvy firms take the data as a competitive advantage and use it for market research.


For someone just getting into web development, would you point me to some resources that describe how to find and pull this data?


> I used to use massive amounts of crawled web data to accurately forecast earnings announcements months in advance.

This is really interesting. Was it using sentiment analysis from news-like sources, or tracking prices/releases, or some other information?


> If you want to productively block scrapers, you should consider using a captcha-based system at the application layer, preferably a captcha that hasn't been broken yet and which can't be outsourced to a mechanical turk-based API. If nothing else, doing that will introduce at least 10 - 20 seconds of latency per request, which might be intolerable for many scrapers even if they're quite sophisticated.

I'm thinking of getting into scraping as a (weird) hobby and still doing high-level research in the field. The point I've quoted stands out to me as one of the biggest hurdles to leap.

Three things.

Firstly, I'm aware of human-based captcha-defeating systems. You describe captchas that "cannot be outsourced to a mechanical turk-based API". I'm wasn't aware such systems existed, that sounds scary. Do you have any examples?

Secondly, I'm a little confused by your mention that #1 will simply raise the latency of the request to 10-20 seconds instead of completely blocking it.

Thirdly, in the case where a captcha (eg, ReCAPTCHA) can be defeated by a captcha-filling service... well, such services are extremely cheap, and there's no per-account fee or overhead. It sounds plausible that one could simply sign up for the captcha-filling service multiple times, and submit captchas to defeat in parallel. Could this work?


> Firstly, I'm aware of human-based captcha-defeating systems. You describe captchas that "cannot be outsourced to a mechanical turk-based API". I'm wasn't aware such systems existed, that sounds scary. Do you have any examples?

Google's latest captcha specification (and similarly sophisticated systems) must be completed in a small window of time, change rapidly, have a varying number of "rounds", and are extremely antagonistic to being reloaded in e.g. a frame.

Practically speaking, you can't consistently outsource that to a third party API that uses humans to click and verify the images.

> Secondly, I'm a little confused by your mention that #1 will simply raise the latency of the request to 10-20 seconds instead of completely blocking it.

For captcha systems which can be defeated by humans, they will tolerate an extended period of time before they're solved. This means the amount of time it takes your crawler to send the data to the API and receive the challenge response, then send it to the captcha on the page will not cause the test to fail. It will, however, introduce a significant amount of latency in your requests.

> Thirdly, in the case where a captcha (eg, ReCAPTCHA) can be defeated by a captcha-filling service... well, such services are extremely cheap, and there's no per-account fee or overhead. It sounds plausible that one could simply sign up for the captcha-filling service multiple times, and submit captchas to defeat in parallel. Could this work?

Sure, but what are you gaining? These things are usually priced on a per-captcha basis. So you could have multiple accounts, but you'll be paying the same as just sending multiple requests to the API, and the concurrent requests are routed via the API's backend to different human operators for a solution.


> Google's latest captcha specification (and similarly sophisticated systems) must be completed in a small window of time, change rapidly, have a varying number of "rounds", and are extremely antagonistic to being reloaded in e.g. a frame.

Hmmm.

I've observed the timing thing myself - I'll sometimes hit the captcha first while I fill the rest of a form out, then it'll expire, and I'll need to do it again.

What do you mean by "change rapidly" and "varying number of "rounds""?

Regarding avoiding loading the content in a frame - what about making the initial iframe load happen on the machine of the person filling the captcha? Why can't this work?

--

> Practically speaking, you can't consistently outsource that to a third party API that uses humans to click and verify the images.

I'm both very curious and somewhat surprised this is the case.

--

>> It sounds plausible that one could simply sign up for the captcha-filling service multiple times, and submit captchas to defeat in parallel. Could this work?

> Sure, but what are you gaining? ... you'll be paying the same as just sending multiple requests to the API, and the concurrent requests are routed via the API's backend to different human operators for a solution.

This is what I'd strongly expect, but I very much wonder if captcha-filling services ratelimit the number of filled captchas per account.

In theory (now I think about it) ratelimiting would hurt the service provider so this might be unlikely, but where I want to play with massive parallelism I do wonder if sharding hundreds of requests across multiple accounts would be useful or not. Sharding across multiple providers would probably increase parallel throughput by a small amount though.

Thanks for the feedback!


> Practically speaking, you can't consistently outsource that to a third party API that uses humans to click and verify the images.

It sounds like you can outsource it, but you essentially need a remote desktop connection for the captcha solver to interact with the page. The overhead might be a bit much, but it doesn't seem impossible.


> Having written bespoke scraping systems professionally, I think you overestimate the applicability of this technique.

You may be right, but I would say that it shifts the balance. If the content is poisoned, the scraper will have to spend effort on quality control and they can never know for certain if you have detected their workarounds. It doesn't stop them of course, but it certainly raises the cost of scraping you at little extra cost to you.


> Instead you begin feeding plausible, but wrong data (like, add a random number to price). This will usually cause much more damage to the scraper than blocking would.

Honest question, would this open the site up to legal liability?

You can never identify a bot 100%, so if you intentionally provided false information to an otherwise legitimate user, and that user is harmed by your false information, isn't that a breach of contract that would make you liable for damages based on the harm your bad data caused?

What if the user claimed that the site was acting maliciously, providing lower prices to cause a market reaction ala what happened to coinmarketcap? How would you prove that you were only targeting bots or provide statistics that your methods were effective instead of arbitrary. That stuff matters when money is lost.

You'd have to have a clause in your terms of service that says "if we think you are a bot then the information we give you will be bad", then users could (and probably should) run far away, since they have no way of knowing whether you think they are a bot or not, and thus can't trust anything on your site?


if you intentionally provided false information to an otherwise legitimate user, and that user is harmed by your false information, isn't that a breach of contract that would make you liable for damages based on the harm your bad data caused?

What contract?


Terms of service. It is not considered a contract if you are just an anonymous user, but once you take an affirmative action like creating an account you're explicitly agreeing to that and they are legally binding. The parent comment did not specify anonymous and plenty of bots create accounts to scrape with.

But even aside from the explicit contractual terms, even without one, you cannot just run around acting in bad faith, right? If you put up a site called AccuratePrices.com and then for some users you knowingly and intentionally provide false information, isn't that something like fraud?


Since when do anonymous users not have to adhere to publicly-stated terms of service?


Short answer is, they never were. You can't just leave a contract around and say that anyone who gets nearby and/or reads it is now bound by it. It requires some type of affirmative act to acknowledge that you read and understand the contract and are voluntarily entering into a formal relationship. That's what that "I have read and agree to the terms of service" checkbox on most sites is for.

Otherwise I could leave a piece of paper in a coffee shop saying. "Apple stock will go up 5% in the next month - By reading these words you agree not to trade Apple stock in the next month unless you pay me a royalty.". Reading something does not mean agreeing to its terms and if the content does not require agreeing to its terms to access the content, then the terms are may as well not exist until you do.

It doesn't mean you can ignore everything there. If you reproduce the content then you are still liable under copyright laws. If you DDOS the site, you are still liable under tortious interference laws.

But outside of something covered by other laws, there is no legal recourse to just putting "no bots" in your terms of service and then suing any bots you find. You can try to _block_ bots, just like you can hide your sheet of paper when you see a cop getting coffee, but you can't sue them for breaking terms they never agreed to. I don't believe there is actually any case law yet as to whether a bot can even agree to a contract. For evidence you can look at the recent LinkedIn case, which, while it is on appeal, turned on this issue.


The new improved version of a "trap street". https://en.wikipedia.org/wiki/Trap_street


Instead you begin feeding plausible, but wrong data (like, add a random number to price). This will usually cause much more damage to the scraper than blocking would.

If you detect wrongly though, you risk alienating your users -- if I saw obviously incorrect prices on an ecommerce site, I wouldn't shop there, assuming that they are incompetent and if I enter any payment information it's going to be whisked away by hackers.

Depending on your industry etc., it may be viable to take the legal route. If you suspect who the scraper is, you can deliberately plant a 'trap street', then look for it at your suspect. If it shows up, let lose the lawyers.

Depending on your industry etc., it may be viable to take the legal route. If you suspect who the scraper is, you can deliberately plant a 'trap street', then look for it at your suspect. If it shows up, let lose the lawyers

That may be harder than you suspect, given that it's not illegal to violate a website's terms of use:

https://www.eff.org/document/oracle-v-rimini-ninth-circuit-o...

Even if you have an enforceable contract, you can still face an expensive legal battle if they choose to fight (and it's likely not their first fight), so hopefully you have some deep pockets.


Wow, that is the most evil thing I've heard in a while


Mailinator does the same thing -- http://mailinator.blogspot.com/2011/05/how-to-get-gmailcom-b...

I don't see anything wrong with it.


I don't see anything wrong with it, I am just impressed the approach.


Evil is an interesting choice of words to mean not wrong and impressive...


Considering the frequency with which not wrong and impressive things are labeled evil, it is only a matter of time for the word to lose it's negative connotations. Similarly to sick or awesome.


> let lose the lawyers.

I agree with you there.

Also relevant: Courts: Violating a Website’s Terms of Service Is Not a Crime (https://news.ycombinator.com/item?id=16119686)


A friend and I wrote a paper about techniques for doing this a while back (never formally published; it was just a class project). My favorite technique for avoiding automated scan environments was to show goatse.cx and then run your code in the onUnload handler when the human frantically tries to close the tab.

http://www.cs.columbia.edu/~brendan/honeymonkey.pdf


"is an arms race that's been going on for a long time"

I'd say not always. Some websites attract a high number of unsophisticated scrapers, and a few clever ones. If it's high volume, blocking the former is often worth it just for the reduced load. I agree that going to war with skilled and funded scrapers is futile.


Scraping is not a real concern for most people. Detecting advertising fraud is the much more pressing issue.


Give it time my friend: they don't need to detect headless browsers, they just need to restrict your DOM-given freedom altogether.

Some prick is bound to make some fancy non-DOM web framework using web assembly and turn the internet into a DRM-ridden mess.


Is there something like a non-DOM UI framework? Either you paint your pixels yourself or you have something equivalent to a DOM.

Even without fancy HTML5 stuff anybody could have sent a prerendered image.

I remember web-sites which exclusively rendered in Flash. I avoided them like a plague.


A while ago Flipboard released a canvas based UI. https://github.com/Flipboard/react-canvas

Most companies do use some flash nonsense, but given the state of that, I'm sure they'll find some new insanity to foist on us.


it's an entirely pointless arms race, where the scraper always wins, because web browsers are just a piece of software saying the right things at the right time over TCP. The only deterrent to scraping is the cost of running a web browser to interact using a real javascript stack, so for some really lame spammy uses of scrapers (the most common) headless browsers are often not cost effective.


This is also the game people in click fraud vs click fraud detection are playing.


Blocking crawlers is dead simple:

Find a way to build an API for your data that allows you both to make money.

Any effort besides that is wasted.

Honey pots links? Great my crawler only clicks things that are visible. See capybara.

IP thresholds? Great I have burner IPs that hit a good page of yours until I’m blocked (am I time banned, captchad or perma) and then I back that number out across my network of residential IPs (bought through squid, hello or anyone else) and a mix of tor nodes ( I sample your site with that too) to make sure I never approach that number. But then I also geolocate the IP so it’s only crawling during sensible browsing hours for that location.

Keystrokes detection? yeah I slow down keystrokes so it looks like Grandma is browsing

Mouse detection? looks like Michael j Fox is on your site (that’s an old Dell or Gateway Commercial reference don’t be mad)

Poison the well? I get a page from multiple IPs and headless browser combinations on different screen orientations and if I detect odd changes in data I flag that URL for a turk to provide insight/tune the crawler.

I keep the screenshot and full payload (css,js,html) that I use over time to do more devious shit like render old versions of your page behind a private nginx server so I can re-extract pieces of data I may have missed.

Stop trying to stop the crawling and figure out how to create a revenue stream.


Caveat to my wasted effort comment:

Your'e an e-commerce site that has a problem with people buying goods (especially virtual goods, ebooks, gift cards, etc)[1] with stolen credit cards. You need a solution.

The hardest thing I've ever had to crawl (as I mention in another comment in this thread) has been linkedin and Facebook. Why? Because I have to be logged in to get the data I want.

If you want to stop crawlers you can also put the good stuff behind auth, but you need a solid auth mechanism. You can't just do email verification because I'll use shit like https://www.mailinator.com to generate a ton of fake emails to sign up for your site.

[1] Why virtual goods? You can't stop shipping or track down the person once the card is reported stolen.


> [1] Why virtual goods? You can't stop shipping or track down the person once the card is reported stolen.

In the meantime, virtual good are also zero-cost : when you sell a ebook and the transaction is cancelled by the bank, you didn't lose anything, it's not like the buyer was willing to pay anyway.


Until you see someone do it with buying gift cards. Those people are bastards. In that case using a service that helps detect fraudulent card usage is bueno.


I dont stop crawlers, I only randomly feed damaged/wrong data to crawlers.

I especially loving doing this for e-commerce sites. Now the table has turned. Try guess which fraction of your scrapped data were wrong.


My point is you’re not detecting the ones that you should be the most concerned with.


Problem is as developers, sometimes we only have hammers for these screws.


How would this stop crawling by users who cannot afford the API subscription or don't want to pay for it?

I think your suggestion would reduce crawling, but not prevent or block it.


Build a revshare API.

I assume the people that can't afford it are using it for non-revenue generating purposes. I don't know that I'd care about those people if they aren't taking away from my bottom line.


Because you'd presumably need to authenticate (and have a paid account) to access the data.


I think they meant, since they can't afford the API, they'd just keep crawling your HTML. Combo move the API and putting your stuff behind auth and you solve a lot of the problem.

You've gotta get that good yummy content publicly accessible for google though, so you'll rank. So, that's a balancing act.


This article is a joke, all those methods of "protections" are a joke. What we called "script kiddys" and are now a major amount of so called developers are just underdeveloped lamers who just don't know that the fight is lost in advance. All the methods that you take are useless when you get into situation of scraper run by someone who is able to modify (oh and is able to code in c/c++) and recompile the client side. The world went into Idiocracy so much that methods are beeing invented by people who are so narrow minded that they see the development in a scope of a browser and have a false sense that they "can handle it". Only if the oponent is as narrow minded as they are. Only than. I can modify the source code of chromium, you will get back exactly what you expect from regular user, i am able to scrape fb and linkedin and the only thing they can do is to slow me down (to hide the fact that the code is doing surfing, not human). Stop wasting your time on protection, you are running your inneficient crappy code in insecure environment, the only "attacker" you are safe against is the one who is as clueless as you are.

The same moment when you send content to the client, it is game over. You have lost all control.

I am sorry for all non-gentle sentences here, but we had developers who were able to decompile asm code and patch it to avoid drms, while now sandboxed idiots are thinking, they are smart. The whole dev. environment became toxic =/ And people are just to stupid to understand how stupid they are =/


Harsh words; but very true that the developers don't realize the chromium is open-source.. maybe they should just jump to the new drm extension, atleast that will challenge the dedicated scrapers.


This is pretty broad criticism, but you are at least right that there is no sound principle on which to protect a website from "automated" interaction.


Isn't it impossible to win the game of blocking headless browsers?

What's stopping someone from creating an API that opens up a real browser, uses a real (or virtual) keyboard, types in/clicks the real address, etc. then proceeds to use computer vision to scrape the information from the page without touching the DOM?


You are in principle correct, but in practice you need to account for the side channels of information as well -- does the mouse and keyboard behave like a human or a robot? Are there thousands upon thousands of sessions coming from the same IP address?

The cat and mouse game happens at every level, not just the DOM/browser-detection level.


So record actual user input data and generate similar input patterns stochastically.

That said if you try to scale this up beyond what a reasonable, normal user world do in one sitting, you are bound to stand out.

Although that said, I find that I trigger such rate-limiting mechanisms already as a human just when searching Google as a human being and clicking through every last search result page.


You'd have to scrape slowly to mimic a real slow user. Maybe at that point you'd be cheaper to get Mechanical Turk to do it. That should solve IP rate limiting, captchas, and just about everything except the endless arms race. Why are so many people going directly to these same-formatted internal URLs without clicking through from random other places? So the site can change the internal URLs and break it all over again.


You'd use a browser extension, scoped to requests of sites you're interested in, and stream your data back to your infrastructure for processing. You're limited only by your install base and your ingest infrastructure.

Recap [1] does this to extract PACER court documents that are public domain, but access is restricted due to draconian public policy.

[1] https://free.law/recap/


>You'd have to scrape slowly to mimic a real slow user.

Sure, but that's easily mitigated by running multiple scrapers as different users.. You don't need to get all the data from a single scrape.


>Are there thousands upon thousands of sessions coming from the same IP address?

As someone who routinely works behind proxies, I can sympathize strongly with this statement:

"The one thing that I was really trying to get across in writing that is that blocking site visitors based on browser fingerprinting is an extremely user-hostile practice."


well headless browsers exist because they are less expensive to automate than real browsers, adding in the computer vision and the scraper just added a lot of expense.


There's a similar "analog hole" for video DRM, too.


I wonder how long it will be before someone comes up with the idea of using iPhone style facial recognition to tell whether a human is looking at the TV/Monitor or not.


Oh please no, exposing those sorts of APIs will quickly be utilized by ad-tech guys to make interstitial video ads that don't go away until you finish watching them.


Black Mirror S01E02



One simple reason is resource cost. Having a non headless browser is more expensive. Therefore at scale you are wasting resources.


Yes, it's more expensive than a headless browser. Impossible to say, though, whether it's more expensive than can be justified in achieving an objective until that objective is known.


Or just recompile Chromium with a few changes ;)


Rate-limiting followed by CAPTCHAs seems to be the usual strategy.

I think Google claims to try to detect humans by parsing out their mouse movements and scroll events.


And I can attest that they often presume that I'm a robot.

At this point it would be easier for me to write an alternative frontend to Google search (or just use duckduckgo), but it was be amusing to think that I might evade this by writing a script to simulate mouse movements to appear less robotic.


> And I can attest that they often presume that I'm a robot.

In my experience this occurs when you are either doing this too much, or you are not accepting their cookies when logged in. (I don't recall the behavior when logged out.)


For the record: I was logged out / incognito.

Actually, for whatever reason, this also often seems to cause reCapcha to essentially hellban me and just keep asking for me to solve an endless series of capchas. :/


Good, the less effective various spying techniques are, and the easier they are to throw off, the better the internet is for its users. I don't want any website owners to know what device, browser, or other program, I use to access their site, and they have no business knowing that. I like it being a piece of information I can supply voluntarily for my own purposes, and I get the heebie jeebies every time I read about a new shady fingerprinting technique that exploits some new, previously unexplored quirk of web technologies.


The conclusion that this makes spying techniques more difficult is not at ALL what this article is saying. It is just saying that no fingerprinting is going to be able to distinguish between headless and not headless, but that is because there are too MANY variations, not because the variations are hard to detect. Nothing in this article gives any instruction or guidance on how to prevent your browser from being fingerprinted and tracked.


This incentivizes more aggressive fingerprinting, not the other way around. Too bad people don't realize it.


Browser fingerprinting, I almost forgot. Non aggressive and impossible to stop.


Good. It shouldn't.

The web should be open the fact that people are still trying to stop this is a joke.


tell that to web assembly and friends


I'm not sure why one wants to bother to do this.

With tools like Sikuli script (sikuli.org) already around for ages, automating a headed browser isn't rocket science. So the best-case scenario for detecting headless browsers is "The bad guys just use headed browsers and another automation solution."


Looks like great tool, never heard of sikuli before. Thanks for the tip!


I name-drop it every chance I get ;) We used it to automate the integration tests for a game engine at a previous company; worked great, because it allowed us to fire events into the engine itself based upon the actual rendered pixels (Sikuli supports varying levels of fuzzy image detection for event targets).


This dicussion is also happening on a counterpoint posted about 9 hours earler, also currently on the front page:

It is possible to detect and block Chrome headless | https://news.ycombinator.com/item?id=16175646


"That’s when it becomes impossible. You can come up with whatever tests you want, but any dedicated web scraper can easily get around them."

As long as the logic is hidden from the scrapers, i.e. not running in a web browser, scrapers are at a disadvantage. They don't have the data about the users that websites have. And even something as simple as Accept-Language header associated with an IP subnet is a data point that can be used to protect against scraping. There are a lot more data points though and more aggressive fingerprinting can effectively destroy scraping.


The counterpoint (and what this article mentioned in several places) is that the "more aggressive fingerprinting" techniques can have high false positive rates, and then the legitimate users of your site are going to end up thinking it's broken or the data never loads because they're inadvertently triggering your fingerprinting.

It feels pretty ridiculous to tell a user "you can't use our site because your system language settings don't match our IP list for that area."

Sidestepping the fact that assuming you can tell someone's language from the geo mapping of their IP address is already pretty problematic.[0]

[0]: https://medium.com/@kristopolous/stop-guessing-languages-bas...


If I can navigate to it using a normal Chrome instance, then a headless instance is going to have all of the same information (accept-language) as far as the server can detect. Chrome headless is Chrome and makes network calls in the exact same way. That's why client side headless detection even exists.

So the only way in which what you suggest would work is if the system is so aggressive that it actually blocks a normal Chrome instance, which is more hostile that most systems are. But this allows a user to change settings until they do get a correct response back, and then just have the headless browser use those settings.


But certainly scapper blockers also have a big disadvantage, which is that the number of (reliable) scrapper detection techniques is finite, and probably pretty small. It's relatively possible (though no easy task) to find most of them by going through the Chromium source and looking at what is done differently when running in headless mode. And you can also probably find more by occasionally running some more real browser and doing differential execution of the JavaScript code. And once all trivial detection techniques can by bypassed, it becomes exponentially more difficult for scrapper blockers to find a new one.


All the passive techniques are much harder to reasons about, but much easier to match. You just look at the complete request/response headers, make sure you match them, and have some good sources to request from.

Much harder is stuff like Distil's script injection, where they transparently inject script tags that do fingerprinting, and they obfuscate the code that does so annoyingly (it's not really hard to reverse, just time consuming and and annoying). They pair this with being a bit more user friendly by redirecting you to a CAPTCHA page if your fingerprinting hits some threshold which if you answer redirects you to the page you wanted, so users experience and inconvenience if there's a false positive, but still get access to what they wanted.

I was able to get around most of the passive fairly easily with Perl and LWP, and even the active stuff and CAPTCHA redirects (cookie_jar all serialized to DB so I could store request and represent it to a user to answer), but once they started tweaking their fingerprint script ever couple months/weeks that's when the equation shifted. Distil, as a solutions provider, gets to amortize their changes across all their customers, while I would have to spend the time de-obfuscating it. They could just assign a person to change it once a week and they would effectively halve my time to get any real work done, so without a collective effort of some sorts to combat them, I saw the writing on the wall. :/

The sad thing is that when we moved to API access, their APIs are hampered to the degree that it actually takes two orders of magnitude more requests each minute for a fraction of the accuracy (I was able to query changes over the last couple minutes previously, and now I have to query the entire item set of a subset of all containers, when there are tens of thousands of containers). :/ Lose lose, since our use case isn't even the main reason the site wanted to block scrapers.


Did you do a cost estimate for the "Wizard of Oz" solution of having real people with real browsers (and a script to pull data from the site?) Might have been worthwhile.


It was actually two changes, one which required a much more intensive request regime because of a public caching (SOLR) system change, and then the much more aggressive scraping detection. The first caused us to change from requesting data 78 times an hour (items changed within 2 minutes requested every minute, 6 minutes of changes every 5 minutes, 11 minutes of changes every 10 minutes for overlapping coverage) to many thousands of checks an hour for much less accurate information. In the end, we ended up doing very targeted checks and much less accuracy for different classes of items. Having people actually do the checking just wouldn't be feasible for our size and resources (very small company, <10 employees), even through mechanical turk (I suspect).


Interesting follow-up (again). It will be very interesting to see where attempts to detect headless browser will first appear in the wild. Once we know that and the prevalence, we can make a judgement call on how much effort to put into anti-detection techniques. It's an arms race for sure, but once you know your target you can evaluate whether you even have to put up the effort to defeat a non-existent adversary.


It's a very dangerous thing to do for SEO reasons too.

I'm sure Google and others have automated user-like crawling which attempts to validate their official Google indexing bot.

If the results between the two differ in certain ways you may well get your site buried way down in search results.


Crawlers & scrapers that rely on headless browsers like Chrome often initiate playback of video on the pages they access.

The company I work for (Mux) has a product that collects user-experience metrics for video playback in browsers & native apps. It's been a non-trivial effort developing a system to identify video views from headless browsers so that we might limit their impact on metrics. Being able to make this differentiation has a real benefit to human users of our customer's websites.

My preference would be for headless browsers to not interact with web video or be easily identifiable via request headers, though I doubt either of these things will happen any time soon.


Video should never play unless actively initiated by the user. That would fix the metrics, as the headless browser probably wouldn't initiate the video playback


If it's a video site, I expect the video to play when I land, e.g. youtube. I'm initiating on purpose by browsing


Regarding YouTube in particular, I tend to open up videos in background tabs for later viewing and find it very annoying that they start playing automatically before I get around to that tab. I did go there to watch the video—eventually. Just not the second that the page finishes loading.

YMMV. A persistent setting to enable or disable auto-play would be ideal.


Chrome doesn't autoplay videos in background tabs until you focus them.


Actually it doesn't even seem to load most of the page in a background tab until it's focused... which is also annoying. (Or perhaps it's just the parts hidden behind 'onload' JavaScript, which these days is most of the page content.) Part of the reason for opening the tab in the background is getting the lengthy loading process out of the way while I'm reading something else.

For this reason I tend to activate the tab and then go back to what I was doing before... and then the video auto-plays.


Firefox likewise has toggles for disabling autoplay in background tabs (on by default) and disabling autoplay completely (off by default).


Yeah, I definitely don't. But I think a persistent "auto-play" toggle (default "off") is ok.


The first thing I do when hitting a YouTube URL is stop the video. Then I'll either run youtube-dl on the URL, or just paste it straight into a proper video player (VLC).


Pretty confident you're in the extreme minority on that one


There are Chrome extensions that automatically pause the video.


How are they initiating playback? Are they pressing play, or just triggering auto-play behavior?


The author's navigator.webdriver fix is easily detected, though of course it is fixable with changes to Chrome. This cat and mouse game probably isn't worth pursuing against dedicated adversaries.

    if (navigator.webdriver || Object.getOwnPropertyDescriptor(navigator, 'webdriver')) {
        // navigator.webdriver exists or was redefined
    }


That test actually wouldn't work:

    > navigator.webdriver
    true
    > Object.getOwnPropertyDescriptor(navigator, 'webdriver')
    undefined
As you say though, it's a cat and mouse game and you could always override the behavior of getOwnPropertyDescriptor() if it were used in a test.


As someone who writes web scrapers for a living, I have only come across one site where I have been unable to reliably extract the information we need. If we were more flexible, we would be able to deal with this site too. Defending yourself from scrapers is an arms race you are almost certain to loose.


What site?

I’d guess LinkedIn or Facebook.

I’ve had to make a lot of fake accounts to get just a decent amount of data from them.


thats why i usually block all the ip range from hosting providers.


Its trivial to randomise HTTP headers, both the content and the order. There are free and commercial databases of user-agent strings available to any user, the same ones the websites may use.

Users can also modify or delete HTTP headers through local proxies, using the same proxy software that many high volume websites use. Sites that rely on redirects to set headers make this even easier.

p0f only works with TCP. Could this be another a selling point for alternative congestion controlled reliable transports that are not TCP, e.g. CurveCP? I have prototype "websites" on my local LAN that do not use TCP.

The arguments in favor of controlling access to public information through "secret hacker ninja shit" (https://news.ycombinator.com/item?id=16176572) are not winning on the www or in the courts. Consider the recent Oracle ruling and the pending LinkedIn HiQ case.

If the information is intended to be non-public, then there is no excuse for not using access controls. Anything from basic HTTP authentication to requiring client x509 certificates would suffice for making a believable claim.

Detecting headless Chrome and serving fake information, or any other such "secret hacker ninja shit" is not going to suffice as a legitimate access control, whether in practice or in an argument to a reasonable person.

The fact is in 2017 websites still cannot even tell what "browser" I am using, let alone what "device" I am using. They still get it wrong every time. Best they can do is make lousy guesses and block indiscriminately. Everything that is not what they want/expect is a "bot", a competitor, an evil villan. Yet they have no idea. Sometimes, assumptions need to be tested.1

  1 https://news.ycombinator.com/item?id=16103235 (where developer thought spike in traffic was an "attack")


The EME DRM is part of the game for those who really want to block headless. It will arrive, sooner or later.


I always assumed DRM would eventually factor into this. I’ve only ever read about it in the context of media, but I’m assuming there’s ways to use it creatively for fingerprinting and blocking scraping as well. Do you have any links with insights into that?


I am not familiar with the specific details and how they would allow this, but ... if it is for human consumption and not bot consumption, it is enough to render the result into a DRMd h264


From my experience in the scene:

Bot mill people are very aware of headless browsers being an effortless solution to mimic a browser, but not that efficient.

The amount of ram and so a bots spends to do a single click can truly hurt their bottom line.

Top tier collectives I heard of use own C/C++ frameworks with hardcoded requests and challenge solvers, and in-depth knowledge of anti-botting anti-fraud techniques used by the opposing force. If DoubleClick finds a brand new performance profiling test, and send it out in the JS code in one in 1000 requests, expect those guys to detect it and crack it within 24 hours.

They have no objective of getting through captchas, just having their number of valid clicks in double digits.


The problem is that you can easily detect that some properties have been overloaded. For example, you can execute Object.getOwnPropertyDescriptor(navigator, "languages") to detect if navigator.languages is a native property or not.


It's possible to hide that as well, funnily enough, by also overwriting Object.getOwnPropertyDescriptor (and similar tricks). As far as I know, it's theoretically possible to use this trick to completely 'sandbox' some code so that there's no way it can detect certain functions being overwritten (by overwriting all functions such as Object.getOwnPropertyDescriptor, Function.toString, etc. and making them hide the overwritten functions, including themselves).

For some more information: http://randomwalker.info/publications/ad-blocking-framework-...


Very interesting article! Thank you.


Can’t I accomplish the same thing by compiling a modified version of headless chrome?


Yes I think it's possible but it's much harder to do, I guess.


Could someone tell me why everybody wants to fight against headless browser ? If I want to use such a browser to browse your site, site that you voluntarily show to the public, then it's my problem, my code, not yours. If you want to protect your data so much, then maybe you shouldn't put them on the web first place. (yep, I present things in black and white, but you get the picture)

I would also add this :

https://www.bitlaw.com/copyright/database.html#Feist

because it basically says it's hard/pointless to protect data.


Some people seem to have figured out how to detect without relying on fingerprinting the browser. ex. Crunchbase

but headless chrome shouldn't be possible to distinguish from a regular chrome browser.

The only vector to block scraper is some sort of navigational awareness that deviates from a distribution curve + awareness of IP.

but this comes at a great cost to hurting your own real vistors by taxing them with captcha or other annoyances.


Thats what invisible recaptcha is for.

Only the users who compulsively clear cookies ever get bothered by it, and even then all they have to do is click a few photos of cars.


It is "easy" to block scraping. Make it very costly to scrape:

- Render your page using canvas and WebAssembly compiled from C, C++, or Rust. Create your own text rendering function.

- Have multiple page layouts

- Have multiple compiled versions of your code (change function names, introduce useless code, different implementations of the same function) so it is very difficult reverse engineer, fingerprint and patch.

- Try to prevent debugging by monitoring time interval between function calls, compare local time interval with server time interval to detect sandboxes.

- Always encrypt data from server using different encryption mechanisms every time.

- Hide the decryption key into random locations of your code (use generated multiple versions of the code that gets the key)

- Create huge objects in memory and consume a lot of CPU (you may mine some crypto coins) for a brief period of time (10s) on the first visit of the user. Make very expensive for the scrapers to run the servers. Save an encrypted cookie to avoid doing it later. Monitor concurrent requests from the same cookie.

The answer is that it is possible but it will cost you a lot.


All of which is defeated by OCR.


Good point. OCR powered web scraping is even available out of the box nowadays.

https://a9t9.com/kantu/docs/scraping#ocr


It is not the OCR that is costly. It is the JavaScript execution to render the page so you can do the OCR. You can even increase the JavaScript execution cost if suspicious.

You will also have to automate all page variations and the traditional challenges (login, captcha, user behavior fingerprinting, ...)

At the end the development time, cost and server cost will kick you out of business if you are too dependent on the information or you start to loose money every time you scrap.


Yes. The idea here is to make you dependent on OCR (you also have to find where is the information as the page design changes) and to waste a lot of your server resources making it very costly to scrape.


If you want to detect if a human is visiting your site, open an ad popup with a big close button directly over the content.

A human being will always, 100% of the time, immediately close the popup. Automation won't care.


OK, but that is guaranteed to annoy users. Plus, I think you’re underestimating the intelligence of the people writing scrapers — obviously they’re going to visit the site manually and see what appears to be a fingerprinting measure. Then they’ll update the scraper to close that pop up. There are no effective solutions to this problem.


It is impossible to make headless and normal browser send 100% indistinguishable traffic. The timing of the browser requests is influenced by rendering that for the two versions will be always different.


It's not impossible; you could e.g. profile a real client's timing and introduce delays into the headless version. It's not zero work, but it's very much not impossible if you're sufficiently motivated.

Especially recently with e.g. https://hackaday.com/2018/01/06/lowering-javascript-timer-re... , high-precision timers in JS might not be available for all clients for reasons other than ~"they're headless and trying to scrape my site".


Chromium can work directly with Wayland afaik. Do a "fake" Wayland implementation and Chromium will happily think it's drawing crap


Cool, I think the new captchas use mouse entropy that would be an interesting test since remote usually go straight to the pixel point.


Most touchscreens go straight to the pixel point too.


if you want to block scrapers, just add rate limiting...


That's old hat and ineffective. Scrapers usually proxy through large lists of rotating ip addresses. There are lots of services for it.


depends. if you also do tracking for valid navigation paths then rate limiting may be effective.

For instance: if you have a search result with a 1000 pages that someone is trying to scrape if you don't allow people to jump into the middle of the result set then just rotating IPs doesn't work.


I'll eat you through a proxy network then, unless you want to slow down your legitimate users too.


I now work for a company that is gathering metadata on the IP address space (in an effort to reduce the amount of abuse that sites and service providers have to deal with).

It won't be very long before it'll be possible to identify most of the common proxying networks and block those.

Scrapers can respond by setting up something like an ssh tunnel from a residential high speed connection to a remote server (so that scraping run on the remote server appears to come from the residential connection) and using that as a proxy, but it adds another level of effort.

(I am generally on the scrapers' side on this and think it's ultimately futile to try to block all scrapers. Website administrators need to accept that anything they put on the internet is public and get over it.)


Just out of curiosity, how are you going to handle Luminati? They're bot-networking home users all over the world in exchange for free VPN.


That's pretty neat.

For our purposes, those residential IPs that are used maliciously through this service will just hurt the reputation of the ISP they belong to. I suspect we'll see something shake out in the data where this activity is limited to some kind of specific demographics (ISP, netrange, geographic location) and shouldn't interfere too badly with the system as a whole.


That's what I thought. In case you were wondering, as a Luminati user (on the proxy end, not the VPN end thank god), my experience has been higher latency and request fail rate than on traditional proxy networks. It's definitely not my first weapon in the arsenal, but since the target for multiple reasons already has an effective high latency and my users are sufficiently motivated enough to wait for what I'm delivering, it's a pretty effective last resort.


wow. hadn't seen this before.


Maybe companies can instead focus on improving network infrastructure and software architecture to be able to eat any amount of scraper traffic since at some point it must become indistinguishable from a ddos that also needs to be handled...


Its not about ddos. Think amazon. They probably dont want to reveal all of heir inventory, their price history and so on. Even their api had rate limiting per second, and i doubts it is because of the load


There is a multi-million dollar industry around blocking scrapers, it's not "just add rate limiting".


    while ScraperBlocked:
        scraper.sleep(5)


All those tests are useless and effective only against script kiddys (which are now like 99.99999% of developers by old standards) and are unable to code anything else but crappy languages like js. For people grown up with web, capable of coding in c/c++ those tests are a joke, I'll just modify the source code to return what is expected and 'game over'. We were reversing drms by dissasembling and patching the binaries - in world of text based protocols and scripts, Idiocracy of todays world is making us invincible.


What does C/C++ have to do with this, when the point of the article is showing that they can be defeated using JS?


JS is run within c/c++ js engine that can be modified to return you whatever fake results. You can't prevent that. As always, any lock is cheaper to defeat than create.


No, you misunderstood; you can defeat the detection using JS. You don't need C at all.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: