Hacker News new | past | comments | ask | show | jobs | submit login

Sites detecting headless browsers vs headless browsers trying not to be detected by sites, is an arms race that's been going on for a long time. The problem is that, if you're trying to detect headless browsers in order to stop scraping, you're stepping into an arms race that's being played very, very far above your level.

The main context in which Javascript tries to detect whether it's being run headless is when malware is trying to evade behavioral fingerprinting, by behaving nicely inside a scanning environment and badly inside real browsers. The main context in which a headless browser tries to make itself indistinguishable from a real user's web browser is when it's trying to stop malware from doing that. Scrapers can piggyback on the latter effort, but scraper-detectors can't really piggyback on the former. So this very strongly favors the scrapers.




In my experience, the most effective counter measure to scraping is not to block, but rather to poison the well. When you detect a scraper - through what ever means - you don't block it, as that would tip it off that you are on to it. Instead you begin feeding plausible, but wrong data (like, add a random number to price). This will usually cause much more damage to the scraper than blocking would.

Depending on your industry etc., it may be viable to take the legal route. If you suspect who the scraper is, you can deliberately plant a 'trap street', then look for it at your suspect. If it shows up, let lose the lawyers.

Of course, the very best solution is to not care about being scraped. If your problem is the load that is caused to your site, provide an api instead and make it easily discoverable.


Poisoning the well is very effective. We employed it at a large ecommerce company that was getting hit by carders testing credit cards on low price point items(sub $5). We were playing cat and mouse with them for six months. Found certain attributes about the browser that the botnet was using and fed them randomized success/fail responses. After two weeks of feeding them bad data, they left and never came back. They did DDOS us though in retaliation.


I think this is a good example of "poisoning the well" in practice, and I was in a similar position as you describe when I was working in incident response at a consumer bank a few years ago.

That said, this is a very particular scenario, and I don't think you can generalize the effectiveness of the technique from this example. In situations where attackers are looking for boolean responses, i.e. to verify email addresses, credit cards, usernames, etc, this can work well.

But in most situations a crawler is looking for particular data. Moreover, it will have a strong expectation (and understanding) about what this data should look like. Poisoning the well is not going to work as well in that scenario, because a scraper will be aware that the data is incorrect.

In contrast, people testing credit cards don't need your data, they just need data from somewhere that verifies the card numbers. In that case it's easier to just keep rotating targets until they've tested all the cards instead of bootstrapping an in-house team to write bespoke crawlers.


Man. I really like that idea, but is giving a random "success" to a supposed credit card charge PCI compliant?


PCI is largely misunderstood. Spoofing a success on a transaction has nothing to do with proper storage and transmission of credit card information.

Most e-commerce sites will use a payment processor such as Braintree or Stripe so they don't even have to deal with PCI.

PCI applies to entities who store and process their own payments, e.g. Target, Amazon.


PCI doesn't care about what i return to the customer. It covers protecting the card number. It could be in the merchant agreement but most of those rules are never enforced. Its against card brand rules to require a person to show an ID but all merchants do it. Merchant account agreement requires you to prevent fraudulent purchases and that part is heavily enforced which is why the card brands turn a blind eye to the merchants asking for an ID.


The whole cat and mouse game thing... for some strange reason that sounds fun to me. Probably because I don't know the details and the workload involved in actually doing it (and it's not my money or inventory at stake). It seems like it would be exciting, in that somewhat naive juvenile-ish fantasy sort of way, to try and figure out how to mitigate the threat, implement it quickly, and deploy it to watch it play out in real time on live production servers. I don't know, maybe I have the wrong idea about the whole thing?


There are aspects that are fun, but I feel like if you're doing it right, it is stressful. You are playing an antagonistic game with bad actors, so there's risk, and you'd better be well past just gaming out the probabilities and costs there. Just because you noticed them doesn't mean they can't do damage. You'd also better get informed buy-in from other relevant departments, etc.

Quite a while ago, I was involved in baiting an attacker in a somewhat different way, but with the same goal (destroying the value we were providing to them). After the attacker figured out what was going on, they issued a somewhat credible threat to damage the company's machines (they included some details demonstrating they had access to a couple internal machines at some point), attacked and DOSed them, and persistently tried spearphishing them for months afterwards.

I guess I'd just say, (a) doing this sort of thing responsibly sucks a lot of the fun out of it, and (b) don't underestimate the risks of things going pear-shaped. You could be buying yourself a lot of ongoing grief. Something as everyday as that spearphishing attack can be nerve wracking - even after making annoying the crap out of everyone by repeating how to be careful with email, there's no way to be sure it won't hit, and the next thing you know people are sitting around with upper management having conversations nobody wants to have about network segmentation and damage mitigation.


It is incredibly fun. I set up a Google Voice number specifically for site content managers to call me when comment/review spammers came around. I was fast enough at blocking spammers repeated attempts at evading registration blocking(email domains, IP address ranges, browser disguising) that I would usually make the spammers give up within an hour.


Your comment is why I feel automating engineers would be very hard. Cat and mouse game sounds like it requires a human... and a cat... and a mouse...


GANs are an automated ‘cat and mouse game’

https://en.m.wikipedia.org/wiki/Generative_adversarial_netwo...


Indeed, I was going to mention those, but I feel there is an element missing from even GANs that we don't quite have yet.


It's interesting that it came down to actual virtual warfare.

Good on you for sticking through it.


That only works if the scrapers are either in a country where you can do something about it.

Also poisoning only works for a while. As soon as they detect the poisoning, they can easily figure out what tripped the scraping detection, and now you need to poison in an even more subtle way because the scrapers know what to look for.

You can't win this game. Especially the "obvious" type of browser checks, where you can tell the js is checking your browser are so easy to circumvent, because you can tell what they're doing.

Really though, I've reversed a few fingerprinting libs in the wild and also looked at some counter measures that are being sold in the blackhat world. Both sides completely and utterly suck.

The fingerprinting stuff is easy to circument if you're willing to reverse a bit of minified JS, the browser automation 'market leader?' is comically bad and it took me 15 minutes to reliably detect it.

That browser/profile/fingerprint automation thingy in the default option sometimes uses a firefox fingerprint while using chrome. Protip: Chrome and Firefox send HTTP headers in different order. Detected, passive, without JS.

Claiming you're on windows while actually running on a Linux VM, your TCP fingerprint gives it away... Detected, passive, without JS.

Really, both sides, step up your game, this is still boring! Or, just stop fighting scrapers, you can't win.


> Claiming you're on windows while actually running on a Linux VM, your TCP fingerprint gives it away

This is interesting, I’d not heard of this approach. What’s the technique for generating such a fingerprint?


The most trivial check is using the TCP packet TTL (Time To Live). Windows and Linux doesn't have the same default value so it's "easy" to recognize which one sent a specific packet.

Just look up for "ttl windows vs linux" to find tons of article about it :)



Thanks!


I don't know the actual fingerprint here but check out nmap: https://nmap.org/


We could use some professional advice from you, lawl. Please contact me at adelmo.new@gmail.com. /thanks


> When you detect a scraper - through what ever means - you don't block it, as that would tip it off that you are on to it.

Having written bespoke scraping systems professionally, I think you overestimate the applicability of this technique.

For one thing, detecting that a scraper is a scraper is the problem, not the prelude to the problem. You might as well block them at that point, if you feel you can reliably detect them. If nothing else it's more resource efficient than sending them fake data.

Second, and more importantly, "poisoning the well" is not going to work against a sophisticated scraper. I used to use massive amounts of crawled web data to accurately forecast earnings announcements months in advance. I've also consulted with various companies building distributed crawling systems or looking for ways to develop integrations without public APIs.

My colleagues would know very quickly if something was wrong with the data because our model would suddenly be extremely out of whack in ways that could be traced back to the website's behavior. We used to specifically look for this sort of thing and basically eyeball the data on a daily basis. We even had tools in place that measured the volume, response time and type of data being received, and would alert us if the ingested data went more than one standard deviation outside of the expectation in any of these metrics.

You might succeed in screwing up whatever the data is being used to inform for a little while, but you will, with near certainty, show your hand by doing this, and scrapers will react in the usual cat and mouse game. Modern scraping systems are extremely sophisticated and success stories are prone to confirmation bias, because you're mostly unaware when it's happening successfully on your website. For a while I was experimenting with borrowing (non-malicious) methods from remote timing attacks to identify when servers were treating automated requests differently from manual requests instead of simply dropping them. The rabbit hole of complexity is sufficiently deep that you could saturate a full research time with work to do in this area.

If you want to productively block scrapers, you should consider using a captcha-based system at the application layer, preferably a captcha that hasn't been broken yet and which can't be outsourced to a mechanical turk-based API. If nothing else, doing that will introduce at least 10 - 20 seconds of latency per request, which might be intolerable for many scrapers even if they're quite sophisticated.


Choosing a believable alteration is just as important as detecting a browser. For price scrapping even 3% changes to prices can do a lot of damage and is much harder to detect than completely bogus data.


That's a fair point, and it's conceivable this could work. But if the scrapers have had a legitimate source of truth in your data for a meaningful amount of time before you pull the trigger on this, it will get significantly harder.

In practice I'm talking about data which is very multidimensional and has more nuanced metrics than something like a price point. As an example, let's say a certain company exposes sequentially crawlable online orders without authentication. A scraper is going to reason about the data instead of just accepting it, and it's going to look for relatively minute errors. In particular, if it suddenly pulls in many more orders in a single day, or if a particular product shoots up in popularity, or if the price of an item in inventory suddenly changes from what it was historically, an alert will fire off.

As a real world example, I was working on a project to scrape all orders for a publicly traded fast food company that offered online orders and delivery. In the middle of one quarter, without warning, we suddenly saw activity that looked legitimate, but noticeably different order data occurring in a way that significantly increase revenue projections. My colleagues and I basically didn't trust the data whatsoever until we could find a localized (and not well publicized online) promotion that accounted for the changes.

Basically, I'm saying that it can be done, but it's hard and not a silver bullet. If the data is being used as part of a timeseries, then data meaningful enough to disrupt operations if going to be noticed and manually reviewed to identify a narrative that explains it; if the data is believably false but not meaningful enough to change a trend, it's not really going to matter to them anyway.


> As an example, let's say a certain company exposes sequentially crawlable online orders without authentication.

Wow. That's starting to get really shady.

Note to self: All things facing the web in ecommerce must be cryptographically randomized. :(


> Wow. That's starting to get really shady.

Yes, there is a massive amount of crawling that happens in the service of financial forecasting. Satellite imagery, drones, web pages, API endpoints, receipt data (free financial aggregators sell this), location data (free geolocation services sell this), purchase history (free email clients sell this), etc. This rabbit hole goes very deep. Some of it is actively sold by free services for "market research", some of it is collected from sources that don't bother with authentication or access control and make all of their data public.

> Note to self: All things facing the web in ecommerce must be cryptographically randomized. :(

The bigger issue is that sensitive information should be behind authentication. If you require authentication instead of just a "browsewrap" terms and conditions clause at the bottom of the page, that data becomes legally actionable if it's ever found. Otherwise you're relying on (at best) a robots.txt to do the enforcement for you.

Many companies overlook this, even if they're told about it, because it doesn't compromise users or constitute a security vulnerability. So savvy firms take the data as a competitive advantage and use it for market research.


For someone just getting into web development, would you point me to some resources that describe how to find and pull this data?


> I used to use massive amounts of crawled web data to accurately forecast earnings announcements months in advance.

This is really interesting. Was it using sentiment analysis from news-like sources, or tracking prices/releases, or some other information?


> If you want to productively block scrapers, you should consider using a captcha-based system at the application layer, preferably a captcha that hasn't been broken yet and which can't be outsourced to a mechanical turk-based API. If nothing else, doing that will introduce at least 10 - 20 seconds of latency per request, which might be intolerable for many scrapers even if they're quite sophisticated.

I'm thinking of getting into scraping as a (weird) hobby and still doing high-level research in the field. The point I've quoted stands out to me as one of the biggest hurdles to leap.

Three things.

Firstly, I'm aware of human-based captcha-defeating systems. You describe captchas that "cannot be outsourced to a mechanical turk-based API". I'm wasn't aware such systems existed, that sounds scary. Do you have any examples?

Secondly, I'm a little confused by your mention that #1 will simply raise the latency of the request to 10-20 seconds instead of completely blocking it.

Thirdly, in the case where a captcha (eg, ReCAPTCHA) can be defeated by a captcha-filling service... well, such services are extremely cheap, and there's no per-account fee or overhead. It sounds plausible that one could simply sign up for the captcha-filling service multiple times, and submit captchas to defeat in parallel. Could this work?


> Firstly, I'm aware of human-based captcha-defeating systems. You describe captchas that "cannot be outsourced to a mechanical turk-based API". I'm wasn't aware such systems existed, that sounds scary. Do you have any examples?

Google's latest captcha specification (and similarly sophisticated systems) must be completed in a small window of time, change rapidly, have a varying number of "rounds", and are extremely antagonistic to being reloaded in e.g. a frame.

Practically speaking, you can't consistently outsource that to a third party API that uses humans to click and verify the images.

> Secondly, I'm a little confused by your mention that #1 will simply raise the latency of the request to 10-20 seconds instead of completely blocking it.

For captcha systems which can be defeated by humans, they will tolerate an extended period of time before they're solved. This means the amount of time it takes your crawler to send the data to the API and receive the challenge response, then send it to the captcha on the page will not cause the test to fail. It will, however, introduce a significant amount of latency in your requests.

> Thirdly, in the case where a captcha (eg, ReCAPTCHA) can be defeated by a captcha-filling service... well, such services are extremely cheap, and there's no per-account fee or overhead. It sounds plausible that one could simply sign up for the captcha-filling service multiple times, and submit captchas to defeat in parallel. Could this work?

Sure, but what are you gaining? These things are usually priced on a per-captcha basis. So you could have multiple accounts, but you'll be paying the same as just sending multiple requests to the API, and the concurrent requests are routed via the API's backend to different human operators for a solution.


> Google's latest captcha specification (and similarly sophisticated systems) must be completed in a small window of time, change rapidly, have a varying number of "rounds", and are extremely antagonistic to being reloaded in e.g. a frame.

Hmmm.

I've observed the timing thing myself - I'll sometimes hit the captcha first while I fill the rest of a form out, then it'll expire, and I'll need to do it again.

What do you mean by "change rapidly" and "varying number of "rounds""?

Regarding avoiding loading the content in a frame - what about making the initial iframe load happen on the machine of the person filling the captcha? Why can't this work?

--

> Practically speaking, you can't consistently outsource that to a third party API that uses humans to click and verify the images.

I'm both very curious and somewhat surprised this is the case.

--

>> It sounds plausible that one could simply sign up for the captcha-filling service multiple times, and submit captchas to defeat in parallel. Could this work?

> Sure, but what are you gaining? ... you'll be paying the same as just sending multiple requests to the API, and the concurrent requests are routed via the API's backend to different human operators for a solution.

This is what I'd strongly expect, but I very much wonder if captcha-filling services ratelimit the number of filled captchas per account.

In theory (now I think about it) ratelimiting would hurt the service provider so this might be unlikely, but where I want to play with massive parallelism I do wonder if sharding hundreds of requests across multiple accounts would be useful or not. Sharding across multiple providers would probably increase parallel throughput by a small amount though.

Thanks for the feedback!


> Practically speaking, you can't consistently outsource that to a third party API that uses humans to click and verify the images.

It sounds like you can outsource it, but you essentially need a remote desktop connection for the captcha solver to interact with the page. The overhead might be a bit much, but it doesn't seem impossible.


> Having written bespoke scraping systems professionally, I think you overestimate the applicability of this technique.

You may be right, but I would say that it shifts the balance. If the content is poisoned, the scraper will have to spend effort on quality control and they can never know for certain if you have detected their workarounds. It doesn't stop them of course, but it certainly raises the cost of scraping you at little extra cost to you.


> Instead you begin feeding plausible, but wrong data (like, add a random number to price). This will usually cause much more damage to the scraper than blocking would.

Honest question, would this open the site up to legal liability?

You can never identify a bot 100%, so if you intentionally provided false information to an otherwise legitimate user, and that user is harmed by your false information, isn't that a breach of contract that would make you liable for damages based on the harm your bad data caused?

What if the user claimed that the site was acting maliciously, providing lower prices to cause a market reaction ala what happened to coinmarketcap? How would you prove that you were only targeting bots or provide statistics that your methods were effective instead of arbitrary. That stuff matters when money is lost.

You'd have to have a clause in your terms of service that says "if we think you are a bot then the information we give you will be bad", then users could (and probably should) run far away, since they have no way of knowing whether you think they are a bot or not, and thus can't trust anything on your site?


if you intentionally provided false information to an otherwise legitimate user, and that user is harmed by your false information, isn't that a breach of contract that would make you liable for damages based on the harm your bad data caused?

What contract?


Terms of service. It is not considered a contract if you are just an anonymous user, but once you take an affirmative action like creating an account you're explicitly agreeing to that and they are legally binding. The parent comment did not specify anonymous and plenty of bots create accounts to scrape with.

But even aside from the explicit contractual terms, even without one, you cannot just run around acting in bad faith, right? If you put up a site called AccuratePrices.com and then for some users you knowingly and intentionally provide false information, isn't that something like fraud?


Since when do anonymous users not have to adhere to publicly-stated terms of service?


Short answer is, they never were. You can't just leave a contract around and say that anyone who gets nearby and/or reads it is now bound by it. It requires some type of affirmative act to acknowledge that you read and understand the contract and are voluntarily entering into a formal relationship. That's what that "I have read and agree to the terms of service" checkbox on most sites is for.

Otherwise I could leave a piece of paper in a coffee shop saying. "Apple stock will go up 5% in the next month - By reading these words you agree not to trade Apple stock in the next month unless you pay me a royalty.". Reading something does not mean agreeing to its terms and if the content does not require agreeing to its terms to access the content, then the terms are may as well not exist until you do.

It doesn't mean you can ignore everything there. If you reproduce the content then you are still liable under copyright laws. If you DDOS the site, you are still liable under tortious interference laws.

But outside of something covered by other laws, there is no legal recourse to just putting "no bots" in your terms of service and then suing any bots you find. You can try to _block_ bots, just like you can hide your sheet of paper when you see a cop getting coffee, but you can't sue them for breaking terms they never agreed to. I don't believe there is actually any case law yet as to whether a bot can even agree to a contract. For evidence you can look at the recent LinkedIn case, which, while it is on appeal, turned on this issue.


The new improved version of a "trap street". https://en.wikipedia.org/wiki/Trap_street


Instead you begin feeding plausible, but wrong data (like, add a random number to price). This will usually cause much more damage to the scraper than blocking would.

If you detect wrongly though, you risk alienating your users -- if I saw obviously incorrect prices on an ecommerce site, I wouldn't shop there, assuming that they are incompetent and if I enter any payment information it's going to be whisked away by hackers.

Depending on your industry etc., it may be viable to take the legal route. If you suspect who the scraper is, you can deliberately plant a 'trap street', then look for it at your suspect. If it shows up, let lose the lawyers.

Depending on your industry etc., it may be viable to take the legal route. If you suspect who the scraper is, you can deliberately plant a 'trap street', then look for it at your suspect. If it shows up, let lose the lawyers

That may be harder than you suspect, given that it's not illegal to violate a website's terms of use:

https://www.eff.org/document/oracle-v-rimini-ninth-circuit-o...

Even if you have an enforceable contract, you can still face an expensive legal battle if they choose to fight (and it's likely not their first fight), so hopefully you have some deep pockets.


Wow, that is the most evil thing I've heard in a while


Mailinator does the same thing -- http://mailinator.blogspot.com/2011/05/how-to-get-gmailcom-b...

I don't see anything wrong with it.


I don't see anything wrong with it, I am just impressed the approach.


Evil is an interesting choice of words to mean not wrong and impressive...


Considering the frequency with which not wrong and impressive things are labeled evil, it is only a matter of time for the word to lose it's negative connotations. Similarly to sick or awesome.


> let lose the lawyers.

I agree with you there.

Also relevant: Courts: Violating a Website’s Terms of Service Is Not a Crime (https://news.ycombinator.com/item?id=16119686)


A friend and I wrote a paper about techniques for doing this a while back (never formally published; it was just a class project). My favorite technique for avoiding automated scan environments was to show goatse.cx and then run your code in the onUnload handler when the human frantically tries to close the tab.

http://www.cs.columbia.edu/~brendan/honeymonkey.pdf


"is an arms race that's been going on for a long time"

I'd say not always. Some websites attract a high number of unsophisticated scrapers, and a few clever ones. If it's high volume, blocking the former is often worth it just for the reduced load. I agree that going to war with skilled and funded scrapers is futile.


Scraping is not a real concern for most people. Detecting advertising fraud is the much more pressing issue.


Give it time my friend: they don't need to detect headless browsers, they just need to restrict your DOM-given freedom altogether.

Some prick is bound to make some fancy non-DOM web framework using web assembly and turn the internet into a DRM-ridden mess.


Is there something like a non-DOM UI framework? Either you paint your pixels yourself or you have something equivalent to a DOM.

Even without fancy HTML5 stuff anybody could have sent a prerendered image.

I remember web-sites which exclusively rendered in Flash. I avoided them like a plague.


A while ago Flipboard released a canvas based UI. https://github.com/Flipboard/react-canvas

Most companies do use some flash nonsense, but given the state of that, I'm sure they'll find some new insanity to foist on us.


it's an entirely pointless arms race, where the scraper always wins, because web browsers are just a piece of software saying the right things at the right time over TCP. The only deterrent to scraping is the cost of running a web browser to interact using a real javascript stack, so for some really lame spammy uses of scrapers (the most common) headless browsers are often not cost effective.


This is also the game people in click fraud vs click fraud detection are playing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: