Depending on your industry etc., it may be viable to take the legal route. If you suspect who the scraper is, you can deliberately plant a 'trap street', then look for it at your suspect. If it shows up, let lose the lawyers.
Of course, the very best solution is to not care about being scraped. If your problem is the load that is caused to your site, provide an api instead and make it easily discoverable.
That said, this is a very particular scenario, and I don't think you can generalize the effectiveness of the technique from this example. In situations where attackers are looking for boolean responses, i.e. to verify email addresses, credit cards, usernames, etc, this can work well.
But in most situations a crawler is looking for particular data. Moreover, it will have a strong expectation (and understanding) about what this data should look like. Poisoning the well is not going to work as well in that scenario, because a scraper will be aware that the data is incorrect.
In contrast, people testing credit cards don't need your data, they just need data from somewhere that verifies the card numbers. In that case it's easier to just keep rotating targets until they've tested all the cards instead of bootstrapping an in-house team to write bespoke crawlers.
Most e-commerce sites will use a payment processor such as Braintree or Stripe so they don't even have to deal with PCI.
PCI applies to entities who store and process their own payments, e.g. Target, Amazon.
Quite a while ago, I was involved in baiting an attacker in a somewhat different way, but with the same goal (destroying the value we were providing to them). After the attacker figured out what was going on, they issued a somewhat credible threat to damage the company's machines (they included some details demonstrating they had access to a couple internal machines at some point), attacked and DOSed them, and persistently tried spearphishing them for months afterwards.
I guess I'd just say, (a) doing this sort of thing responsibly sucks a lot of the fun out of it, and (b) don't underestimate the risks of things going pear-shaped. You could be buying yourself a lot of ongoing grief. Something as everyday as that spearphishing attack can be nerve wracking - even after making annoying the crap out of everyone by repeating how to be careful with email, there's no way to be sure it won't hit, and the next thing you know people are sitting around with upper management having conversations nobody wants to have about network segmentation and damage mitigation.
Good on you for sticking through it.
Also poisoning only works for a while. As soon as they detect the poisoning, they can easily figure out what tripped the scraping detection, and now you need to poison in an even more subtle way because the scrapers know what to look for.
You can't win this game. Especially the "obvious" type of browser checks, where you can tell the js is checking your browser are so easy to circumvent, because you can tell what they're doing.
Really though, I've reversed a few fingerprinting libs in the wild and also looked at some counter measures that are being sold in the blackhat world. Both sides completely and utterly suck.
The fingerprinting stuff is easy to circument if you're willing to reverse a bit of minified JS, the browser automation 'market leader?' is comically bad and it took me 15 minutes to reliably detect it.
That browser/profile/fingerprint automation thingy in the default option sometimes uses a firefox fingerprint while using chrome. Protip: Chrome and Firefox send HTTP headers in different order. Detected, passive, without JS.
Claiming you're on windows while actually running on a Linux VM, your TCP fingerprint gives it away... Detected, passive, without JS.
Really, both sides, step up your game, this is still boring! Or, just stop fighting scrapers, you can't win.
This is interesting, I’d not heard of this approach. What’s the technique for generating such a fingerprint?
Just look up for "ttl windows vs linux" to find tons of article about it :)
Having written bespoke scraping systems professionally, I think you overestimate the applicability of this technique.
For one thing, detecting that a scraper is a scraper is the problem, not the prelude to the problem. You might as well block them at that point, if you feel you can reliably detect them. If nothing else it's more resource efficient than sending them fake data.
Second, and more importantly, "poisoning the well" is not going to work against a sophisticated scraper. I used to use massive amounts of crawled web data to accurately forecast earnings announcements months in advance. I've also consulted with various companies building distributed crawling systems or looking for ways to develop integrations without public APIs.
My colleagues would know very quickly if something was wrong with the data because our model would suddenly be extremely out of whack in ways that could be traced back to the website's behavior. We used to specifically look for this sort of thing and basically eyeball the data on a daily basis. We even had tools in place that measured the volume, response time and type of data being received, and would alert us if the ingested data went more than one standard deviation outside of the expectation in any of these metrics.
You might succeed in screwing up whatever the data is being used to inform for a little while, but you will, with near certainty, show your hand by doing this, and scrapers will react in the usual cat and mouse game. Modern scraping systems are extremely sophisticated and success stories are prone to confirmation bias, because you're mostly unaware when it's happening successfully on your website. For a while I was experimenting with borrowing (non-malicious) methods from remote timing attacks to identify when servers were treating automated requests differently from manual requests instead of simply dropping them. The rabbit hole of complexity is sufficiently deep that you could saturate a full research time with work to do in this area.
If you want to productively block scrapers, you should consider using a captcha-based system at the application layer, preferably a captcha that hasn't been broken yet and which can't be outsourced to a mechanical turk-based API. If nothing else, doing that will introduce at least 10 - 20 seconds of latency per request, which might be intolerable for many scrapers even if they're quite sophisticated.
In practice I'm talking about data which is very multidimensional and has more nuanced metrics than something like a price point. As an example, let's say a certain company exposes sequentially crawlable online orders without authentication. A scraper is going to reason about the data instead of just accepting it, and it's going to look for relatively minute errors. In particular, if it suddenly pulls in many more orders in a single day, or if a particular product shoots up in popularity, or if the price of an item in inventory suddenly changes from what it was historically, an alert will fire off.
As a real world example, I was working on a project to scrape all orders for a publicly traded fast food company that offered online orders and delivery. In the middle of one quarter, without warning, we suddenly saw activity that looked legitimate, but noticeably different order data occurring in a way that significantly increase revenue projections. My colleagues and I basically didn't trust the data whatsoever until we could find a localized (and not well publicized online) promotion that accounted for the changes.
Basically, I'm saying that it can be done, but it's hard and not a silver bullet. If the data is being used as part of a timeseries, then data meaningful enough to disrupt operations if going to be noticed and manually reviewed to identify a narrative that explains it; if the data is believably false but not meaningful enough to change a trend, it's not really going to matter to them anyway.
Wow. That's starting to get really shady.
Note to self: All things facing the web in ecommerce must be cryptographically randomized. :(
Yes, there is a massive amount of crawling that happens in the service of financial forecasting. Satellite imagery, drones, web pages, API endpoints, receipt data (free financial aggregators sell this), location data (free geolocation services sell this), purchase history (free email clients sell this), etc. This rabbit hole goes very deep. Some of it is actively sold by free services for "market research", some of it is collected from sources that don't bother with authentication or access control and make all of their data public.
> Note to self: All things facing the web in ecommerce must be cryptographically randomized. :(
The bigger issue is that sensitive information should be behind authentication. If you require authentication instead of just a "browsewrap" terms and conditions clause at the bottom of the page, that data becomes legally actionable if it's ever found. Otherwise you're relying on (at best) a robots.txt to do the enforcement for you.
Many companies overlook this, even if they're told about it, because it doesn't compromise users or constitute a security vulnerability. So savvy firms take the data as a competitive advantage and use it for market research.
This is really interesting. Was it using sentiment analysis from news-like sources, or tracking prices/releases, or some other information?
I'm thinking of getting into scraping as a (weird) hobby and still doing high-level research in the field. The point I've quoted stands out to me as one of the biggest hurdles to leap.
Firstly, I'm aware of human-based captcha-defeating systems. You describe captchas that "cannot be outsourced to a mechanical turk-based API". I'm wasn't aware such systems existed, that sounds scary. Do you have any examples?
Secondly, I'm a little confused by your mention that #1 will simply raise the latency of the request to 10-20 seconds instead of completely blocking it.
Thirdly, in the case where a captcha (eg, ReCAPTCHA) can be defeated by a captcha-filling service... well, such services are extremely cheap, and there's no per-account fee or overhead. It sounds plausible that one could simply sign up for the captcha-filling service multiple times, and submit captchas to defeat in parallel. Could this work?
Google's latest captcha specification (and similarly sophisticated systems) must be completed in a small window of time, change rapidly, have a varying number of "rounds", and are extremely antagonistic to being reloaded in e.g. a frame.
Practically speaking, you can't consistently outsource that to a third party API that uses humans to click and verify the images.
> Secondly, I'm a little confused by your mention that #1 will simply raise the latency of the request to 10-20 seconds instead of completely blocking it.
For captcha systems which can be defeated by humans, they will tolerate an extended period of time before they're solved. This means the amount of time it takes your crawler to send the data to the API and receive the challenge response, then send it to the captcha on the page will not cause the test to fail. It will, however, introduce a significant amount of latency in your requests.
> Thirdly, in the case where a captcha (eg, ReCAPTCHA) can be defeated by a captcha-filling service... well, such services are extremely cheap, and there's no per-account fee or overhead. It sounds plausible that one could simply sign up for the captcha-filling service multiple times, and submit captchas to defeat in parallel. Could this work?
Sure, but what are you gaining? These things are usually priced on a per-captcha basis. So you could have multiple accounts, but you'll be paying the same as just sending multiple requests to the API, and the concurrent requests are routed via the API's backend to different human operators for a solution.
I've observed the timing thing myself - I'll sometimes hit the captcha first while I fill the rest of a form out, then it'll expire, and I'll need to do it again.
What do you mean by "change rapidly" and "varying number of "rounds""?
Regarding avoiding loading the content in a frame - what about making the initial iframe load happen on the machine of the person filling the captcha? Why can't this work?
> Practically speaking, you can't consistently outsource that to a third party API that uses humans to click and verify the images.
I'm both very curious and somewhat surprised this is the case.
>> It sounds plausible that one could simply sign up for the captcha-filling service multiple times, and submit captchas to defeat in parallel. Could this work?
> Sure, but what are you gaining? ... you'll be paying the same as just sending multiple requests to the API, and the concurrent requests are routed via the API's backend to different human operators for a solution.
This is what I'd strongly expect, but I very much wonder if captcha-filling services ratelimit the number of filled captchas per account.
In theory (now I think about it) ratelimiting would hurt the service provider so this might be unlikely, but where I want to play with massive parallelism I do wonder if sharding hundreds of requests across multiple accounts would be useful or not. Sharding across multiple providers would probably increase parallel throughput by a small amount though.
Thanks for the feedback!
It sounds like you can outsource it, but you essentially need a remote desktop connection for the captcha solver to interact with the page. The overhead might be a bit much, but it doesn't seem impossible.
You may be right, but I would say that it shifts the balance. If the content is poisoned, the scraper will have to spend effort on quality control and they can never know for certain if you have detected their workarounds. It doesn't stop them of course, but it certainly raises the cost of scraping you at little extra cost to you.
Honest question, would this open the site up to legal liability?
You can never identify a bot 100%, so if you intentionally provided false information to an otherwise legitimate user, and that user is harmed by your false information, isn't that a breach of contract that would make you liable for damages based on the harm your bad data caused?
What if the user claimed that the site was acting maliciously, providing lower prices to cause a market reaction ala what happened to coinmarketcap? How would you prove that you were only targeting bots or provide statistics that your methods were effective instead of arbitrary. That stuff matters when money is lost.
You'd have to have a clause in your terms of service that says "if we think you are a bot then the information we give you will be bad", then users could (and probably should) run far away, since they have no way of knowing whether you think they are a bot or not, and thus can't trust anything on your site?
But even aside from the explicit contractual terms, even without one, you cannot just run around acting in bad faith, right? If you put up a site called AccuratePrices.com and then for some users you knowingly and intentionally provide false information, isn't that something like fraud?
Otherwise I could leave a piece of paper in a coffee shop saying. "Apple stock will go up 5% in the next month - By reading these words you agree not to trade Apple stock in the next month unless you pay me a royalty.". Reading something does not mean agreeing to its terms and if the content does not require agreeing to its terms to access the content, then the terms are may as well not exist until you do.
It doesn't mean you can ignore everything there. If you reproduce the content then you are still liable under copyright laws. If you DDOS the site, you are still liable under tortious interference laws.
But outside of something covered by other laws, there is no legal recourse to just putting "no bots" in your terms of service and then suing any bots you find. You can try to _block_ bots, just like you can hide your sheet of paper when you see a cop getting coffee, but you can't sue them for breaking terms they never agreed to. I don't believe there is actually any case law yet as to whether a bot can even agree to a contract. For evidence you can look at the recent LinkedIn case, which, while it is on appeal, turned on this issue.
If you detect wrongly though, you risk alienating your users -- if I saw obviously incorrect prices on an ecommerce site, I wouldn't shop there, assuming that they are incompetent and if I enter any payment information it's going to be whisked away by hackers.
Depending on your industry etc., it may be viable to take the legal route. If you suspect who the scraper is, you can deliberately plant a 'trap street', then look for it at your suspect. If it shows up, let lose the lawyers
Even if you have an enforceable contract, you can still face an expensive legal battle if they choose to fight (and it's likely not their first fight), so hopefully you have some deep pockets.
I don't see anything wrong with it.
I agree with you there.
Also relevant: Courts: Violating a Website’s Terms of Service Is Not a Crime (https://news.ycombinator.com/item?id=16119686)
I'd say not always. Some websites attract a high number of unsophisticated scrapers, and a few clever ones. If it's high volume, blocking the former is often worth it just for the reduced load. I agree that going to war with skilled and funded scrapers is futile.
Some prick is bound to make some fancy non-DOM web framework using web assembly and turn the internet into a DRM-ridden mess.
Even without fancy HTML5 stuff anybody could have sent a prerendered image.
I remember web-sites which exclusively rendered in Flash. I avoided them like a plague.
Most companies do use some flash nonsense, but given the state of that, I'm sure they'll find some new insanity to foist on us.