

Ask HN: What to do with competitor excessively crawling site? - CWuestefeld

Last week our site's performance suddenly went in the toilet, with the servers' CPUs pegging at 100% for extended periods. Without giving away too much that might be confidential about the parties involved, there is very good circumstantial evidence showing that this traffic is being driven by a competitor who is simply trying to scrape pricing for a catalog of a few hundred thousand products.<p>These bots were coming from the Amazon EC2 cloud, so it was impossible to identify them by IP address, but other patterns in their requests makes them identifiable. From these, we can determine that they were using over 6Mbps incoming bandwidth just for making their requests; at least the top 15 client IPs came from this source, rather than legitimate customers. They have never requested our robots.txt file, to see how we'd like bots to behave.<p>Because the usage is high enough that it's pegging the CPUs on our web farm, it's actually causing degraded service for legitimate customers, so it's important to me that we do something about it. It's also a headache for administration of the web servers: with usage so high, it's difficult to do maintenance without affecting customer experience.<p>Note that they've actually been doing this for some time, but shifted into high gear last week. Previously, I'd just put filtering into my usage reports so that their scraping wouldn't affect our BI reports. But now they're doing material damage.<p>What avenues might we pursue to address this? We could devote resources to finding a way to throttle bandwidth of users, but this is difficult because their usage is distributed, and we have many large corporate customers going through proxies, making them also appear to be extremely high-usage single clients (from an IP perspective). We could talk with the competitor, threatening legal action or maybe even offering to trade data with them directly. Any other suggestions?
======
amoore
I'd identify them and then deny them.

To identify them, I'd use a few tricks. One would be to place invisible links
on some of your pages. These could be empty anchor tags, or hidden by
javascript so that normal users don't click on them. If someone clicks on
them, they're probably a bot or something odd. Give them a particular cookie,
or mark their IP address as problematic. Another way to identify them is to
look at browsing behavior such as requests per minute, browser name, not
fetching images, and the "other patterns in their requests" that you mention.

Denying them is probably easier. Put in an entry for their IP in iptables, or
your routing table, or use an apache module to serve them something that
doesn't require a lot of computing power. Deny them for some period of time,
like a few minutes or hours.

You're right that you can offer to trade data with them directly. Build an API
and ask them to use it. If it's easier for them, they'll do it. I'm not sure
if this undermines your business or not without knowing more details.

If you don't know who the competitor is, put some bogus items on your site.
Let them crawl them and put them on their site. Then, you can search Google
for those oddly named items. Anyone who has them on their site has been
gathering data from you.

This is an arms race. You can't really stop it, but you can make the desired
behavior easier for them.

------
cd34
Have you contacted them and asked them to stop? If they continue, then you
start to have legal rights.

Figure out how to uniquely identify them, use mod_rewrite or an equivalent if
you can determine a UserAgent or something unique. mod_evasive perhaps.

Find a way to automate your pattern searches, i.e. x hits to site without
fetching a single image or static asset, must be bot, deny.

80legs was doing this to a client, numerous requests to get them to stop
hammering a site went unanswered. Same with cuil - though, cuil's bots finally
got fixed.

~~~
jdrock
Shion from 80legs here. Please try contacting us again and I'll make sure that
we adjust our crawl rate file accordingly for your domain(s) or IP(s).

------
jeffmould
A few ideas:

1\. File a report with Amazon. 2\. Block the IPs using iptables 3\. File a
DMCA complaint against them with Amazon 4\. Send a cease and desist letter to
them or offer to sell them data.

~~~
CWuestefeld
I filed an abuse report with Amazon on Friday; I have received no
acknowledgement.

We blocked all the EC2 IP ranges (much as I hated to do so; and did you know
how big a chunk of the Internet that is?). They responded by re-routing
through anonymizers, especially TOR. We could play whack-a-mole with that, but
it's a manpower-intensive arms race.

 _File a DMCA complaint against them with Amazon_

Can you explain to me how DMCA comes into this? I think that, although there's
nothing specific in our T&C (yet) disallowing this, the fact that it's
arguably at a DoS level might push this into criminal territory.

~~~
buro9
This is a good list of Tor endpoints that is proving to be useful for stopping
spammers on my site: <http://proxy.org/tor.shtml>

Is this a continuation of this: <http://news.ycombinator.com/item?id=2085859>

If so, did you try the iptables rate limiting as described? Or the Varnish
rate limiting?

~~~
CWuestefeld
I hadn't seen that previous thread; no, this a complete different set of
actors.

Rate limiting is difficult, because our legitimate users are frequently
Fortune 50 companies coming through a proxy. It's hard to differentiate that
heuristically. Based on the fingerprints we've observed, we might be able to
hard-code something, but if the bad guys change the factors we've picked out,
it would become whack-a-mole again. Also, at this level, the fact that we've
got multiple servers in a farm works against recognition.

The kind of crawling they're doing isn't going to be helped by caching.
They're looking for product pricing, and we have customized pricing for every
customer (Fortune 500 customers want their own specific contracts), so we
wouldn't be getting any cache hits.

~~~
epc
Do you know or have a sense of who’s controlling the crawler? Much as I hate
involving lawyers, that might be appropriate here.

I take it you don't do any sort of authentication which you could use to
differentiate your real customers from this actor?

Do you have any large binary files you could serve up once you've identified
the actor?

------
jdrock
I'd recommend:

1\. Blocking all the IP addresses you've found. You can also block the entire
Amazon IP range. No real users will be coming from there.

2\. Contacting Amazon and sending them your logs and evidence.

3\. Contact competitor and threaten legal action.

Don't expect much from (2) and (3), but (1) should solve your problem quickly.

