
QVC Sues Shopping App for Web Scraping That Allegedly Triggered Site Outage - domdip
http://newmedialaw.proskauer.com/2014/12/05/qvc-sues-shopping-app-for-web-scraping-that-allegedly-triggered-site-outage/
======
johngd
My main focus for the entirety of my career has been on internet facing
consumer web applications. I have seen many, many, DOS attacks from IRC bots
to Ukrainian web scrapers to Chinese get-lucky wordpress exploit scanners.
Most of these can be ignored and blocked with little effort.

By FAR the most annoying of any of these is when Google, Bing and/or Yahoo
decide to wake up and crawl your infrastructure with little regard to your
robots.txt or webmaster settings, if available. I think they have got better
in recent years, but they used to be the absolute worst. It came down to: Let
us DOS you, or have your ranking suffer. Suing Google, Bing, Yahoo isn't
exactly an option.

Some context: I was the lead architect/engineer combo for a CMS that hosted
~500k domains for a fairly large international company. Some days I could
login and see them crawling every domain from A-Z. Some days I would get
caught by Google and Bing at the same time. They were the largest consumers of
data on this system.

~~~
acdha
FWIW, every time I've seen what looked like a major search engine ignoring
rate-limits (either Crawl-Delay or webmaster tools settings) a check of the
actual IPs being used showed that it was someone spoofing a well-known User-
Agent, which left you needing some other form of rate-limiting either way.

~~~
johngd
Very true, however the incidents I reference this was definitely not the case.

In fact for a while we would get Bing (MSN bot back then) crawl us everyday at
the same time, almost on the dot.

Let me plug project honeypot (which I am in no way affiliated with). This is
truly an awesome, and surprisingly accurate, free, service that does an
amazing job at collecting heuristics on suspicious IP activity and exposing it
in a easy to interpret way..

[http://www.projecthoneypot.org/index.php](http://www.projecthoneypot.org/index.php)

~~~
acdha
Project Honeypot is indeed great – I've been running their collectors for
years on a few spare domains, racking up a fair number of harvesters.

------
birken
Result.ly are really a bunch of jerks. One of the most common sense things you
can possibly do while crawling a website is monitor the response time and/or
error rates from the sites you are crawling. If those are going up, your crawl
rate should go down or go to 0.

There is one form of internet justice, which is QVC should file abuse
complaints to the ISPs that host those IPs. I've found abuse complaints are
the best way to stop people from using IPs for bad activities (excessive
scraping, spamming, etc).

~~~
greglindahl
From the complaint, it seems that Result.ly was crawling through proxies...
which means QVC doesn't know whom to complain to:

    
    
      The complaint alleges that the defendant disguised its
      web crawler to mask its source IP address and thus 
      prevented QVC technicians from identifying the source of 
      the requests and quickly repairing the problem.
    

Your comment about crawler politeness is spot-on.

~~~
birken
My reading of that is they were splitting their crawling amongst a large block
of IPs they had. If they are using proxies, that is much easier because you
can just block them all without any worry of accidentally blocking real
consumers (in addition to the fact that you can also file abuse complaints). I
think at Thumbtack we blocked all AWS IPs, a good deal of foreign IPs, in
addition to the specific IPs of people who were abusively crawling.

At the same time, the onus shouldn't be on QVC to have to block this,
result.ly should either be a good citizen or have to face a lawsuit. Granted,
QVC's tech team should be able to deal with this because next time the person
who is DOSing them might not be a US entity which can be sued, but that isn't
entirely relevant in this situation.

~~~
greglindahl
How do you block a proxy network where each IP makes only 1 request? I've seen
a lot of that at blekko, we're a big target for scrapers. I have no idea if
that's what was happening to QVC, given the non-technical nature of the
article.

~~~
birken
You likely have more experience with this than I do, but it seems to me that
getting access to 36,000 IPs (the quoted maximum amount per minute) to make a
single request each would be extremely difficult. And even if you could do
that, most of them would probably be in similar blocks which would make them
easier to detect and stop. It just is really hard and/or really expensive to
get access to that many IPs (short of running a large botnet, which also would
be hard and expensive and illegal).

Based on QVC's inability to stop this I think their tech competence is
probably fairly low, so my guess is that result.ly just would spin up 100 AWS
instances, crawl for some amount of time (with a spoofed user agent), then
close those down and spin up another 100 instances, then rinse and repeat. For
QVC it would seem like you are constantly facing a moving target with all
these different IPs, but in fact it is just from AWS.

The complaint is of course inaccurate in that it is impossible to lie about
the source of the request. Whatever IP the request was coming from is the IP
it was coming from. Perhaps in the case of an AWS instance Result.ly is only
leasing it for a short time, or in the case of a proxy it is the proxy's IP,
but obviously it is still traceable to the owner somehow.

~~~
greglindahl
Yes, I have seen botnets which are >> 36,000 IPs. Again, from the lack of
technical information, I have no idea what actually happened to QVC.

------
Someone1234
> Of these and other causes of action typically alleged in these situations,
> the breach of contract claim is often the clearest source of a remedy.

That's a strange claim given that we're talking about a "contract" which QVC
has no proof that the other party read or agreed to, and which there has been
no explicit exchange ("offer" and "acceptance").

Are web-site contracts/terms even enforceable at all? According to this
article[0]/case law likely not. Strange thing for a lawyer to say, but this
article makes a lot of strange claims that seem inconsistent with US case law.

[0] [http://www.forbes.com/sites/oliverherzfeld/2013/01/22/are-
we...](http://www.forbes.com/sites/oliverherzfeld/2013/01/22/are-website-
terms-of-use-enforceable/)

~~~
greglindahl
Several cases have allowed contract claims where (1) the crawler created
accounts and (2) the account creation flow has a checkbox for the user to
indicate that they agree to the contract.

Trying to enforce a contract on a crawler that's just fetching pages without
ever checking a box is much more difficult... many failures in the past.

------
Xorlev
Having been on both sides of the coin, once you hit 600 reqs/s without a prior
arrangement, that almost qualifies as a DoS attack. If they'd maintained
200-300 req/min would have been pretty acceptable.

------
Spoom
Honestly, you _really_ shouldn't have to hit "36,000 requests per minute"
scraping a website for price updates. Can someone explain if there is any
scenario in which this is reasonable? Do QVC's prices change that often?

~~~
tomjen3
Just a guess, assuming they have a lot of different things they sell? Also it
may go through a partial checkout flow for each item (to find the shipping
rates, etc).

Still yeah, that is too much.

------
swalsh
I have mixed feelings about this. On the one hand, the bot seems to have been
a really bad netizen. On the other hand I hate the idea of there being a
precedence that you can be sued for automating get requests.

~~~
akama
I don't think they are getting sued for automating get requests. Most of the
problem here seems to be the excessive number of requests that made the
scraping effectively be a DOS attack.

------
korzun
Agree with the suit but QVC (by this time) should have rate limiting /
throttling per IP.

(waits for somebody to claim that each request came from a different proxy)

~~~
greglindahl
From TFA:

    
    
      The complaint alleges that the defendant disguised its 
      web crawler to mask its source IP address and thus 
      prevented QVC technicians from identifying the source of 
      the requests and quickly repairing the problem.

~~~
korzun
Yes. Did your read/understand what I posted? Especially the last part where I
predicted that somebody will crawl out and post what you just did?

