
How to Crawl the Web Politely with Scrapy - stummjr
https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/
======
markpapadakis
In the past we built and operated Greece’s largest search engine(Trinity), and
we would crawl/refresh all Greek pages fairly regularly.

If memory serves, the frequency was computed for clusters of pages from the
same site, and it depended on how often they were updated(news sites front-
pages were in practice different in successive updates, whereas e.g users
homepage were not, they rarely were updated), and how resilient the sites were
to aggressive indexing (if they ‘d fail or timeout, or it ‘d take longer than
expected to download the page contents than what we expected based on site-
wide aggregated metrics, we ‘d adjust the frequency, etc).

The crawlers were all draining multiple queues, whereas URLs from the same
site would always end up on the same queue(via consistent hashing, based on
the hostname’s hash), so a single crawler process was responsible for
throttling requests and respecting robots.txt rules for any single site,
without need for cross-crawler state synchronisation.

In practice this worked quite well. Also, this was before Google and its
PageRank and social networks (we ‘d probably have also considered pages
popularity based on PageRank like metrics and social ‘signals’ in the
frequency computation, among other variables).

~~~
greglindahl
In the current web, sites like Amazon are so large that you'll need many
crawlers. On the plus side, it appears that almost all large sites don't have
rate limits.

~~~
stummjr
Crawl-delay is not in the standard robots.txt protocol, and according to
Wikipedia, some bots have different interpretations for this value. That's why
maybe many websites don't even bother defining the rate limits in robots.txt.

~~~
greglindahl
I was referring to an actual rate limit, not crawl-delay. For example, YouTube
is pretty strict about rate limits:

[http://www.bing.com/search?q=%22We+have+been+receiving+a+lar...](http://www.bing.com/search?q=%22We+have+been+receiving+a+large+volume+of+requests+from+your+network%22+site%3Ayoutube.com)

I agree that crawl-delay is rare, and often it's set too long so that it's
impossible to fully crawl a site -- as if the webmaster set it up 10 years ago
and never updated it as their site got faster and bigger.

------
elorant
In my experience the best way to crawl in a polite way is to never use an
asynchronous crawler. The vast majority of small to medium sites out there
have absolutely no kind of protection from an aggressive crawler. You make 50
to 100 requests per second chances are you’re DDoS-ing the shit out of most
sites.

As for robots.txt problem is most sites don’t even have one. Especially
e-commerce sites. They also don’t have a sitemap.xml in case you don’t want to
hit every url just to find the structure of the site. Being polite in many
cases takes a considerable effort.

~~~
greglindahl
Search engine crawlers use adaptive politeness: start being very polite, and
ramp up parallel fetches if the site responds quickly and has a lot of pages.

~~~
stummjr
That's kind of what Scrapy's AUTO_THROTTLE middleware does.

------
minimaxir
See also Tuesday's HN discussion on the ethics of data scraping
([https://news.ycombinator.com/item?id=12345952](https://news.ycombinator.com/item?id=12345952)),
in which Hacker News is _completely split_ on whether data scraping is ethical
even if the Terms of Service _explicitly forbids it_.

~~~
emodendroket
Are you trying to imply that's a ridiculous position? I don't see it as one.

~~~
hsod
I'll say it's a ridiculous position. How can it possibly be ethical? The owner
of the server and the content has specifically told you to stop sending
packets at it.

I honestly don't know how to construct an argument for this because it's so
obvious to me.

~~~
cookiecaper
It's ethical because it's a public internet. The same way you can't use the
force of law to stop a homeless guy from asking you for change as you walk
along a public street, you can't [shouldn't be able to] use force of law to
stop a client from asking your server for data as it sits connected to a
public network.

It's not unethical for a beggar to continue asking for change. It's up to the
passerbys to choose whether or not they'll honor his request, but he is free
to make it as long as he doesn't get out of control. Many people see the
client-server relationship that exists online similarly. As the beggar can't
receive anything that the giver doesn't willingly give, neither can the client
receive anything the server doesn't willingly give.

It wouldn't make any sense if a guy could give the beggar change and then sue
him and say that he shouldn't have gotten change because he actually wanted to
use it for his lunch. The judge would say, "Well, why did you give it away?
You can't just change your mind and then sue someone over it." This is also
what judges should ask servers who dispense information to clients and then
try to take it back.

tl;dr there's no harm in asking for data, even after someone has told you no,
as long as you do so reasonably.

~~~
emodendroket
Strictly speaking there are a lot of places where panhandling is not legal.

------
tangue
Reading the previous thread again, I suppose that many of those against
scraping didn't realized they've already lost : with Ghost, Phantom, and now
headless Chrome you're going to have a hard time to detect a well built
scraper.

Instead of fighting against scrapers that don't want to harm you, maybe it's
about time to invest in your robots.txt and cooperate.

You could say that scraping you're website is FORBIDDEN, but come on : if
Airbnb can rent houses, I can scrap you site.

~~~
FussyZeus
It depends on your definition of harm. When your product is what's published
on the websites and you regularly find ripoffs of said website publishing your
ripped off content, maybe you'd feel differently about it.

~~~
duiker101
fair enough but I don't think that's the main purpose. There are many many
cases where you would want to scrape something and often people would probably
be encouraged in doing so in a "polite" way if websites didn't make it hard.

~~~
JamesBarney
Yes, or if they just provided a csv with all the data most people wanted to
scrape anyway with a plain English explanation about how it can be used.

------
betolink
I worked on a research project to develop a web-scale "google" for scientific
data and we found very interesting things on robots.txt, from "don't crawl us"
to "crawl 1 page every other day" or even better "don't crawl unless you're
google".

Another thing we noticed is that google's crawler is kind of aggressive, I
guess they are in a position to do it.

Our paper in case someone is interested: Optimizing Apache Nutch for domain
specific crawling at large scale
([http://ieeexplore.ieee.org/document/7363976/?arnumber=736397...](http://ieeexplore.ieee.org/document/7363976/?arnumber=7363976))

~~~
AznHisoka
This is why I think Google's position as the #1 search engine will never go
away. Many sites will tell your bot to go away if you're not Google. They
don't care if you're building a search engine that will compete with Google.

~~~
greglindahl
At blekko, we did not find this issue to be a significant one... almost
everyone who banned our crawler was a crappy over-SEOed website.

~~~
AznHisoka
[https://www.linkedin.com/robots.txt](https://www.linkedin.com/robots.txt)

[https://yelp.com/robots.txt](https://yelp.com/robots.txt)

There goes all Linkedin + Yelp content from your index.

~~~
betolink
What about
[https://www.facebook.com/robots.txt](https://www.facebook.com/robots.txt)

..and medium-sized/small sites are even worse.

The irony of Facebook being a core part of all NSA surveillance programs and
their terms of service including their "Automated Data Collection Terms"
[https://www.facebook.com/apps/site_scraping_tos_terms.php](https://www.facebook.com/apps/site_scraping_tos_terms.php)

------
vonklaus
The current protocols promote data exchange and since websites are primarily
designed to be consumed, there is really no way to stop automated requests.
Even companies like distilli[1] networks that parse inflight requests have
trouble stopping any sufficiently motivated outfit.

I think data should be disseminated and free info exchange is great. If
possible, devs should respect website owners as much as possible; although in
my experience people seem to be more willing to rip off large "faceless" sites
rather than mom&&pops. Both because that is where valuable data is, and it
seems more justifiable even if morally gray.

Regardless, the thing I find most interesting is that Google is most often
criticized for selling user data/out their users privacy. However, it is oft
not mentioned that Googlebot & the army of chrome browsers are not only
permitted, but encouraged to crawl all sites except a scant few that gave
achieved escape velocity. Sites that wish to protect their data must disallow
and forcibly stop most crawlers except google, otherwise they will be
unranked. This creates an odd dichotomy where not only does google retain
massive leverage, but another search engine or aggregator has more hurdles and
less resources to compete.

[1] They protect crunchbase and many media companies.

------
libeclipse
If you're worried about being a pain in the ass to administrators, with a web-
scraper, they probably need to rethink the way they have their website set up.

------
novaleaf
An alternative to Scrapinghub: PhantomJsCloud.com

It's a bit more "raw" than Scrapinghub but full featured and cheap.

Disclaimer: I'm the author!

