
A Web Crawler with Asyncio Coroutines - nickpresta
http://aosabook.org/en/500L/a-web-crawler-with-asyncio-coroutines.html
======
theVirginian
Great tutorial, I would love to see this rewritten with the new async / await
syntax in python 3.5

~~~
justusw
I've created a similar example in order to try out the new Python 3.5 async
syntax. While the async function bodies themselves do not change, there is
some boilerplate necessary in order to call async functions.

You can check it out right here
[https://github.com/justuswilhelm/kata/blob/master/python/cor...](https://github.com/justuswilhelm/kata/blob/master/python/coroutine.py#L35)

------
potatosareok
One question I have about this - and I might have missed in article is - I'm
all for using asyncio to make HTTP requests. But I see they apparently also
use asyncio for "parse_links". Since parselinks should be CPU op, would it
make sense to use fibers to download links and pass them into a thread pool to
actually parse them//add to queue?

I'm messing around with some of the ParallelUniverse Java fiber implementation
and what I do is spam fibers to download pages and send the String response
over to another fiber over a channel that maintains a thread pool to parse
response body as they come in//create new fibers to read these links.

I'm really just doing this to get more familiar with async programming and
specifically the paralleluniverse Java libs but one thing I'm struggling a bit
with is how to best make it well behaved (e.g right now there's no bound on
number of outstanding HTPT requests).

------
Schwolop
This article is way more important than the web crawler example used to
motivate it. It's easily the single best thing I've ever read on asyncio, and
I've been using it in anger for a year now. I've passed it around my team, and
will be recommending it far and wide!

------
fabiandesimone
I'm working on a project that involves lot's of web crawling. I'm not
technical at all (I'm hiring freelancers).

While I do have access to great general technology related advice, this post
is bound to bring people well versed in crawling.

My question is: in terms of crawling speed (and I know this is dependent of
several factors) what's a decent amount of pages a good crawler could do per
day?

The crawler I built is doing about 120K pages per day which to our initial
needs is not bad at all, but wonder if in the crawling world this is peanuts
or a decent chunk of pages?

~~~
reinhardt
It doesn't make much sense to give a number for speed without some specifics
about the crawler environment, such as:

    
    
      - How many servers (if distributed)?
      - How many cores/server?
      - What kind of processing takes place for each page? 
        Does it just download and save the pages somewhere (local filesystem, cloud storage, database) or it extracts (semi) structured data? And so on.
    

Specifics aside, these days it's not hard to crawl millions of pages/day on
commodity servers. Some related posts:

[http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-
bil...](http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-
webpages-in-40-hours/)

[http://blog.semantics3.com/how-we-built-our-almost-
distribut...](http://blog.semantics3.com/how-we-built-our-almost-distributed-
web-crawler/)

[http://engineering.bloomreach.com/crawling-billions-of-
pages...](http://engineering.bloomreach.com/crawling-billions-of-pages-
building-large-scale-crawling-cluster-part-1/)

[http://engineering.bloomreach.com/crawling-billions-of-
pages...](http://engineering.bloomreach.com/crawling-billions-of-pages-
building-large-scale-crawling-cluster-part-2/)

~~~
fabiandesimone
Thank you very much!

------
Animats
It would be interesting to compare this Python approach with a Go goroutine
approach. The main question is whether Go's libraries handle massive numbers
of connections well. Since Google wrote Go to be used internally, they
probably do.

~~~
mseri
Or Erlang, or rust

~~~
logn
Or Java. Right now I have a web driver that uses standard Java classes for
requests but I wonder if NIO would offer significantly better performance.

------
juddlyon
Node is well-suited for this type of thing and there are numerous libraries to
help.

------
rgacote
Appreciate the in-depth description. Look forward to working through this in
detail.

