
Show HN: Python 3.5 Async Web Crawler Example - mehmetkose
https://github.com/mehmetkose/python3.5-async-crawler
======
pixelmonkey
Guido van Rossum, the creator of Python, wrote a web crawler as a motivating
example for asyncio. You can find the code for it here:

[https://github.com/aosabook/500lines/tree/master/crawler](https://github.com/aosabook/500lines/tree/master/crawler)

And a detailed post about its design, co-written with A. Jesse Jiryu Davis,
here:

[http://aosabook.org/en/500L/a-web-crawler-with-asyncio-
corou...](http://aosabook.org/en/500L/a-web-crawler-with-asyncio-
coroutines.html)

------
kmike84
I was investigating how to add asyncio / async def support to Scrapy (see
[https://github.com/scrapy/scrapy/issues/1144#issuecomment-14...](https://github.com/scrapy/scrapy/issues/1144#issuecomment-141843616)).
Small examples like the one at the link look neat, but it is not all roses as
you go further. The problems are not specific to Scrapy; I think any advanced
`async def` based crawler will face them.

There are 2 challenges with async def I don't know how to solve elegantly:

1\. how to integrate coroutine-based scraping code with on-disk persistent
request queues;

2\. how to deallocate resources without boilerplate in coroutine-based
scraping code.

(1) is easier with callbacks-as-methods because this way state is passed
explicitly (it is not in local variables), so Scrapy can choose to save it to
disk.

Example of (2) is this code:

    
    
        async def parse(self, response):
            resp = await self.fetch(url)
            # ... find another URL to follow
        
            # Here we have the problem: 
            # response object is kept in memory
            # until second response is fully received.
            # This is a problem if 10s and 100s 
            # of requests are processed in parallel
            # and responses are large.
            # Because of refcounting, with callbacks 
            # response would have been kept in 
            # memory only until second request
            # starts - callbacks+refcounting provide
            # an elegant way for resource deallocation.
            resp = await self.fetch(url2)
    

If anyone has suggestions please comment on
[https://github.com/scrapy/scrapy/issues/1144#issuecomment-14...](https://github.com/scrapy/scrapy/issues/1144#issuecomment-141843616).

~~~
takeda
I don't know scrapy enough to understand your first problem.

As for the second one, wouldn't using streams to process the data solve the
memory usage issue?

~~~
kmike84
The first problem is not specific to Scrapy; it is the same if you're using
inlineCallbacks from twisted, tornado.gen or async def: there is no way to
serialize tasks inside coroutine and save them to disk, e.g. to be able to
stop the process and then restart it from the same point, or to avoid keeping
the whole request queue in memory.

Stream processing could fix the second problem, but it is not a practical
solution: in most cases one needs to build HTML tree in memory to do further
processing (e.g. extract links). I'm also not aware of streaming regex
libraries.

------
zedpm
This example isn't really making use of asyncio. asyncio.run_until_complete()
is a blocking method (note that you don't use await when calling it, as it's
not a coroutine.) You'd want to use something like asyncio.wait() with
multiple futures to achieve some concurrency.

~~~
mehmetkose
Thanks for the notice. I just trying python 3.5 goodies. I updated the code
now and I added a queue.

------
takeda
While you're using AsyncIO, your requests are still done serially due to using
loop.run_until_complete().

------
dham
What is the advantage of this over say using threads? Web scraping is pretty
much all IO so you get big wins using threads in Python and Ruby.

~~~
poooogles
Threads in Python creates a full posix thread which is very heavyweight
compared to using AsyncIO. You should get the same throughput for a far lower
resource usage.

~~~
platz
does asyncio in python = green threads in python, therefore defeats the GIL?

~~~
takeda
It does not defeat GIL, but the way asyncio is working makes GIL irrelevant.

In simplest use case you have a single thread (an actual OS thread) and as
soon as certain operation blocks (e.g. waiting a response from server) another
coroutine is scheduled.

The reason why GIL is irrelevant is because only a single coroutine executes
at given time[1]. The way it works is similar to cooperative multitasking. The
new task that was scheduled will execute until it encounters an operation that
will make it block (typically an I/O operation).

[1] You actually can spin multiple threads and each could have its own event
loop and then schedule coroutines on different threads.

------
aaront
Here's a proper example written for 3.4+:
[https://gist.github.com/madjar/9312452](https://gist.github.com/madjar/9312452)

