
Build a crawler to crawl million pages with only one machine in just 2 hours - plantpark
https://medium.com/@tonywangcn/how-to-build-a-scaleable-crawler-to-crawl-million-pages-with-a-single-machine-in-just-2-hours-ab3e238d1c22
======
beejiu
Not too long ago I built a small webcrawler using Node.js, figuring that
crawlers spend most of their time waiting (e.g. downloading) and therefore
Node.js would be well suited. At the time I found crawlers written in Python
were fairly slow, which is not a surprise. It is backed by Redis and is pretty
fast even on a single process.
[https://github.com/brendonboshell/supercrawler](https://github.com/brendonboshell/supercrawler)

~~~
j_s
Can you beat the article's 800 concurrent connections & 12GB RAM used to
scrape 100000 pages in 15 minutes, with just one process?

Not close to a real comparison without the same URLs, but still fun to
compare.

~~~
beejiu
I've just run a small test (crawling a server running locally) and it comes
out at 243 pages per second with one process. This crawls a webpage, adds its
links to the queue and saves the URL in a Redis set. This is running on a
Macbook Pro.

~~~
dchuk
So you eliminated the biggest cause of slow down in crawling, network latency,
and are asserting yours is faster?

~~~
beejiu
The selling point of Node.js is asynchronous I/O. I'm sure you mean bandwidth
rather than network latency - in which case that is really not a limiting
factor when running in a datacenter (40 Gbps in at Linode for example).

~~~
plantpark
Some library in python ,such as asyncio or gevent could do some work
asynchronously and efficently. I will have a test later for these library. In
the meanwhile , welcome to post more details about asynchronous of Node.js.
Thanks for your comment again!

------
mbrumlow
Did I read that right? "it's necessary to deploy docker clusters to maximize
performance of your machine" to get the performance out of a single system?

~~~
zepolen
I had the same initial thought too - but he said he's creating a crawler that
is meant to be distributed - in which case it's fine to use Docker since it
makes the deployment on multiple machines simpler.

However that ram usage though...ugh

~~~
chatmasta
Is there really much overhead of wrapping processes in docker containers vs
orchestrating them via a process manager? Since containers are basically a set
of mounts and namespaces, what memory overhead do containerized processes
incur that non-containerized processes do not? I am under the impression that
a container does not add very much memory overhead itself; it's the
process(es) _inside_ the containers that add memory overhead. Please correct
me if I'm wrong.

~~~
zepolen
Didn't really mean Docker was the cause of the memory usage. I mean it might
add a little overhead, but afaict the article's memory usage comes from the
fact he's using a bunch of heavy python libraries making each process come to
about 300mb and running 40 workers.

You could get the same performance within 600mb by using 2 processes each
running 20 threads.

But I guess hardware is cheap.

~~~
chatmasta
2 processes = 2 GIL

There is no avoiding the GIL within a single Python process (even with asyncio
IIRC, though I've been using JS lately). Multiprocessing is usually the most
efficient way to execute I/O intensive, independent parallel operations. Of
course you can also run threads within each process.

I do wonder where the 300mb memory is coming from. Surely it can't all be
python interpreter? It doesn't look like he's importing 300mb of modules,
unless MongoClient really is that big. In that case he could create a separate
worker process for persisting data, and only that worker process needs to load
the MongoClient module.

One explanation for the memory overhead might be conntrack tables within the
network namespace of the container. However I would expect that conntrack
table to be on the host, where SNAT is performed. As an aside, the default
Docker networking configuration is really not well suited to concurrent
network requests, whether inbound or outbound. If you can avoid NAT (and
therefore a conntrack table), that is preferable.

This stack could also benefit from tuning some kernel parameters, both within
the containers and on the host. Great blog post with details:
[https://blog.packagecloud.io/eng/2017/02/06/monitoring-
tunin...](https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-
networking-stack-sending-data/)

~~~
plantpark
Thanks for your comment. Great article for networking. I'm not so familiar
with NAT of linux, so could you post more details about it with python or
docker, performance/advantage or something else?

Thanks again.

------
bpchaps
In the Clojure side of things, I recently used this [1] to scrape/parse ~4m
pages in a few hours. It's very plug-and-play, but maintains a pretty decent
amount of extensibility. Parsing using Tika turned out to be extremely useful.

While it's on topic.. anyone have any other recommendations for web crawlers?
I'm particularly interested in finding unique identifiers (phone numbers,
emails) and their contexts on gov-owned websites for a project.

[0] [https://github.com/junjiemars/itsy](https://github.com/junjiemars/itsy)

~~~
plantpark
Great crawler, Thanks for your share.

~~~
bpchaps
Agreed. It's probably save me over 100 hours of work in the past two months.

------
Bedon292
Wouldn't it be more appropriate to use something like aiohttp?
[https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22...](https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-
aiohttp.html) With no docker, or anything like that.

I know the benchmarks cannot really be compared, since one involves a queue,
and mongo, while the other does not. But, it seems like a prime use case for
async.

~~~
zepolen
Could also use multiprocessing, got about ~500req/s returning a 'hello world'
response (which the article also does). The article does about 300req/s but
that's because he saturates his pipe. The reality is the article might be
faster than 1,000,000/hour.

    
    
        from multiprocessing import Pool
        from requests import get
        urls = 1000 * ['http://localhost/hello']
        def scrape(url):
            return get(url).text
        p = Pool(40)
        results = p.map(scrape, urls)
    

~2.2 seconds on a dual core 2.2ghz

~~~
plantpark
Thanks for your comment. If you have the same test with cloud server or some
public website , perhaps it will decrease some.

I've used multiprocessing/threads/geven/asyncio before. And I will have a full
test with these libraries.

Thanks again!

~~~
zepolen
As I said, the benchmark is flawed since it's dependent on the network pipe.
It would be a good idea to run tests locally so you get a real maximum.

There are lots of factors involved which can completely skew benchmarks, for
example, if you were scraping an average 10kb response instead of 'hello
world' you would automatically be limited to 100req/s on a 10mbit pipe.

------
mfontani
You might want to set a specific user-agent for your crawler

------
arcaster
Seems like this wouldn't really be useful to scrape js rendered content or any
content of "real" value that had any kind of rate limiting or monitoring
enabled. Spreading the ip space and making scraping look like genuine user
input is a far greater challenge than spinning up a RMQ cluster.

~~~
plantpark
You are right. But with more codes or tools , it could do this too. It's just
a quick demo for distributed crawler. If you moniter traffic of your target
website with js rendered content, you will find json file and json api. And
what you need next is just the same code in my article.

------
drallison
Another worthwhile article if you are building a crawler.
[http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-
bil...](http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-
webpages-in-40-hours/)

~~~
plantpark
I've seen this article.Great article about distributed crawler. Mine just is a
demo version of his.

------
xagarth
Crawling described here is very inefficient. For efficient and high
performance crawling I recommend libcurl and curlmulti.

------
wopwopwop
Question to the experts here:

\- What is the relevance of Docker here? I'm pretty sure that celery+rabbitmq
are enough to do a distributed scraper...

~~~
charlieegan3
I think the OP just drank the docker kool-aid :) It's also the future, obvs.
[https://circleci.com/blog/its-the-future/](https://circleci.com/blog/its-the-
future/)

> and learn how to use docker and celery

Seems the OP was learning Docker at the time? I think it just comes down to
the tools you're comfortable with.

