
I just finished crawling 5.19B web pages, Ask Me Anything - dor_jack
I WAS JUST RATE LIMITED BY HN, SO IM GOING TO ANSWER YOUR QUESTIONS UNDER A NEW ACCOUNT: dor_jack_2
======
grzm
If you're rate-limited, you can contact the mods via the Contact link in the
footer.

------
dm_i386
What tools did you use? What had to be custom-written and why?

~~~
dor_jack_2
We tried a bunch of technologies like Nutch, Heritrix, Storm Crawler, ...
eventually settled on Mixnode and since it's a 'cloud platform' we didn't
really have to change anything.

As for processing the data we crawled, we are using ArchiveSpark
([https://github.com/helgeho/ArchiveSpark](https://github.com/helgeho/ArchiveSpark))

Also, Mixnode defaults on Amazon S3 for storage which was ok with us since
we're using EC2 for processing the results.

------
maurtinshkreli
How much did it cost?

~~~
dor_jack
It was an all inclusive deal: 420 TB at 0.06 per GB = $25,804

------
tlack
what did you do to avoid winding up in endless GET url loops? How deep did you
get per site, and how did you schedule followup requests?

~~~
dor_jack_2
Loop/spam prevention was done by mixnode, I'm not sure how they do it.

The data does not follow a DFS or BFS pattern so pages/site varies greatly by
a host's server capacity and anti-crawling configs.

There was a minimum of 10 seconds between followup requests to the same
website unless robots.txt had a lower delay. Pretty polite...

------
joshpen188
Why didn't you use common crawl instead?

~~~
dor_jack_2
For our purposes Common Crawl's corpus was missing too many websites (possibly
due to robots.txt configs of websites) Also we needed some deep coverage which
CC could not provide.

------
savethefuture
What did you discover.

~~~
dor_jack
We are processing the data as we speak. However the movement of technology
based on where your company is based is truly incredible.

Will update this in a few days with more data.

~~~
savethefuture
That will be an interesting correlation to see different frameworks or tech or
even design elements based on geographical location.

~~~
dor_jack
If our company approves I would like to publish some general statistics that
may be of interest to others.

------
itburnslikeice
but why?

~~~
dor_jack
Our company is in the Marketing Intelligence (MI) industry. We needed to
measure the penetration of multiple technologies in different countries.

