A couple of months ago I processed all metadata from the Common Crawl project fo...

jtwaleson · on Oct 11, 2015

This was a fun Saturday afternoon project to do actually. I spawned a single c4.8xlarge (10 Gbit) instance in US-EAST-1 (next to where the Common Crawl Public Data Set lives in S3) and downloaded 10TB spread over 33k files in +/- 30 simultaneous 'curl | ungzip | grep whatever-the-common-crawl-url-prefix-was' pipelines and got a solid 5Gbit transfer speed.

The bottleneck was userland CPU, so probably the gzip processes. Took about 5 hours. Cutting of the path and piping through sort and uniq took another hour or so.

The only costs were the price/hour of running the EC2 instance. Network costs were 0 as you're only transferring data to your instance.

gerner · on Oct 11, 2015

In my experience, and I suppose depending on the data, I've found that grep is often the bottleneck for data pipeline tasks like you describe. The silver searcher (https://github.com/ggreer/the_silver_searcher) is, in my experience, about 10x faster than grep for tasks like pulling out fields from json files. It's changed my life.

pv (pipe viewer, http://www.ivarch.com/programs/pv.shtml) and top are pretty handy to measure this kind of thing. You should be able to see exactly which process is using how much CPU, and what your throughput is.

snowpanda · on Oct 11, 2015

Both your story and the torrent are fascinating. Thank you for sharing.

sheraz · on Oct 11, 2015

Synchronicity :-) Just this week I scraped and analyzed only the homepages from the Alexa top 1 million [1] and majestic million [2]. I used rqworker[3] and and a fleet of mini-servers from scaleway[4] to do the scraping. Some results:

- 1.8MM successful scrapes - 25GB total size (stored in postgres) - 1 server to host redis and postgres - 9 physical servers for workers (4x core ARM servers) = 36 total cores - Peak rate of ~ 100 reqs/second achieved across all workers (36 physical cores total) - I saw that I could oversubscribe workers to cores by a factor of 2x (72 total workers) to achieve ~75% utilization on each server

All in all it was a fun project that provided quite a bit of learning. There are a lot of levers to pull here to make things faster, robust, and more portable. Next steps are to optimize the database and wrap a simple django app around it for exploring data.

Or maybe push it futher and try my hand at these 26MM domains?

[1] - http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

[2] - https://blog.majestic.com/development/majestic-million-csv-d...

[3] - http://python-rq.org/

[4] - http://scaleway.com

mortenlarsen · on Oct 11, 2015

Please make a torrent.

jtwaleson · on Oct 11, 2015

https://all-certificates.s3.amazonaws.com/domainnames.gz?tor...

mortenlarsen · on Oct 11, 2015

Thank you very much.

pibefision · on Oct 11, 2015

thank you for this!

jtwaleson · on Oct 11, 2015

just out of curiosity, what are you (and the other commenters) planning to use this for?

I myself was trying to download all X509 server certificates on the Internet I could find to:

- gain some insight into the dealings of CAs

- see if private keys were reused across multiple sites

- check the average expiry date of certificates etc.

flexd · on Oct 11, 2015

My intentions is that I want a complete list of .no (Norwegian) domains, and http://norid.no refuse to give out that list to anyone.

I would like to be able to continuously check the Norwegian IP-space for compromised sites, because it would be interesting to see. Of course doing this on a bigger scale would be cool as well.

royce · on Oct 12, 2015

This might be useful for that: https://scans.io/study/sonar.ssl

jtwaleson · on Oct 12, 2015

Yes I found that some time afterwards, so I decided not to continue with it ;) I did gather about 1.4M certificates in the process though.

elorant · on Oct 11, 2015

Any way to extract just country specific domains without processing all 10TB of it?

jtwaleson · on Oct 11, 2015

No, but my torrent is just the list of unique domain names and 147MB gzipped.

flexd · on Oct 11, 2015

I am interested.