Hacker News new | comments | show | ask | jobs | submit login

A couple of months ago I processed all metadata from the Common Crawl project for all indexed domain names. This was about 10TB of metadata and resulted in 26 million domain names. EC2 costs were only about 10$ to process this. If anyone is interested, let me know.

edit: available as torrent here: https://all-certificates.s3.amazonaws.com/domainnames.gz?tor...

This was a fun Saturday afternoon project to do actually. I spawned a single c4.8xlarge (10 Gbit) instance in US-EAST-1 (next to where the Common Crawl Public Data Set lives in S3) and downloaded 10TB spread over 33k files in +/- 30 simultaneous 'curl | ungzip | grep whatever-the-common-crawl-url-prefix-was' pipelines and got a solid 5Gbit transfer speed.

The bottleneck was userland CPU, so probably the gzip processes. Took about 5 hours. Cutting of the path and piping through sort and uniq took another hour or so.

The only costs were the price/hour of running the EC2 instance. Network costs were 0 as you're only transferring data to your instance.

In my experience, and I suppose depending on the data, I've found that grep is often the bottleneck for data pipeline tasks like you describe. The silver searcher (https://github.com/ggreer/the_silver_searcher) is, in my experience, about 10x faster than grep for tasks like pulling out fields from json files. It's changed my life.

pv (pipe viewer, http://www.ivarch.com/programs/pv.shtml) and top are pretty handy to measure this kind of thing. You should be able to see exactly which process is using how much CPU, and what your throughput is.

Both your story and the torrent are fascinating. Thank you for sharing.

Synchronicity :-) Just this week I scraped and analyzed only the homepages from the Alexa top 1 million [1] and majestic million [2]. I used rqworker[3] and and a fleet of mini-servers from scaleway[4] to do the scraping. Some results:

- 1.8MM successful scrapes - 25GB total size (stored in postgres) - 1 server to host redis and postgres - 9 physical servers for workers (4x core ARM servers) = 36 total cores - Peak rate of ~ 100 reqs/second achieved across all workers (36 physical cores total) - I saw that I could oversubscribe workers to cores by a factor of 2x (72 total workers) to achieve ~75% utilization on each server

All in all it was a fun project that provided quite a bit of learning. There are a lot of levers to pull here to make things faster, robust, and more portable. Next steps are to optimize the database and wrap a simple django app around it for exploring data.

Or maybe push it futher and try my hand at these 26MM domains?

[1] - http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

[2] - https://blog.majestic.com/development/majestic-million-csv-d...

[3] - http://python-rq.org/

[4] - http://scaleway.com

Please make a torrent.

Thank you very much.

thank you for this!

just out of curiosity, what are you (and the other commenters) planning to use this for?

I myself was trying to download all X509 server certificates on the Internet I could find to:

- gain some insight into the dealings of CAs

- see if private keys were reused across multiple sites

- check the average expiry date of certificates etc.

My intentions is that I want a complete list of .no (Norwegian) domains, and http://norid.no refuse to give out that list to anyone.

I would like to be able to continuously check the Norwegian IP-space for compromised sites, because it would be interesting to see. Of course doing this on a bigger scale would be cool as well.

This might be useful for that: https://scans.io/study/sonar.ssl

Yes I found that some time afterwards, so I decided not to continue with it ;) I did gather about 1.4M certificates in the process though.

Any way to extract just country specific domains without processing all 10TB of it?

No, but my torrent is just the list of unique domain names and 147MB gzipped.

I am interested.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact