edit: available as torrent here: https://all-certificates.s3.amazonaws.com/domainnames.gz?tor...
The bottleneck was userland CPU, so probably the gzip processes. Took about 5 hours. Cutting of the path and piping through sort and uniq took another hour or so.
The only costs were the price/hour of running the EC2 instance. Network costs were 0 as you're only transferring data to your instance.
pv (pipe viewer, http://www.ivarch.com/programs/pv.shtml) and top are pretty handy to measure this kind of thing. You should be able to see exactly which process is using how much CPU, and what your throughput is.
- 1.8MM successful scrapes
- 25GB total size (stored in postgres)
- 1 server to host redis and postgres
- 9 physical servers for workers (4x core ARM servers) = 36 total cores
- Peak rate of ~ 100 reqs/second achieved across all workers (36 physical cores total)
- I saw that I could oversubscribe workers to cores by a factor of 2x (72 total workers) to achieve ~75% utilization on each server
All in all it was a fun project that provided quite a bit of learning. There are a lot of levers to pull here to make things faster, robust, and more portable. Next steps are to optimize the database and wrap a simple django app around it for exploring data.
Or maybe push it futher and try my hand at these 26MM domains?
 - http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
 - https://blog.majestic.com/development/majestic-million-csv-d...
 - http://python-rq.org/
 - http://scaleway.com
I myself was trying to download all X509 server certificates on the Internet I could find to:
- gain some insight into the dealings of CAs
- see if private keys were reused across multiple sites
- check the average expiry date of certificates etc.
I would like to be able to continuously check the Norwegian IP-space for compromised sites, because it would be interesting to see. Of course doing this on a bigger scale would be cool as well.