- 1.8MM successful scrapes
- 25GB total size (stored in postgres)
- 1 server to host redis and postgres
- 9 physical servers for workers (4x core ARM servers) = 36 total cores
- Peak rate of ~ 100 reqs/second achieved across all workers (36 physical cores total)
- I saw that I could oversubscribe workers to cores by a factor of 2x (72 total workers) to achieve ~75% utilization on each server
All in all it was a fun project that provided quite a bit of learning. There are a lot of levers to pull here to make things faster, robust, and more portable. Next steps are to optimize the database and wrap a simple django app around it for exploring data.
Or maybe push it futher and try my hand at these 26MM domains?
 - http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
 - https://blog.majestic.com/development/majestic-million-csv-d...
 - http://python-rq.org/
 - http://scaleway.com