Hacker News new | comments | show | ask | jobs | submit login

Synchronicity :-) Just this week I scraped and analyzed only the homepages from the Alexa top 1 million [1] and majestic million [2]. I used rqworker[3] and and a fleet of mini-servers from scaleway[4] to do the scraping. Some results:

- 1.8MM successful scrapes - 25GB total size (stored in postgres) - 1 server to host redis and postgres - 9 physical servers for workers (4x core ARM servers) = 36 total cores - Peak rate of ~ 100 reqs/second achieved across all workers (36 physical cores total) - I saw that I could oversubscribe workers to cores by a factor of 2x (72 total workers) to achieve ~75% utilization on each server

All in all it was a fun project that provided quite a bit of learning. There are a lot of levers to pull here to make things faster, robust, and more portable. Next steps are to optimize the database and wrap a simple django app around it for exploring data.

Or maybe push it futher and try my hand at these 26MM domains?

[1] - http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

[2] - https://blog.majestic.com/development/majestic-million-csv-d...

[3] - http://python-rq.org/

[4] - http://scaleway.com

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact