Hacker News new | past | comments | ask | show | jobs | submit login

This seems to be a perfect use case for GNU Parallel[0] to download and process, say 10 ids, in parallel. If you have already downloaded/processed and have no need to do it again, then probably doesn't matter now.

[0]: https://www.gnu.org/software/parallel/




Xargs, actually, though saturating my dinky Internet connection was trivial. Ten concurrencies kept any one request from stalling the crawl though.

That's why the script echoed the curl commant rather than run it directly. It fed xargs.

The other problem was failed or errored (non 3xx/4xx, or incomplete HTML -- no "</html>" tag found) responses. There was no runtime detection of these. Instead, I checked for those on completion of the first run and re-pulled those in a few minutes, a few thousand from the whole run, most of which ended up being 4xx/3xx ultimately.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: