Hacker News new | comments | show | ask | jobs | submit login

This was a fun Saturday afternoon project to do actually. I spawned a single c4.8xlarge (10 Gbit) instance in US-EAST-1 (next to where the Common Crawl Public Data Set lives in S3) and downloaded 10TB spread over 33k files in +/- 30 simultaneous 'curl | ungzip | grep whatever-the-common-crawl-url-prefix-was' pipelines and got a solid 5Gbit transfer speed.

The bottleneck was userland CPU, so probably the gzip processes. Took about 5 hours. Cutting of the path and piping through sort and uniq took another hour or so.

The only costs were the price/hour of running the EC2 instance. Network costs were 0 as you're only transferring data to your instance.




In my experience, and I suppose depending on the data, I've found that grep is often the bottleneck for data pipeline tasks like you describe. The silver searcher (https://github.com/ggreer/the_silver_searcher) is, in my experience, about 10x faster than grep for tasks like pulling out fields from json files. It's changed my life.

pv (pipe viewer, http://www.ivarch.com/programs/pv.shtml) and top are pretty handy to measure this kind of thing. You should be able to see exactly which process is using how much CPU, and what your throughput is.


Both your story and the torrent are fascinating. Thank you for sharing.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: