Hacker News new | past | comments | ask | show | jobs | submit login

I wish there was something pointing readers to the next step to take to remediate this. I read through the article and thought the author's iterative approach made sense. There are a lot of tutorials out there for say processing something with a quick script in either ruby or node - but harder to find resources aimed at processing/parsing a log file in GB/sec. It's definitely a topic I'd like to research/write about further.



Well it's not exactly rocket science, is it? You just need to work with the hardware in a vaguely sympathetic way and not do insane things.

If your goal is GBs/sec then first look at your I/O speed. It's probably less than that off a standard single SSD => you need to compress the data with an algorithm which offers very rapid decompression. A quick Google would find you things like zstd and lz4. You may infer that to get to GB/sec you might need multiple threads, and therefore to chunk your data somewhat.

Beyond that, it obviously depends what you're doing with your data. Assuming it's plain text/JSON, but you want to extract dates and numbers from it, you'll need to parse those fast enough to keep up. (Naive datetime parsing can be slow, but is fairly easy to make fast, especially if your timestamps are sequential, or all for the same date. Lookup tables or whatever.)

You'll want to avoid storing your records in some serialisation format that requires you read the whole file in at once. (E.g. Use one line per JSON object, or whatever.)

If you do all that, it's hard not to be fast, even with a GC'ed language and all the rest of it.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: