Hacker News new | past | comments | ask | show | jobs | submit login

It's a pretty sad state we've got ourselves into as an industry when people think that parsing a 1GB log file in three minutes is "exciting". That's five megabytes per second. Jeez, anyone would think it's 1988, not 2018.

And that was an _improvement_ from the original, note - 36 minutes is 470KB per second.

That is literally 1000 times slower than a modern SSD should be capable of shovelling your data. And remember, I/O is s-l-o-w.

I just despair, I really do.

Note that this article sweeps the cost of gzip under the carpet, which I would expect to dominate here. Try compressing with something like LZ4-HC, you'll then be able to decompress at GBs/sec of raw data.

I share the sentiment. While the tech is getting faster and faster, our typical ways of handling data are getting worse and worse, eliminating all the gains. Devs are just tolerating slower languages, tools, architectures, methodologies etc. until a wall of some kind is being hit.

"With Rust in Lambda, each 1GB file takes about 23 seconds to download and parse. That’s about a 78x speedup compared to each Python Glue worker."

There should be no separate "download" and "processing". The whole thing should take exactly as much time as it takes to download the file via stream processing.

Also: https://github.com/facebook/zstd ; better ratios, better speed.

When our industry hits the ceiling of scaling, then we'll start thinking about optimisation. Otherwise there are other low-hanging fruit lying around.

On that matter - does anyone want to start a group that would lobby for making the computation more efficient exactly the same way the greens are lobbying to get rid of single-use plastics?

Exactly. And because of that basic business incentives no matter how fast the hardware gets, we will always have software that is just fast enough to be barely bearable, but not a notch faster. That's why I'm depressed ;).

You could also argue that computation has the same tendency as building new/better roads in the cities: the more efficient and more available it is - the more of it we will be using.

E.G. as they say about the CGI - computers became more powerful, but the minutes per frame have not generally changed year on year - they've just become more detailed.

The difference is that the roads are pushing way more traffic and the frames of CGI are getting more and more detailed / realistic / whatever.

A more accurate analogy would be building a 400mph maglev train line and discovering people are riding bicycles down the tracks.

Or that you've built a 400 lane highway but everyone's still stuck in traffic because they each made their cars one hundred times wider.

The sole reason the road problem is known is because when we build new highways going to the city centre, we start using the roads for less important things, thus reducing the efficiency of using roads. It's a well-known problem in city planning.

E.g. Instead of only using a car when you need to get there "now", you start using a car to get some coffee in the central area.

So there is no difference and that's why I mentioned it.

Artists also don't have to optimize their renders as much, leading to less time spent worry about technical details and decreased complexity, so there is some overlap with CGI.

In general though, your analogy is much more true than not, since renderers are mostly optimized as much as possible.

You could argue that the only reason renderers are optimised is because what they're rendering is the sole selling point of the movie. If they are not optimised, they'll not be able to deliver the novel picture that is expected to be loved by the audience.

On an economics level, the cost of storage space dominates. The cost of storing those logs with gzip -9 was $3.50/month. The cost of parsing them with Rust was free, as the parsing time fit in their Lambda free tier. The cost of parsing them with Python would've been ~$1000/month.

I'd also expect gzip to dominate CPU time, but this doesn't matter much when CPU time is basically free. Sometime in the future (or today, if you're Google) it'll be important to parse logs at GB/sec. For most businesses, today, it's not really.

I wish there was something pointing readers to the next step to take to remediate this. I read through the article and thought the author's iterative approach made sense. There are a lot of tutorials out there for say processing something with a quick script in either ruby or node - but harder to find resources aimed at processing/parsing a log file in GB/sec. It's definitely a topic I'd like to research/write about further.

Well it's not exactly rocket science, is it? You just need to work with the hardware in a vaguely sympathetic way and not do insane things.

If your goal is GBs/sec then first look at your I/O speed. It's probably less than that off a standard single SSD => you need to compress the data with an algorithm which offers very rapid decompression. A quick Google would find you things like zstd and lz4. You may infer that to get to GB/sec you might need multiple threads, and therefore to chunk your data somewhat.

Beyond that, it obviously depends what you're doing with your data. Assuming it's plain text/JSON, but you want to extract dates and numbers from it, you'll need to parse those fast enough to keep up. (Naive datetime parsing can be slow, but is fairly easy to make fast, especially if your timestamps are sequential, or all for the same date. Lookup tables or whatever.)

You'll want to avoid storing your records in some serialisation format that requires you read the whole file in at once. (E.g. Use one line per JSON object, or whatever.)

If you do all that, it's hard not to be fast, even with a GC'ed language and all the rest of it.

Interestingly, the tiny C compiler can fully compile the 6 MB sqlite.c single file distribution in about 1/10th of a second. Not just parse, but fully compile and link to an executable with some optimizations.


What a shitty attitude to have. Do you think the authors of the first linear time suffix array construction algorithm should have been embarrassed too because they didn't think of it sooner?

Everyone goes through a learning process, and in many cases, also an evolution in the requirements themselves. There's absolutely no reason to be embarrassed.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact