Hacker News new | past | comments | ask | show | jobs | submit login
Parsing logs 230x faster with Rust (arko.net)
78 points by steveklabnik on Oct 26, 2018 | hide | past | favorite | 22 comments

It's a pretty sad state we've got ourselves into as an industry when people think that parsing a 1GB log file in three minutes is "exciting". That's five megabytes per second. Jeez, anyone would think it's 1988, not 2018.

And that was an _improvement_ from the original, note - 36 minutes is 470KB per second.

That is literally 1000 times slower than a modern SSD should be capable of shovelling your data. And remember, I/O is s-l-o-w.

I just despair, I really do.

Note that this article sweeps the cost of gzip under the carpet, which I would expect to dominate here. Try compressing with something like LZ4-HC, you'll then be able to decompress at GBs/sec of raw data.

I share the sentiment. While the tech is getting faster and faster, our typical ways of handling data are getting worse and worse, eliminating all the gains. Devs are just tolerating slower languages, tools, architectures, methodologies etc. until a wall of some kind is being hit.

"With Rust in Lambda, each 1GB file takes about 23 seconds to download and parse. That’s about a 78x speedup compared to each Python Glue worker."

There should be no separate "download" and "processing". The whole thing should take exactly as much time as it takes to download the file via stream processing.

Also: https://github.com/facebook/zstd ; better ratios, better speed.

When our industry hits the ceiling of scaling, then we'll start thinking about optimisation. Otherwise there are other low-hanging fruit lying around.

On that matter - does anyone want to start a group that would lobby for making the computation more efficient exactly the same way the greens are lobbying to get rid of single-use plastics?

Exactly. And because of that basic business incentives no matter how fast the hardware gets, we will always have software that is just fast enough to be barely bearable, but not a notch faster. That's why I'm depressed ;).

You could also argue that computation has the same tendency as building new/better roads in the cities: the more efficient and more available it is - the more of it we will be using.

E.G. as they say about the CGI - computers became more powerful, but the minutes per frame have not generally changed year on year - they've just become more detailed.

The difference is that the roads are pushing way more traffic and the frames of CGI are getting more and more detailed / realistic / whatever.

A more accurate analogy would be building a 400mph maglev train line and discovering people are riding bicycles down the tracks.

Or that you've built a 400 lane highway but everyone's still stuck in traffic because they each made their cars one hundred times wider.

The sole reason the road problem is known is because when we build new highways going to the city centre, we start using the roads for less important things, thus reducing the efficiency of using roads. It's a well-known problem in city planning.

E.g. Instead of only using a car when you need to get there "now", you start using a car to get some coffee in the central area.

So there is no difference and that's why I mentioned it.

Artists also don't have to optimize their renders as much, leading to less time spent worry about technical details and decreased complexity, so there is some overlap with CGI.

In general though, your analogy is much more true than not, since renderers are mostly optimized as much as possible.

You could argue that the only reason renderers are optimised is because what they're rendering is the sole selling point of the movie. If they are not optimised, they'll not be able to deliver the novel picture that is expected to be loved by the audience.

On an economics level, the cost of storage space dominates. The cost of storing those logs with gzip -9 was $3.50/month. The cost of parsing them with Rust was free, as the parsing time fit in their Lambda free tier. The cost of parsing them with Python would've been ~$1000/month.

I'd also expect gzip to dominate CPU time, but this doesn't matter much when CPU time is basically free. Sometime in the future (or today, if you're Google) it'll be important to parse logs at GB/sec. For most businesses, today, it's not really.

I wish there was something pointing readers to the next step to take to remediate this. I read through the article and thought the author's iterative approach made sense. There are a lot of tutorials out there for say processing something with a quick script in either ruby or node - but harder to find resources aimed at processing/parsing a log file in GB/sec. It's definitely a topic I'd like to research/write about further.

Well it's not exactly rocket science, is it? You just need to work with the hardware in a vaguely sympathetic way and not do insane things.

If your goal is GBs/sec then first look at your I/O speed. It's probably less than that off a standard single SSD => you need to compress the data with an algorithm which offers very rapid decompression. A quick Google would find you things like zstd and lz4. You may infer that to get to GB/sec you might need multiple threads, and therefore to chunk your data somewhat.

Beyond that, it obviously depends what you're doing with your data. Assuming it's plain text/JSON, but you want to extract dates and numbers from it, you'll need to parse those fast enough to keep up. (Naive datetime parsing can be slow, but is fairly easy to make fast, especially if your timestamps are sequential, or all for the same date. Lookup tables or whatever.)

You'll want to avoid storing your records in some serialisation format that requires you read the whole file in at once. (E.g. Use one line per JSON object, or whatever.)

If you do all that, it's hard not to be fast, even with a GC'ed language and all the rest of it.

Interestingly, the tiny C compiler can fully compile the 6 MB sqlite.c single file distribution in about 1/10th of a second. Not just parse, but fully compile and link to an executable with some optimizations.


What a shitty attitude to have. Do you think the authors of the first linear time suffix array construction algorithm should have been embarrassed too because they didn't think of it sooner?

Everyone goes through a learning process, and in many cases, also an evolution in the requirements themselves. There's absolutely no reason to be embarrassed.

This is a good post (and the regex crate in rust is _fantastic_), but I'm always a bit surprised when I find these tools were not built for general log searching and analytics, but:

> What we want out of those files is incredibly tiny—a few thousand integers, labelled with names and version numbers. For example, “2018-10-25 Bundler 1.16.2 123456”, or “2018-10-25 platform x86_64-darwin17 9876”.

I'm almost positive that something like Prometheus could do this just as cheaply (run a single Prometheus node on free tier EC2 with a grafana container alongside) - there would be no massive log files to parse, just integers for the desired download counts, collected regularly and viewable in near-real-time (instead of waiting for the day's log-and-parse job to be completed).

I recently replaced a massively expensive "view count" system which was built using bloom filters and hadoop map-reduce with a very small piece of code that called `INCR` and `GET` against a Redis Cluster - this allowed real-time view counts, for far less money, with exponentially less log storage, massively fewer in-house lines-of-code, and an API that could not possibly be made simpler (no Java or Hadoop or AWS knowledge required)

Why is it developers so badly want to parse lots and lots of access log data, rather than just increment a metric somewhere as an event happens?

AWS Glue feels like a product meant for people who refuse to take a step back and think about what they're building. We also evaluated that and found it would be hilariously expensive, had poor documentation, was a brand new product, had OOM issues, and _still_ people on my team wanted to use it rather than a system actually designed for metrics in the first place.

Oftentimes the interesting data isn't what the metric is, but how it correlates with different user populations and actions taken. That can be as simple as knowing that all of your pageviews with zero clickthroughs happened on IE and were because clicktracking was broken on that browser. You need to store the individual events to do that sort of analysis, and once you're in that situation you run into single-source-of-truth consistency issues that strongly encourage you to store them as a log.

If that's not the situation your business is in, by all means just increment a counter. It's faster, simpler, and takes much less storage space. It sounds like the RubyGems situation in the article requires new metrics to be added on request, though, and it's really handy if those can be backfilled from previous logs even if you hadn't thought to compute that metric beforehand.

The usecase in the article was to parse logs from a CDN serving cached content not an app server which serves every request.

Rubygems.org uses Fastly which supports streaming logs (as should any CDN worth their salt - there is no reason whatsoever to wait a whole day to download data that is by definition event-based). FWIW, Fastly CDN log streaming is exactly how I built the view-count system I mentioned above.

At the end of the day a computer does a thing and counts the thing it did. Why write down what it did in a bloated format that only needs to be picked apart later?

From the post it looks like that's what he does now. Fastly uploads files to S3 (you can specify how often), and then a Lambda function is run per file to aggregate the files into view-counts. How is this different from yours?

Why take such a roundabout way of solving the problem? Why dump data into a log and then parse it back? Why don't they just have a separate structured log file/database with only the information they need?

"It turns out serde, the Rust JSON library, is super fast. It tries very hard to not allocate, and it can deserialize the (uncompressed) 1GB of JSON into Rust structs in 2 seconds flat."

1GB/2 seconds = 500MB/s which happens to be the sequential performance of a SATA SSD.

Beyond decompressing gzipped data to then skip most of it there doesn't appear to be any performance problem at all as far as I can see.

What's the grep speed?

Great ending!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact