
Parsing logs 230x faster with Rust - steveklabnik
https://andre.arko.net/2018/10/25/parsing-logs-230x-faster-with-rust/
======
HereBeBeasties
It's a pretty sad state we've got ourselves into as an industry when people
think that parsing a 1GB log file in three minutes is "exciting". That's five
megabytes per second. Jeez, anyone would think it's 1988, not 2018.

And that was an _improvement_ from the original, note - 36 minutes is 470KB
per second.

That is literally 1000 times slower than a modern SSD should be capable of
shovelling your data. And remember, I/O is s-l-o-w.

I just despair, I really do.

Note that this article sweeps the cost of gzip under the carpet, which I would
expect to dominate here. Try compressing with something like LZ4-HC, you'll
then be able to decompress at GBs/sec of raw data.

~~~
dpc_pw
I share the sentiment. While the tech is getting faster and faster, our
typical ways of handling data are getting worse and worse, eliminating all the
gains. Devs are just tolerating slower languages, tools, architectures,
methodologies etc. until a wall of some kind is being hit.

"With Rust in Lambda, each 1GB file takes about 23 seconds to download and
parse. That’s about a 78x speedup compared to each Python Glue worker."

There should be no separate "download" and "processing". The whole thing
should take exactly as much time as it takes to download the file via stream
processing.

Also: [https://github.com/facebook/zstd](https://github.com/facebook/zstd) ;
better ratios, better speed.

~~~
heavenlyblue
When our industry hits the ceiling of scaling, then we'll start thinking about
optimisation. Otherwise there are other low-hanging fruit lying around.

On that matter - does anyone want to start a group that would lobby for making
the computation more efficient exactly the same way the greens are lobbying to
get rid of single-use plastics?

~~~
dpc_pw
Exactly. And because of that basic business incentives no matter how fast the
hardware gets, we will always have software that is just fast enough to be
barely bearable, but not a notch faster. That's why I'm depressed ;).

~~~
heavenlyblue
You could also argue that computation has the same tendency as building
new/better roads in the cities: the more efficient and more available it is -
the more of it we will be using.

E.G. as they say about the CGI - computers became more powerful, but the
minutes per frame have not generally changed year on year - they've just
become more detailed.

~~~
HereBeBeasties
The difference is that the roads are pushing way more traffic and the frames
of CGI are getting more and more detailed / realistic / whatever.

A more accurate analogy would be building a 400mph maglev train line and
discovering people are riding bicycles down the tracks.

Or that you've built a 400 lane highway but everyone's still stuck in traffic
because they each made their cars one hundred times wider.

~~~
CyberDildonics
Artists also don't have to optimize their renders as much, leading to less
time spent worry about technical details and decreased complexity, so there is
some overlap with CGI.

In general though, your analogy is much more true than not, since renderers
are mostly optimized as much as possible.

~~~
heavenlyblue
You could argue that the only reason renderers are optimised is because what
they're rendering is the sole selling point of the movie. If they are not
optimised, they'll not be able to deliver the novel picture that is expected
to be loved by the audience.

------
erulabs
This is a good post (and the regex crate in rust is _fantastic_), but I'm
always a bit surprised when I find these tools were not built for general log
searching and analytics, but:

> What we want out of those files is incredibly tiny—a few thousand integers,
> labelled with names and version numbers. For example, “2018-10-25 Bundler
> 1.16.2 123456”, or “2018-10-25 platform x86_64-darwin17 9876”.

I'm almost positive that something like Prometheus could do this just as
cheaply (run a single Prometheus node on free tier EC2 with a grafana
container alongside) - there would be no massive log files to parse, just
integers for the desired download counts, collected regularly and viewable in
near-real-time (instead of waiting for the day's log-and-parse job to be
completed).

I recently replaced a massively expensive "view count" system which was built
using bloom filters and hadoop map-reduce with a very small piece of code that
called `INCR` and `GET` against a Redis Cluster - this allowed real-time view
counts, for far less money, with exponentially less log storage, massively
fewer in-house lines-of-code, and an API that could not possibly be made
simpler (no Java or Hadoop or AWS knowledge required)

Why is it developers so badly want to parse lots and lots of access log data,
rather than just increment a metric somewhere as an event happens?

AWS Glue feels like a product meant for people who refuse to take a step back
and think about what they're building. We also evaluated that and found it
would be hilariously expensive, had poor documentation, was a brand new
product, had OOM issues, and _still_ people on my team wanted to use it rather
than a system actually designed for metrics in the first place.

~~~
gilfoyle
The usecase in the article was to parse logs from a CDN serving cached content
not an app server which serves every request.

~~~
erulabs
Rubygems.org uses Fastly which supports streaming logs (as should any CDN
worth their salt - there is no reason whatsoever to wait a whole day to
download data that is by definition event-based). FWIW, Fastly CDN log
streaming is exactly how I built the view-count system I mentioned above.

At the end of the day a computer does a thing and counts the thing it did. Why
write down what it did in a bloated format that only needs to be picked apart
later?

~~~
zerd
From the post it looks like that's what he does now. Fastly uploads files to
S3 (you can specify how often), and then a Lambda function is run per file to
aggregate the files into view-counts. How is this different from yours?

------
imtringued
Why take such a roundabout way of solving the problem? Why dump data into a
log and then parse it back? Why don't they just have a separate structured log
file/database with only the information they need?

"It turns out serde, the Rust JSON library, is super fast. It tries very hard
to not allocate, and it can deserialize the (uncompressed) 1GB of JSON into
Rust structs in 2 seconds flat."

1GB/2 seconds = 500MB/s which happens to be the sequential performance of a
SATA SSD.

Beyond decompressing gzipped data to then skip most of it there doesn't appear
to be any performance problem at all as far as I can see.

------
kwillets
What's the grep speed?

------
db48x
Great ending!

