And that was an _improvement_ from the original, note - 36 minutes is 470KB per second.
That is literally 1000 times slower than a modern SSD should be capable of shovelling your data. And remember, I/O is s-l-o-w.
I just despair, I really do.
Note that this article sweeps the cost of gzip under the carpet, which I would expect to dominate here. Try compressing with something like LZ4-HC, you'll then be able to decompress at GBs/sec of raw data.
"With Rust in Lambda, each 1GB file takes about 23 seconds to download and parse. That’s about a 78x speedup compared to each Python Glue worker."
There should be no separate "download" and "processing". The whole thing should take exactly as much time as it takes to download the file via stream processing.
Also: https://github.com/facebook/zstd ; better ratios, better speed.
On that matter - does anyone want to start a group that would lobby for making the computation more efficient exactly the same way the greens are lobbying to get rid of single-use plastics?
E.G. as they say about the CGI - computers became more powerful, but the minutes per frame have not generally changed year on year - they've just become more detailed.
A more accurate analogy would be building a 400mph maglev train line and discovering people are riding bicycles down the tracks.
Or that you've built a 400 lane highway but everyone's still stuck in traffic because they each made their cars one hundred times wider.
E.g. Instead of only using a car when you need to get there "now", you start using a car to get some coffee in the central area.
So there is no difference and that's why I mentioned it.
In general though, your analogy is much more true than not, since renderers are mostly optimized as much as possible.
I'd also expect gzip to dominate CPU time, but this doesn't matter much when CPU time is basically free. Sometime in the future (or today, if you're Google) it'll be important to parse logs at GB/sec. For most businesses, today, it's not really.
If your goal is GBs/sec then first look at your I/O speed. It's probably less than that off a standard single SSD => you need to compress the data with an algorithm which offers very rapid decompression. A quick Google would find you things like zstd and lz4. You may infer that to get to GB/sec you might need multiple threads, and therefore to chunk your data somewhat.
Beyond that, it obviously depends what you're doing with your data. Assuming it's plain text/JSON, but you want to extract dates and numbers from it, you'll need to parse those fast enough to keep up. (Naive datetime parsing can be slow, but is fairly easy to make fast, especially if your timestamps are sequential, or all for the same date. Lookup tables or whatever.)
You'll want to avoid storing your records in some serialisation format that requires you read the whole file in at once. (E.g. Use one line per JSON object, or whatever.)
If you do all that, it's hard not to be fast, even with a GC'ed language and all the rest of it.
Everyone goes through a learning process, and in many cases, also an evolution in the requirements themselves. There's absolutely no reason to be embarrassed.
> What we want out of those files is incredibly tiny—a few thousand integers, labelled with names and version numbers. For example, “2018-10-25 Bundler 1.16.2 123456”, or “2018-10-25 platform x86_64-darwin17 9876”.
I'm almost positive that something like Prometheus could do this just as cheaply (run a single Prometheus node on free tier EC2 with a grafana container alongside) - there would be no massive log files to parse, just integers for the desired download counts, collected regularly and viewable in near-real-time (instead of waiting for the day's log-and-parse job to be completed).
I recently replaced a massively expensive "view count" system which was built using bloom filters and hadoop map-reduce with a very small piece of code that called `INCR` and `GET`
against a Redis Cluster - this allowed real-time view counts, for far less money, with exponentially less log storage, massively fewer in-house lines-of-code, and an API that could not possibly be made simpler (no Java or Hadoop or AWS knowledge required)
Why is it developers so badly want to parse lots and lots of access log data, rather than just increment a metric somewhere as an event happens?
AWS Glue feels like a product meant for people who refuse to take a step back and think about what they're building. We also evaluated that and found it would be hilariously expensive, had poor documentation, was a brand new product, had OOM issues, and _still_ people on my team wanted to use it rather than a system actually designed for metrics in the first place.
If that's not the situation your business is in, by all means just increment a counter. It's faster, simpler, and takes much less storage space. It sounds like the RubyGems situation in the article requires new metrics to be added on request, though, and it's really handy if those can be backfilled from previous logs even if you hadn't thought to compute that metric beforehand.
At the end of the day a computer does a thing and counts the thing it did. Why write down what it did in a bloated format that only needs to be picked apart later?
"It turns out serde, the Rust JSON library, is super fast. It tries very hard to not allocate, and it can deserialize the (uncompressed) 1GB of JSON into Rust structs in 2 seconds flat."
1GB/2 seconds = 500MB/s which happens to be the sequential performance of a SATA SSD.
Beyond decompressing gzipped data to then skip most of it there doesn't appear to be any performance problem at all as far as I can see.