And that was an _improvement_ from the original, note - 36 minutes is 470KB per second.
That is literally 1000 times slower than a modern SSD should be capable of shovelling your data. And remember, I/O is s-l-o-w.
I just despair, I really do.
Note that this article sweeps the cost of gzip under the carpet, which I would expect to dominate here. Try compressing with something like LZ4-HC, you'll then be able to decompress at GBs/sec of raw data.
"With Rust in Lambda, each 1GB file takes about 23 seconds to download and parse. That’s about a 78x speedup compared to each Python Glue worker."
There should be no separate "download" and "processing". The whole thing should take exactly as much time as it takes to download the file via stream processing.
Also: https://github.com/facebook/zstd ; better ratios, better speed.
On that matter - does anyone want to start a group that would lobby for making the computation more efficient exactly the same way the greens are lobbying to get rid of single-use plastics?
E.G. as they say about the CGI - computers became more powerful, but the minutes per frame have not generally changed year on year - they've just become more detailed.
A more accurate analogy would be building a 400mph maglev train line and discovering people are riding bicycles down the tracks.
Or that you've built a 400 lane highway but everyone's still stuck in traffic because they each made their cars one hundred times wider.
E.g. Instead of only using a car when you need to get there "now", you start using a car to get some coffee in the central area.
So there is no difference and that's why I mentioned it.
In general though, your analogy is much more true than not, since renderers are mostly optimized as much as possible.
I'd also expect gzip to dominate CPU time, but this doesn't matter much when CPU time is basically free. Sometime in the future (or today, if you're Google) it'll be important to parse logs at GB/sec. For most businesses, today, it's not really.
If your goal is GBs/sec then first look at your I/O speed. It's probably less than that off a standard single SSD => you need to compress the data with an algorithm which offers very rapid decompression. A quick Google would find you things like zstd and lz4. You may infer that to get to GB/sec you might need multiple threads, and therefore to chunk your data somewhat.
Beyond that, it obviously depends what you're doing with your data. Assuming it's plain text/JSON, but you want to extract dates and numbers from it, you'll need to parse those fast enough to keep up. (Naive datetime parsing can be slow, but is fairly easy to make fast, especially if your timestamps are sequential, or all for the same date. Lookup tables or whatever.)
You'll want to avoid storing your records in some serialisation format that requires you read the whole file in at once. (E.g. Use one line per JSON object, or whatever.)
If you do all that, it's hard not to be fast, even with a GC'ed language and all the rest of it.
Everyone goes through a learning process, and in many cases, also an evolution in the requirements themselves. There's absolutely no reason to be embarrassed.