Building things for spinning disks is a cul-de-sac, which has no exit over onto the faster characteristics of SSDs.
A significant part of Hadoop is built around the idea that "random = bad, sequential = good" and this is still arguably true when it comes to writes.
However, building around sequential reads gives you the wrong idea when the hardware changes underneath you.
I see a tiny sign of that in this post, where it talks about the smarter data partitioning.
For example, when dealing something like a join, a spilling sequential write join like Grace hash-join works much better than a memory mapped hashtable for spinning disks - however a grace join will be crushed by a better random IOPS driven SSD.
Also, if the author is reading this - Zlib isn't a slouch either (just like a spinning disk), the trick is to use the right combination of Zlib parameters. Compression will be slower with Zlib, but the read throughput of Zlib can exceed the performance of memcpy of the same data depending on your data conditions (look for inflate_fast and inflate_slow).
And if you're getting into the business of improving sequential reads, the next big trick for systems is vertical row partitioning of data (think of TPC-H, except now you stored L_COMMENT in a separate file).
> However, building around sequential reads gives you the wrong idea when the hardware changes underneath you.
If one is going to build a system tuned to the hardware, they can't ignore the hardware. What abstraction should the author be using?
Many people deal with csv made by external sources (ex: financial data). Having to parse and load it, then redo that on every change is tough for non technical users, and still a waste of time for technical users