Just finished reading DDIA, can't recommend it enough! I learned a lot of new info in every single chapter even for topics I thought I had a firm grasp on. Great job Martin Kleppmann!
I know this sounds like a cliche at this point, but volume 3 of Donald Knuth's "The Art of Computer Programming" goes into more depth on the theory that underpins these algorithms than anything else I've ever seen/read/heard of (in fairness, I haven't read OP's suggestion yet, though).
This is one of the best technology books I've ever read. I spent a few years diving into big data architecture. I thought I had a reasonably good handle on it, then I read this book. So many insights.
There's is one important detail that is never fully articulated in this article. It says B-trees are read optimized. This is true, but there's only a big difference for random reads and not sequential reads. Because seeking a key in LSM is relatively expensive, because it has to read multiple indexes. But after this is done, reading the a lot of keys an values from this point on is not expensive (or at least not much more so than in a B-tree). The reason for this is mentioned in the article, which are the SSTables. These allow for quick through the keys, because they are stored on disk in sorted order.
LSMs do not guarantee that sequential key/value pairs are stored in the same SSTable so reading "n" k/v pairs, in the worst case, could require seeking through "n" different SSTables. There are datastores that use LSMs such as HBase that provide guarantees on top of LSMs to facilitate efficient range reads and, of course, many LSM compaction algorithms improve range reads, but the basic LSM is optimized only for single k/v reads. That's the reason, for example, why Cassandra doesn't even offer range reads such as "SELECT * FROM mytable WHERE KEY > x AND key < (x + k)".
> That's the reason, for example, why Cassandra doesn't even offer range reads
This is not true. The reason why Cassandra doesn't support that is because of hashing of keys across the cluster -- you'd have to query all shards and merge the results. That has nothing to do with the LSM storage.
Enterprise drives usually have ultra caps with enough energy to power the writeback of the writebuffer to stable storage in case of power loss, thus your write latency is just the time it takes to hit the memory buffer, typically dominated by the PCIe transfer, that is, a few µs (unless of course, the buffer has filled but that means hitting peak write bandwidth).
Consumer drives typically don't pay for this, but instead good drives buffer writes in SLC flash which has lower latency than MLC.
ADDENDUM: funny thing is that you have to actively manage the energy in the caps; the writebuffer must never have more data than you have energy to write back. This becomes especially important on power-up where the empty cap is charging.
Yes, unless you did a journal flush. Default journal flush interval is 1 second because of hard drives, but on a good SSD it can be turned up to around 10 ms and still maintain those numbers.
I see how you made that mistake now. SSDs typically have many (eg 10s-100s) program operations happening in parallel. They also pack each write into something around 16kb or larger units so it's possible multiple b+ op are in each program.
Isn’t this pretty traditional stuff? What is modern about it?
Does any of this map to a GPU for column-oriented analytical data processing? Basically, machines are only good at reading and writing large, contiguous chunks of data. As these machines evolve, the optimal width of those chunks keeps getting larger. The volume of data available is growing. And the types of operations being done on data are becoming more “analytical” (meaning column-oriented and streaming access, rather than row-oriented random access). I would expect “modern storage” algorithms to therefore be cache friendly, column oriented and take the modern, in-memory storage hierarchy into account (from on-chip registers to, to high bandwidth GPU type parallel devices, to NVRAM system memory).
This article comes off to me like a CS101 intro doing Big-O asymptotic analysis on linked lists, without even mentioning the existence and effects of memory caches.
That's a fantastic introduction (although it's far more than an introduction!). I wrote a native addon for Node.js that does Reed Solomon, also using optimized Cauchy matrices (MIT license): https://github.com/ronomon/reed-solomon
Thanks. If that was in November last year, then it's changed a lot since then (and now a few times faster). Back then it was using a Vandermonde matrix as it was based on Backblaze's Java implementation, but a month ago everything was rewritten to use optimized Cauchy matrices.
Until recently I worked for a molecular neurobiology research group that, among other things, used Single-molecule RNA Fluorescence in situ Hybridization (smFISH) to measure gene expression in tissue[0].
Basically: design viruses with fluorescent molecules attached to them such that they attach to specific sections of RNA in the cell that are associated with particular genes. Soak tissue in viruses. Look at tissue through a microscope and count individual fluorescent dots, each of which represents one RNA molecule (I find this absolutely mind-blowing). Wash off the viruses with a specific type of chemical, and repeat with a new one for a different gene.
You can only use a few colours at a time because otherwise the microscope cannot discern them. But that would severely limit the throughput - there are a lot of genes we want to check, but the tissue will also degenerate after repeated washing. So, what can we do? Well, as I've understood, scientists using smFISH and similar techniques now use multiplexing and with Hamming codes to get around that.
So yeah, linear algebra is definitely so, so hot right now.
it's more about space efficiency. if a model can fit in only one or two cachelines, and get you "close enough", you may end up touching a lot less memory than if you just did a binary search.
I don't feel like it tonight. Drinking whiskey and trying not to think about how effed up the Israel footage and the North Korea situation are. Maybe that's why it's not the best post.
I don't mean to crap on the work presented. Great article, good summary, and the tech is solid. It's just older than what excites me; a lot of progress has been made in 20 years and the majority of it hasn't found commercial applications.
But here is a citation root for a lot of amazing work in this space:
My absolute hero Edward Kmett gave a talk stitching a lot of this work together a long time ago: https://www.youtube.com/watch?v=uA0Z7_4J7u8 . I have no idea if he's pursued it, it's just one of his many talks that left me with an incredibly lasting impression.
Variants of this technique work for arbitrary documents and structures, work better at very high volume, have cache oblivious properties, and support transactions. Universal indexes that are reasonably good for all queries (as opposed to specfiic queries) are also possible. Coupled with discrimination-based techniques for O(n) table joins, there's probably a whole startup around there.
If you're optimal for a cache without knowing how big the cache is, you're optimal for all caches. It's tough to do, but it happens.
This property probably doesn't get the respect it deserves in this super weird world where you can't really say how many caches are between you and your data.
Really good read, thanks. My brain couldn't quite grok the idea of a "cache-oblivious" data structure, but the article suggests they'd be more aptly described as "cache size-oblivious".