
Algorithms Behind Modern Storage Systems - matt_d
https://queue.acm.org/detail.cfm?id=3220266
======
throwawaypls
I read this book titled "Designing Data Intensive Applications", which covers
this and a lot of other stuff about designing applications in general.
[https://www.amazon.com/Designing-Data-Intensive-
Applications...](https://www.amazon.com/Designing-Data-Intensive-Applications-
Reliable-Maintainable/dp/1449373321)

~~~
pbowyer
It's the best written computer-related book I've read. On a par with Friedl's
"Mastering Regular Expressions".

Very highly recommended.

~~~
_jal
> On a par with Friedl's "Mastering Regular Expressions".

That comparison sold me. You deserve a commission.

~~~
pbowyer
Thank you!

Be warned, the O'Reilly print quality is miles apart between the two. Their
print-on-demand text quality and the binding are real let-downs.

~~~
_jal
That's too bad. The MRE book was great - the special typography for zero-width
indicators and whatnot was really great. Progress.

------
JelteF
There's is one important detail that is never fully articulated in this
article. It says B-trees are read optimized. This is true, but there's only a
big difference for random reads and not sequential reads. Because seeking a
key in LSM is relatively expensive, because it has to read multiple indexes.
But after this is done, reading the a lot of keys an values from this point on
is not expensive (or at least not much more so than in a B-tree). The reason
for this is mentioned in the article, which are the SSTables. These allow for
quick through the keys, because they are stored on disk in sorted order.

~~~
dhd415
LSMs do not guarantee that sequential key/value pairs are stored in the same
SSTable so reading "n" k/v pairs, in the worst case, could require seeking
through "n" different SSTables. There are datastores that use LSMs such as
HBase that provide guarantees on top of LSMs to facilitate efficient range
reads and, of course, many LSM compaction algorithms improve range reads, but
the basic LSM is optimized only for single k/v reads. That's the reason, for
example, why Cassandra doesn't even offer range reads such as "SELECT * FROM
mytable WHERE KEY > x AND key < (x + k)".

~~~
misframer
> That's the reason, for example, why Cassandra doesn't even offer range reads

This is not true. The reason why Cassandra doesn't support that is because of
hashing of keys across the cluster -- you'd have to query all shards and merge
the results. That has nothing to do with the LSM storage.

------
koverstreet
_cough_ bcachefs's b+ tree can push about a million random updates per second
(almost half a million single threaded, last I checked).

I haven't heard of anything faster...

edit: found the benchmarks I ran [https://www.patreon.com/posts/more-
bcachefs-16647914](https://www.patreon.com/posts/more-bcachefs-16647914)

~~~
espadrine
Congrats! That is Redis-on-Xeon territory!
[https://redis.io/topics/benchmarks](https://redis.io/topics/benchmarks)

Though I assume that, like Redis, a reboot would lose acked writes?

An SSD can't save a write in a microsecond, its latency is at least an order
of magnitude higher, right?

~~~
surajrmal
NAND program operations are in the order of O(100us) to O(1ms).

~~~
surajrmal
I see how you made that mistake now. SSDs typically have many (eg 10s-100s)
program operations happening in parallel. They also pack each write into
something around 16kb or larger units so it's possible multiple b+ op are in
each program.

------
riskneutral
Isn’t this pretty traditional stuff? What is modern about it?

Does any of this map to a GPU for column-oriented analytical data processing?
Basically, machines are only good at reading and writing large, contiguous
chunks of data. As these machines evolve, the optimal width of those chunks
keeps getting larger. The volume of data available is growing. And the types
of operations being done on data are becoming more “analytical” (meaning
column-oriented and streaming access, rather than row-oriented random access).
I would expect “modern storage” algorithms to therefore be cache friendly,
column oriented and take the modern, in-memory storage hierarchy into account
(from on-chip registers to, to high bandwidth GPU type parallel devices, to
NVRAM system memory).

This article comes off to me like a CS101 intro doing Big-O asymptotic
analysis on linked lists, without even mentioning the existence and effects of
memory caches.

------
akeck
This reminds me of RAID 6’s use of abstract algebra.

~~~
akalin
I wrote a blog post a while back that might be of interest:
[https://www.akalin.com/intro-erasure-codes](https://www.akalin.com/intro-
erasure-codes) . It's similar to Igor's but fills in a few more details.

~~~
jorangreef
That's a fantastic introduction (although it's far more than an
introduction!). I wrote a native addon for Node.js that does Reed Solomon,
also using optimized Cauchy matrices (MIT license):
[https://github.com/ronomon/reed-solomon](https://github.com/ronomon/reed-
solomon)

~~~
akalin
Very nice! I think I stumbled on your implementation when researching my post.
:)

~~~
jorangreef
Thanks. If that was in November last year, then it's changed a lot since then
(and now a few times faster). Back then it was using a Vandermonde matrix as
it was based on Backblaze's Java implementation, but a month ago everything
was rewritten to use optimized Cauchy matrices.

------
jacksmith21006
Most interesting thing I have seen for for database indexes is this paper.

[https://www.arxiv-vanity.com/papers/1712.01208v1/](https://www.arxiv-
vanity.com/papers/1712.01208v1/)

Using a Neural Network in place of a B-tree. What is interesting is a NN can
use many processors at the same time versus you can not with a B-tree.

In the end it comes down the power as in joules to get something done.

~~~
koverstreet
it's more about space efficiency. if a model can fit in only one or two
cachelines, and get you "close enough", you may end up touching a lot less
memory than if you just did a binary search.

------
walrus01
Obligatory reiserfs joke: [http://www.baltimoremick.com/blog/wp-
content/uploads/2008/07...](http://www.baltimoremick.com/blog/wp-
content/uploads/2008/07/reiser-wife.jpg)

------
KirinDave
This was a bad post and shouldn't be preserved. And I can still edit it. So I
did. Please see below, it's a better post (even if it's bad).

~~~
truncate
>> LSM-trees are great, but better stuff exists.

Sure. My knowledge is pretty much limited to what this articles talks about,
so interested to know what else is out there.

~~~
willvarfar
Fractal Trees as used in TokuDB (now owned by Percona).

The big thing in practice is that TokuDB vs InnoDB is dealing with large
datasets.

However, I don't know where the very latest MyRocks stands vs TokuDB.

