However I think that the real problem here is not at the algorithmic level, but at OS API level: just a provocation, don't you think that in the era of SSD drives where seeks are cheap is strange we resort to append only data structures?
In fact, underneath the hood, flash drives use append only even when you are trying to overwrite something. For example, if you want to overwrite a byte at location X, the flash drive actually writes the new data at a different location Y and uses logical mapping to have the byte at location Y pretend it is the byte at location X. The initial byte at location X will eventually be "recycled" and initialized.
Since Flash does append only writing under the hood anyways, it is a better idea to do append only explicitly so that you get the advantages of append only (e.g., the ability to revert to older versions).
That said, there is room for someone to develop a filesystem API designed with solid state in mind, or even an entire OS, but I don't know if there would be enough of a market to make that development worthwhile.
Is there Anything about SSDs that interests you, from the perspective of the future of Redis?
What about the Journalling Flash Filesystem? It's been around for quite some time.
LogFS is yet another file system designed for larger solid state drives (on the order of gigabytes instead of megabytes), and seems to be a step in the right direction. http://en.wikipedia.org/wiki/LogFS
Interesting. I've only used JFFS2 on embedded systems like OpenWRT routers, where you wouldn't see the large drive penalty.
In fact, JFFS2 works by scanning all the storage on mount and having it all in RAM, that causes it's problems with large volumes (slow mounts and also large memory use). In embedded systems, for which it is intended, mounting the volume is rare operation and volumes are generally small, also typical embedded systems running linux has more RAM than Flash.
I'm not sure about LogFS.
Sure, it will be a long time (if ever! likely, mechanical drives will remain important for bulk storage) before filesystem developers focus solely on SSD drives, but SSD performance is already very important.
(That said, I'm no SSD expert, so I might be way off)
Second, they actually tout that their data structure is better on SSDs than btrees (due to block-size obliviousness):
> One major advantage is that they are
naturally good candidates for SSDs, where a large asym-
metry in read/write block sizes makes life very hard for
B-trees. In contrast, the arrays making up the SDA can
be written with extremely large effective block sizes.
Footnote in edit:  in the case of search trees; for large objects that will be read front-to-back it does become a seek-time issue.
I prefer giving the Arxiv abstract list, which always links to the current version of the PDF.
I also wish the paper spelled out the data structures and algorithms in more detail. I did a little searching but couldn't find more information anywhere. Has anyone found a more complete writeup?
However stratified b-trees bring some algorithmic rigor to the merging process of the arrays that provides guarantees on the search time.
The authors do indeed provide some comparison with Cassandra which should answer your question - http://www.acunu.com/2011/03/cassandra-under-heavy-write-loa...
Summary: 1) Single key read performance for Cassandra depends somewhat on when the last compaction was done (Bloom filters help here but even just a couple of false positives per query eats random I/O. The less SSTables there are the less random I/O) so write spikes create poor read performance and 2) Bloom filters don't work for range queries so they tank in Cassandra in general. Stratified b-trees don't need bloom filters and performs better and with more consistency in both cases.
Why do you say that stratified b-trees don't need Bloom filters? Yes, the improved merging discipline reduces the number of arrays to read, but presumably there is often >1 array, which is sufficient to make Bloom filters desirable. Even if you only have two arrays, doubling the number of random I/Os per key lookup is easily enough of a penalty to make Bloom filters worthwhile. The paper itself seems to indicate that Bloom filters are used:
"In the simplest construction, we embed into each array a B-tree and maintain a Bloom ﬁlter on the set of keys in each array. A lookup for version v involves querying the Bloom ﬁlter for each array tagged with version v, then performing a B-tree walk on those arrays matching the ﬁlter."
The stratified b-trees use forward pointers to help target the search in the next array down the tree. Like regular b-trees the smaller root arrays will likely be cached in memory so the the number of random I/O's will be small.
"most ambitious storage project undertaken since ZFS" (http://www.acunu.com/2011/03/why-is-acunu-in-kernel/)
it looks like they put a key/value store in the kernel and they came up with a new userspace API for it. i can see also see how getting something like this into the mainline kernel is going to be a big uphill battle, but it might actually be a really big win.
It will never be in the mainline kernel. Also, although I haven't actually looked at what they did yet, I assume they're just loading a regular old kernel module instead of actually really messing with a lot of the mainline code.
I wouldn't be surprised if the Tokutek guys implemented something very similar to this Stratified B-tree to implement MVCC.
This lets you:
(a) read the abstract before deciding whether or not to read the paper
(b) easily find earlier (or later) versions of the paper
(c) easily search for other papers the authors have on the arxiv.
"Stratified B-trees and versioning dictionaries"
This is actually the newer paper, and presents the "Stratified B-Tree" data structure itself. It also explicitly mentions SSDs and appending, by the way.
"Optimal query/update tradeoffs in versioned dictionaries"
This is a longer and more, shall I say, academic paper. It appears to be incompletely finished, though, because it has a TODO, and also a marginal FIXME.
For more information about append-only B-trees and SSDs, see http://www.acunu.com/2011/04/log-file-systems-and-ssds-made-...
(That & the fact that a code release is an invaluable road map for anyone looking to follow in your footsteps.)
That said, I can't tell what they're suggesting from a cursory reading, I'll need to spend some time deciphering it. (this one seems to fall into one of those unparseable academic publications where the interesting bits are buried somewhere deep within the text)
That's the theory, of course. ;)
In any case, having inspected the 2 papers a little more in depth, this data structure is very different from Clojure's in that it holds multiple versions directly. Additionally, it does indeed take all the trade-offs to minimise disk I/O in favour of more computation. I find it unlikely that much insight can be gained from this for in-memory data structures.
 you can expect ~10ms latency for spinning disks, ~10-100µs for solid-state drives, ~10ns for memory and ~1ns for the CPU. spinning disk to memory is 6 orders of magnitude, 3-4 OOM for solid-state drives.
 In Clojure the MVCC STM is implemented at the ref level, not the collection level. Doing it at the collection level may be possible, but probably not desireable as complexity will go through the roof.
> I find it unlikely that much insight can be gained from this for in-memory data structures.
Reasons for my skepticism:
First, better big O results usually come with some valuable insights that can be reapplied.
Second, the tradeoffs in constant factors are really hard to predict. E.g. the extra CPU cycles may pay for themselves by increased cache coherency or better behaviour under multi-core contention. Hard things to even measure, let alone predict. ;)
Third, besides caring about absolute speed, degradation is important in some cases -- an in-memory-only implementation that the OS decides to swap out to disk won't hit (as much of) a brick wall if/when that happens.
Fourth -- although the extra complexity is prohibitive, MVCC tuned specifically for a specific data structure will almost certainly do way better on all kinds of metrics than a language/runtime-wide implementation. Appropriate for clojure? Not my call. Useful in general? Probably so.
"In a more involved version of the construction, elements from higher levels are sampled..."
Reads like "in one embodiment of the invention, ..."