
Copy-on-write B-tree finally beaten. - jcsalterego
http://tm.durusau.net/?p=9346
======
antirez
I did not read the paper yet, but COW btrees are interesting for practical
reasons since it is possible to update the tree just writing in append-only
mode, that is nearly the only way to avoid fsync (or alike) but still avoiding
trivial corruptions (the new root node is always written at the end of the
update). Otherwise you either use fsync as a write barrier or live with the
problem that from the OS to the disk itself writes reordering can corrupt very
easily your data structure in the event of a crash.

However I think that the real problem here is not at the algorithmic level,
but at OS API level: just a provocation, don't you think that in the era of
SSD drives where seeks are cheap is strange we resort to append only data
structures?

~~~
tjarratt
You raise a good point w.r.t. SSD drives, but the reality is that even if most
computers being manufactured today included a solid state drive, the vast
majority of consumers are using drives with moving parts. It will be at least
a few more generations of hardware before OS-level developers will be able to
focus specifically on the features of solid state drives when designing the
filesystem and main OS APIs.

That said, there is room for someone to develop a filesystem API designed with
solid state in mind, or even an entire OS, but I don't know if there would be
enough of a market to make that development worthwhile.

Is there Anything about SSDs that interests you, from the perspective of the
future of Redis?

~~~
sigil
> That said, there is room for someone to develop a filesystem API designed
> with solid state in mind...

What about the Journalling Flash Filesystem? It's been around for quite some
time.

<http://en.wikipedia.org/wiki/JFFS2>

~~~
tjarratt
Someone more knowledgable might want to correct me here, but I think JFFS2
arrived a little too early; the performance gains are not enormous and it
suffers from too many disadvantages (namely performance with small blocks and
apparently determining free space??).

LogFS is yet another file system designed for larger solid state drives (on
the order of gigabytes instead of megabytes), and seems to be a step in the
right direction. <http://en.wikipedia.org/wiki/LogFS>

~~~
sigil
> LogFS stores the inode tree on the drive; JFFS2 does not, which requires it
> to scan the entire drive at mount and cache the entire tree in RAM. For
> larger drives, the scan can take tens of seconds and the tree can take a
> significant amount of main memory.

Interesting. I've only used JFFS2 on embedded systems like OpenWRT routers,
where you wouldn't see the large drive penalty.

------
ComputerGuru
Blog Spam. Actual link:
<http://arxiv.org/PS_cache/arxiv/pdf/1103/1103.4282v2.pdf>

~~~
chalst
Right, but the blog post does give a second reference.

I prefer giving the Arxiv abstract list, which always links to the current
version of the PDF.

<http://arxiv.org/abs/1103.4282>

------
snewman
This sounds, in many ways, very similar to the data structure underlying
Google's Bigtable (<http://labs.google.com/papers/bigtable.html>) and its
descendants. Multi-version key/value store, data organized into a hierarchy of
generations, each generation stored in large linear chunks, Bloom filters to
minimize the number of generations consulted for a single-key fetch... it
would be interesting to see a direct comparison, too bad the authors didn't
mention Bigtable in their analysis of previous work.

I also wish the paper spelled out the data structures and algorithms in more
detail. I did a little searching but couldn't find more information anywhere.
Has anyone found a more complete writeup?

~~~
leef
It's somewhat similar in that BigTable (and Cassandra) perform writes as
sequential I/O by writing to arrays (for stratified b-tree) or to SSTables
(for Cassandra) and then merge them together using more sequential I/O. This
is the not-so-secret-sauce that makes inserts stay fast compared to vanilla
b-trees.

However stratified b-trees bring some algorithmic rigor to the merging process
of the arrays that provides guarantees on the search time.

The authors do indeed provide some comparison with Cassandra which should
answer your question - [http://www.acunu.com/2011/03/cassandra-under-heavy-
write-loa...](http://www.acunu.com/2011/03/cassandra-under-heavy-write-load-
part-i/) [http://www.acunu.com/2011/03/cassandra-under-heavy-write-
loa...](http://www.acunu.com/2011/03/cassandra-under-heavy-write-load-part-
ii/)

Summary: 1) Single key read performance for Cassandra depends somewhat on when
the last compaction was done (Bloom filters help here but even just a couple
of false positives per query eats random I/O. The less SSTables there are the
less random I/O) so write spikes create poor read performance and 2) Bloom
filters don't work for range queries so they tank in Cassandra in general.
Stratified b-trees don't need bloom filters and performs better and with more
consistency in both cases.

~~~
snewman
Thanks for the links. The statistics on latency distribution are quite
impressive, to say the least.

Why do you say that stratified b-trees don't need Bloom filters? Yes, the
improved merging discipline reduces the number of arrays to read, but
presumably there is often >1 array, which is sufficient to make Bloom filters
desirable. Even if you only have two arrays, doubling the number of random
I/Os per key lookup is easily enough of a penalty to make Bloom filters
worthwhile. The paper itself seems to indicate that Bloom filters are used:

"In the simplest construction, we embed into each array a B-tree and maintain
a Bloom ﬁlter on the set of keys in each array. A lookup for version v
involves querying the Bloom ﬁlter for each array tagged with version v, then
performing a B-tree walk on those arrays matching the ﬁlter."

~~~
leef
The paper doesn't say so I am making some assumptions here but if you had a
bloom filter per array then as these doubling arrays get really big all the
bloom filter would tell you is that the target entry is probably contained in
this giant array. A false positive in the bloom filter would cause a pretty
significant amount of work.

The stratified b-trees use forward pointers to help target the search in the
next array down the tree. Like regular b-trees the smaller root arrays will
likely be cached in memory so the the number of random I/O's will be small.

------
justincormack
Acunu (<http://acunu.com>) are a UK startup and have an implementation of
Cassandra, it supports versions presumably based on this.

~~~
dchest
Correct, here's a post by one of the paper authors:
<http://www.acunu.com/2011/03/big-dictionaries-ii-versioning/>

~~~
cbetz
this is actually very exciting. I decided to explore the site a little and
found this quote:

"most ambitious storage project undertaken since ZFS"
(<http://www.acunu.com/2011/03/why-is-acunu-in-kernel/>)

it looks like they put a key/value store in the kernel and they came up with a
new userspace API for it. i can see also see how getting something like this
into the mainline kernel is going to be a big uphill battle, but it might
actually be a really big win.

~~~
tdmackey
Part of my job is working on a in-kernel key/value store for a data center
network operating system. The problem it is the context switching between user
and kernel space kills your performance if you're targeting sub-millisecond
read/writes. That may not matter for something like acunu when you factor in
network latency but when you're using the database as part of the packet path
it does. In addition, under a high system load your user space process has a
high likelihood of getting scheduled out at the ioctl call which makes latency
even worse. Although being in the kernel allows you a little leeway in terms
durability constraints and all that because if you screw up the entire system
comes crashing down anyway.

It will never be in the mainline kernel. Also, although I haven't actually
looked at what they did yet, I assume they're just loading a regular old
kernel module instead of actually really messing with a lot of the mainline
code.

------
plasma
Are Fractal Tree's better than B-Trees? Apparently so, as used by a database
addon to MySQL: [http://tokutek.com/2010/11/avoiding-fragmentation-with-
fract...](http://tokutek.com/2010/11/avoiding-fragmentation-with-fractal-
trees/)

~~~
leef
These Stratified B-trees are basically multi-versioned variations of Fractal
Trees (aka Cache-oblivious streaming B-trees). This paper even references the
tokutek teams paper -
<http://www.cs.sunysb.edu/~bender/newpub/BenderFaFi07.pdf>.

I wouldn't be surprised if the Tokutek guys implemented something very similar
to this Stratified B-tree to implement MVCC.

------
Kallikrates
edited correct link to the actual paper
<http://arxiv.org/PS_cache/arxiv/pdf/1103/1103.4282v2.pdf>

~~~
jholman
If we're replacing blog links with arXiv abstract links, note that there are
actually two papers linked, both interesting.

"Stratified B-trees and versioning dictionaries"
<http://arxiv.org/abs/1103.4282> This is actually the newer paper, and
presents the "Stratified B-Tree" data structure itself. It also explicitly
mentions SSDs and appending, by the way.

"Optimal query/update tradeoffs in versioned dictionaries"
<http://arxiv.org/abs/1103.2566> This is a longer and more, shall I say,
academic paper. It appears to be incompletely finished, though, because it has
a TODO, and also a marginal FIXME.

~~~
andytwigg
Thanks for that. The second paper you reference contains more details of the
data structure. I have submitted an updated version of the second paper - it
will probably appear on arxiv in 2 days. We have a much improved version of
it, though, which I hope to post sometime next week.

For more information about append-only B-trees and SSDs, see
[http://www.acunu.com/2011/04/log-file-systems-and-ssds-
made-...](http://www.acunu.com/2011/04/log-file-systems-and-ssds-made-for-
each-other/)

~~~
crux_
I don't suppose there's a chance of seeing the O'Caml implementation released
as well, is there? It's neat to see it being used for what (IMO) is a really
underutilized sweet spot for the language.

(That & the fact that a code release is an invaluable road map for anyone
looking to follow in your footsteps.)

~~~
andytwigg
The OCaml implementation is what we use internally to test implementations of
complicated new data structures, and as such, it's quite unreadable to anyone
else! We will probably talk at CUFP about our experiences of OCaml. It
definitely has some upsides, but is not without some fairly major downsides
(concurrency, serialization, lack of consistently-named library functions,
...) . As such, we're rewriting our basic data structures in Scala - this
seems like it keeps most of the benefits but without some of the major
problems. Now we know how to implement the structures, we might be able to do
a clean implementation which can be released!

------
viraptor
Since they want to start with an implementation built into kernel libs, I
wonder how easy would it be for btrfs to switch... Since the interface is
similar, I really hope it's possible.

~~~
nwmcsween
btrfs can't switch the structure is tied into the disk format.

------
andytwigg
Slides from a recent talk <http://www.acunu.com/2011/04/algorithms-at-
cassandra-meetup/>

------
krosaen
Before even reading the paper, I lazily wonder if any of the techniques used
have implications for clojure's persistent vector implementation

~~~
pmjordan
The challenges faced with on-disk data structures compared to in-memory ones
are quite different in some respects, so not necessarily. (e.g. disk seek time
isn't uniform, indirection hits you MUCH harder, sectors are much bigger than
cache lines, robustness against sudden loss of power is important, etc.).

That said, I can't tell what they're suggesting from a cursory reading, I'll
need to spend some time deciphering it. (this one seems to fall into one of
those unparseable academic publications where the interesting bits are buried
somewhere deep within the text)

~~~
crux_
On the other hand, cache-oblivious algorithms (and upon first skim, the
presented data structure is) should play just as nicely at the L1/L2 vs RAM
level as they do at the RAM vs disk one, without needing to be specifically
tuned or rewritten.

That's the theory, of course. ;)

~~~
pmjordan
As _danvet_ has hinted at, on-disk data structures don't exist in isolation.
You've generally got a CPU and memory system attached to the disk that is
orders of magnitude faster[1], so if you can reduce the number of dependent
disk I/O operations by increasing the required processing power, that's
usually a trade-off you want to take. As an example, linear searches through
each 4kiB disk block you read are absolutely acceptable, whereas doing that
for an associative data structure in memory is madness.

In any case, having inspected the 2 papers a little more in depth, this data
structure is very different from Clojure's in that it holds multiple versions
directly.[2] Additionally, it does indeed take all the trade-offs to minimise
disk I/O in favour of more computation. I find it unlikely that much insight
can be gained from this for in-memory data structures.

[1] you can expect ~10ms latency for spinning disks, ~10-100µs for solid-state
drives, ~10ns for memory and ~1ns for the CPU. spinning disk to memory is 6
orders of magnitude, 3-4 OOM for solid-state drives.

[2] In Clojure the MVCC STM is implemented at the _ref_ level, not the
collection level. Doing it at the collection level may be possible, but
probably not desireable as complexity will go through the roof.

~~~
andytwigg
In general, I agree - as you'd expect, practical implementations of this
structure will naturally deviate quite far from the paper's presentation, and
indeed we do make many optimizations for SSDs and rotating disks. But, in
theory, disk vs RAM is no different to RAM vs L2. A central trick is the
'stratification' procedure that operates on arrays - that could be helpful
elsewhere.

------
sigil
So, anybody know if they're pursuing a patent on this tech? A quick Google
Patent search didn't turn up any US results. I ask because this language from
the paper sounded patent-y:

"In a more involved version of the construction, elements from higher levels
are sampled..."

Reads like "in one embodiment of the invention, ..."

