
The Bw-Tree: A B-tree for New Hardware - motter
http://research.microsoft.com/apps/pubs/default.aspx?id=178758
======
sparky
Lock-free B+ trees seem like a natural and good idea. However, it's hard to
evaluate how it compares to previous work, in part because this paper uses
completely different terminology than any paper I've ever read on lock-free or
wait-free data structures. For starters, it uses 'latch' and 'latch-free'
probably a hundred times, in lieu of the ubiquitous'lock' and 'lock-free' . I
gather from Google that this is an Oracle thing[1]; they call spinlocks
'latches' and more complicated queueing locks 'enqueues'.

It would also be good to know more about the skip list implementation they
compared against; their description in VI.A doesn't sound like any concurrent
skip list I'm aware of (e.g., Java's
java.util.concurrent.ConcurrentSkipListMap). They don't say what all their BW-
tree implementation includes, but if it's just the data structure, 10k lines
of C++ is an order of magnitude larger than even pretty complex concurrent
skip lists.

[1]
[http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTI...](http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:10899066672589)

~~~
ww520
Latch is a popular term in the database world. It's used to avoid confusing
with the term Lock in a database. Lock in RDBMS usually associates with
transaction, data in the tables, data consistence, etc. It has long term
semantic and deadlock can be a problem due to user actions. Latch is same as a
lock in the traditional CS sense. It's a mutax or semaphore in memory to
protect shared data structure in memory, e.g. a paged in index page in memory.
The locked duration usually is very short, and deadlock is not a problem due
to user actions.

When walking a B+ tree index, the usual code often uses latches to have
exclusive access to the index pages being walked from root on the way down so
that the pages won't be split due to overflow caused by other threads since
the walking thread itself can cause a split and needs to updates the walked
pages. Smarter implementation would shorten the list of locked pages when it
finds a page with enough room that won't split even if its child pages split.
This shortens the scope of the locking of the walk but there are still pages
being locked.

This can be a single point of contention when a lot of index walking happen.
Pretty much most db operations touch the index. This is especially problematic
in modern memory rich systems where most hot data are paged into memory so the
locking of the index during walking stuck out like a sore thumb.

A latch-free B+ tree would allow multiple threads to walk the index at the
same time, thus removing the single point of contention and allowing massive
scaling with more threads added.

~~~
sparky
Thanks for clarifying! It makes sense to use a different word in the DB
community if Lock has some other previous meaning. It's easy to forget that
RDBMS terminology has been along longer than most areas of CS.

The terminology clash in this case is unfortunate, because I'd wager that 90%
of the people active in the field of concurrent data structures will use
'lock' rather than 'latch'.

------
NatW
More context: "Adhering to the “latch-free” philosophy, the Bw-tree delivered
far better processor-cache performance than previous efforts.

“We had an ‘aha’ moment,” Lomet recalls, “when we realized that a single table
that maps page identifiers to page locations would enable both latch-free page
updating and log-structured page storage on flash memory. The other highlight,
of course, was when we got back performance results that were stunningly
good.”

The Bw-tree team first demonstrated its work in March 2011 during TechFest
2011, Microsoft Research’s annual showcase of cutting-edge projects. The Bw-
tree performance results were dazzling enough to catch the interest of the SQL
Server product group.

“When they learned about our performance numbers, that was when the Hekaton
folks started paying serious attention to us,” researcher Justin Levandoski
says. “We ran side-by-side tests of the Bw-tree against another latch-free
technology they were using, which was based on ‘skiplists.’ The Bw-tree was
faster by several factors. Shortly after that, we began engaging with the
Hekaton team, mainly Diaconu and Zwilling.”

“A skiplist is often the first choice for implementing an ordered index,
either latch-free or latch-based, because it is perceived to be more
lightweight than a full B-tree implementation”, says Sudipta Sengupta, senior
researcher in the Communication and Storage Systems Group. “An important
contribution of our work is in dispelling this myth and establishing that
latch-free B-trees can perform way better than latch-free skiplists. The Bw-
tree also performs significantly better than a well-known latch-based B-tree
implementation—BerkeleyDB—that is widely regarded in the community for its
good performance.”

[1] [http://research.microsoft.com/en-
us/news/features/hekaton-12...](http://research.microsoft.com/en-
us/news/features/hekaton-122012.aspx)

~~~
hyc_symas
Performance of skiplists vs Btrees was already debunked 7 years ago, at least.
So nice try M$ but as usual you're late to the party, not advancing the state
of the art.

<http://resnet.uoregon.edu/~gurney_j/jmpc/skiplist.html>

------
jmgrosen
I'm glad that Microsoft Research publishes their studies for free like this
instead of having to pony up for it through the IEEE -- this certainly looks
intriguing!

~~~
raccer
Seriously, anytime I find a company that freely shares the details of a newer
faster way, I'm more a fan. Though with the negative points MS has earned,
they're still in the red in my book.

------
jburgueno
20x times faster than BerkeleyDB is quite impresive. Would love to see a
implementation of this.

~~~
snaky
BerkeleyDB is not so fast actually comparing to modern alternatives

<http://symas.com/mdb/microbench/>

~~~
jules
Lies, damned lies, statistics, and benchmarks.

Those benchmarks seem too good to be true. And reading the associated
information they do look like that they _are_ too good to be true. They claim
zero-copy access to the database, which is great. But that probably means that
in their read benchmarks, they are just returning a pointer to a record, and
are probably not reading the actual record from disk (or for databases that
live entirely in memory: into CPU cache). This gives an unfair view of the
performance compared to databases that do read the data into memory (or CPU
cache). While it's great that the database itself doesn't read the record,
lets face it, most clients _will_ need the actual record and not just a
pointer to it. That is after all the point of a database. This also explains
the unreal performance for large records. They do 30 million reads of 100
kilobyte records per second. If they were actually reading the records that
would mean that their disk is doing 3 terabytes per second throughput. I want
that disk!!! The hard disk and SSD also have exactly the same performance, so
that means that they aren't even hitting the disk at all. So yes, they are
cheating.

~~~
hyc_symas
Nonsense. Whatever the calling app does with the data will always be an
additional cost over what the DB does with the data. Eliminating the DB's cost
doesn't invalidate the measurement. It only shows how much waste is going on
in other DB implementations.

There are plenty of results in real-world apps too, not just microbenches.
E.g. YCSB, Yahoo Cloud Storage Benchmark with Mapkeeper driver
[https://groups.google.com/forum/?fromgroups#!topic/mapkeeper...](https://groups.google.com/forum/?fromgroups#!topic/mapkeeper-
users/pPQl50fgXu4)

MemcacheDB
[https://groups.google.com/forum/?fromgroups=#!topic/memcache...](https://groups.google.com/forum/?fromgroups=#!topic/memcached/dxU8iO27ce4)

These tests actually read data and send to a client.

And of course the slapd test results, as documented in the multiple
papers/presentations. <http://symas.com/mdb/>

~~~
jules
> Nonsense. Whatever the calling app does with the data will always be an
> additional cost over what the DB does with the data.

Way to ignore everything I said. What you said here is obviously not the case,
as I explained, since for a large part, this is simply deferring work till
later. Also, explain to me this:

How is the 100kb value benchmark performing 2x as many operations per second
as the 100 bytes value benchmark? Do you claim this is indicative of real
world performance?

And for example section 7, which does tests on the SSD. Do you really think
that 30 million operations per second for 100kb values is in any relation to
real world performance?

Obviously, these benchmarks are _not_ indicative of real world performance.
This database may well be very fast, but these benchmarks don't show it.

~~~
hyc_symas
Way to ignore what's printed in front of you. As the writeup clearly states,
the benchmark shows that throughput is based on the number of keys in the DB,
not on the size of the values. The reason the 100KB test is faster is because
there are fewer keys.

~~~
jules
Sure, I get that. My question was: do you consider that indicative of real
world performance? I consider that misleading. Especially if you are labeling
benchmarks with tmpfs, HDD and SSD, when the read benchmarks are not even
touching the disk.

~~~
hyc_symas
Microbenchmarks practically never map 1:1 to real world performance, since
these libraries get embedded in much larger software systems that each have
their own bottlenecks and inefficiencies. That should already be well
understood when we call these things "microbenchmarks" and not "benchmarks".
Meanwhile, compare all of the LevelDB/Kyoto/SQLite3 results we reported to the
results the LevelDB authors originally reported - you'll find they are
directly comparable. It may well be that read tests whose data sets are
already fully cached in RAM are not representative of the underlying media
speed. But (1) we're just trying to produce numbers using the same methodology
as the original LevelDB benchmarks and (2) the full results show that even
with all data fully cached in RAM, the type of underlying filesystem still had
dramatic effects on the performance of each library.

We've done another bench recently (not yet prettied up for publication) with a
VM with 32GB of RAM and 4 CPU cores. On an LDAP database with 5 million
entries, the full DB occupies 9GB of RAM and this VM can return 33,000 random
queries/second ad nauseum. Scaling the DB up to 25 million entries, the DB
requires 40GB of RAM and the random query rate across all 25M drops down to
around 20,000/sec. Tests like these are more about the price/quality of your
hard drive than the quality of your software. As publishable results they're
much less referenceable since any particular site's results will have much
greater variance, and the only thing for a conscientious party to do is
measure it on their own system anyway.

~~~
jules
Fair enough, I don't see any problem with in memory benchmarks, as long as
they are marked as such, and if you're comparing apples to apples. The best
way to do this being actually using the data from the read queries to do some
trivial operation like computing the XOR hash -- that would still be a best
case for your library yet still real world.

I've read the papers on your DB and they are quite interesting. What do you
think about the work in the paper linked in this post? It's unfortunate that
they just compare with skip lists. I don't think anybody seriously believes
skip lists are a good idea ever, so it's a bit of a straw man at this point
(though I may be wrong).

~~~
hyc_symas
I find the Bw-tree work pretty underwhelming. It's not a transactional engine;
the fact that it can operate "latch-free" is irrelevant since it must be
wrapped by some higher concurrency management layer to truly be useful. The
delta-update mechanism is probably more write-efficient than LMDB, which could
be a point in their favor. The fact that they still rely on an application-
level page cache manager is a strong negative - you can never manage the cache
more accurately and more efficiently than the kernel, which gets hardware
assist for free. Overall, it's an awful lot of noise being made over not much
actual innovation.

------
CoolGuySteve
I hope this makes its way into ReFS or some other Windows filesystem. A friend
who used to work on the NTFS team told me ReFS was B-tree based, which
disappointed me as B-Trees are ill suited to SSDs.

It was almost like MS completely missed the technology shift due to their
glacial release cycles. But maybe I was wrong.

~~~
etrain
Since when are B-Trees ill-suited to SSDs? The big idea behind B-Trees is to
store pages of keys, and SSDs still operate on pages.

The key feature of B+Trees is that they are optimized to allow sequential
scans through the index - I suppose SSDs don't "need" the sequential scan
property, but it doesn't hurt, and pragmatically would still reduce the number
of disk reads required to perform a scan of the index.

~~~
CoolGuySteve
B-Trees are all fine and nice, and perfectly adequate for SSDs, but log
structured filesystems provide better wear leveling and garbage collection,
even with TRIM support.

------
davvid
Does anyone have any idea about how this compares to Google's btree?

<https://code.google.com/p/cpp-btree/>

~~~
ww520
BTree and B+ Tree are different animals.

~~~
lvh
But both this Bw-tree and the implementation the parent linked to claim to be
"B-trees" (variants thereof), so I'm not sure why that's relevant. (Apart from
the fact that B+ trees are just another B-tree variant themselves.)

------
ttrreeww
I wonder how many patents Microsoft filed on this tree.

~~~
gwern
A search in Google Patents for "BW-tree" or "BW tree" turns up nothing (but I
don't know how fast their database is updated or how long patents can be
hidden or delayed).

~~~
caf
It's likely that if it is patented, it's filed under an anodyne name like _"A
system and method for indexing data"_.

