Hacker News new | comments | show | ask | jobs | submit login
The Bw-Tree: A B-tree for New Hardware (microsoft.com)
143 points by motter on Apr 9, 2013 | hide | past | web | favorite | 81 comments



Lock-free B+ trees seem like a natural and good idea. However, it's hard to evaluate how it compares to previous work, in part because this paper uses completely different terminology than any paper I've ever read on lock-free or wait-free data structures. For starters, it uses 'latch' and 'latch-free' probably a hundred times, in lieu of the ubiquitous'lock' and 'lock-free' . I gather from Google that this is an Oracle thing[1]; they call spinlocks 'latches' and more complicated queueing locks 'enqueues'.

It would also be good to know more about the skip list implementation they compared against; their description in VI.A doesn't sound like any concurrent skip list I'm aware of (e.g., Java's java.util.concurrent.ConcurrentSkipListMap). They don't say what all their BW-tree implementation includes, but if it's just the data structure, 10k lines of C++ is an order of magnitude larger than even pretty complex concurrent skip lists.

[1] http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTI...


Latch is a popular term in the database world. It's used to avoid confusing with the term Lock in a database. Lock in RDBMS usually associates with transaction, data in the tables, data consistence, etc. It has long term semantic and deadlock can be a problem due to user actions. Latch is same as a lock in the traditional CS sense. It's a mutax or semaphore in memory to protect shared data structure in memory, e.g. a paged in index page in memory. The locked duration usually is very short, and deadlock is not a problem due to user actions.

When walking a B+ tree index, the usual code often uses latches to have exclusive access to the index pages being walked from root on the way down so that the pages won't be split due to overflow caused by other threads since the walking thread itself can cause a split and needs to updates the walked pages. Smarter implementation would shorten the list of locked pages when it finds a page with enough room that won't split even if its child pages split. This shortens the scope of the locking of the walk but there are still pages being locked.

This can be a single point of contention when a lot of index walking happen. Pretty much most db operations touch the index. This is especially problematic in modern memory rich systems where most hot data are paged into memory so the locking of the index during walking stuck out like a sore thumb.

A latch-free B+ tree would allow multiple threads to walk the index at the same time, thus removing the single point of contention and allowing massive scaling with more threads added.


Thanks for clarifying! It makes sense to use a different word in the DB community if Lock has some other previous meaning. It's easy to forget that RDBMS terminology has been along longer than most areas of CS.

The terminology clash in this case is unfortunate, because I'd wager that 90% of the people active in the field of concurrent data structures will use 'lock' rather than 'latch'.


As a note, latch is also used in the hardware world[1]. It's used to implement memory and is often connected to a clock.

It also was part of the conceptual idea of the very cool demo language ANIC[2].

[1] https://en.wikipedia.org/wiki/Latch_%28electronics%29

[2] http://code.google.com/p/anic/wiki/Tutorial


That was the primary reason I was confused; "WTF does it mean for software to not use latches?" :)


re "skiplist", from my other comment: "“A skiplist is often the first choice for implementing an ordered index, either latch-free or latch-based, because it is perceived to be more lightweight than a full B-tree implementation”, says Sudipta Sengupta, senior researcher in the Communication and Storage Systems Group. “An important contribution of our work is in dispelling this myth and establishing that latch-free B-trees can perform way better than latch-free skiplists. The Bw-tree also performs significantly better than a well-known latch-based B-tree implementation—BerkeleyDB—that is widely regarded in the community for its good performance.”

[1] http://research.microsoft.com/en-us/news/features/hekaton-12...


Are there any open source implementations of lock-free B+ trees? I'd love to play with it.


This is the closest I've found [1]; it's an expanded version of a SPAA paper by the same name. It's not executable, but it's pretty detailed and C-like as paper pseudocode goes :-/

[1] http://www.cs.technion.ac.il/~erez/Papers/lfbtree-full.pdf


More context: "Adhering to the “latch-free” philosophy, the Bw-tree delivered far better processor-cache performance than previous efforts.

“We had an ‘aha’ moment,” Lomet recalls, “when we realized that a single table that maps page identifiers to page locations would enable both latch-free page updating and log-structured page storage on flash memory. The other highlight, of course, was when we got back performance results that were stunningly good.”

The Bw-tree team first demonstrated its work in March 2011 during TechFest 2011, Microsoft Research’s annual showcase of cutting-edge projects. The Bw-tree performance results were dazzling enough to catch the interest of the SQL Server product group.

“When they learned about our performance numbers, that was when the Hekaton folks started paying serious attention to us,” researcher Justin Levandoski says. “We ran side-by-side tests of the Bw-tree against another latch-free technology they were using, which was based on ‘skiplists.’ The Bw-tree was faster by several factors. Shortly after that, we began engaging with the Hekaton team, mainly Diaconu and Zwilling.”

“A skiplist is often the first choice for implementing an ordered index, either latch-free or latch-based, because it is perceived to be more lightweight than a full B-tree implementation”, says Sudipta Sengupta, senior researcher in the Communication and Storage Systems Group. “An important contribution of our work is in dispelling this myth and establishing that latch-free B-trees can perform way better than latch-free skiplists. The Bw-tree also performs significantly better than a well-known latch-based B-tree implementation—BerkeleyDB—that is widely regarded in the community for its good performance.”

[1] http://research.microsoft.com/en-us/news/features/hekaton-12...


Performance of skiplists vs Btrees was already debunked 7 years ago, at least. So nice try M$ but as usual you're late to the party, not advancing the state of the art.

http://resnet.uoregon.edu/~gurney_j/jmpc/skiplist.html


I'm glad that Microsoft Research publishes their studies for free like this instead of having to pony up for it through the IEEE -- this certainly looks intriguing!


Seriously, anytime I find a company that freely shares the details of a newer faster way, I'm more a fan. Though with the negative points MS has earned, they're still in the red in my book.


20x times faster than BerkeleyDB is quite impresive. Would love to see a implementation of this.


You will, it's built into the next version of SQL Server.


Umm, how do you know? Do you work at Microsoft?


I saw it demoed today. Hekaton is an announced and public feature of SQL14.



BerkeleyDB is not so fast actually comparing to modern alternatives

http://symas.com/mdb/microbench/


Lies, damned lies, statistics, and benchmarks.

Those benchmarks seem too good to be true. And reading the associated information they do look like that they are too good to be true. They claim zero-copy access to the database, which is great. But that probably means that in their read benchmarks, they are just returning a pointer to a record, and are probably not reading the actual record from disk (or for databases that live entirely in memory: into CPU cache). This gives an unfair view of the performance compared to databases that do read the data into memory (or CPU cache). While it's great that the database itself doesn't read the record, lets face it, most clients will need the actual record and not just a pointer to it. That is after all the point of a database. This also explains the unreal performance for large records. They do 30 million reads of 100 kilobyte records per second. If they were actually reading the records that would mean that their disk is doing 3 terabytes per second throughput. I want that disk!!! The hard disk and SSD also have exactly the same performance, so that means that they aren't even hitting the disk at all. So yes, they are cheating.


Nonsense. Whatever the calling app does with the data will always be an additional cost over what the DB does with the data. Eliminating the DB's cost doesn't invalidate the measurement. It only shows how much waste is going on in other DB implementations.

There are plenty of results in real-world apps too, not just microbenches. E.g. YCSB, Yahoo Cloud Storage Benchmark with Mapkeeper driver https://groups.google.com/forum/?fromgroups#!topic/mapkeeper...

MemcacheDB https://groups.google.com/forum/?fromgroups=#!topic/memcache...

These tests actually read data and send to a client.

And of course the slapd test results, as documented in the multiple papers/presentations. http://symas.com/mdb/


> Nonsense. Whatever the calling app does with the data will always be an additional cost over what the DB does with the data.

Way to ignore everything I said. What you said here is obviously not the case, as I explained, since for a large part, this is simply deferring work till later. Also, explain to me this:

How is the 100kb value benchmark performing 2x as many operations per second as the 100 bytes value benchmark? Do you claim this is indicative of real world performance?

And for example section 7, which does tests on the SSD. Do you really think that 30 million operations per second for 100kb values is in any relation to real world performance?

Obviously, these benchmarks are not indicative of real world performance. This database may well be very fast, but these benchmarks don't show it.


Way to ignore what's printed in front of you. As the writeup clearly states, the benchmark shows that throughput is based on the number of keys in the DB, not on the size of the values. The reason the 100KB test is faster is because there are fewer keys.


Sure, I get that. My question was: do you consider that indicative of real world performance? I consider that misleading. Especially if you are labeling benchmarks with tmpfs, HDD and SSD, when the read benchmarks are not even touching the disk.


Microbenchmarks practically never map 1:1 to real world performance, since these libraries get embedded in much larger software systems that each have their own bottlenecks and inefficiencies. That should already be well understood when we call these things "microbenchmarks" and not "benchmarks". Meanwhile, compare all of the LevelDB/Kyoto/SQLite3 results we reported to the results the LevelDB authors originally reported - you'll find they are directly comparable. It may well be that read tests whose data sets are already fully cached in RAM are not representative of the underlying media speed. But (1) we're just trying to produce numbers using the same methodology as the original LevelDB benchmarks and (2) the full results show that even with all data fully cached in RAM, the type of underlying filesystem still had dramatic effects on the performance of each library.

We've done another bench recently (not yet prettied up for publication) with a VM with 32GB of RAM and 4 CPU cores. On an LDAP database with 5 million entries, the full DB occupies 9GB of RAM and this VM can return 33,000 random queries/second ad nauseum. Scaling the DB up to 25 million entries, the DB requires 40GB of RAM and the random query rate across all 25M drops down to around 20,000/sec. Tests like these are more about the price/quality of your hard drive than the quality of your software. As publishable results they're much less referenceable since any particular site's results will have much greater variance, and the only thing for a conscientious party to do is measure it on their own system anyway.


Fair enough, I don't see any problem with in memory benchmarks, as long as they are marked as such, and if you're comparing apples to apples. The best way to do this being actually using the data from the read queries to do some trivial operation like computing the XOR hash -- that would still be a best case for your library yet still real world.

I've read the papers on your DB and they are quite interesting. What do you think about the work in the paper linked in this post? It's unfortunate that they just compare with skip lists. I don't think anybody seriously believes skip lists are a good idea ever, so it's a bit of a straw man at this point (though I may be wrong).


I find the Bw-tree work pretty underwhelming. It's not a transactional engine; the fact that it can operate "latch-free" is irrelevant since it must be wrapped by some higher concurrency management layer to truly be useful. The delta-update mechanism is probably more write-efficient than LMDB, which could be a point in their favor. The fact that they still rely on an application-level page cache manager is a strong negative - you can never manage the cache more accurately and more efficiently than the kernel, which gets hardware assist for free. Overall, it's an awful lot of noise being made over not much actual innovation.


I'm not sure it's fair to call this benchmark unfair.. LMDB enables use cases that simply aren't possible with non-zero-copy technologies. For example imagine a database full of large objects, and an HTTP/1.0 server that supports an API returning offsets from within these objects.

In a situation like this it is possible (and efficient) to say, store 500mb-sized files in LMDB and have the first and only copy of file content be done by the kernel as it faults and copies relevant portions of file data directly into a skbuff (e.g. by passing (value_pointer + offset) directly to write(), etc.).

Another use case is streamy decompression. Why read or fault an entire value into memory if the reading client disconnects after consuming only the first kilobyte?

In any database where copying the entire record value occurs first, designs like the above simply aren't feasible.


These are very niche use cases. Doing an entire benchmark on that basis without making that clear is misleading. And even in that exact use case, you will not see the massive performance difference that they are showing, because even in that case the data will have to be read from disk by the kernel, whereas in their benchmark the disk isn't even touched. Sure, in a database that doesn't support this, it might read the data from disk to memory, and then write it to a buffer, and that's going to be a bit less efficient than letting the kernel do it directly. You have an extra in memory copy, but that's still far from the performance difference that they are showing between disk read vs no disk read at all. So even in this ideal scenario, they are still massively overstating their performance (which should already be clear from the fact that no disk has 3 terabytes per second throughput). Designs like that are absolutely feasible in any database. You just have to do your own chunking. Storing 500MB records in a database is a bad idea anyways, since you never know if you're going to trip over some logic in the database, your application, or the kernel, or some interaction between these, that's going to load or copy it all unnecessarily. Better to make the chunking explicit.


Better to make everything as simple and as easy as possible. App-managed chunking is error-prone, and it forces redundancy of implementations. Likewise, "you have an extra in memory copy" can make or break a system. OpenLDAP slapd using LMDB uses 1/4 the RAM of slapd using BerkeleyDB for a given number of entries. That means you can serve 4x as many entries using LMDB on a given hw config at full RAM speed, vs BDB. LMDB efficiency has a direct impact on hardware cost and electricity cost, as well as sw development cost.


Exactly! So do real world fair benchmarks that show 4x improvement, not unfair benchmarks that literally show 3000x improvement over BDB.


We are doing all of the above. Microbenchmarks that show only the performance of the DB engine still serve a purpose. And the difference between the microbenches and the full benches shows you how much overhead is eaten in the rest of the system. E.g., when testing read performance of SQLite3 adapted to LMDB, we measured only a 1-2% improvement, because the majority of SQLite3 read execution time is spent in SQL processing, not in its Btree code.


Here's some more real world testing, done by the folks at Zimbra:

http://wiki.zimbra.com/wiki/OpenLDAP_MDB_vs_HDB_performance

Clearly more than 4x write speed improvement.


Nice! Do you know why that benchmark show almost no read improvement but massive write improvement? The slides here (http://symas.com/mdb/20120322-UKUUG-MDB.pdf) suggest the opposite? Or am I reading them incorrectly?


As I understand it, that test used something like 32 server threads on a machine with only 4 cores. The UKUUG prezo used 16 server threads on a machine with 16 cores, which allowed all of the reads to run concurrently. The smaller number of CPU cores in the Zimbra test naturally reduces the concurrency advantage of LMDB.


I believe the test runs with 100 byte values, and the key is stored on the same page as the value. Therefore in order to perform a lookup the entire page must still be faulted in, and the key must still be compared, so the benchmark is not fake in that regard. Data is getting pulled off disk, just not in a way that requires multiple copies. Note the library's tiny code size (and zero copy design) has another benefit, in that it can greatly reduce cache pollution, which may again help account for the speed difference.

> Storing 500MB records in a database is a bad idea anyways, since you never know if you're going to trip over some logic in the database, your application, or the kernel, or some interaction between these, that's going to load or copy it all unnecessarily. Better to make the chunking explicit.

So basically you're saying there is no benefit because you aren't used to systems that support it.


> Therefore in order to perform a lookup the entire page must still be faulted in, and the key must still be compared, so the benchmark is not fake in that regard. Data is getting pulled off disk, just not in a way that requires multiple copies.

How did you get this information? If that's the case, why did they not actually use the values (e.g. do something trivial like compute accumulated XOR or something). That would still show the advantages of the zero-copy design, but not ridiculously so.

It still seems highly suspicious. For example the benchmark with 100kb values runs twice as fast as the benchmark with 100 byte values?!

I certainly agree that the zero-copy design can be good, but these benchmarks are vastly overstating it to the point that the benchmarks are pretty useless. With these benchmarks it's impossible to know if that database is actually faster than the others. It could well be faster, but it could also be slower. To know that you need a fair comparison.

Note also that memory mapped designs have disadvantages as well. For example AFAIK memory mapping doesn't work asynchronously. You will block your threads, so you can't use lightweight threads or callbacks but you'll have to use heavyweight threads (which may limit concurrency and be expensive in general). Unless you implement your own OS of course (which interestingly some people are doing: http://www.openmirage.org/).

> So basically you're saying there is no benefit here because you aren't used to systems that effortlessly support it.

I'm not saying there is no benefit, I'm saying that it's a bad idea to rely on it that way for 500mb because for it to work safely every component has to support the same. Too easy to have an unexpected problem somewhere and have it blow up by copying 500mb when you only wanted 1kb. But it's still valuable for a database to handle large values well of course. Anyway, this is beside the main point.


Still nonsense. "AFAIK memory mapping doesn't work asynchronously" - clearly you don't know much. Reader threads in LMDB never block (except for pagefaults to pull data in, which are obviously unavoidable).


> except for pagefaults to pull data in, which are obviously unavoidable

Umm, obviously that's what I mean. And with async IO instead of memory mapping, you could avoid that. Hence why I said memory mapping has disadvantages. Clearly the "nonsense" was invented at your end.


Eh. You're still talking nonsense. If an app requests some data and it needs to be read in from disk, you have to wait. Async I/O or not, that app can't make any progress until its data arrives. Async I/O is only advantageous for writes.


That's incorrect, even for reads the app can do something else at the same time while the disk read is being performed. That's what async I/O is all about.


The app can do whatever else it wants to do that's completely unrelated to the data request. The part of the app that requested the data is stalled until the data arrives. If the data request is the only useful thing going on (e.g., a user clicked "search for Joe's phone number") then there's nothing else productive for the app to do and async I/O is irrelevant.


Yes. That last bit is very rare, especially in server applications where you have multiple clients. In any case you might want to be a little more careful before calling something that you didn't understand nonsense.


In server applications you have multiple threads too. Individual clients may stall but overall a large number of clients progress.


Yes, I already pointed that out, so we're finally back to what I already originally said ;-)


So we're in violent agreement? ;)


I think so! :)


I use it in a private project. For comparison, here's the difference between touch-all-bytes and touch-no-bytes using the Python binding:

    random lookup 1000000 buffers in 1.82sec (550928/sec)
    random lookup+hash 1000000 buffers in 2.52sec (397189/sec)
This was using 100 byte values.


Could it be that your test is dominated by Python speed, given that their tests show 15-30 million operations per second but your test only half a million? Or is their hardware so much faster?


> Doing an entire benchmark on that basis without making that clear is misleading.

A benchmark which provides source code and test data should never be categorized as "misleading." At worst it puts too much trust in the average reader's understanding of things. But look at it this way: if you littered every page that included a benchmark with all the relevant caveats, it'd be a total mess.

Or as David Simon says, "fuck the average viewer."


The main issue with those stats is that they are comparing sqlite3 opened liked this:

    status = sqlite3_open(file_name, &db_);
Writing to a file, with full ACID completeness and everything been put on this disk to some in memory key value stores. Completely different thing.


>SQLite: We tuned SQLite's performance, by setting its locking mode to exclusive. We also enabled SQLite's write-ahead logging

>Databases are opened in asynchronous write mode. (LevelDB's sync option, TreeDB's OAUTOSYNC option, SQLite3's synchronous options are all turned off, MDB uses the MDB_NOSYNC option, and BerkeleyDB uses the DB_TXN_WRITE_NOSYNC option). I.e., every write is pushed to the operating system, but the benchmark does not wait for the write to reach the disk.

By the way there is SQLite with LMDB backend - https://gitorious.org/mdb/sqlightning

>Using tool/speedtest.tcl in the SQLite source tree, the time to insert 1000 records on my laptop SSD was 22.42 seconds using the original code, and only 1.06 seconds using MDB. Both tests were run 3 times, with results average


Your point eludes me? For a start that is not tuning it. It is still using synchronous IO


Note that the LevelDB, Kyoto Cabinet, and SQLite3 configs are identical to what the Google LevelDB authors used in their original benchmark. http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html

The only thing different about the Symas LMDB tests is that LMDB and BDB were added.


Did you look at http://symas.com/mdb/microbench/db_bench_sqlite3.cc?

Search for 'write_sync'


Ahh yea i missed, apologies

     std::string sync_stmt = (write_sync) ? "PRAGMA synchronous = FULL" :
                                           "PRAGMA  synchronous = OFF";

Still. how is it a valid compare if it is getting it to ssd/hdd and the others are writing to memory? Just open sqlite3 with :memory:


You're still not understanding. LMDB is a disk-based database. It is not only writing to memory.


MDB in those stats is writing to disk? I assume its some kind of lazy write though where the memory store is eventually persisted? in that case thats fairly impressive for a lot of use cases. Though a k/v store has a lot less to do than something like sqlite3.

I'll look into it (never heard of it before). Might fix fairly well with something I'm working on right now (currently implemented using sqlite3).


In fully synchronous mode, which is LMDB's default mode, everything is persisted upon commit. This is a memory-mapped database, not a main-memory/in-memory database. It is fully persistent to disk, it just so happens that it uses mmap() instead of read()/write() and a user-level page cache.

People are always confusing "memory-mapped" with "memory-only". LMDB is not a memory-only database.

http://www.openldap.org/lists/openldap-devel/201111/msg00064...


I guess (as i did) assumed it meant memory-mapped I/O as opposed to something like memory-mapped file I/O or port mapping.


>But that probably means

Why not check it yourself? http://symas.com/mdb/microbench/db_bench_mdb.cc

Other tests source code published too, right there.


I have checked but it's impossible to tell from that, since you don't know what the API calls are doing. If somebody wants to dig into the source, that's great. However the other evidence I mentioned is enough to be quite certain of this claim, for example, that their tests do not show any difference between SSD and HDD, and that their disks would have 3 terabytes per second throughput. I know around here is it more appreciated if you state things with much hubris, even if you are only 50% sure. I worded things with a little bit of caution, even if it's 95% certain.


"MDB write performance when using its writable mmap option." Not really comparing apples, is it ? It doesn't seem fair to me.


Why is that unfair?


I did not delve into the details as you might have, but whenever I see mmap associated with huge perf I tend to wonder about corruption. I've seen to many bench comparing ondisk stores with in mmap stores flushed to disk one in a while. Then you need extra steps/configs to get good enough reliability that hinder perf.


That's certainly a possibility but I'll have to assume that they are running with the same durability guarantees unless proven otherwise :)


Correct. In synchronous mode full ACID semantics are guaranteed. Writable mmap avoids memcpy()/write() overhead but msync() is still performed.


>whenever I see mmap associated with huge perf I tend to wonder about corruption

If that was the case with lmdb implementation, that works as OpenLDAP backend all over the world, most probably you would have heard about it from the news.


You've have a point. But then, it's a bench mark and it would not be the first time that a bench is done using different settings than in production. Like I said, that lmdb is used in OpenLDAP doesn't automatically mean it is used with the exact same configuration as in the benchmark ;-) Nor did I say a memory store is bad, it's just you're not comparing the same things. Redis could be added to the bench for instance.


IMO main-memory stores are bad, for multiple reasons. LMDB is not a main memory store, it is a fully ACID-compliant transactional store.


I hope this makes its way into ReFS or some other Windows filesystem. A friend who used to work on the NTFS team told me ReFS was B-tree based, which disappointed me as B-Trees are ill suited to SSDs.

It was almost like MS completely missed the technology shift due to their glacial release cycles. But maybe I was wrong.


Since when are B-Trees ill-suited to SSDs? The big idea behind B-Trees is to store pages of keys, and SSDs still operate on pages.

The key feature of B+Trees is that they are optimized to allow sequential scans through the index - I suppose SSDs don't "need" the sequential scan property, but it doesn't hurt, and pragmatically would still reduce the number of disk reads required to perform a scan of the index.


B-Trees are all fine and nice, and perfectly adequate for SSDs, but log structured filesystems provide better wear leveling and garbage collection, even with TRIM support.


Does anyone have any idea about how this compares to Google's btree?

https://code.google.com/p/cpp-btree/


BTree and B+ Tree are different animals.


But both this Bw-tree and the implementation the parent linked to claim to be "B-trees" (variants thereof), so I'm not sure why that's relevant. (Apart from the fact that B+ trees are just another B-tree variant themselves.)


looks like this does not support concurrent lock-free operations.


I wonder how many patents Microsoft filed on this tree.


A search in Google Patents for "BW-tree" or "BW tree" turns up nothing (but I don't know how fast their database is updated or how long patents can be hidden or delayed).


It's likely that if it is patented, it's filed under an anodyne name like "A system and method for indexing data".


That's very good.


What makes you think Google, Oracle or any other corporation who funds its research division with couple of billions of dollars wouldn't file a patent?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: