It would also be good to know more about the skip list implementation they compared against; their description in VI.A doesn't sound like any concurrent skip list I'm aware of (e.g., Java's java.util.concurrent.ConcurrentSkipListMap). They don't say what all their BW-tree implementation includes, but if it's just the data structure, 10k lines of C++ is an order of magnitude larger than even pretty complex concurrent skip lists.
When walking a B+ tree index, the usual code often uses latches to have exclusive access to the index pages being walked from root on the way down so that the pages won't be split due to overflow caused by other threads since the walking thread itself can cause a split and needs to updates the walked pages. Smarter implementation would shorten the list of locked pages when it finds a page with enough room that won't split even if its child pages split. This shortens the scope of the locking of the walk but there are still pages being locked.
This can be a single point of contention when a lot of index walking happen. Pretty much most db operations touch the index. This is especially problematic in modern memory rich systems where most hot data are paged into memory so the locking of the index during walking stuck out like a sore thumb.
A latch-free B+ tree would allow multiple threads to walk the index at the same time, thus removing the single point of contention and allowing massive scaling with more threads added.
The terminology clash in this case is unfortunate, because I'd wager that 90% of the people active in the field of concurrent data structures will use 'lock' rather than 'latch'.
It also was part of the conceptual idea of the very cool demo language ANIC.
“We had an ‘aha’ moment,” Lomet recalls, “when we realized that a single table that maps page identifiers to page locations would enable both latch-free page updating and log-structured page storage on flash memory. The other highlight, of course, was when we got back performance results that were stunningly good.”
The Bw-tree team first demonstrated its work in March 2011 during TechFest 2011, Microsoft Research’s annual showcase of cutting-edge projects. The Bw-tree performance results were dazzling enough to catch the interest of the SQL Server product group.
“When they learned about our performance numbers, that was when the Hekaton folks started paying serious attention to us,” researcher Justin Levandoski says. “We ran side-by-side tests of the Bw-tree against another latch-free technology they were using, which was based on ‘skiplists.’ The Bw-tree was faster by several factors. Shortly after that, we began engaging with the Hekaton team, mainly Diaconu and Zwilling.”
“A skiplist is often the first choice for implementing an ordered index, either latch-free or latch-based, because it is perceived to be more lightweight than a full B-tree implementation”, says Sudipta Sengupta, senior researcher in the Communication and Storage Systems Group. “An important contribution of our work is in dispelling this myth and establishing that latch-free B-trees can perform way better than latch-free skiplists. The Bw-tree also performs significantly better than a well-known latch-based B-tree implementation—BerkeleyDB—that is widely regarded in the community for its good performance.”
Those benchmarks seem too good to be true. And reading the associated information they do look like that they are too good to be true. They claim zero-copy access to the database, which is great. But that probably means that in their read benchmarks, they are just returning a pointer to a record, and are probably not reading the actual record from disk (or for databases that live entirely in memory: into CPU cache). This gives an unfair view of the performance compared to databases that do read the data into memory (or CPU cache). While it's great that the database itself doesn't read the record, lets face it, most clients will need the actual record and not just a pointer to it. That is after all the point of a database. This also explains the unreal performance for large records. They do 30 million reads of 100 kilobyte records per second. If they were actually reading the records that would mean that their disk is doing 3 terabytes per second throughput. I want that disk!!! The hard disk and SSD also have exactly the same performance, so that means that they aren't even hitting the disk at all. So yes, they are cheating.
There are plenty of results in real-world apps too, not just microbenches. E.g. YCSB, Yahoo Cloud Storage Benchmark with Mapkeeper driver https://groups.google.com/forum/?fromgroups#!topic/mapkeeper...
These tests actually read data and send to a client.
And of course the slapd test results, as documented in the multiple papers/presentations. http://symas.com/mdb/
Way to ignore everything I said. What you said here is obviously not the case, as I explained, since for a large part, this is simply deferring work till later. Also, explain to me this:
How is the 100kb value benchmark performing 2x as many operations per second as the 100 bytes value benchmark? Do you claim this is indicative of real world performance?
And for example section 7, which does tests on the SSD. Do you really think that 30 million operations per second for 100kb values is in any relation to real world performance?
Obviously, these benchmarks are not indicative of real world performance. This database may well be very fast, but these benchmarks don't show it.
We've done another bench recently (not yet prettied up for publication) with a VM with 32GB of RAM and 4 CPU cores. On an LDAP database with 5 million entries, the full DB occupies 9GB of RAM and this VM can return 33,000 random queries/second ad nauseum. Scaling the DB up to 25 million entries, the DB requires 40GB of RAM and the random query rate across all 25M drops down to around 20,000/sec. Tests like these are more about the price/quality of your hard drive than the quality of your software. As publishable results they're much less referenceable since any particular site's results will have much greater variance, and the only thing for a conscientious party to do is measure it on their own system anyway.
I've read the papers on your DB and they are quite interesting. What do you think about the work in the paper linked in this post? It's unfortunate that they just compare with skip lists. I don't think anybody seriously believes skip lists are a good idea ever, so it's a bit of a straw man at this point (though I may be wrong).
In a situation like this it is possible (and efficient) to say, store 500mb-sized files in LMDB and have the first and only copy of file content be done by the kernel as it faults and copies relevant portions of file data directly into a skbuff (e.g. by passing (value_pointer + offset) directly to write(), etc.).
Another use case is streamy decompression. Why read or fault an entire value into memory if the reading client disconnects after consuming only the first kilobyte?
In any database where copying the entire record value occurs first, designs like the above simply aren't feasible.
Clearly more than 4x write speed improvement.
> Storing 500MB records in a database is a bad idea anyways, since you never know if you're going to trip over some logic in the database, your application, or the kernel, or some interaction between these, that's going to load or copy it all unnecessarily. Better to make the chunking explicit.
So basically you're saying there is no benefit because you aren't used to systems that support it.
How did you get this information? If that's the case, why did they not actually use the values (e.g. do something trivial like compute accumulated XOR or something). That would still show the advantages of the zero-copy design, but not ridiculously so.
It still seems highly suspicious. For example the benchmark with 100kb values runs twice as fast as the benchmark with 100 byte values?!
I certainly agree that the zero-copy design can be good, but these benchmarks are vastly overstating it to the point that the benchmarks are pretty useless. With these benchmarks it's impossible to know if that database is actually faster than the others. It could well be faster, but it could also be slower. To know that you need a fair comparison.
Note also that memory mapped designs have disadvantages as well. For example AFAIK memory mapping doesn't work asynchronously. You will block your threads, so you can't use lightweight threads or callbacks but you'll have to use heavyweight threads (which may limit concurrency and be expensive in general). Unless you implement your own OS of course (which interestingly some people are doing: http://www.openmirage.org/).
> So basically you're saying there is no benefit here because you aren't used to systems that effortlessly support it.
I'm not saying there is no benefit, I'm saying that it's a bad idea to rely on it that way for 500mb because for it to work safely every component has to support the same. Too easy to have an unexpected problem somewhere and have it blow up by copying 500mb when you only wanted 1kb. But it's still valuable for a database to handle large values well of course. Anyway, this is beside the main point.
Umm, obviously that's what I mean. And with async IO instead of memory mapping, you could avoid that. Hence why I said memory mapping has disadvantages. Clearly the "nonsense" was invented at your end.
random lookup 1000000 buffers in 1.82sec (550928/sec)
random lookup+hash 1000000 buffers in 2.52sec (397189/sec)
A benchmark which provides source code and test data should never be categorized as "misleading." At worst it puts too much trust in the average reader's understanding of things. But look at it this way: if you littered every page that included a benchmark with all the relevant caveats, it'd be a total mess.
Or as David Simon says, "fuck the average viewer."
status = sqlite3_open(file_name, &db_);
>Databases are opened in asynchronous write mode. (LevelDB's sync option, TreeDB's OAUTOSYNC option, SQLite3's synchronous options are all turned off, MDB uses the MDB_NOSYNC option, and BerkeleyDB uses the DB_TXN_WRITE_NOSYNC option). I.e., every write is pushed to the operating system, but the benchmark does not wait for the write to reach the disk.
By the way there is SQLite with LMDB backend - https://gitorious.org/mdb/sqlightning
>Using tool/speedtest.tcl in the SQLite source tree, the time to insert 1000 records on my laptop SSD was 22.42 seconds using the original code, and only 1.06 seconds using MDB. Both tests were run 3 times, with results average
The only thing different about the Symas LMDB tests is that LMDB and BDB were added.
Search for 'write_sync'
std::string sync_stmt = (write_sync) ? "PRAGMA synchronous = FULL" :
"PRAGMA synchronous = OFF";
I'll look into it (never heard of it before). Might fix fairly well with something I'm working on right now (currently implemented using sqlite3).
People are always confusing "memory-mapped" with "memory-only". LMDB is not a memory-only database.
Why not check it yourself?
Other tests source code published too, right there.
If that was the case with lmdb implementation, that works as OpenLDAP backend all over the world, most probably you would have heard about it from the news.
It was almost like MS completely missed the technology shift due to their glacial release cycles. But maybe I was wrong.
The key feature of B+Trees is that they are optimized to allow sequential scans through the index - I suppose SSDs don't "need" the sequential scan property, but it doesn't hurt, and pragmatically would still reduce the number of disk reads required to perform a scan of the index.