Just scanning the beginning, parts of this are pretty misleading.
"... if a database row contains 100 bytes,
and a B tree such as InnoDB employs 16KiB pages , then the B tree may perform
16KiB of I/O to write a single 100-byte row, for a write amplification of 160, compared to
a write amplification of 30 to 70 for the other data structures."
Reading this gives the impression that the only way pages are updated is by writing the page in whole. Storage engines commonly use a slotted page layout to avoid this. That is, variable length records are stored contiguously from the start of the page, and are indexed by an array of fixed sized offset entries that grows up from the end of the page. Inserting a 100 byte row will append it to the end of data area, then prepend a new offset to the slot array. On spinning disks this is typically two 512 byte writes (assuming page size is decently larger than block size, as is common). On SSD's it's more complicated due to the FTL, and state of the art systems tend to use a log structured approach. See the Bw-tree paper from Microsoft Research for example.
IMO they're being quite disingenuous in that note. InnoDB is the default storage engine in MySQL, and there are very few production SQL databases with a tiny 100 bytes per row. A typical DB row in common apps is a couple KB at least. As a further example, MySQL's NDB Cluster engine has a maximum row size of 8KB, and this is frequently a problem for apps that want to migrate to it.
If you start a discussion of write amplification with such an unrealistic record size, that pretty much moots the rest of the discussion.
Many important databases do contain 100-byte rows. Sometimes the biggest tables contain 100-byte rows, and then smaller tables contain larger rows.
I'd have to disagree with your assessment of the value of math.
It's nice to see someone trying to do performance benchmarks, but there are some serious problems with the benchmark presented at symas.com.
1) The benchmark hardware used fusion I/O. If you can afford fusion I/O, and you don't care about compression, you should use InnoDB. But you didn't measure InnoDB. If you use fusion I/O you probably actually care about compression, which you didn't measure, since you disabled compression. Why even report space used if you've disabled compression?
2) You used libc malloc to try to make a "uniform test". TokuDB recommends the use of JEmalloc, and you mentioned that some other storage engine requires TCmalloc. There's a reason to use a better memory allocator, and some storage engines need it. You've misconfigured at least two of the storage engines.
3) You wrote TokuDB code using the embedded API without passing the flags that make the high-ingestion rate possible. I'll be the first to admit that the documentation is poor for Tokutek's embedded API (the API that looks like Berkeley DB), but that doesn't change the fact that you are misusing the software. In particular, it appears that you used `0' for the flag argument to the DB->put() operations, meaning that every write is preceeded by a query. Since you apparently didn't mess up the LSM-based API's, it's not a fair comparision.
4) You ran your benchmarks for 20 minutes on each storage engine. That's not nearly long enough. For example, the system won't have aged at all. Interesting things start happening when the the database has been running for a long time. Interesting things that are predicted by the math.
I'm sure there are many important databases with 100 byte rows. But very few of these will be SQL databases, and very few of these will be using InnoDB.
1) re: InnoDB - sorry, but your constant comparison to InnoDB only reminds me of the old joke about the snail that got run over by a turtle - while waking up in the recovery room, the snail says "I don't remember what happened, it all happened so fast!" Beating InnoDB at anything says nothing useful. http://symas.com/mdb/memcache/
I also didn't write the LSM-based programs, their respective authors did. I trusted each of them to use their own APIs correctly. That may have been a mistake in the case of the TokuDB code, but given the lack of documentation I didn't have much better alternatives.
4) We have much longer benchmark runs published as well. But as noted here http://symas.com/mdb/hyperdex/ what really matters is the total number of ops executed. "Running for a long time" is only a requirement when your DB engine is ... slow.
Welcome, Bradley and Leif, to the club of people whom Howard has declared are doing it wrong. You are in fine company as I am also a member of that club. Although I do prefer snark from Turing Award winners. I don't think there is reason to pursue a discussion. It won't end well. But I do understand the desire to address the claims once.
It's important to understand LMDB is an edge case design that wants the fastest reads possible and is willing to significantly burden the write path in that interest. Immutability is still a powerful approach. For a good example read the MSR papers on the hekaton main memory engine's approach: http://research.microsoft.com/apps/pubs/default.aspx?id=1787...
The original post isn't exactly wrong, it's just something of a straw man. We live in an era of terabyte disks. Cold data storage is cheap.
In the early days of both Engine Yard and Kongregate Ezra and I worked closely on solving some problems during high pressure moments. He was smart, dedicated, and genuine. He worked hard to not just solve problems, but to communicate and teach everyone around him.
We lost touch over the years, chatting occasionally and always saying "hey, we should meet up sometime." I'm sorry now we didn't.
I know, in general, about various kinds of serialization anomalies, and I know (in moderate detail) how all of the isolation modes are implemented in postgres.
What I don't know is about the particular problem the exchange encountered, and why postgres's isolation would not have prevented it. Based on the very brief descriptions of the problem that I've seen, it seems like postgres would have prevented it, but I don't have enough information to say for sure.
Think of a transaction that inserts a row representing a withdrawal, updates a materialized total balance, and checks that it's positive. Under snapshot isolation, two concurrent instances of this transaction could commit. The materialized balance would reflect only one of the debits however, and would be inconsistent vs queries that recompute the aggregate in full.
Postgres pre version 9.2-ish would allow this situation even in "serializable" mode. Later versions wouldn't.
In this model, the view doesn't "own" the name. The application state owns it.
Think of it as being similar to the classic one request at a time LAMP app. You query data from the database, then a template turns that into displayed content. When you want to change the data, you send an update transaction and then redo the query and the rendering to html.
It's just as simple here. The application state is like our database. We don't change it in part, down in some particular object, we write transaction functions that take the entire state to a new state. What's neat is our code is conceptually very simple: just functions that take take state as input and render a view. It might seem this would be absurdly slow, but the implementation doesn't have to be so naive. React does diffing under the hood to figure out the exact minimum changes needed to update the DOM after the state changes. Mori gives you similar optimizations when changing the state itself.
Ok, so I guess what I'm looking for here is how to implement our "database" as a client side data model that can do its thing, with the immutable structures (mori?) so that React can figure out what's updated super quickly, and a way to make sure the app knows that it's been updated.
I need examples, basically!
Ideally I'd have an API to my datastore, which in turn would handle telling React what I've updated.
It feels like I'm just not "getting" something with Mori.
Say our state looks like this in a normal JS structure: