
Immutability, MVCC, and garbage collection (2013) - spiffytech
http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/?
======
brandonbloom
This article was discussed previously. In short, it's based on numerous false
assumptions. For example, Datomic is not based on an append-only B tree. It
utilizes persistent data structures instead. Here's Rich's comment:
[https://news.ycombinator.com/item?id=7011102](https://news.ycombinator.com/item?id=7011102)

~~~
taeric
I'm curious to know what the actual data structure is. Also, it should be
noted that "append only b-tree" gives a _lot_ of leeway in how something is
actually implemented. That is as easily seen as a family of data structures as
it is a specific one. (Not to mention "it uses persistent data structures
instead of append-only B tree" is a somewhat silly statement. As append-only B
tree is persistent.)

And there is something amusing about using git as an example of how there is
no technical argument against losing information. Because, losing information
is a specific feature that _was_ added to git. (Look up shallow clones.) I can
accept that there is no technical reason to have this, actually. But there are
pragmatic reasons to have it. Which is basically the point of this article.

~~~
brandonbloom
> I'm curious to know what the actual data structure is.

This tweet from Rich provides the missing puzzle pieces:

"The log is not a btree - used for durability, not query. Separate indexes
combine memory with batch-updated storage."
[https://twitter.com/richhickey/status/420910382538948608](https://twitter.com/richhickey/status/420910382538948608)

Restating in my own words:

The transaction log is essentially just a stream of asserts and retracts with
metadata. That's used for durability first and echoed to clients/peers and the
indexing engine. The indexing engine asynchronously batch updates indexes
(which probably are b-trees, but have no need to be streaming append-only
style in this context). The peers query a joined dataset of the latest
consistent indexes plus the set of unindexed transactions.

> As append-only B tree is persistent.

You're right, it can be. However, the popular append-only b-tree systems (such
as CouchDB) are actually append-only B+-tree systems. That "+" means the
leaves of the tree are intrusively linked together, and so are not persistent.
You can not "fork" such trees cheaply.

> losing information is a specific feature that was added to git

[http://blog.datomic.com/2013/05/excision.html](http://blog.datomic.com/2013/05/excision.html)

~~~
taeric
This doesn't really answer any missing piece. The contention is precisely what
is used for query. Specifically, "how long before a write/assert is able to be
queried?" Otherwise, it might as well not have happened. Right?

Going further, you want fast queries on recent data. At this point, structures
that build updateable indexes as fast as possible are going to be preferred.
And you are likely looking at some form of b-tree for this. (Not necessarily,
true. But likely.)

Excision is neat, but not necessarily the same thing. In git, if I do a
shallow copy, I _do not_ know that it was shallow. It literally gives me wrong
information on who authored parts of code at this point.

And again, the point there is that it is a pragmatic feature, not necessarily
a technical/ideal one.

~~~
Tuna-Fish
> Specifically, "how long before a write/assert is able to be queried?"
> Otherwise, it might as well not have happened. Right?

Immediately. The approach used is to have a separate "long-term" data
structure and a smaller short term one. The long-term structure is only batch
updated periodically, while the short-term logs changes since last batch.
Every access queries both.

~~~
taeric
So now we are back to "what is the data structure?" Or is it non-indexed and
therefore has slower queries? Because that appears to be what is implied.

And this doesn't even get in to the questions such as at what point is a
record eligible to be in my query. When I ran my query, or when I iterated to
where that record would be? Is this controllable?

And seriously, just answering "immediately" is very close to saying "by
magic." Too close, for this old timer's preference. I have a ton of respect
for the stack. More so for those making it.

~~~
t__crayford
It's a B-tree alike, with a fixed depth of 3, built on top of cached (at the
query layer), immutable storage. There's a "atom" stored in consistent storage
(and in memory on all processes), with a UUID pointer to the immutable tree.
The root of the immutable tree contains pointers to "directories", which are
the second layer. The third layer contains the actual data, in chunks of ~1k
facts per segment. All of the immutable tree is stored in
riak/dynamo/cassandra/postgres or whatever (it's just binary blob storage by
k/v, so it's pretty easy to implement). There are several indexes that are in
different orders, all of them covering indexes.

The transaction log also goes into a tree, but that tree is structured rather
differently (for performance reasons). For example, it maintains a "linked-
list" (in storage!) of the latest N transactions, which it then rolls up into
one tree node once that list gets to a certain size.

The missing part about how transactions are available "immediately" (which is
a word that doesn't make sense for a distributed database ;), is that the
transactor (which is a separate process/system from those that answer
questions), streams new transactions (as they happen) to the query boxes
(known as "peers"). A peer is just your usual client process: for example your
Java frontend webserver process (at this time only JVM clients are properly
supported in this model)

Indexing is done in the background every ~33mb of transaction data (in the
transactor) (and it allows that to build up during new indexing jobs, applying
back pressure if it gets too much in memory data). Indexing isn't append only
at all - it creates a new tree (that very often shares a lot of data with the
old tree, however).

To answer queries, the "peers" merge the new transactions they've received
(that are in memory), with the durable index. That's how new transactions get
seen quickly - the peers have recent data in memory, and other data in long
term durable storage.

Transactions are visible as soon as the transactor's streaming sends them to
peers. There is also a mechanism to say "wait until this transaction has
arrived at this peer" before querying.

Because the data in the indexes is immutable, it's trivial to cache in the
client processes. Many smaller databases can fit entirely in memory, in which
case querying only hits main memory, not the network on the peer process,
which makes them (potentially) many orders of magnitude faster than querying a
traditional RDBMS).

~~~
taeric
First, I'm hesitant in making this post, as I do not intend to belittle
anything getting done. Sounds like this team is working very hard on
solutions, and that is ultimately awesome. And, serious thanks for the
details. Sounds very fun. (And, over my head.)

I am curious on why you have "potentially" in parens. Is this just not panning
out in measurements? Are these not techniques that older products could have
already subsumed into their repertoire?

A lot of this reads like RISC versus CISC debates. There are virtually no
techniques that one side can claim monopoly on. So it is not surprising to see
that picking the acceptable tradeoffs and combining solutions appropriately is
often very effective.

------
eloff
I've been developing high performance databases for the last five years off
and on. This article is right on the money. It's tough to beat MVCC for a
database that wants isolated concurrent transactions. Append only designs seem
attractive, but they don't actually help much with write throughput on SSDs
(but avoiding in-place updates does, due to erase blocks.) And they often
require writing much more information. E.g. writing 100 bytes in LMDB, which
is an append only btree, requires copying all pages from the root to the leaf
and writing them all out, typically 64k or more. Writing 100 bytes to a
transaction log or WAL costs about 100 bytes. There's a huge write
amplification going on.

~~~
hyc_symas
LMDB is a COW tree but it is emphatically _not_ an append-only tree. Append-
only sucks, for all of the reasons already spelled out in the blog post.

The write amplification in LMDB is bounded by the height of the tree, so it's
O(logN) with N=number of records. The write amplification in all WAL-based
designs is O(N) where N=size of the records. With a 4K page size, to get a
write amp of 64K in LMDB would require a tree depth of 16. At 100 bytes per
record, 40 records per page, that would require 40^16 records, or
42949672960000000000000000 records. IMO your figure of 64K is vastly
overestimated.

I've done more detailed investigation of write amplification in various DB
engines here [http://symas.com/mdb/ondisk/](http://symas.com/mdb/ondisk/) and
you can see that the break-even point relative to WAL-based engines is around
record size of 2KB. Above that size LMDB has _lower_ write-amp than WAL-based
approaches. Coincidentally (hah) LDAP entries tend to be >= 2KB in size.

But most importantly for real-time services - LMDB write latency is tightly
bounded, and never suffers from stop-the-world compaction pauses. This was a
design requirement from day 1. (See section 3.2 of the 2011 LMDB design paper
[http://symas.com/mdb/20111010LDAPCon%20MDB.pdf](http://symas.com/mdb/20111010LDAPCon%20MDB.pdf)
) E.g., see the comparison of write latency in LMDB vs HyperLevelDB here
[http://symas.com/mdb/hyperdex/#100M](http://symas.com/mdb/hyperdex/#100M) \-
we can reliably compute/predict LMDB's I/O latency and our prediction
perfectly matches the real world results - 33ms avg latency on a disk with
16ms avg seek time. You cannot do that with any of the other DB engines.

~~~
eloff
Sorry, I got my terminology muddled. IIRC LMDB is similar to append only
trees, except it puts obsolete pages into a free list (also a LMDB tree?) and
will reuse them when it reaches the end of the file? And my math is way off
for 100 byte records, laughably so. Don't get me wrong though, I needed an
example of write amplification caused by COW that happens in append only
trees, LMDB is a great database, and for the use case it was designed for,
large records and read heavy workload, it has no peer.

~~~
hyc_symas
> and will reuse them when it reaches the end of the file?

You were right up till that - it reuses pages as soon as they are safe to
reuse, not waiting to get to the end of the file.

Anyway, yes, write amplification is a significant factor, in any DB design.

~~~
eloff
While I have your attention, assuming SSD with typical 128k erase blocks, do
you do anything special to avoid write amplification at the disk level when
reusing pages? My understanding is that those are in-place updates from the
filesystem's POV, so writing over 32k of old pages will cause 128-256k actual
writes. One could avoid this somewhat with trim support, fallocate + hole
punch (not portable) but you have to think carefully about alignments and
grouping free pages into contiguous erase block multiples. That never struck
me as being very practical. Have you thought about that problem before or
tried anything there?

~~~
hyc_symas
Thought about it - yes. Tried anything - no. We have no idea what an SSD's FTL
is doing under the covers, and no way to find out thru a SATA/SCSI command
interface. Trim and hole punching are no advantage - it's not safe to do them
before a page may be safely reused, as soon as a page can be safely reused,
we'll probably be writing new data into it.

------
daemonk
Is this a case of right tools for the right job? Are there special cases where
immutability is superior?

------
juliangregorian
On the RethinkDB thread yesterday there seemed to be a popular misconception
that MVCC provides rewindable versioning capabilities -- it does not. Good to
see this article getting some traction.

~~~
hyc_symas
Many MVCC systems _can_. I personally have never had a reason to want to.

~~~
juliangregorian
Sure, but it's not due to implementing MVCC. And, as the article illustrates,
that type of optimistic approach doesn't come for free.

