

532x Performance Increase for MongoDB with fractal tree indexes - rdudekul
http://www.tokutek.com/2012/11/532x-multikey-index-insertion-performance-increase-for-mongodb-with-fractal-tree-indexes/

======
imperio59
I read the presentation here:
<http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf>

The math looks a bit hand-wavy to me but I get the basic data structure, you
basically have sorted arrays whose lengths are each a power of two, so 1, 2,
4, 8, etc... When you insert a set of values, you can either fill an entire
array or leave it empty. So for example if you have 5 values, you'd have an
array of 1 element, an empty array with 2 spaces and a full array of 4
elements.

Each array is sorted by itself but the smaller arrays do not necessarily
contain all values that are lesser than all values of bigger arrays.

To do an efficient lookup you have forward pointers from each node to the
next-bigger node in the next array, so you can basically start your binary
search at a given index in a bigger array and move faster.

The problem is when you are inserting lots of values because you have to merge
arrays multiple times and overwrite a bunch of data, but the point is that you
are getting better disk IO doing that than rebalancing a B-Tree because you're
not making the disk head skip around all the time, thus achieving greater
speed.

I'm curious about the details of this benchmark though, what kind of values
are we talking about? Are they all the same size rows or do you have variable
sized values (like strings?) in the benchmark?

This looks promising but I'd love more explanations from the authors...

~~~
leif
Hi, I'm one of the authors (I'm an engineer, not an author of the original
papers, though I can discuss those too)! What would you like to know?

First, the structure you described is called a cache oblivious lookahead array
(COLA), and is not what we implement. It betrays a lot of the concepts that
make up a fractal tree and is a good educational tool, which is why you'll
find lots of Tokutek material that describes it, but there are some key places
where it departs from what we actually ship. Conceptually (and asymptotically)
they're similar, but when you get down to the details of an industrial-
strength implementation (like variable-sized keys, as you mention, concurrency
is another frequent question we get), you have to be more specific about what
we build.

So. What we build is closer to a cache-oblivious streaming b-tree, but I won't
make you look that up. Basically, we take a b-tree and stick an unsorted
buffer on each internal node. To insert data into a subtree rooted at node X,
you just append your data to X's buffer. Then, if the buffer's too full, you
flush X's buffer, distributing all its elements among X's children (but that
just puts them in the buffers of the children). You can flush recursively, and
make leaf nodes just hold data instead of buffers. This gives the same
insertion asymptotics as the COLA did.

Now you need to pick a node size, a strategy for querying, ideas for
concurrency, recovery, etc. but this is a start. I'll start by saying that
since we write nodes less frequently, we can pick a much larger node size,
which helps eliminate fragmentation and gives us a chance to get much more
aggressive compression (since large blocks compress better than small ones).

Hopefully this gives some insight into how we handle variable sized elements.
The mongo benchmark (as with all our mysql benchmarks) is running on something
that supports variable sized elements (though I don't remember if the
benchmark itself takes advantage of that).

This is still pretty unspecified, and we can keep talking about fractal trees,
but does what I've said so far make sense? Do you have any new questions?

~~~
imperio59
What if I pull out the plug to the server while you're in the middle of a
buffer write? You say you achieve throughput by writing to nodes less
frequently, does that mean in case of a hardware/power failure the amount of
data loss will be higher than with a system that writes more often to disk?
How often do you guys flush to disk?

Edit: I'm not saying those are necessarily bad trade-offs, i'm just curious :)

~~~
leif
We don't do anything super special here. Most databases have some notion of
"node" (or in mongo, "bucket", to distinguish from a machine as a node in a
cluster), and keep track of which nodes in memory are clean or dirty. We log
all operations and by default, fsync that on commit, so we're completely
durable, there's no data loss (unless you didn't commit something before you
pulled the plug, of course). To trim the log, you have to know that everything
dirtied by a transaction has been written to disk, and if you don't trim the
log often enough, it gets big and your crash recovery takes a long time
(innodb sometimes exhibits this). We and others have a notion of
"checkpointing" which says "all the nodes as of this point in the log has been
written to disk and marked clean", and that allows us to trim the log. Note
that this doesn't have to mean "stop the world and write out everything in
main memory", you can do a lot better (but that's a long discussion).

Because a given operation, even if it involves lots of random writes in the
key space, still only dirties a few nodes at the top of the tree, we actually
checkpoint faster and by writing less data than the same b-tree would (because
it would have to checkpoint a bunch of leaves, with random I/O, for every one
of our nodes), so our log is often much smaller and our recovery times are
much faster, usually on the order of seconds or minutes, rather than hours. By
default in our mysql product, we checkpoint every...I think 60 seconds, but it
could be 90. It's configurable but that's the ballpark.

But most of my comparisons here are with innodb, to be fair. Mongodb's storage
system doesn't support transactions and we're still learning about its
durability and recovery model, so I can't make any totally fair claims
comparing us to them.

(I love talking about this stuff, so keep up the curiosity)

------
carterschonwald
Please note that the actual name for a "fractal tree" in the research
literature is "Streaming (cache oblivious) B-Tree" (or at least they've very
very closely related).

Writing good code wrt memory locality is SUPER important for writing high
performance code, whether its in memory work, or larger than ram (eg for the
DB). Also a fun exercise to try to understand how!

~~~
monopede
Would make sense. Michael Bender is a co-author on that paper and he
apparently works for Tokutek, judging by this presentation:
[http://www.bnl.gov/csc/seminars/abstracts/Bender_Presentatio...](http://www.bnl.gov/csc/seminars/abstracts/Bender_Presentation.pdf)

~~~
ot
He is a co-founder: <http://www.tokutek.com/about/team/>

------
Groxx
> _At 3.5 million inserted documents, the exit velocity of standard MongoDB
> was 2.11 inserts per second..._

That's terrifying - do people expect performance like this? Or was this
crafted to be a pathological case? 100 element arrays don't seem too common,
but that only makes this 300 million entries in e.g. a SQL table - I suspect
my laptop running MySQL could outdo that kind of performance (but have no
proof. I could be very wrong).

~~~
knightni
It doesn't surprise me a great deal - Mongo doesn't have a history of being a
technologically strong database system. While it is/was perceived as high
performance, it's always been clear that this performance was gained as a
result of compromises rather than technical excellence.

------
gizmo686
The title is a little misleading. It looks like an asyntotic improvement, that
is 532x at the scale the benchmark ran. Looking at the graph, it appears to be
a significant improvement, as the old version clearly dropped to about 0,
while the new version looks constant. (It took some staring to see the
downward trend)

------
yason
Whenever I see speed-up factors of the order of 100x, all I can think of "the
original implementation must have been trivially superslow".

------
b0b0b0b
Sounds tantalizing, but in reading it is there a risk of patent infringement
liability?

------
nasalgoat
I've discussed these results with the team at 10gen and their comment is
basically "that's interesting but we're not looking at it at this time."

All told, based on my experience, MongoDB's performance still has a ways to
go.

------
hmexx
have the mongodb architects commented on this?

------
X4
Nobody believes me when I say Fractals solve literally everything efficiency
related. Even though it's true.

Nobody believed me when I introduced on-demand script injection for
javascript, today it's IT etiquette.

The power of popularity I guess.

~~~
krmmalik
Did you blog about it? give examples? What else can be solved with fractals?

~~~
Zenst
<http://en.wikipedia.org/wiki/Fractal_compression>

~~~
krmmalik
Thank you

