
The kivaloo data store - cperciva
http://www.daemonology.net/blog/2011-03-28-kivaloo-data-store.html
======
mjb
> It is a durable, consistent, high-performance key-value data store built out
> of a background-garbage-collected log-structured B+Tree.

Some questions for Colin, if he reads this thread:

I assume log-structured implies 'append-only'. Is that correct?

This seems to be fairly similar in spirit to the design of BDB-JE, but without
offering many of JE's features (like ACID). This similarity is a good thing -
JE does a really great job in many situations. Have you done any comparisons
of kivaloo with BDB and BDB-JE, especially looking at IO constrained
performance?

Although its pretty clear why BDB-JE wouldn't be ideal for Tarsnap (starting
the the Java thing), why did you chose not to go with BDB? Just a licensing
issue, or more technical?

How do you trigger garbage collection? What kind of effect does garbage
collection have on the throughput of the database?

What kind of performance drop do you see with very sparse trees/logs, for
example with workloads that are very insert and delete heavy?

I haven't had time to look at the source yet, so I apologize if these
questions have obvious answers.

~~~
cperciva
_I assume log-structured implies 'append-only'. Is that correct?_

Yes. (Technically, append-at-head and delete-from-tail only.)

 _Although its pretty clear why BDB-JE wouldn't be ideal for Tarsnap (starting
the the Java thing), why did you chose not to go with BDB?_

Last time I checked, BDB was a library. I wanted a server (because a server
can cache data structures).

 _How do you trigger garbage collection?_

The code keeps track of how much garbage is present and keeps a running tally
of how much garbage collection it "owes" based on maintaining a long-term
optimal GC rate. When that value is large enough, it looks for some old pages
to clean.

 _What kind of effect does garbage collection have on the throughput of the
database?_

"It depends". The optimal cleaning rate depends on the amount of I/O you're
already doing, so it turns out that cost-optimization automatically results in
you doing more cleaning when the active I/O load is lower. In the common case
where the load on the data store varies (either because it's bursty or because
of daily/weekly load cycles) there won't be any cleaning happening during the
high-load periods.

 _What kind of performance drop do you see with very sparse trees/logs, for
example with workloads that are very insert and delete heavy?_

The B+Tree is rebalanced every time pages are written to disk, so "sparse
trees" aren't possible.

~~~
mjb
Thanks for the answers, this is very interesting.

> Yes. (Technically, append-at-head and delete-from-tail only.)

Do you handle this by breaking the DB up into multiple small files, like BDB-
JE?

> The code keeps track of how much garbage is present and keeps a running
> tally of how much garbage collection it "owes" based on maintaining a long-
> term optimal GC rate. When that value is large enough, it looks for some old
> pages to clean.

How long-term is that? I assume you have considered the degenerate case, where
load is increasing linearly (instead of more common daily/weekly/etc. cycles)
and garbage collection falls behind. In context of your next answer, it seems
like this case would cause the GC to get starved out, leading to higher IO
requirements for queries, leading to less time to GC, and so on to failure.

Granted, you would have to be running very hot for this to happen, but its
possible.

One more question: Have you considered implementing in-memory
locking/synchronisation (like a shared mutex)? Offering test-and-set type
operations is a nice alternative to full-fledge transactions for many use
cases. If these are just used as synchronisation primitives then fsyncing them
every time seems wasteful, on the assumption that many use cases don't care
about lock durability across server failure.

~~~
cperciva
The block store component in kivaloo uses multiple files, yes.

Needing to do GC won't make you need more I/Os to service requests; the exact
same sequence of B+Tree nodes will need to be loaded from disk. The only
effect of GC is wasted disk space.

Kivaloo does support a data-loss mode, so you could store locks in a daemon
running with that option. Personally I prefer to err on the side of caution
when I'm dealing with locking and transactions.

------
wladimir
So it's a key value store that supports mapping keys of up to 255 bytes to
values of up to 255 bytes.

Isn't 255 bytes a bit short, especially for values? What usecases does this
have in Tarsnap? Filename <-> Hash mapping?

~~~
cperciva
_Isn't 255 bytes a bit short, especially for values?_

No. Values, not blobs.

 _What usecases does this have in Tarsnap? Filename <-> Hash mapping?_

(Machine, block name) -> (object on S3, offset within object, length of block)

And a few others to deal with in-progress transactions and suchlike. But I
think the largest values I need are 20 bytes long.

~~~
wladimir
Thanks, that makes a lot of sense.

------
seiji
It even has dollar cost aware adjustment knobs:

    
    
      -S <storage:I/O cost ratio>
        Cost of one GB-month of storage divided by the cost of 10^6 I/Os.
        Used to control how aggressive the background log cleaner is.  Good
        values range from 0.08 (3 TB SATA with 100 random I/Os per second)
        up to 1600 (40 GB SSD with 25k random I/Os per second).  Setting
        -S 0 disables background log cleaning.  Defaults to 1.0 (which is a
        good value for Amazon EBS).

~~~
cperciva
This is mostly for the benefit of SSDs. You want to be far more aggressive
with cleaning on them.

------
jetz
congrats. performance-wise it does not look like the best but durability and
consistency would be the key points.

> default 128 MB memory limit on kvlds. this can be changed i suppose? i hope
> docs will be coming.

> services requests upon it from a single connection; and a request
> multiplexer (mux) which accepts multiple connections and routes requests and
> responses to and from a single "upstream" connection. does this mean it
> support multiple requests in one round-trip?

~~~
cperciva
_performance-wise it does not look like the best_

Show me a networked data store which does better on the same hardware. (FWIW,
with trivial keys and values, kivaloo gets significantly more operations per
second.)

 _default 128 MB memory limit on kvlds. this can be changed i suppose? i hope
docs will be coming._

Command-line option to kvlds, documented in a text file in the kivaloo
tarball.

 _services requests upon it from a single connection; and a request
multiplexer (mux) which accepts multiple connections and routes requests and
responses to and from a single "upstream" connection. does this mean it
support multiple requests in one round-trip?_

Yes, and responses can come back out-of-order, too.

~~~
jetz
i didn't get what you mean by networked data store. afaik almost all of kv
stores are similar wrt "network".

i checked <http://redis.io/topics/benchmarks> and saw that kivaloo performs
better within given constraints like value limit at 255 bytes. your benchmark
page makes numbers worse because of that 128 mb option.

~~~
cperciva
_i didn't get what you mean by networked data store._

As opposed to a library like BDB.

 _your benchmark page makes numbers worse because of that 128 mb option._

I assumed that people would get "ok, here's the in-core performance, and
here's the out-of-core performance" and understand that where that drop
happens is a function of how much RAM they use. I suppose I should re-run the
benchmarks with kivaloo set to use 1.5 GB of RAM.

------
mml
i needed to get to work one day, and i didn't like all the available cars,
bikes, trains, motorcycles and the like, so i built myself a dandyhorse.

~~~
cperciva
A closer analogy would be "I needed to cross the English channel, and planes,
helicopters, boats, and trains weren't suitable to my needs, so I built myself
a jet pack instead".

------
narag
Kivaloo is said to be based on B+ and to support a "range" operation. So in
fact it's _more_ than a key-value store, isn't it?

~~~
cperciva
Some key-value stores support RANGE; some don't. I don't see what you're
getting at here.

~~~
narag
I'd read that some key-value stores are implemented using hash tables and
erroneously assumed that was the general case. Good to know it isn't.

~~~
cperciva
A lot of key-value stores use hashing at the distributive layer but not within
individual nodes. Since kivaloo is currently single-node-only there's not much
point using a hash.

