
FoundationDB's Lesson: A Fast Key-Value Store Is Not Enough - jermo
http://voltdb.com/blog/foundationdbs-lesson-fast-key-value-store-not-enough
======
jandrewrogers
The underlying organization of a database is not about what can be expressed
in theory, it is about what can be expressed _efficiently_. Not only is there
a very rich set of data structure designs that can be used with varying
tradeoffs, but most sophisticated databases use an ecosystem of different,
tightly interwoven data structures that smooth over the sharp corners that any
single data structure has.

Efficient expressiveness follows from the relationships directly preserved in
the organization. Generally from least-to-most expressive you have:

\- cardinality preserving (hash tables, most KV stores)

\- order preserving (LSM, btree, skiplist, space-filling curves*)

\- space preserving (space decomposition e.g. quad-tree, a zoo of exotics)

Competent implementation is progressively more complex, nuanced, and
sophisticated as you go down the list, so most implementations reflect the
comfort level of implementor. As you move down this stack, you can express
things efficiently that will be very inefficient to express with a less
expressive class of organization.

SQL was designed for databases built on order-preserving structures. While you
can implement it on a KV store, it will never be as efficient as a database
organized in the more expressive organization that SQL assumes.

KV stores are popular because they are relatively simple to design and
implement, not because they are expressive. It is an architectural impedance
mismatch to add query functionality that tacitly assumes a more expressive
organization. It will never perform well against a database actually organized
for the expressiveness of the query layer.

Any database can only do a few things really well. It is inherent in the
tradeoffs. You can add mediocre support for a laundry list of other "good
enough" capabilities but you never want to market and position your product
around those mediocre capabilities.

~~~
aquark
That's a good breakdown.

In this case though FoundationDB's KV store was order preserving. The API
supported (& efficiently implemented) ranged reads, not just individual get's.

Implementing layered architectures always looks good 'on paper' but the
details often throw up performance issues that are hard to deal with without
punching some holes in the abstractions.

In this case it seems that relatively few companies had a need for a massively
scalable ordered KV store, and the whole SQL layer was an attempt to bridge
the product to a wider audience. It would be fascinating to hear more of the
story but I suspect that will never escape now.

------
gtrubetskoy
Regarding SQLite using FoundationDB as k/v, I got inspired by that comment to
do the same thing with Redis instead:
[http://grisha.org/blog/2013/05/29/sqlite-db-stored-in-a-
redi...](http://grisha.org/blog/2013/05/29/sqlite-db-stored-in-a-redis-hash/)
(and it was quite slow too)

I'm curious though - databases typically store data in B-Trees which are
blocks of equal size which works great for block storage. So isn't "block
storage" essentially a key value store, where the key is the block number and
the value is the block itself? _That_ I think is the proper way of using a
key/value store as the database back-end. (And that's what I did in my
SQLite/Redis experiment, BTW)

~~~
jhugg
I think this may slightly improve the too-fine-granularity locking, and it
might make full table scans a bit more efficient, but otherwise most of what I
wrote in the post applies. In fact the metadata problem has gotten worse and
you might have to move even more data around.

It would help if you could push down filter predicates to run locally inside
Redis, but at that point you're already more than a key-value store. I wonder
if you could do this using Lua?

~~~
fizx
I think you could, but at this point you're butting up against the event loop
assumption: that most of the work is IO. If you do compute on the edges, you
then want threads, and you're re-engineering redis (Edit: I should have read
grandparent's link, where he does just this).

But the core idea of pushing predicates to edges seems reasonable. At one
point, I built this sql engine that coordinated queries and pushed down
queries to the edges. It assumed that each edge store implemented an iterator
over all its values, with optional filtering and sorting (if not implemented
on the edge store, then the engine/client would filter/sort). It works great,
but I haven't yet published it for other reasons.

~~~
FeepingCreature
Hey, amateur here - but has anybody tried to do a database where your edge
servers literally run jit code? Like, you'd define a predicate like an OpenCL
kernel, as a small ball of code taking a predetermined set of constants or
per-row variables, then presumably push this as LLVM bytecode and let the
edges compile it into locally appropriate loops (probably with caching). Is
the problem there that it would become hard to apply optimizations that depend
on awareness of data structure at a higher-than-row level?

~~~
jhugg
So there are plenty of systems that compile portions of a SQL plan to bytecode
(LLVM or JVM) or machine code directly. Usually, the part you compile is the
SQL plan and most importantly the predicate filters.

Common operations like networking, transaction management and even index walks
(except the key comparisons) are already compiled to native code, so you don't
need to go all in. You just optimize the stuff that needs it.

------
orand
Hmm, a previous competitor slamming them only after they're certain
FoundationDB's gag order will prevent a rebuttal. Classy.

~~~
lmb
Correct me if I'm wrong, but what happened to FoundationDB smells like an
acqui-hire par excellence. This article provides an interesting alternative to
the "Apple bought this space age technology to keep it to themselves"
narrative.

Seems like the guy knows what he's talking about as well, b/c surprise, he's
working on a DB.

~~~
exelius
There's a third option: Apple had a need for a massive, distributed K/V store,
and at their scale, it's cheaper to buy a company than a license.

~~~
fweespeech
[http://cassandra.apache.org/](http://cassandra.apache.org/)

I'm not sure why Apple would prefer FoundationDB to Cassandra for this
usecase.

~~~
robertszkutak
Perhaps because FoundationDB is ACID compliant?

~~~
jhugg
So there is a point here. Apple is trying to compete with Google. Google has
some amazing distributed systems, including Spanner, F1, MillWheel, etc...
Apple has Cassandra and other OSS/COTS software. Not to ding Cassandra, but
this is a problem for Apple. We've seen repeatedly that at huge scale, it
often makes sense to own (or control) the software. See Facebook and LinkedIn
as well.

Now I don't think FDB (the product) is the answer, at least not in the short
term. There are more problems scaling it to Apple's use case than there are
working around Cassandra's lack of ACID.

So I'm convinced the value of FDB is the experience in the engineers' brains.
Apple need brains to run Cassandra, but also to figure out if Cassandra is the
right long term path. Build, buy, adapt? It takes veterans to make the right
call.

~~~
threeseed
Apple runs Cassandra at scale today. It underpins most of iTunes Match and
last time I checked all your iCloud data was sharded and stored in Cassandra.
They run one of the latest clusters in the world (at least the last time I
spoke with Datastax). Cassandra is a pretty easy database to run and scale and
Datastax is just down the road from them to help.

My guess is that FoundationDB is replacing their Teradata installation. Better
to buy the company and invest heavily in it then let it not met it's full
potential as a small startup.

------
arthursilva
Their SQL layer architecture is indeed chatty.

If I remember correctly they were doing/planing a lot of optimizations like:

1\. delaying requests a little on purpose to take more advantage of batch
requests

2\. fancy techniques to improve join locality (based on Akibans previous
work).

These two go hand-to-hand.

So in a good enough network (aka not any public cloud) it'd probably work
reasonably well up to a point.

~~~
jhugg
When I started VoltDB, Mike Stonebraker told me we had to be 10x faster than
Oracle sitting on a RAM drive, or nobody would care. 5x wasn't interesting to
him.

It seems like FDB-SQL was closer to 1x, with a much better replication story,
but with huge limitations on the kinds of things you can do.
([https://foundationdb.com/layers/sql/documentation/Concepts/k...](https://foundationdb.com/layers/sql/documentation/Concepts/known.limitations.html))

So maybe you could push it to 2x or 3x with a few years of work, but other new
systems with more SQL support and more customer traction are doing 10x and up
today. It's a tough sell.

~~~
arthursilva
While I do agree with you I don't think anyone used foundation DB sonellly for
its sql layer and if they did...

~~~
jhugg
Right. I'm not being critical of the underlying KV tech. It seemed pretty
impressive from what I know. My two points were: 1\. The SQL thing wasn't
gonna work they way they went about it. 2\. Without SQL or some other powerful
query tool, it's a less interesting product.

------
shin_lao
I think quite the contrary it can be quite enough and it should not be spoiled
with an SQL engine.

In which case it is a bit silly and you get the worst of both world.

