
FoundationDB: A new generation of NoSQL database - shansinha79
http://foundationdb.com/
======
haberman
This is a beautiful site with tons of information, graphs, lists, examples,
etc. and yet omits the one thing that underpins it all: how it actually works.

The claims are: serializable ACID cross-machine transactions "without
performance penalty." MVCC with optimistic concurrency control.

Optimistic concurrency control means that the server has to check the version
on all modified data before the transaction commits. Cross-machine
transactions mean that this version check has to happen on multiple machines.
ACID means that both machines have to either commit (if all the version checks
succeed) or roll back. How are you going to reconcile all of these
requirements without resorting to two-phase commit, which most certainly has a
performance penalty? And how are you going to get serializable transactions
across machines without processing these cross-machine commits one at a time
(waiting for two-phase commit each time)?

I'm just not seeing it.

~~~
voidmain
FoundationDB co-founder here.

I'll try to address your questions. This isn't a complete explanation of how
our system works!

Most people trying to solve this problem have approached it by trying to take
multiple machines, each capable of processing transactions, and do distributed
transactions between them. This leads to systems that do local transactions
fast but global transactions much more slowly, and a tradeoff between the
quality of fault tolerance (paxos=good, 2-phase=bad) and commit performance.

Our approach is instead to decouple the first job you mention (checking if
transactions conflict to provide isolation) from the second job (making
transactions durable). These can then be scaled independently.

The first job, conflict resolution, is a small fraction of the total work and
needs access only to the key ranges used by each transaction, not the existing
contents of the database or the actual values being written. This allows it to
be very fast.

The rest of the system is a consistent, durable distributed store that has to
preserve a predetermined ordering of writes, and provide a limited window of
MVCC snapshots for reads.

Does that help at all?

~~~
haberman
I still don't see it. Say you have conflicting transactions T1 and T2 that
both update data on two machines M1 and M2. If these transactions race to
commit, how do you guarantee consistency? Even if you have a perfect, zero-
latency oracle that can tell you that T1 and T2 conflict, you still need M1
and M2 to form consensus about which transaction commits first, and to make
sure that both machines either commit or roll back (I am assuming that both
machines have their own transaction log). It still sounds to me like 2PC is
required.

Specific questions: does each machine have its own commit log? Is each machine
authoritative for its range of the keyspace, in that it can serve reads
without having to consult other machines?

~~~
voidmain
The conflict resolution service assigns a global ordering to transactions as
well as pass/fail. Transactions that fail don't do any writes. Transactions
that pass still aren't durable, they could be rolled back by a subsequent
failure until they get on disk in a few places.

Each machine doesn't have its own commit log at the level of the distributed
database. (The key/value store where we locally store data on each node
happens to also use write ahead logging, but this is not an architectural
requirement). Ideally transactions just have to be made durable in some N
places to be considered committed. In practice, we elect specific machines to
a transaction logging role, among other reasons because SSDs do better at
providing fast durable commits if they aren't also serving random reads.

Each machine is authoritative for reads for some ranges of the keyspace and
for the range of versions it knows about. It proactively fetches writes for
newer versions. If it gets a request for a version it hasn't managed to get
the writes for yet, the read has to wait while that storage server catches up.
In practice this lag is very small, as you can see from our latency
measurements.

------
fdr
Hmm. Okay. I open a transaction for three days (let's say indefinitely), maybe
reading a few tuples here and there and updating a few, but never committing.
How does this system not fall over? I guess it could abort my transaction...

Still, I think their performance numbers can be legit, because they are much,
much slower than what a handful (much less than 24) machines could do if fully
partitioned and without coordination. 500KRead/Second on such small values is
not really fantastic performance over 24 machines. I also don't understand the
initial burst-capacity on the read-side either, although I can guess -- on
writes it can make sense because some work is deferred, but I'm trying to
understand how that can happen on reads as well. My guess is a transaction ID
allocation on-read that hits a wall somewhere.

Another common use case is I want to take a consistent backup/copy. For large
databases, this is a very, very long snapshot to maintain, if MVCC, or a lot
of locks to acquire, if 2PL. How does the system act then?

All in all, I think it's pretty neat, and I like that someone is dealing with
OLTP database problems with a mind towards easing one's burden via
transactions.

~~~
Andys
Re your first point - this is a problem for PostgreSQL also.

~~~
fdr
Sure it is. But I would say, more to the point, it's a problem for systems
that support consistent snapshots. The question is: how does it deal with the
failure condition?

------
blackhole
This very badly needs proof that it is, in fact, ACID compliant. Incredible
claims require incredible evidence.

~~~
Dave_Rosenthal
FoundationDB co-founder here.

Yes, we provide the strongest level of ACID semantics. Although proving things
about large computer programs is pretty hard, we have spent much of the past
three years building testing and validations systems to ensure this is true.
We run tens-of-thousands of nightly simulations to check system correctness
and properties in the face of machine failures, partitions, etc. We also run
long-running tests on real-world clusters using programmable network and power
switches to simulate these same cases in the real world.

So, we've convinced ourselves. What would you like to see on the site to help
provide the kind of incredible evidence you're looking for?

------
cbsmith
Honestly... if you have figured out how to scale with full ACID... I have a
hard time believing that having SQL support in there is going to "kill" your
ability to scale.

~~~
Roboprog
I'm guessing that no-SQL documents are written at a more course granularity
than SQL table rows. One document could well represent 3, 4, 5 rows in a
normalized SQL database.

OTOH, if you put all your "documents" in 3rd normal form, you might not see
much gain.

~~~
cbsmith
The overhead with transactions doesn't come from the fields being in different
tables...

~~~
voidmain
You are right. Our API is strong enough that you could implement a SQL
database on top of it efficiently. I have done it (using sqlite4 as the front
end) as a proof of concept.

------
damian2000
I recommend having a read of this article explaining the differences between
CAP and ACID: [http://www.infoq.com/articles/cap-twelve-years-later-how-
the...](http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-
have-changed)

TL;DR

ACID represents one design philosophy at the consistency end of the
consistency-availability spectrum. According to CAP, to have the maximum
Consistency of ACID, you'd need to trade-off against a lower Availability
and/or Partitionability.

~~~
jynnan
"Trading off against partinonability" is kind of a strange concept -- not only
can you not prevent network partitions, you usually can't tell when they've
happened until it's too late.

The real upshot of the CAP theorem is that you have to choose what goes in the
face of a partition. An ACID system says you lose availability (writes, and
possibly reads, fail if you can't reach a majority of the participants); an
Eventually Consistent system may pick either.

------
aaronblohowiak
Explain how you can have availability in the face of network partitions and a
perfect total ordering to events (ie: the sequentiality you claim.)

~~~
DanWaterworth
You can use Paxos. There will be some partitions that can cause an outage, but
this is only the case when there is no majority of nodes that can communicate
with each other, which in practice means you are exposing yourself to a
vanishingly small risk.

~~~
aaronblohowiak
3 rounds of communication to reach consensus for every transaction? Off the
cuff, it seems like that would hinder the performance. I haven't tested it,
though..

~~~
DanWaterworth
It doesn't have to be per transaction, it could be for a batch of
transactions. There are also optimizations like multi-paxos that can reduce
the number of round trips.

------
tagx
I worked under two of these guys two summers ago. They're really sharp guys
and I have faith they can back up their claims.

------
cbsmith
Looking at what they are providing, it's a shared nothing architecture, with
data stored in ordered form. Technically, you can make this work in a very
scalable fashion with an SSTable style store.

Think of it as LevelDB with a distributed B+ tree (or even just a few extra
levels) handling the partitioning between nodes. That can scale quite well,
and wrap updates and reads with snapshots at very low overhead to provide all
the key bits to handle ACID in a distributed database.

~~~
shin_lao
What's problematic is when you have a transaction that span amongst several
nodes and you need to reconcile several transactions (commit phase).

The usual way to do it is to refuse the transaction if a conflict is detected,
however this is a real performance and useability issue for a NoSQL database.

What I'm implying is that if you have ACID transactions but they can fail very
easily, you don't offer much...

~~~
jynnan
It's not that hard to retry a transaction until some deadline is reached. Many
programming languages have standard library facilities to make this extremely
easy.

If the database is fast enough and transaction failure is at least moderately
unlikely, it's not really an issue.

------
ericcholis
Perhaps I'm being a designer-snob, but Twitter's Bootstrap has quickly become
too popular. I honestly have a hard time taking a site seriously when it uses
the default Bootstrap look.

Having a unique visual identity is extremely important.

~~~
Wilya
Honestly, if someone wants to sell me a database, I'd prefer they spend money
on database engineers rather than on designers. To each his own, I guess.

------
jhull
Is it just me or is there a new "superior-than-the-others" NoSQL solution
every week?

~~~
bsg75
Today's database is better than yesterday's.

------
shin_lao
I'd be surprised to see how they can be truly ACID and truly scale out.

This is an unsolved problem to me, I submit there are fine prints regarding
ACID or scalability...

~~~
wmf
Jim Gray solved it in the 80s, but then we forgot about it.

~~~
shin_lao
To which of Jim's contribution are you referring to? I wasn't aware he
designed a scale-out ACID transaction principle.

~~~
wmf
I was thinking of [http://research.microsoft.com/en-
us/um/people/gray/papers/TE...](http://research.microsoft.com/en-
us/um/people/gray/papers/TEs.pdf) although in retrospect I'm not sure if they
implemented cross-node transactions.

~~~
shin_lao
To my knowledge, ACID transactions and scale-out architectures are mutually
exclusive, unless I missed a major breakthrough.

~~~
wmf
Two-phase commit?

------
jacques_chester
Without wanting to be any snarkier than usual, all-the-scale-without-the-
compromise has been possible for decades. It's just really expensive.

~~~
olalonde
How?

~~~
d0ugal
noSQL and YesACID. It's that easy.

------
jynnan
Your performance tests seem to have been run on a cluster that lives on one
gigabit LAN. Have you tested a cluster that's distributed geographically?

------
js4all
The claim to be the first acid compliant NoSQL database is wrong.

Bigcouch and Couchbase are around for a while. There is also CouchDB (without
the scaling).

~~~
voidmain
I was nervous writing that sentence, because it's hard to be sure of the truth
of any claim to be 'first'! But none of the examples you mention, to my
knowledge, provide _multi-key_ ACID transactions. A compare and set facility
for an individual key/document/row/entity group/etc is a very useful feature,
but cannot be used to provide atomicity or isolation for transactions that
read or write more than one key.

We plan to write something longer talking about the different levels of
support that various products provide for A,C,I, and D.

~~~
js4all
> But none of the examples you mention, to my knowledge, provide multi-key
> ACID transactions

If you see it that way, you are right. The guaranty is just for one
key/document pair (the key can be a key vector though). There is no way to
have a commit over several key/document pairs with these databases.

If FoundationDB can do that, it is a big plus. If it can do it in a cluster,
your product will have a great future.

------
d0ugal
What exactly is "YesACID" ?

------
wmf
Hey guys, I like units on the Y axis of graphs.

~~~
Dave_Rosenthal
So do I. Fixed :)

------
christianaranda
I don't need to be convinced. No doubt this works as advertised. Nicely done.

~~~
d0ugal
Switching all my production code to use their alpha as we speak.

