The claims are: serializable ACID cross-machine transactions "without performance penalty." MVCC with optimistic concurrency control.
Optimistic concurrency control means that the server has to check the version on all modified data before the transaction commits. Cross-machine transactions mean that this version check has to happen on multiple machines. ACID means that both machines have to either commit (if all the version checks succeed) or roll back. How are you going to reconcile all of these requirements without resorting to two-phase commit, which most certainly has a performance penalty? And how are you going to get serializable transactions across machines without processing these cross-machine commits one at a time (waiting for two-phase commit each time)?
I'm just not seeing it.
I'll try to address your questions. This isn't a complete explanation of how our system works!
Most people trying to solve this problem have approached it by trying to take multiple machines, each capable of processing transactions, and do distributed transactions between them. This leads to systems that do local transactions fast but global transactions much more slowly, and a tradeoff between the quality of fault tolerance (paxos=good, 2-phase=bad) and commit performance.
Our approach is instead to decouple the first job you mention (checking if transactions conflict to provide isolation) from the second job (making transactions durable). These can then be scaled independently.
The first job, conflict resolution, is a small fraction of the total work and needs access only to the key ranges used by each transaction, not the existing contents of the database or the actual values being written. This allows it to be very fast.
The rest of the system is a consistent, durable distributed store that has to preserve a predetermined ordering of writes, and provide a limited window of MVCC snapshots for reads.
Does that help at all?
Specific questions: does each machine have its own commit log? Is each machine authoritative for its range of the keyspace, in that it can serve reads without having to consult other machines?
Each machine doesn't have its own commit log at the level of the distributed database. (The key/value store where we locally store data on each node happens to also use write ahead logging, but this is not an architectural requirement). Ideally transactions just have to be made durable in some N places to be considered committed. In practice, we elect specific machines to a transaction logging role, among other reasons because SSDs do better at providing fast durable commits if they aren't also serving random reads.
Each machine is authoritative for reads for some ranges of the keyspace and for the range of versions it knows about. It proactively fetches writes for newer versions. If it gets a request for a version it hasn't managed to get the writes for yet, the read has to wait while that storage server catches up. In practice this lag is very small, as you can see from our latency measurements.
Like FoundationDB, datomic does appear to separate transaction processing from storage.
A few relevant differences:
- Datomic provides transaction isolation through stored procedures which run on its single "transactor", while FoundationDB provides interactive transactions.
- To my knowledge datomic does not claim to be interested in high write scalability, which we definitely are.
- Datomic is designed to keep multiversion history indefinitely, while FoundationDB keeps it just long enough to execute short transactions.
As any ACID database must, we choose Consistency over Availability in the CAP theorem sense. This means that when a datacenter is disconnected or poorly connected to the internet, the database is not available for writes in that datacenter. If you are serving a web site to the internet, and need it to stay available if one Amazon AZ fails, this isn't too bad.
Some applications really do need disconnected operation, and thus 'AP' semantics. (For example, a grocery list for a mobile phone, or an application that must be available everywhere if the entire Internet partitions.)
These coordination servers do paxos, but are only needed when there are failures or role changes - they don't participate at all in committing individual transactions. Normally, we only need a single geographic ping time to do a durable transaction if it originates in the current 'primary' datacenter.
For further improved latency, we plan to allow each individual transaction to decide whether it needs multi-datacenter durability. The commit process is the same, but we can notify the client of success earlier if it is willing to take the risk that a WAN partition or meteor strike violates its 'D' guarantee. ACI are guaranteed either way.
Sounds like your coordinators are authoritative masters for transaction ordering. Does that imply a single-machine limit to throughput in a given dc? Presumably the ordering process is much less expensive than the kv store itself, so this might not be a practical issue.
What happens when nodes are asymmetrically partitioned from the coordinator and peers? E.g. a node is unreachable by a peer, but reachable by a coordinator, or vice versa?
Still, I think their performance numbers can be legit, because they are much, much slower than what a handful (much less than 24) machines could do if fully partitioned and without coordination. 500KRead/Second on such small values is not really fantastic performance over 24 machines. I also don't understand the initial burst-capacity on the read-side either, although I can guess -- on writes it can make sense because some work is deferred, but I'm trying to understand how that can happen on reads as well. My guess is a transaction ID allocation on-read that hits a wall somewhere.
Another common use case is I want to take a consistent backup/copy. For large databases, this is a very, very long snapshot to maintain, if MVCC, or a lot of locks to acquire, if 2PL. How does the system act then?
All in all, I think it's pretty neat, and I like that someone is dealing with OLTP database problems with a mind towards easing one's burden via transactions.
As other posters have already guessed, we don't support long-running transactions (we only keep multi-version information for a short time). So if you keep a transaction open for a long time it will not commit (after a while reads will start to fail, too).
It's not architecturally impossible for us to support long-running read snapshots, but it is expensive for our storage servers to keep snapshots alive for a long time. So our backup solution instead works by backing up transaction logs while doing a non-isolated read of the database. At restore time we will replay the logs to get back a consistent point-in-time snapshot.
The reason that we see higher burst than steady state performance has nothing to do with transactions. We have to do reads from SSD as the application requests them, and we have to make writes durable by logging them right away, but as you surmise we can defer the hard work of doing random writes to our btrees for a while. Even a workload that is 90% reads benefits a lot from deferring writes because writes have to be 3x replicated and because consumer SSDs are comparatively slow at mixed read/write workloads. Read only workloads will not see any initial burst (and might see a ramp-up time from cold cache effects).
Sure, it's all technically possible -- Postgres will simply stop collecting garbage via VACUUM, for example -- but the results are not usually very pleasant. But it sounds like instead your solution to the snapshot-read is more like how the online binary backups work, whereby you combine transaction logs with inconsistent reads. That system works very well. On the other hand, "logical" backups have been a major problem for me; sidestepping this by using lower level transaction log mechanics and dirty reads I think is a much better idea.
> The reason that we see higher burst than steady state performance has nothing to do with transactions. We have to do reads from SSD as the application requests them, and we have to make writes durable by logging them right away, but as you surmise we can defer the hard work of doing random writes to our btrees for a while. Even a workload that is 90% reads benefits a lot from deferring writes because writes have to be 3x replicated and because consumer SSDs are comparatively slow at mixed read/write workloads. Read only workloads will not see any initial burst (and might see a ramp-up time from cold cache effects).
I see. It's not read-only, it's 90/10, I misread. That makes more sense. What's your read-only performance? Presumably it would be impacted by requiring fencing to deal with cases involving network partitions, which is one of the problem with non-partitioned and consistent systems systems, as far as I can tell: reads need to check for liveness, and that's much harder than....doing nothing.
Read requests come to a storage server with a specific version number attached, so the storage server can reply authoritatively without any further communications. Starting a transaction requires selecting a version number for the read snapshot, and that incurs some latency to deal with the concerns you mention (but scales fine).
They say they use optimistic concurrency control, so your writers aren't taking any locks, instead the transaction will just fail when you commit if any of your read/modify/write's were modified in the meantime.
> Coming soon. An integrated backup system provides a true "moment-in-time" snapshot backup of the entire distributed database stored to a remote file system on a schedule.
Yes, we provide the strongest level of ACID semantics. Although proving things about large computer programs is pretty hard, we have spent much of the past three years building testing and validations systems to ensure this is true. We run tens-of-thousands of nightly simulations to check system correctness and properties in the face of machine failures, partitions, etc. We also run long-running tests on real-world clusters using programmable network and power switches to simulate these same cases in the real world.
So, we've convinced ourselves. What would you like to see on the site to help provide the kind of incredible evidence you're looking for?
OTOH, if you put all your "documents" in 3rd normal form, you might not see much gain.
ACID represents one design philosophy at the consistency end of the consistency-availability spectrum. According to CAP, to have the maximum Consistency of ACID, you'd need to trade-off against a lower Availability and/or Partitionability.
The real upshot of the CAP theorem is that you have to choose what goes in the face of a partition. An ACID system says you lose availability (writes, and possibly reads, fail if you can't reach a majority of the participants); an Eventually Consistent system may pick either.
From Abadi's blog:
Calvin requires all transactions to be executed fully server-side and sacrifices the freedom to non-deterministically abort or reorder transactions on-the-fly during execution. In return, Calvin gets scalability, ACID-compliance, and extremely low-overhead multi-shard transactions over a shared-nothing architecture.
The comments on that post are pretty interesting, too.
And from the foundationdb features page:
FoundationDB transactions are true interactive sessions, unlike distributed databases that require stored procedures. This means that client code can make an iterative series of reads and writes over the network to execute complex transactions.
Think of it as LevelDB with a distributed B+ tree (or even just a few extra levels) handling the partitioning between nodes. That can scale quite well, and wrap updates and reads with snapshots at very low overhead to provide all the key bits to handle ACID in a distributed database.
The usual way to do it is to refuse the transaction if a conflict is detected, however this is a real performance and useability issue for a NoSQL database.
What I'm implying is that if you have ACID transactions but they can fail very easily, you don't offer much...
If the database is fast enough and transaction failure is at least moderately unlikely, it's not really an issue.
Having a unique visual identity is extremely important.
But launching an entire new database, does deserve its own visual identity. That said, their logo is rather nice!
This is an unsolved problem to me, I submit there are fine prints regarding ACID or scalability...
Be prepared for sticker shock.
Bigcouch and Couchbase are around for a while. There is also CouchDB (without the scaling).
We plan to write something longer talking about the different levels of support that various products provide for A,C,I, and D.
If you see it that way, you are right. The guaranty is just for one key/document pair (the key can be a key vector though). There is no way to have a commit over several key/document pairs with these databases.
If FoundationDB can do that, it is a big plus. If it can do it in a cluster, your product will have a great future.