Hacker News new | comments | show | ask | jobs | submit login

FoundationDB co-founder here.

I'll try to address your questions. This isn't a complete explanation of how our system works!

Most people trying to solve this problem have approached it by trying to take multiple machines, each capable of processing transactions, and do distributed transactions between them. This leads to systems that do local transactions fast but global transactions much more slowly, and a tradeoff between the quality of fault tolerance (paxos=good, 2-phase=bad) and commit performance.

Our approach is instead to decouple the first job you mention (checking if transactions conflict to provide isolation) from the second job (making transactions durable). These can then be scaled independently.

The first job, conflict resolution, is a small fraction of the total work and needs access only to the key ranges used by each transaction, not the existing contents of the database or the actual values being written. This allows it to be very fast.

The rest of the system is a consistent, durable distributed store that has to preserve a predetermined ordering of writes, and provide a limited window of MVCC snapshots for reads.

Does that help at all?

I still don't see it. Say you have conflicting transactions T1 and T2 that both update data on two machines M1 and M2. If these transactions race to commit, how do you guarantee consistency? Even if you have a perfect, zero-latency oracle that can tell you that T1 and T2 conflict, you still need M1 and M2 to form consensus about which transaction commits first, and to make sure that both machines either commit or roll back (I am assuming that both machines have their own transaction log). It still sounds to me like 2PC is required.

Specific questions: does each machine have its own commit log? Is each machine authoritative for its range of the keyspace, in that it can serve reads without having to consult other machines?

The conflict resolution service assigns a global ordering to transactions as well as pass/fail. Transactions that fail don't do any writes. Transactions that pass still aren't durable, they could be rolled back by a subsequent failure until they get on disk in a few places.

Each machine doesn't have its own commit log at the level of the distributed database. (The key/value store where we locally store data on each node happens to also use write ahead logging, but this is not an architectural requirement). Ideally transactions just have to be made durable in some N places to be considered committed. In practice, we elect specific machines to a transaction logging role, among other reasons because SSDs do better at providing fast durable commits if they aren't also serving random reads.

Each machine is authoritative for reads for some ranges of the keyspace and for the range of versions it knows about. It proactively fetches writes for newer versions. If it gets a request for a version it hasn't managed to get the writes for yet, the read has to wait while that storage server catches up. In practice this lag is very small, as you can see from our latency measurements.

That sounds alot like partialy atleast, the same model datomic has?

I can only speculate, because I have never used datomic.

Like FoundationDB, datomic does appear to separate transaction processing from storage.

A few relevant differences:

- Datomic provides transaction isolation through stored procedures which run on its single "transactor", while FoundationDB provides interactive transactions.

- To my knowledge datomic does not claim to be interested in high write scalability, which we definitely are.

- Datomic is designed to keep multiversion history indefinitely, while FoundationDB keeps it just long enough to execute short transactions.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact