I'll try to address your questions. This isn't a complete explanation of how our system works!
Most people trying to solve this problem have approached it by trying to take multiple machines, each capable of processing transactions, and do distributed transactions between them. This leads to systems that do local transactions fast but global transactions much more slowly, and a tradeoff between the quality of fault tolerance (paxos=good, 2-phase=bad) and commit performance.
Our approach is instead to decouple the first job you mention (checking if transactions conflict to provide isolation) from the second job (making transactions durable). These can then be scaled independently.
The first job, conflict resolution, is a small fraction of the total work and needs access only to the key ranges used by each transaction, not the existing contents of the database or the actual values being written. This allows it to be very fast.
The rest of the system is a consistent, durable distributed store that has to preserve a predetermined ordering of writes, and provide a limited window of MVCC snapshots for reads.
Does that help at all?
Specific questions: does each machine have its own commit log? Is each machine authoritative for its range of the keyspace, in that it can serve reads without having to consult other machines?
Each machine doesn't have its own commit log at the level of the distributed database. (The key/value store where we locally store data on each node happens to also use write ahead logging, but this is not an architectural requirement). Ideally transactions just have to be made durable in some N places to be considered committed. In practice, we elect specific machines to a transaction logging role, among other reasons because SSDs do better at providing fast durable commits if they aren't also serving random reads.
Each machine is authoritative for reads for some ranges of the keyspace and for the range of versions it knows about. It proactively fetches writes for newer versions. If it gets a request for a version it hasn't managed to get the writes for yet, the read has to wait while that storage server catches up. In practice this lag is very small, as you can see from our latency measurements.
Like FoundationDB, datomic does appear to separate transaction processing from storage.
A few relevant differences:
- Datomic provides transaction isolation through stored procedures which run on its single "transactor", while FoundationDB provides interactive transactions.
- To my knowledge datomic does not claim to be interested in high write scalability, which we definitely are.
- Datomic is designed to keep multiversion history indefinitely, while FoundationDB keeps it just long enough to execute short transactions.