
A history of the distributed transactional database - evanweaver
https://www.infoq.com/articles/relational-nosql-fauna
======
zzzcpan
> This is not at all a theoretical problem. Deliberately inducing withdrawal
> races at ATMs is a widely reported type of fraud. Thus, databases need
> “external consistency”, colloquially known as ACID.

In all those fraud cases banks were actually running ACID databases as they
always do. Because ACID has nothing to do with "external consistency", unless
you can run it in a human brain, which you can't. Interactions with humans can
only be eventually consistent. But nobody even bothers to have ACID all the
way to the web browser or APIs, so it's only limited to the application, but
not beyond that. With all the frauds and double spending problems remaining
unsolved.

Overall the usual FaunaDB attacks on other systems, especially AP. Strong
eventual consistency of AP systems is actually much stronger, than ACID and is
much much easier to verify and AP is where the cutting edge of distributed
systems research happens. But more importantly, SEC can offer the best
theoretically possible performance and latency for multi datancenter
deployments. While distributed ACID transactions don't even show viability in
multi datacenter deployments at the moment, forcing vendors to degrade into
weak consistency models in such setups.

~~~
freels
I don't think it's correct to pit ACID against AP systems this way.
Consistency is orthogonal. ACID is model for reasoning about transaction
guarantees in terms of allowed anomalies at each isolation level.
Serializability in ACID implies CP over AP, but weaker levels don't have the
same requirement. It still comes down to tradeoffs for a specific use-case of
a database system.

Not sure what you're getting at when you say ACID transactions don't work in a
multi-DC context. FaunaDB aside, there are many production systems (commercial
or internal) that provide them that I am aware of.

~~~
espadrine
> _Not sure what you 're getting at when you say ACID transactions don't work
> in a multi-DC context._

I expect they are talking about DC as _regions_, not DC as _availability
zones_ (in the AWS parlance).

Planetary databases hit latency issues on cross-region ACID transactions.
Whatever the sharding, a transaction modifying one piece whose leader is in
Paris and another in Beijing will have at least the round-trip time between
the two.

It is already unforgiving at that scale, and will become inadmissible at
interplanetary scale, requiring some form of eventual consistency.

However, I still expect that we will retain ACID operations for local changes.
Humans tolerate latencies that are necessary for planetary ACID transactions.
The combination of the size of the Earth and the speed of light is a lucky
draw.

~~~
freels
Yes, I meant regions as well. Speed-of-light bounded latency certainly means
that serializable ACID transactions are not workable in some situations, but I
disagree with OP that this proves them not viable period, at least at planet-
scale as you say.

We're indeed pretty lucky as far as transactions are concerned...

------
evanweaver
For further reading, Dan Abadi's post at
[http://dbmsmusings.blogspot.com/2018/09/newsql-database-
syst...](http://dbmsmusings.blogspot.com/2018/09/newsql-database-systems-are-
failing-to.html) addresses the Spanner issues more directly.

I don't think there's a great reference for the Percolator transaction
architecture specifically, but there is a brief description here:
[https://blog.octo.com/en/my-reading-of-percolator-
architectu...](https://blog.octo.com/en/my-reading-of-percolator-architecture-
a-google-search-engine-component/)

One thing I learned recently is that FoundationDB is essentially the
Percolator model (timestamp oracle), even though FoundationDB predated the
Percolator paper.

~~~
williamallthing
What consistency model does FDB provide?

~~~
evanweaver
In my understanding, it can provide strict serializability.

The main difference from actual Percolator is that FoundationDB manages range
locks through a logically separate pool of in-memory lock caches, instead of
writing durable locks to the replicas, but the end result is the same.
External consistency for read/write transactions within a single datacenter at
a time.

~~~
wwilson
There is no _correctness_ limitation on using FoundationDB in multiple
datacenters, and if you do most of your writes in a single datacenter while
using the others for disaster recovery, the latency is honestly pretty good
too (especially with some of the new 6.0 features).

EDIT: I was wondering why this conversation was giving me deja vu, and then I
remembered that you and Dave discussed this in detail last month:
[https://news.ycombinator.com/item?id=18258805](https://news.ycombinator.com/item?id=18258805)

~~~
evanweaver
I forgot about that; thanks for the link. I didn't intend to imply a
correctness limitation.

------
spullara
I was using IBM DB/2 Parallel Edition in 1995 at Trans Union for a year 2000
project to move off the mainframe. It was running on 60 RS/6000 SP2s. I was
hoping the blog post would go further back in time.

[https://pdfs.semanticscholar.org/9c08/4a56c1c625da116eca4742...](https://pdfs.semanticscholar.org/9c08/4a56c1c625da116eca474208d9780539ceff.pdf)

------
wmf
It's kinda sad that Tandem and similar work has been forgotten. I've wondered
what we could learn from such history and never gotten a clear answer.

~~~
yourapostasy
HP still manufactures and maintains Tandem systems, under the HPE Integrity
Nonstop brand [1]. Just like IBM Systems Magazine [2], there is a lot of
proprietary tech described that takes awhile to seep into Linux and the open
source world. You can still read the Tandem documentation, so the ideas are
still there to be picked up by anyone interested in re-implementing them on
Linux.

[1]
[https://www.hpe.com/us/en/servers/nonstop.html](https://www.hpe.com/us/en/servers/nonstop.html)

[2] [http://ibmsystemsmag.com/](http://ibmsystemsmag.com/)

~~~
atombender
Is there a technical writeup anywhere about the design and architecture of
Tandem? One that has enough detail that a skilled database designer could
understand it well enough to reimplement it?

~~~
wmf
[https://jimgray.azurewebsites.net/papers/TEs.pdf](https://jimgray.azurewebsites.net/papers/TEs.pdf)

I don't have the background to fully understand this stuff, but it claims "The
first distributed SQL -- it offers distributed data, distributed execution,
and distributed transactions." while every single NoSQL/NewSQL article says
things like "Legacy databases (typically the centralized RDBMS) solve this
problem by being undistributed."

~~~
evanweaver
That claim is technically true. There's a lot of good stuff in
here—partitioned locks, transaction recovery journals, a realistic expectation
of WAN latency and reliability—but page 20 ("Local Autonomy") and page 37
("NonStop Operation") give clues as to the differences from modern systems.

Rather than build consensus-based high availability into the software, NonStop
SQL relies on custom hardware and redundant networks that are designed to
never fail. If they do fail, some portion of the data or some types of
modifications become unavailable and the operator is expected to intervene and
make choices about how to proceed.

They give an example of being unable to run an ALTER TABLE statement because
two datacenters that each own a shard of the table are disconnected—the
statement can't be run at _either_ datacenter during the partition.

They give other examples where queries return partial results because a
specific shard is down. This is potentially a correctness violation if it is
not explicit, and it is definitely a loss of availability.

Modern distributed systems are instead designed to maintain almost total
levels of availability on constantly failing hardware, and heal themselves
immediately and autonomously without losing correctness.

------
dongxu
TiDB developer here,

'This model is equivalent to the multiprocessor RDBMS—which also uses a single
physical clock, because it’s a single machine—but the system bus is replaced
by the network. In practice, these systems give up multi-region scale out and
are confined to a single datacenter.'

I don't think so, although Percolaotor has a single point of physical clock,
but it is easy to eliminate the SPOF problem by using high-availability
algorithms such as Raft or Paxos (Just like the design in TiDB). On the other
hand, the strong consistency across multiple data centers, I think, without
hardware devices such as TrueTime, it is difficult to have low latency and
strong consistency while allowing writes happen in multiple data centers at
the same time. The more realistic situation is that there is a primary data
center (where both read and write occur), and multiple data centers guarantee
high availability with strong consistency, we can achieve this goal by
scheduling Paxos or Raft leader to the primary data center.

------
iblaine
TiDB should be in here but trying to keep up with the latest trends is
overwhelming.

~~~
freels
TiDB uses the Percolator protocol

------
majidazimi
Why the hell does everyone make a banking example when they want to talk about
transactions? BANKS RUN ON EVENTUAL CONSISTENCY, PERIOD.

Another point worth mentioning is that, single transaction coordinator model
is undervalued in community. Majority of systems have the following query
pattern (sorted descending):

1\. Single key read --> you don't need to contact coordinator

2\. Single key write --> you don't need to contact coordinator. OCC with
simple version number works here.

3\. Multi-key asynchronous write (updating second, third keys using a pub/sub
infrastructure) --> you don't need to contact coordinator in this case.

4\. True multi-key transactions that should happen atomically --> need to
contact coordinator

If a single coordinator can process 20-30K/s transactions of category 4, then
total throughput of the system would be around 300K queries per second which
is sufficient for many workloads.

~~~
pritambaral
> BANKS RUN ON EVENTUAL CONSISTENCY, PERIOD.

Especially American banks. I can't think of any other retail banking market
where banks abuse eventual consistency so bad that transfers take days to
clear.

------
known
Cap theorem states that it is impossible for a distributed data store to
simultaneously provide more than two out off Consistency, Availability and
Partition tolerance

[https://en.wikipedia.org/wiki/CAP_theorem](https://en.wikipedia.org/wiki/CAP_theorem)

------
doombolt
Apache Ignite is an ACID key-value CP database that does not depend on clock.

------
likeabbas
Where does FoundationDB fit with this?

