
When is "ACID" ACID? Rarely - pbailis
http://www.bailis.org/blog/when-is-acid-acid-rarely/
======
akeefer
While I appreciate how thorough the article is, it's a bit of a strawman.
Pretty much everyone who makes use of a relational database in a professional
capacity has to be aware of what transaction isolation level they're using,
make their own choice about what to use, and then do things like acquire
explicit update locks or do optimistic locking in order to ensure data
integrity. But that doesn't mean that the ACID properties are useless merely
because you have to do that; it might mean you have to think about a few
things more than you'd like to, but it's still a different world than trying
to mimic ACID properties in a NoSQL database, and there are still fairly hard
guarantees about things like consistency that you get with other isolation
levels. For example, with read committed or snapshot isolation, I still have
transactionality and can be sure that if I issue a sequence of 10 updates in a
single transaction and then commit it, any other single query is either seeing
the results of all 10 or of none of them. That's an important guarantee in
many situations, and it's a guarantee that I can use to make decisions about
how I structure my application logic.

The author of the post basically seems to treat any isolation level below
serializability as some sort of sham perpetrated on the development community,
and that's not the case: they're still useful, and they're still something
that you can use to build the sorts of application guarantees you want. The
mere fact that pretty much every database vendor gives you a choice as to what
isolation level to use should be a pretty obvious clue that there's no one-
size-fits-all answer there, so harping on the non-existence of a serializable
isolation level is somewhat missing the point.

~~~
MichaelSalib
I think it is extremely hard to reason about data integrity properties in non-
ACID systems.

Consider a banking application that uses linked accounts such that
transactions fail if the combined balance of linked accounts is below zero. If
I implement this in Oracle using Snapshot Isolation (the highest isolation
level Oracle offers) in the obvious manner, I'll get silent constraint
violations: transactions that should fail will succeed because MVCC can't stop
two competing transactions debiting the same account pair (i.e., I have two
linked accounts A & B with $100 in each and I launch two simultaneous
transactions to transfer $150 away from both: afterwards, I'll end up with
-$50 in both or some garbage data).

Now, you can fix if you recognise the problem in advance by doing a select-
for-update or changing your schema to materialize the constraint (say by
creating a linked-balance table that holds the combined balance for all pairs
of linked accounts).

But it is really hard to even notice the problem, especially if you've got a
few dozen tables with multiple applications writing to your database (RDBMS
advocates insist that this is a good thing). And there are no automated
methods for determining when this will be a problem: you just get silent data
corruption or silent constraint violations in your extremely expensive "ACID-
compliant" (but not really) Oracle database.

~~~
shariqm
"MVCC can't stop two competing transactions debiting the same account pair
(i.e., I have two linked accounts A & B with $100 in each and I launch two
simultaneous transactions to transfer $150 away from both: afterwards, I'll
end up with -$50 in both or some garbage data)."

That isn't true. In a pure MVCC world you could detect such issues,
specifically that two trxs were opened and they both tried to update the same
row. This would generate a conflict and one transaction could just get shot.
Snapshot-isolation/MVCC describes this issue.

However what InnoDB/mysql and Clustrix actually do is use MVCC for read-write
conflicts and two-phase-locking for write-write conflicts which trivially
fixes your proposed problem.

~~~
MichaelSalib
To clarify the example, the two simultaneous transactions are updating
different rows: one updates account A's row while the other updates account
B's. There is no way for MVCC to recognize this conflict: it will happily
corrupt your data.

~~~
shariqm
Sorry, I'm not following. If you have ___2_ __rows storing how much money you
have in __ _1_ __account then your data isn't normalized and serializability
isn't going to solve your problem, if you could be (even) more explicit, I
might be able to answer.

~~~
pbailis
I think we're conflating Snapshot Isolation and MVCC (e.g., _Snapshot-
isolation/MVCC_ ). MVCC is a general concurrency control mechanism, not an
isolation level. Coupling MVCC and Snapshot Isolation is like saying "using
locks provides serializability," which is not true in general--it depends on
how you use the locks.

That said, Snapshot Isolation _can_ lead to anomalies like Example 1.2:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142.556&rep=rep1&type=pdf)
If you perform additional synchronization (e.g., 2PL, as you describe), then
you can get serializability.

~~~
MichaelSalib
Yes, you're totally correct. I was thinking of SI.

------
shariqm
Clustrix says it provides repeatable-read because it complies with the ANSI
spec. In truth we actually provide something closer to snapshot-isolation by
default. Snapshot isolation in summary is a guarantee that all reads made in a
transaction will see a consistent state of the database. A transaction will
only fail to commit if there is a conflict of two updates. This is a strong
isolation level that does have a set of anomalies but most companies don't
care about them. Microsoft put out a great paper about snapshot-isolation:
<http://research.microsoft.com/pubs/69541/tr-95-51.pdf>

As the post points out, weaker isolation levels allow for more concurrency
which translates to more scalability, which is why we chose our isolation
level.

------
jpitz
Huh? You ought not be depending on the default isolation settings. If your
workload needs serializability, set it. Most don't. CAP teaches us that ACID
is a spectrum in a distributed system.

~~~
MichaelSalib
Did you read the post? First, it covers both default isolation and maximum
available. Second, note that many databases (like Oracle 11g) don't actually
give you serializable semantics at ANY setting. Third, most of these products
are not distributed systems. Fourth, ACID is not a spectrum for distributed
systems; it is basically impossible to apply without sacrificing availability.
But that's perhaps not a big deal if our non-distributed DBs don't provide
real ACID anyway.

The real issue here is that the database world is a cargo cult where ignorant
people scream ACID to denigrate new technologies without noticing that most
production databases aren't running with anything close to ACID and that major
database vendors can't even support ACID.

~~~
jpitz
I did read the post.

It does cover both default and maximum. I didn't dispute that. I called out
the notion that anyone ought to be depending on the defaults in the first
place, or that SERIALIZABLE as a default was a good choice.

Yes, many don't support SERIALIZABLE. Didn't contradict that either.

As to whether many of these are or aren't distributed systems:

Ingres - has replication.

Aerospike - distributed/fault tolerant/blah blah

Persistit - nope. appears to be a library.

Clustrix - clustered.

Greenplum - this is shared-nothing clustered postgres.

DB2 for zOS - i have no idea. let's call this one not distributed, for
giggles.

Informix - same

MySQL - lots of replication and HA options

MemSQL - replicated

MSSQL - replication and federated query modes

Nuodb - cloud database management? looks distributed to me.

Oracle - dont they have RAC ?

Berkeley (x2) - dont know. probably not.

PostgreSQL - a few replication options

HANA - no idea. lets call it in your favor.

ScaleDB - clustered.

Volt - shared nothing clustering

That's a little over half, by my count. Certainly close to most.

"The real issue here is that the database world is a cargo cult where ignorant
people scream ACID to denigrate new technologies without noticing that most
production databases aren't running with anything close to ACID and that major
database vendors can't even support ACID."

Some can't. Some do. I'm not screaming. My main message is this: Don't depend
on defaults. They differ from vendor to vendor. Understand your workload and
use the APPROPRIATE isolation for it.

( edited for formatting and clarity )

~~~
MichaelSalib
_I called out the notion that anyone ought to be depending on the defaults in
the first place_

Regardless of whether you should depend on them, many many people do. Heck,
many people don't even understand that there's a choice to be made: after all,
everyone knows that Oracle is ACID compliant, right?

 _Yes, many don't support SERIALIZABLE. Didn't contradict that either._

Sorry, I was confused by the bit about "If your workload needs
serializability, set it" since that's physically impossible on Oracle 11g.

Just because a DB has a replication package available (like MySQL) does not
mean that it is a distributed system. And the file backed DBs (like Berkeley)
are definitely not distributed. Sure, there are some extremely expensive
massively parallel DBs in use (like Volt), but the number of deployments for
those systems is a drop in the bucket compared with single-node
MySQL/Postgres/Oracle/SQLServer/DB2 instances.

~~~
jpitz
"Regardless of whether you should depend on them, many many people do. Heck,
many people don't even understand that there's a choice to be made: after all,
everyone knows that Oracle is ACID compliant, right?"

Ignorance is not an excuse. It just isn't.

"Just because a DB has a replication package available (like MySQL) does not
mean that it is a distributed system. And the file backed DBs (like Berkeley)
are definitely not distributed. Sure, there are some extremely expensive
massively parallel DBs in use (like Volt), but the number of deployments for
those systems is a drop in the bucket compared with single-node
MySQL/Postgres/Oracle/SQLServer/DB2 instances."

I do not understand what point you are trying to make here. Are these systems
de-facto non-distributed systems merely because of deployment counts? Is there
an objective criteria here I should be aware of?

~~~
MichaelSalib
_Ignorance is not an excuse. It just isn't._

I'm not interested in excusing anything, I'm interested in understanding the
real world. And what I see is that a lot of people who insist on the absolute
need for ACID don't really understand it because they're using non-ACID
technology right now.

 _I do not understand what point you are trying to make here._

Consider MySQL with asynchronous replication to a slave. In a weak sense, that
is a distributed system because the remote slave is on a different machine
(and probably very distant) from the master. But the distributed bit here
doesn't interfere with correctly implementing ACID: MySQL with async
replication operates identically to single node MySQL: it just transmits the
binary logs to a slave server. The database system itself is a single-node
service whose state gets replicated at transaction boundaries.

In contrast, distributed shared-nothing databases like Volt have to work
really hard to maintain consistency: there is no single node in those systems
that does all the work and gets replicated: multiple nodes have to cooperate
in order to get anything done.

------
redwood
When I first read this headline, I thought it had to do with purity of a
certain Swiss-made variety that left its mark on San Francisco (and
surprisingly, Palo Alto---e.g. Jerry Garcia and Ken Kesey). My mistake

~~~
mutagen
I should have known better, especially in the context of this website. Due to
a recent conversation with a friend about ph balance and the acidity of
various foods, I thought it was a discussion of the chemical properties of
various substances. Extraneous context affecting my initial impression.

------
stephenpiment
I think it's great that someone is taking a detailed look at distinct
isolation levels for different systems. It's also worth looking at distinct
levels for atomicity, consistency, and durability. One could then define an
overall "pH" level for systems.

I would take exception to Bailis' statement that "it is impossible to
simultaneously achieve high availability and serializability." At
FoundationDB, we do exactly that.

Stephen Pimentel

foundationdb.com

~~~
HenryR
I think it's rather clear that he's talking about high availability in the CAP
sense, which is precisely the kind of availability FoundationDB (rightly)
doesn't claim to achieve (<http://foundationdb.com/#CAP>).

BTW, the images are failing to load on that page for me.

