Hacker News new | past | comments | ask | show | jobs | submit login

> industry experience has taught us that availability—in the strict sense that it is defined in the CAP theorem—is overrated: In practice, the uptime of systems that favor availability have not proven greater than what can be achieved with consistent systems.

The answer to any distributed consistent database is always the same: we gave up on one letter in CAP. In this case, it was 'A', availability.

And hey, cool, that's certainly an option. But I'd prefer if they had simply made that the subtitle.

What they're trying to explain is that so-called AP systems do not have 100% availability either. The CAP theorem applies to very specific scenarios that are not necessarily common - it is a much more narrow theorem then "CP vs AP" implies. For example, if your network partitions are usually small (i.e. there exists a majority outside the partition) and you can avoid routing requests to the minority partition (e.g. because you peer with ISPs and can route somewhere else, or internally avoid routing to a rack that is don etc.) then you won't observe a loss of A.

For more information on this perspective you can read _Spanner, TrueTime and the CAP theorem_ by the fella that coined the term CAP: https://static.googleusercontent.com/media/research.google.c...

Yammer concurred with this perspective, stating:

"At Yammer we have experience with AP systems, and we’ve seen loss of availability for both Cassandra and Riak for various reasons. Our AP systems have not been more reliable than our CP systems, yet they have been more difficult to work with and reason about in the presence of inconsistencies. Other companies have also seen outages with AP systems in production. So in practice, AP systems are just as susceptible as CP systems to outages due to issues such as human error and buggy code, both on the client side and the server side."


"In practice, the uptime of systems that favor availability have not proven greater than what can be achieved with consistent systems."

This claim is much stronger than just saying AP systems sometimes fail. Therefore observing some AP systems failures is not enough to prove it. For the statement to hold true, AP systems would have to fail just as often as CP systems. However, such claim is totally unfounded. The article doesn't link to any research showing AP systems have the same (or worse) availability than CP systems.

On the other hand, most CP systems, even centralized ones, are not truly consistent as required by ACID either. Most RDBMS systems run at Read Committed level for performance reasons, which is ACD, not ACID.

It is also ignoring the fact that most systems are neither AP or CP, and that this distinction is not very useful to describe database systems.

Yep, the Yammer article is just opinion. It may or may not match your experiences/beliefs :) Check the Spanner paper for slightly more data (but its still pretty informal) from Google. They don't talk about other systems, but they explain why Spanner has such high availability. It would be interesting to see an empirical study comparing availability but I'm not holding my breath because doing that fairly and realistically would be challenging.

I'm not sure about the claim about CP systems, but I probably have a different opinion on "common" CP systems (etcd, consul, Spanner-likes, which do offer consistency in the sense of ACID.) But it's definitely worth considering things weaker than that - e.g. Azure Blob Storage and Google Cloud Storage (S3-likes) offer strong consistency but no cross-key transactions. These databases are very applicable but weaker than Spanner-likes like the one from TFA.

Agreed that AP vs. CP isn't a good way to describe databases :)

Note that this mostly just an opinion piece is also very guilty of attempting to poison the well.

"AP systems are not 100% available in practice"

Nothing is 100% available, it is like absolute zero, you can get close but never reach it.

I am not sure if the intent on this was to justify the uptime problems that yammer had this year, or if it was to just self justify engineering choices.

The real flaw is ignoring the horses for courses reality of systems needs. Automated Teller Machine's transactions are not strongly consistent, yet we manage to get by.

Spanner is a "shared nothing" architecture which has advantages availability and recoverablity. But you do pay a price for the high level of consistency.

Each write is mediated by a leader, and has to be committed on at least 2 of the three regional nodes which does have a cost.

Yammer (and I guess rayokota) chose to use PostgreSQL which is not a horrible choice for a traditional RDBMS these days, but we all know the pain of a master node failover, re-sync etc.. And with most streaming replication methods there will be a serious issue with the nodes that may have been on a partition with the original master and the fail-over node that is promoted.

Distributed systems are hard, but I am having a hard time really finding much value in this post outside of a fairly opaque opinion that there is no perfect product for all needs, which will always be true.

But as for why I chose to respond to your comment jacobparker. The thing to remember about strong consistency is that locks, two stage commits, single mediators etc...they will all limit the theoretical speedup in latency you can have by adding resources.

While not perfect this means that you should engineer with Amdahl's law and hope for Gustafson's law.

As a practical example on how this becomes an issue with scaling consider the following article about the limitations of single disk queues in the linux kernel on systems with multiple CPUs


Note the extreme falloff on Figure 4.

This doesn't mean that you can't start out with a simple strong consistency model, but if you don't at least try to avoid the need for ACID type transactions it will be very hard to scale later. Worse most teams I have been on try to implement complex and fragile sharding schemes which in effect replicate a BASE style datastore. If we think managing distributed systems are hard, writing them is much more difficult.

But really there are no universal truths in this area, and all decisions should be made on use case and not by selecting the product before defining the need (which is our most common method it seems)

> For the statement to hold true, AP systems would have to fail just as often as CP systems.

I think all that would need to be true is that there exists one CP system that has uptime figures comparable to AP systems. Then you could say that targeting AP is pointless, when you could either just use that system; or engineer your own new CP system to do the same things to achieve uptime as that system.

You can achieve high uptime in a CP system by assuring P will never happen, ending up with some kind of an AC system, but in practice it may be very costly, hard to setup and operate.

Can you imagine every ATM having a redundant networking connectivity and two independent power lines?

There is no free lunch. Comparing just availability figures is therefore not enough. Still an AP system may be preferable if it has the same availability at a fraction of a cost of that unicorn available and consistent system.

Note that a "CP system with high uptime" isn't suddenly an AC system. It still doesn't attempt to have 100% uptime, and its ecosystem of clients still must be built around the idea that the server can go down, if just for a few minutes a year.

This is different from an AP system, because an AP system has to be built from the ground up to assume that stupid corruptions will happen during netsplits and that it'll have to re-integrate them; while the CP system can just assume it'll all go down, and not have to have any of that extra code or infrastructure.

A good example of a high-uptime CP system that I'm aware of, is telecom switches. They're usually architected as simple pairs: one live, and one hot standby. And—even in platforms that don't have "hot upgrade" functionality—you can still upgrade both nodes without downtime by just intentionally causing connections to fail-over back and forth between the nodes.

Telecom switches have very little downtime. But they do go down, rather than becoming inconsistent. And the resulting architecture is much cheaper (in both hardware/networking costs, ops costs, and software development/maintenance costs) than the equivalent architecture you'd need to serve calls under AP guarantees.

For many applications repeatable read, or even read committed, is plenty good enough Isolation.

For many applications BASE is plenty good consistency.

Does your explanation also hold when you look at read and write availability separately? I'd only see that to hold for the read "view" here. E.g., Cassandra is used because you can write your data almost always to Cassandra, while with a CP-oriented system (HBase?) you are stuck "waiting" for consistency.

Yes; in vanilla Paxos you need a majority of nodes to be up/reachable to service either a read or a write. There isn't really a distinction between read/write in terms of availability (for better or worse.) I think its pretty unlikely (in terms of db design) that you'd have a consistent system that could service reads but not writes during some manner of outage.

> I think its pretty unlikely (in terms of db design) that you'd have a consistent system that could service reads but not writes during some manner of outage.

Consider a bank account. I can consistently service writes that add or subtract from the balance as long as I can guarantee durability and is willing to blindly update without a consistent view of the balance.

I can only service consistent reads when I can guarantee that I have seen every committed transaction.

In this case we opt for availability over consistency when presenting the current balance, but for consistency when e.g. cutting statements (through the brute-force method of just waiting until all settlement for the relevant dates has occurred).

I think you misread my comment; I said its unlikely to have a consistent system that can service reads but not writes during some manner of disruption.

Ah, yes, I read it as exactly the opposite in fact.

But there are lots of consistent systems that can service reads but not writes during disruption. Databases with synchronous replication, for example would typically continue to handle reads, but fail writes, during a partition.

True, good point!

> I think its pretty unlikely (in terms of db design) that you'd have a consistent system that could service reads but not writes during some manner of outage.

It's very easy to imagine this scenario. All you need is a read replica of your CP RDBMS.

For most systems consistency is only required on write transactions and eventual consistency on read only transactions is acceptable.

Paxos is about ensuring consistency. So why would the original arguments hold for a write-available database like Cassandra - or about log/journal storage, then?

This is an insightful observation. Designing more and more complex systems to achieve high availability may end up not achieving anything or even being negative because those HA systems are too complex to implement, configure and operate in a way that actually achieves HA.

I think the same may end up being true of globally distributed databases that try to provide generic ACID transactions. Will they actually ever achieve the performance and reliability needed to serve massive scale global work loads?

Or will this be a Mongo situation where all of the scale promises turn out to be unfulfilled once you actually reach the scale where you need them?

Mongo situation is best scenario. They went ipo while others failed.

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact