The reason failures are hard to detect in asynchronous networks is that permissible message transit times are unbounded; ie they refuse to acknowledge the presence of any partition. If you acknowledge the possibility of partitions, then your system is by definition not asynchronous.
Reporting an error condition counts as an availability violation.
This is bullshit. Per the definition quoted in the linked article, availability only means that "...every request must terminate.". It is not required that it terminate successfully.
No. Like you say, async means failures are hard to distinguish from delays. If a node's NIC sets on fire, I'm pretty sure no messages are ever going to get delivered to it - hence it is partitioned from the network. It is very hard to tell whether it has failed, or whether it is just running slowly, in an async network.
"This is bullshit. Per the definition quoted in the linked article, availability only means that "...every request must terminate.". It is not required that it terminate successfully."
No. The definition of the atomic object modelled by the service doesn't include an 'error' condition. Otherwise I could make a 100% available, 100% consistent system by always returning the error state, which is thoroughly uninteresting. You have to read more than the quoted definition in the Gilbert and Lynch paper to start calling bs - it is very clear that authors do not allow an 'error' response.
A curious inconsistency.
But still... ok, the bullshit is elsewhere.
For a distributed (i.e., multi-node) system to not require partition-tolerance it would have to run on a network which is guaranteed to never drop messages (or even deliver them late) and whose nodes are guaranteed to never die. You and I do not work with these types of systems because they don’t exist.
This is bullshit; it assumes that the only options are complete reliance on absence of failures, and tolerance of arbitrary partitions. Specifically, it claims that P(system fails) = 1 - (1 - P(any particular node fails)) ^ (number of nodes), and that therefore "the question you should be asking yourself is: In the event of failures, which will this system sacrifice? Consistency or availability?". There are plenty of real-life counter-examples to this; given a real-world (ie, at least partially synchronous) network, it is possible to maintain consistency and availability in the face of partition/failure of up to half-minus-one of your nodes. This blows the probability calculations completely out of the water.
You cannot, however, choose both consistency and availability in a distributed system. ... As a thought experiment, imagine a distributed system which keeps track of a single piece of data using three nodes—A, B, and C—and which claims to be both consistent and available in the face of network partitions.
Hey, see how that "in the face of network partitions" snuck in there? It's bullshit, you want "in the face of these specific kinds of network partitions", things like crashed nodes. Just enough to invalidate that abuse of statistics used to claim that Availability is impossible.
I don't agree - availability is a totally meaningless property if you are allowed to occasionally return "no, I won't process your request". Such a response communicates nothing about the state of the atomic object you are writing to or reading from, so you can always return it and trivially satisfy 'availability' if we define it this way.
To your other point - be aware that I didn't write the article, so I'm not speaking for the author. However, I think you're right that the article makes it sound a bit like a single failure or message loss will cause any protocol to immediately sacrifice availability or consistency. This isn't the case - all CAP does is establish that for every protocol, there exists a failure pattern that will force it to abandon one of the two.
For quorum systems, this means that a permanent partition causes one half of the partition to no longer be the majority, and therefore can no longer be consistent if it responds to any requests. Paxos is another example.
So you're right, there are particular patterns of partition that stop a protocol from functioning correctly. And many that don't - hence the term 'fault tolerant' has some meaning.
Avoiding these patterns, in practice, can turn out to be surprisingly tricky. High-performance systems can't afford to have too many participants, which means that the probability of a problematic failure is higher than we might like (five participants in a consensus protocol is already a lot for high throughput, but now we are susceptible to only three failures). Failures are also often correlated, so independence assumptions don't hold as much as would like. Machines crash. Networking gear fails.
There's no abuse of statistics here. The probability of a particular failure pattern can be engineered low, and at that point you must weigh the trade-offs of the cost of loss of availability / consistency vs. the effort you make to minimise the chance of occurrence. We are talking about edge cases here, and the implicit assumption is that the cost of hitting one of them is huge (and it often is). However if you run a cluster large enough, you hit edge cases all the time.
(Although you mention partial synchrony, note that most of these results are mainly applicable to asynchronous networks in the first instance).
Availability of the individual node, or of the service as a whole? So long as the nodes always answer quickly, can't I just ask a few different ones until I get a successful response (or conclude that the entire system is down)?
Let's imagine a system where you want to be 100% available for reads, for any number of failures less than N. Then you need to be able to submit every single write to every single node in the system, otherwise the failure of all but the up-to-date node will result in stale reads.
But then if a single node is partitioned from the network, we can't (correctly) be available for writes, because the system is incapable of sending updates to all reads as required. It doesn't matter which node you ask.
The point is that every system has a failure mode like this. I take your point that it's not always just a single node failure that precipitates the abandonment of C or A, but that was never the point of the CAP theorem.