
Keeping CALM: when distributed consistency is easy - ngaut
https://blog.acolyer.org/2019/03/06/keeping-calm-when-distributed-consistency-is-easy/
======
DanielBMarkham
This is outstanding -- although in lieu of Yet Another New Programming
Language (YANPL), I would suggest simply having developers learn the concepts
and apply them no matter what environment they're in. This should be part of
any developer's education.

I have some good friends, smart people who know a lot more about good
programming than I ever will, who keep saying things like "OO and FP are just
different flavors. It's all the same"

That's both true and dangerously incomplete. Pure FP takes you places like
this essay. Yeah you can make all of this happen in OO, but it sucks. And if
you want to code at scale you gotta know exactly how much it sucks and what to
do about it.

ADD: My biggest critique of this essay is that the people who need to
understand it most will be put off by the terminology. I have no way to fix
that, but it still sucks.

~~~
jerf
"This is outstanding -- although in lieu of Yet Another New Programming
Language (YANPL), I would suggest simply having developers learn the concepts
and apply them no matter what environment they're in."

One of the recurring themes I've been banging on on HN for the last couple of
months is that there's actually a _lot_ of room for new languages to come out
that aren't just respellings of the current languages we have, and this is one
of the opportunities that exist. I say this because the vast bulk of our
current languages will just fight you so much if you try to work this way.
What most languages are designed to do and designed to make easy are precisely
the things that will blow your foot off. Trying to make this work in current
languages is tapdancing through a minefield. It can certainly be done, but
it's not something I think we can ever reasonably expect to be done at scale.

As you suggest, Haskell is pretty much the only semi-popular language that
could implement this with a library, and it would be correct. And even then it
probably would be a not-entirely-optimal experience, just because it's not
really optimized for this.

~~~
DanielBMarkham
I don't think we disagree. I'm a big fan of DSLs/languages over further app
complexity. To clarify my point, Another Language (tm) is not the problem. The
problem is a lack of understanding of the fundamentals.

I keep seeing people sell tech/languages/frameworks as some sort of
replacement for deep understanding. Hey, that's cool -- as long as you don't
spend all your time honking around with the tool instead of the problem. It's
when these promises fail that devs and teams end up on a dark, lonely path
with unhappy customers.

Overall I am quite fascinated by the concept of new environments for
programmers -- and I believe it's not just new languages, but the tools, the
industry background, the workflow, and a lot of other things come together to
make tools and frameworks rock. (Many times we all want to geek out on
language details when the problem with the language is lack of a community or
tooling that makes you want to jump out a window. But I digress.)

Something I'm working on is coming up with guidelines of when and how to
create good DSLs in order to create that kind of environment.

~~~
jerf
"I don't think we disagree. "

We do not. I meant more as elaboration, not outright disagreement. It is a
great idea to apply these ideas today, but it's _really hard_ , and I'd still
love to see someone fix that. And maybe some other things as they go. But in
the meantime, programmers of today really can't afford to wait to apply these
principles.

I guess my point is that I agree that YAPL that just has this in it and is
basically thrown out as a hack for writing a couple of papers is not very
useful, but I'd love to see someone set out to create the Next Big Programming
Language that _incorporates_ this idea, as well as other things, but isn't
just a demonstration project.

------
jerf
I expect that the continuing popularity and necessity of distributed
programming is going to finally put an end to the idea that "programming
doesn't involve mathematics". It is just _so much easier_ if you approach it
from the sort of mathematical mindset demonstrated in this paper, and exactly
as discussed, this is going to be a whole-stack affair, not something that can
be isolated to just the data layer, despite our valiant and useful efforts on
that front so far.

Programming this way is already easier in a lot of places, but the boundary
between non-distributed systems that are possible without mathematical
thinking and the ones that can only be done with mathematical thinking is easy
to ignore. It tends to be specific niches where programs are only possible
with this sort of thinking, and everything else can mostly be hit hard enough
with a stick until it works well enough that we can can just get on with it.
But as things get distributed, it just becomes insanely difficult to build
even a "correct enough" system if you're not building it with some sort of
math at least held in your mind as you design it. And "distributed" isn't just
about servers, it's about multi-core CPUs, it's about GPUs, it's about CPUs
and GPUs, it's about networks, just everywhere you look there's more and more
"distributed" arising.

~~~
agentultra
I think this has been the case for many decades. Lamport, the Turing award
winning mathematician and computer scientist who wrote seminal algorithms and
papers on distributed consensus, invented TLA+ for this express purpose.

A programming language is insufficiently expressive enough to specify and
guarantee the safety and liveness properties of a multi-agent system.

I think formal methods and program synthesis are going to become the tools of
choice in the long term for designing reliable systems in this vein.

I've been following the development of the CALM theorem and various attempts
at programming languages that implement it's ideas (BloomL, etc) since 2010 at
least -- I'd wished I knew about TLA+ back then. This paper is great... I hope
it catches on more.

~~~
deegles
What's a good small project + stack for someone to start with formal methods
and/or program synthesis?

~~~
agentultra
For an introduction and motivation to TLA+ I recommend Hillel Wayne's talk
_Everything About Distributed Systems is Terrible_ [0] and basically anything
he has written on his blog about the TLA+.

I think following along with the TLA+ video course is really thorough and
digestible once you've decided that you want to learn more [1].

You'll find the TLA+ Hyperbook is a useful reference and guide as you start
specifying your own systems [2]. It has a number of examples and projects to
work on.

Eventually if you want to go deep after you've decided it's worth it for
yourself, get into _Specifying Systems_ the text book and bible of TLA+ [3].

 _Program Synthesis_ has a lot of research in it and is starting to show
promising results. I'd say that while the research has been going on for a
long time it's still rather new to industrial use... for a motivating talk and
explanation on what program synthesis is try this one:
[https://www.youtube.com/watch?v=HnOix9TFy1A](https://www.youtube.com/watch?v=HnOix9TFy1A)

[0] [https://www.hillelwayne.com/talks/distributed-systems-
tlaplu...](https://www.hillelwayne.com/talks/distributed-systems-tlaplus/)

[1]
[https://lamport.azurewebsites.net/video/videos.html](https://lamport.azurewebsites.net/video/videos.html)

[2]
[https://lamport.azurewebsites.net/tla/hyperbook.html](https://lamport.azurewebsites.net/tla/hyperbook.html)

[3]
[https://lamport.azurewebsites.net/tla/book.html](https://lamport.azurewebsites.net/tla/book.html)

~~~
deegles
Thanks!

------
elierotenberg
It seems like monotonicity (or a subset, stronger monotinicity, still covering
many practical use cases) can be statically enforced in an algrebraic type
system. With some generics magic you can encode monotonicity as a type
assertion on (optionally asynchronous) function calls mapping an underlying
transport/RPC mechanism (REST over HTTP, RPC over Websocket, custom protocol
over TCP/domain socket...).

In TypeScript for example you can strongly type your communication protocol
(web requests/reponses, websockets, parent/worker ipc, etc) and get some
guarantees at compile time. For example, you can guarantee that your state
updates over the wire are eventually consistent.

The same goes of course for other algebraic type systems by encoding
monotonicity in the form of static type assertions (Flow, Caml, Haskell...).

~~~
marcosdumay
Well, since you just need to restrict the operations one can do with data,
even a Java or C++ style type system is enough.

What is best done with another language (or a complex library on a very
malleable language) is permitting non-monotonic operations with a clear
boundary between them and some code style that makes them undesirable.

------
hardwaresofton
> CALM falls short of being aconstructiveresult—it does not actu-ally tell us
> how to write consistent, coordination-free distributedsystems. Even armed
> with the CALM theorem, a system buildermust answer two key questions. First,
> and most difficult, is whetherthe problem they are trying to solve has a
> monotonic specification.

> The second question is equally important:given a monotonic specification for
> a problem, how can I imple-ment it in practice?

Can anyone explain a good way to use these results?

The OP (acolyer.org) seems to suggest that this boils down to using CRDTs but
that doesn't seem to be isn't really the answer but I can't figure out how to
use/properly think about the results in this paper (as opposed to others that
are a bit more practical). I've crawled through a bunch of paxos papers[0],
and at this point it seems to that the distribution part of the problem is
somewhat solved but the coordination required is what we're trying to attack
now.

Whenever I see systems that use CRDTs it's usually either a K/V store or a
concurrent editing application. Also, most articles will gloss over the fact
that both types of CRDTs (cmRDTs and cvRDTS IIRC) are actually roughly
equivalent (pure operation-based CRDTs[1]). IMO pure operation-based CRDTs are
just distributed WAL by another name -- and if that's the case then we already
_have_ the solution for this (paxos-like quorum on the WAL segements).

All this boils down to the thought I've been holding lately that distributed
consistency isn't _hard_ in this day and age -- WAL (which you generally want
for hard-disk-level consistency to start with, if you're not just waiting for
big fsyncs) + synchronous commit (whether quorum or all nodes) is all you need
to achieve a perfectly consistent distributed system, but what _is_ hard is
doing that with low latency and high throughput.

[0]: [https://vadosware.io/post/paxosmon-gotta-concensus-them-
all](https://vadosware.io/post/paxosmon-gotta-concensus-them-all)

[1]:
[https://www.researchgate.net/publication/320371248_Pure_Oper...](https://www.researchgate.net/publication/320371248_Pure_Operation-
Based_Replicated_Data_Types)

~~~
kilburn
The benefits we want to obtain from building distributed systems are:

1) Increased availability

2) Ability to scale (better throughput)

3) Lower latency (get the data closer to the client)

As you said, WAL + Consensus solves the consistency problem in distributed
systems. It does however go against _all_ those desirable properties:

1) You lose availability when consensus cannot be reached

2) Throughput is decreased in the face of contention

3) Latency is worse because you need quorum between the separate locations

One way to go about this trade-off is to push for "raw power". Better
networks, better clocks, etc.. This a commendable task and advances here
should be celebrated, but there's a wall in the horizon: eventually we will
hit actual physical limits (e.g.: speed of light doesn't let you lower
latencies anymore). Google's spanner is probably close to those limits
already. What can we do when this is not enough then?

The other approach is to work on reducing the need for coordination as much as
possible. The paper fits in this realm. What it does is to identify a class of
(sub)problem specifications that are solvable without coordination: monotonic
specifications. It shows that CRDTs are monotonic, and I'm not sure whether
any monotonic specification can be redefined as a CRDT (be it operational or
state-based).

What CALM provides though is a "different way to think about the issue". If
you can devise a monotonic specification of your problem, then you know that
an implementation that provides consistent output without any coordination is
possible. Furthermore, if your design is _not_ monotonic, an implementation
_will_ require coordination or the output won't be consistent.

Finally, the paper dabbles in the realm of breaking your problem in monotonic
and non-monotonic "pieces". The monotonic pieces you can implement without
coordination. Non-monotonic pieces either require coordination OR "repair"
(e.g.: send an e-mail to the customer apologizing that their item is actually
unavailable and their order has been cancelled). Once a non-monotonic "piece"
has been handled in this manner, the new piece that includes
repair/coordination can be considered monotonic. Once all your system's pieces
have gone through this process, you are guaranteed to have a "consistent"
output.

An interesting analogy would be the "unsafe" blocks in rust. Rust in general
is not safe because unsafe exists, but the fact that unsafe pieces are clearly
identified makes it easier to reason about program safety as a whole.
Similarly, the non-monotonic pieces of your system are where you risk losing
consistency, whereas monotonic pieces are just not a problem (i.e.: you can
get easily get good performance for those parts). By extension, if you can
model your system so that all non-monotonic pieces are out of the critical
path, your system is guaranteed to perform well in those scenarios!

All in all, this is just about giving better tools/thinking frameworks for
system designers to minimize the use of coordination in a distributed system,
which will invariably result in better performance without sacrificing output
"consistency" (i.e.: without violating business rules).

~~~
oggy
That's an excellent explanation, except for couple of points which I think can
be misinterpreted. A distributed system with consensus will in practice
provide higher availability than a single-node system, because it provides
fault-tolerance. In fact, fault-tolerance is the primary point of using (non-
Byzantine) consensus. But you are absolutely right that a distributed system
using consensus has worse availability than a distributed system with no
coordination.

Also, when using a system "with consensus" there is often no need to actually
invoke consensus on the read side of the system, in which case you don't have
to pay the throughput and latency penalties. I know you've sort of said this
already, but it might be helpful to mention explicitly.

~~~
sagichmal
> A distributed system with consensus will in practice provide higher
> availability than a single-node system, because it provides fault-tolerance.

I'm not sure this is true. It protects against one class of fault (node
failure) but opens you up to another (network failure). As a distsys engineer
I am increasingly convinced that fault tolerance is not a good selling point
for distribution, the fallacies of distributed computing are real and
difficult to accommodate.

~~~
oggy
That's not really true. With a single, non-replicated server, you're also very
much exposed to network failures. If the server's connection goes down, you're
screwed. Compare this to a Google replication setup of 2 East Cost, 2 West
Coast, and 1 central USA server. The client must only be able to reach two out
of three data centers (and they have to be able to communicate with each
other). That sounds much more resilient to me - and I guess Google agrees,
since they deploy the setup.

------
strictfp
Interesting results.

The results seem very useful when reasoning about coordination in programs,
but I'm questioning how practically useful these results are when applied as
methods.

In the GC example, for instance, you do need to find the garbage nodes. Any
trivial way of re-formulating that search in a monotonic way would likely
destroy performance. My guess is that it's no silver bullet to solving
problems in distributed systems.

------
dasvadanya
>"Rather than micro-optimize protocols to protect race conditions in
procedural code, modern distributed systems creativity often involves
minimizing the use of such protocols."

Commerical databases 20 years ago had "shared nothing" designs precisely to
avoid locks. The term "embarrassingly parallel" was introduced in the 1980s!
The technique of multi-version concurrency control is widely used in databases
and was first introduced in the 1980s. Maybe the authors forgot.

------
KaiserPro
sure its easy, if you don't care about speed.

~~~
carb
The post makes clear statements about how distributed consistency following
CALM can regain lost performance by removing the need for coordination,
without adding much complexity.

Keeping architectures monotonic (as defined by post and the linked paper)
isn't difficult, especially if you allow yourself excess storage and (again,
as discussed in the paper) deal with compaction of your data structures
asynchronously, off the critical path.

