
Notes on Distributed Systems for Young Bloods (2013) - kiyanwang
https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/
======
Animats
It's a step backwards that distribution is being dealt with at the application
level. Tandem Computers [1] had distribution at the OS and database level
working well in the 1980s. The technology is still available from HP.

Tandem's NonStop OS is fundamentally different from UNIX-like OSs. UNIX is a
time-sharing system with a file system. NonStop is a transaction system with a
database system. There's very little state outside the database. Programs are
started, do one transaction, and exit. If there's a problem during a
transaction, the entire transaction aborts and does not commit.

Tandem was running an average system uptime of about 10 years across their
installed base, with the main cause of failure being human errors during
hardware maintenance. One of their people wrote a paper on how to get the
uptime up to 50 years. Tandem's own hardware had heavy self-checking. The goal
was to detect errors and abort, not fix them. Failing a transaction would
result in a fallover to the paired backup machine, which would result in a
retry and recovery.

Tandem's main problem was high hardware cost; you needed several of anything.
Also, because of a series of acquisitions, the system had to be ported from
Tandem to MIPS to Alpha to HP-RISC to Itanium to, finally, x86. X86 lacks the
checking hardware, so it's not as good for this.

[1]
[https://en.wikipedia.org/wiki/Tandem_Computers](https://en.wikipedia.org/wiki/Tandem_Computers)

~~~
adamnemecek
This is the first time I'm hearing about Tandem. Are there any other
noteworthy, older computer technologies that are in some ways superior to
current technologies?

~~~
nostrademons
Some examples I can think of:

\- Burroughs B5000. Designed from the start to be programmed solely with high-
level languages. Supported type-tagging directly in the hardware. All code was
automatically re-entrant; problems like PC-lusering, interrupt handling,
concurrent execution, and unsafe accesses to globals simply didn't exist. All
of this in 1961.

\- SmallTalk, Interlisp, and the Lisp Machine development environment (late
70s). All of these had the property that you could modify & customize your IDE
_as it was running_ \- it's sorta like Eclipse or Emacs on steroids, but
rather than a plugin architecture where you have to wade through megabytes of
docs to figure out how to write some code, the IDE was just written in itself
and open for re-use and modification. Smalltalk in particular was very good
about letting you inspect & pull up documentation on any object inside the
running system, seeing the runtime values of its fields.

\- SmallTalk also let you search by example, eg. you could type in the inputs
and the outputs of a function and it would suggest which function you were
thinking about.

\- The Common Lisp condition system (mid 80s) is way more thorough than any
error-handling system used today. A key feature was that error-handlers could
run _in the stack frame that caused the error_ ; thus, they could inspect the
code and data that caused it, and attempt to correct the problem, letting the
algorithm run uninterrupted. If the handler couldn't do this, you had the
option of dropping into an interactive debugger, fixing the code, and
continuing where you left off.

\- Multics (1964) did away with the distinction between files and memory. It
was as if every file was implicitly mmapped, or alternatively, the processes's
memory was just another file.

\- The capability security model (late 90s) fixes many of the security
problems that most systems today have. The key idea is that instead of letting
each subsystem (remote node, process, library, function call, etc.) _know
about_ every other node in the system and then force them all to check whether
each request is authorized from an intended sender, each subsystem can only
communicate to subsystems which have _explicitly_ delegated a capability that
they may perform. For example, instead of every process having access to the
open() system call and then the syscall checking the process's uid against the
ACLs for the file, each process would be passed in a list of file descriptors
on which they could operate and _no mechanism would exist_ for accessing
something outside of that set.

\- Exokernels (late 1990s) were a reformulation of the role of the operating
system such that the OS is responsible only for securely multiplexing the
hardware, and any abstraction mechanisms are delegated to userspace libraries.
They've been undergoing a bit of a comeback under other names lately, with
virtualization hypervisors like Xen being basically an exokernel, and
unikernels being the libOS.

~~~
nostrademons
Oh, and some examples of things that were largely considered dead & unknown
relics when I first started programming in 2000, but have since undergone
renaissances and are now quite mainstream:

\- Lexical closures were largely unknown to working desktop programmers.
They've been around since Algol in 1960, but Assembly/C/C++/Java don't support
them and those languages had "won" in 2000, so they were quite unknown to
programmers first encountering them in Perl, Python, Ruby, or JS (the first
three of which have strange variable-capture behaviors). Now basically
everyone knows about them.

\- Erlang-style distributed programming was also a lost art; I remember coming
across some mailing list posts by Joe Armstrong c. 2003 where he described
Erlang's strategy as "wait for everyone else to realize this is important."
Now that future is here, and we get articles like this one.

\- Type inference has been around since the 80s, but it was missing from
virtually all mainstream languages, which either had manifest static typing or
dynamic typing. Almost all languages released since 2009 have been type-
inferenced.

\- map/filter/fold functions were also something commonly known to
Lisp/ML/Haskell programmers, but unknown to working C/C++/Java software
engineers. Now they're fundamental to many programming languages, and the
basis for MapReduce (which itself was revolutionary in bringing functional
programming ideas to a mainstream audience).

~~~
fanf2
Proper lexical closures were a prominent feature of Perl 5 in the mid 1990s -
hardly unknown!

Type inference dates from 1969 (Hindley) and 1978 (Milner)

------
agentultra
You may think you understand your specification but unless you used a model
checker or proof assistant there will be bugs in your implementation. Unit
tests are not a specification (or are, at best, incomplete). Get used to
predicate calculus, temporal logic, and writing proofs. If you cannot express
a design with a model or proof that can be verified you're purposefully
leaving bugs in your system.

They may be buried > 40 steps deep in some code path where the stars have to
align just right to trigger it. At the scale where Gustafson's law makes the
trade-offs acceptable you will find that such errors are no longer, "highly
improbable."

Don't accept, "battle-tested," as the hallmark of a reliable system or
algorithm -- all that means is that people willing to tolerate the risk did so
and paid for it for a few years. There are still bugs in that system and they
will affect you or your users (and often both).

~~~
viraptor
> Get used to predicate calculus, temporal logic, and writing proofs.

But at that point, you need to start considering bugs in proofs. You're saying
"Don't accept, "battle-tested," as the hallmark of a reliable system or
algorithm" \- and that would be awesome - I'd like to use a fully verified
system. The reality however is that so far we verified some specific
implementations of specific algorithms and did massive efforts to verify
something like sel4... and that's it. (in publicly available systems anyway)
Unless you're into spending billions, nobody will prove their systems correct.

~~~
agentultra
> But at that point, you need to start considering bugs in proofs.

If there's a flaw in your specification it is much more obvious than in your
code. A high-level specification is usually built up from formal mathematics
to model the execution of a system. If you use a model-checker it will take
the specification of the model of your system and check that the invariants
and properties you specify will hold across all possible executions of that
model. If there is a problem in your specification it will be immediately
obvious which cannot be said of computer programs.

In other words if you cannot see the error in your specification what makes
you think you should be implementing it?

Now there is a long-held myth that we cannot verify the entire software
stack... so what's the point?

You do not have to verify the entire software stack. A specification of a
critical component is often sufficient. Take Paxos as an example: the author,
Leslie Lamport, wrote the specification in TLA+. You can follow the
mathematics to convince yourself of its efficacy or you can trust that the
model checker is implemented correctly and will be sufficient for your
purposes. The same is implied for a proof assistant. The point is that you do
not need to assume these things are true: you can _verify_ them by hand if you
have to... the math will check out.

The same is true for other critical components of distributed systems. A
single and even multi-threaded application on a single computer is difficult
enough to reason about (what with multiple cores executing tasks out-of-order
and sharing different levels of cache... it's almost like a tiny distributed
system itself). Humans make terrible predictions of performance and
correctness that fail time and again when they are benchmarked and profiled.
It gets magnitudes worse in large, distributed systems and specifications are
only a tool to manage thinking in those terms.

It's actually a tool for _simplifying_ the development of such complex
systems. I don't trust a consensus protocol or a lock-free algorithm if it is
not accompanied by some sort of verifiable specification... not some hand-
waving English prose, but a hard formal mathematical specification.

Do we still use these under-specified systems? Yes, sometimes. The trade-off
being that the risk is acceptable and we design ways to tolerate failures as
best as we can. However I will prefer a system that provides specifications
for its key components.

On the myth of cost... it's true, thinking hard about your system up-front has
a cost to it. Where and what you choose to specify will have some correlation
to how much time you choose to spend. However the other component of the
equation to consider is how critical is it to get this system correct? Another
way to think about it is to consider what is the worst thing that could happen
if your design contains a flaw.

This is obscenely easy to do in distributed systems. Our brains are not well
enough equipped to imagine a model of our system and trace its execution and
all of its state 49 steps deep. When confronted with such information many of
us are not even capable of seeing the forest for the trees! We think, "Bah,
that could only happen once in 250,000 executions.. it's not worth worrying
about!" They don't realize that when you make that bet a couple of million
times a minute then it's quite often you will see that error. Finding those
kinds of traces postmortem is incredibly difficult... and why waste the time
searching for that error when spending the extra time up-front will save you
the headache down the road?

Yes it requires more work and it costs more at the beginning of a project...
but if you cannot afford to spend it then know what you're getting yourself
into and act accordingly.

I honestly believe that these mathematical tools can be taught to
intermediate-to-advanced programmers who could use them to check their designs
and help them write better software. It may seem intimidating but it's just
_booleans_ : there are only two possible values for any expression in
predicate logic! In all seriousness the tools that are available today to do
this work are becoming use-able by mere mortals and are not the eldritch
scrawls of snooty ivory-tower academics.

You don't need to spend billions. You can learn TLA+ in a couple of weeks and
start writing specifications in a month. There's a fairly slim tome called,
_Programming in the 1990s_ that demonstrates a practical technique for proofs
that is digestible by programmers.

I say, "don't accept 'battle-tested' as the hallmark of a reliable system,"
because it says nothing about the underlying, undiscovered _errors_ in the
implementation (I made the mistake of using the word, _bug_ , earlier... a
habit I am trying to correct). All that buzz-word signifies is that the
developers have attempted to implement a white-paper as best they could and
wrote unit tests for the code they care about and used their customers to find
the errors and edge-cases they didn't initially consider. While that _is_
valuable to you after the fact you must be wary that you'll probably find some
of those errors yourself. Hopefully the paper they based their critical
sections on published a specification or proof. That's always a bonus... but
you need to check these things.

Worse, someone might try to sell you on their distributed mutex based on their
well-reasoned essay describing said algorithm... don't trust it unless you can
check the model or understand the proof. It may sound reasonable but engineers
don't build sky-scrapers without blueprints.

~~~
viraptor
> In other words if you cannot see the error in your specification what makes
> you think you should be implementing it?

Since we're talking about proving correctness, nobody can prove they didn't
miss an error - either in the specification or implementation. Even wikipedia
provides a trivial case of that in the article on TLA+
[https://en.wikipedia.org/wiki/TLA%2B#Liveness](https://en.wikipedia.org/wiki/TLA%2B#Liveness)

Would you never miss the fact that you didn't specify that the clock must
tick? Would you never miss an error like that in a system orders of magniture
more complicated?

So by extension - if you cannot ensure you see an error in your
specifications, what makes you think you should write them? I'm not saying the
specification will be incorrect - you can check that of course. I'm saying it
can be incomplete.

It's the same as a test. You write more of them, you catch more bugs. You
specify more, you verify more. In either one you can't prove you covered all
the cases you implicitly meant to.

Honestly, at this point, I'd rather take an implementation that's battle-
tested by a large company for months than a verified implementation which
hasn't been used by anyone apart from the authors in their lab. (and yeah,
that may not be a popular view with some people :) )

~~~
nickpsecurity
"So by extension - if you cannot ensure you see an error in your
specifications, what makes you think you should write them?"

Your standard here is that it must achieve perfection or we don't use it at
all. A realistic approach is to use anything proven to increase correctness,
reliability, security, whatever as much as one can justify. Decades of work in
formal specification show they catch all kinds of errors. The most recent,
which I think agentultra was referring to, was Amazon's use of TLA catching
problems that would've required 30 or more steps in testing. Easily. They
caught so many problems & improved their understanding enough that the team is
totally sold on it. Same kinds of results as safety-critical industry and
high-assurance security. Nobody doubted it was helping them.

So, such empirical evidence showing all kinds of problems caught by the method
plus benefits means it's a good method to use if one can. The complexity of
distributed systems increasing potential problems means one should be
_increasing_ their tooling rather than decreasing it. For another good tool
there, check out the Verdi system for verifying distributed systems against
failures. As usual, applying it already caught problems in real-world stuff.

~~~
viraptor
> Your standard here is that it must achieve perfection or we don't use it at
> all.

Quite the opposite actually. It was just a hyperbolic response to "why
implement if you don't have proven specification" \- which you explained why
it's silly as well.

I completely agree - if you have time, resources and proper setting for it -
go for proofs and verification. Most software as we use it today cannot afford
and doesn't need this. The bugs are either acceptable or not, but they're
never "purposefully [left] in your system" as the OP claimed. (which was the
phrase that really annoyed me / caused the response)

I'm quite excited about the progress in software verification. But outside of
very special cases today, I don't think it's worth it.

~~~
agentultra
> The bugs are either acceptable or not, but they're never "purposefully
> [left] in your system" as the OP claimed. (which was the phrase that really
> annoyed me / caused the response)

It seems you are really annoyed that I give responsibility for "bugs," or as I
like to call them: _errors_ , to programmers. If you do not use a
specification for your lock-free mutex or consensus protocol then I posit that
you are willfully choosing not to use a tool that can help you avoid errors in
your design and that you should conduct yourself accordingly. The original
article provides many good suggestions for coping with this situation.

And while I'm on the subject it is worth pointing out that while the
specification might check out the implementation may still contain errors.
This is normal. The specification gives you a compass that helps you to track
down errors in your implementation. Once found it gives you another invariant
to add to your specifications and another assertion or contract to add to your
code.

These are good tools to have and as I and others have suggested one does not
need to "verify" the entire system to make use of specifications. They are a
great design tool in the programmer's tool belt that help us to write better
software. Distributed systems are hard enough and we need all the help we can
get.

~~~
nickpsecurity
"If you do not use a specification for your lock-free mutex or consensus
protocol then I posit that you are willfully choosing not to use a tool that
can help you avoid errors in your design and that you should conduct yourself
accordingly."

Sort of. Remember that, if they work for companies, programmers' job is not to
write bug-free software. It's to translate requirements into software within
constraints imposed by management. They should put in what quality they can.
They often don't have time to spec out stuff because management won't allow
it. "Ship, ship, ship!" It's why I typically recommend them re-use previously-
proven solutions for stuff like you mentioned.

So, they certainly are leaving errors in intentionally if they avoid tools
that would catch them. Yet, it's not always their fault. It's also not always
a bad thing if we're talking hobbyists doing stuff for fun where they and/or
their community accept some occasional bugs. On opposite end, people would get
fired at Altran/Praxis for not doing what they could for correctness. To each
their own based on their needs or situation.

~~~
agentultra
Hence the caveat: _conduct yourself accordingly_.

Just because you don't use specifications doesn't make you a bad programmer
but you must be aware of your own short-comings and plan for them. If you're
not allowed to use specifications you're being told you're not allowed to
think. If that's the case you know there are going to be errors produced by
the team either in the design itself or in the implementation.

------
rdtsc
> Implement backpressure throughout your system.

This is a subtle one and gets lost among CAP theorem talk, but it is
important.

Just recently had to struggle with adding backpressure. Did some things as
just sending a message (a "gen_server:cast" in Erlang). The sender shoved all
300K+ messages in the mailbox of receiver which could not process them in time
because of a bottleneck and bug. Receiver's memory usage went to 120GB when it
eventually restarted.

The simple solution was just to switch a gen_server:cast with a
gen_server:call i.e. make the receiver acknowledge, and the sender wait for
acknowledge before sending another message. (As an aside that's the nice
benefit of Erlang/Elixir there, it was really a 2 line change. If it was some
kind of serialization library + rpc socket layer, it would have been a lot
more code to write).

An obvious corollary of this is "always test on production workload and
conditions, not just your laptop". What had happened locally is with a smaller
test data it would fit in memory and for a 10Gb or so it wasn't noticeable. So
it seemed like everything was working fine.

But speaking of CAP theorem this is always a great fun:

[http://ferd.ca/beating-the-cap-theorem-
checklist.html](http://ferd.ca/beating-the-cap-theorem-checklist.html)

\---

Beating the CAP Theorem Checklist

Your ( ) tweet ( ) blog post ( ) marketing material ( ) online comment
advocates a way to beat the CAP theorem. Your idea will not work. Here is why
it won't work: ...

\---

~~~
alanning
Thanks for sharing this. A few questions from someone interested in learning
how to use BEAM-based systems in production:

* How did you debug the issue?

I know Erlang in Anger [1] is kind of written just for this but it feels
pretty intense to me as a newbie. I don't know enough to tell whether the
issues there are only things I'd need to worry about at larger scale but I
found it pretty intimidating, to the point where I have delayed actually
designing/deploying a Erlang solution. Designing for Scalability with
Erlang/OTP [2] has a chapter on monitoring that I'm looking forward to
reading. Wondering if there's some resources you could recommend to get
started running a prod Erlang system re: debugging, monitoring, etc.

* How do you decide when to use a message queue (like sqs, RabbitMQ) vs regular processes? Do you have any guidelines you could share or is it more just, "use processes when you can; more formal MQ if interfacing with non-beam systems"? I struggle since conceptually each sender/receiver has its own queue via its mailbox.

1\. [https://www.erlang-in-anger.com/](https://www.erlang-in-anger.com/)

2\.
[http://shop.oreilly.com/product/0636920024149.do](http://shop.oreilly.com/product/0636920024149.do)

~~~
rdtsc
Good question. So first I noticed in metrics dashboard (so it is important to
have metrics) the receiver never seemed to have gotten the expected number of
messages.

Then noticed the count of messages was being reset. Suspected something
restarted the node. Browsed through metrics, noticed both memory usage was
spiking too high and node was indeed restarting (uptime kept going up and
down).

Focused on memory usage. We have a small function which returns top N memory
hungry processes. Notice a particular one. Noticed its mailbox was tens of
gigabytes.

Then used recon_trace to trace that process to see what it was doing.
recon_trace is written by author of Erlang In Anger. So did something like:

    
    
        recon_trace:calls[{my_module, '_', '_'}, 100, [{scope, local}].
    

dbg module is built-in and can use that, but it doesn't have rate limiting so
it can kill a busy node if you trace the wrong thing (it floods you with
messages).

So noticed what it was doing and noticed that a particular operation was
taking way too long. It was because it was doing an O(n) operation instead of
O(1) on each message. On a smaller scale it wasn't noticeable but when got to
1M plus instances it was bringing everything down.

After solving the problem to test it, I compiled a .beam file locally, scp-ed
to test machine, hot-patched it (code:load_abs("/tmp/my_module")) and noticed
that everything was working fine.

On whether to pick a message queue vs regular processes. It depends. They are
very different. Regular processes are easier, simpler and cheap. But they are
not persistent. So perhaps if your messages are like "add $1M to my account"
and sender wants to just send the messages and not worry about acknowledging
it. Then you'd want something with very good persistence guarantees.

------
critium
I would add to the list;

* If you have transactions, keep your transactions to a single machine/instance/db as much as possible. Muti-machine or software transactions are the LAST solution you should try.

* Pay attention to payload sizes. Make sure you dont come close to saturating the network. Which leads to weird "app is slow" problems.

* Design for testability and diagnosability of production systems. If this is java, use JMX extensions EVERYWHERE for EVERYTHING.

* Time (and timing) is your enemy. Make it your friend.

EDIT: Side note. IMO, JMX extensions are one of the most under-appreciated
things that java and jvm devs have but keep forgetting about but its so
powerful.

~~~
AdieuToLogic

      IMO, JMX extensions are one of the most
      under-appreciated things that java and
      jvm devs have but keep forgetting about
      but its so powerful.
    

This is an excellent point which I cannot agree with more. When doing
distributed systems using JVM's, I almost always reach for the excellent
Metrics[0] library. It provides substantial functionality "out of the box"
(gauges, timers, histograms, etc.) as well as exposing each metric to JMX. It
also integrates with external measuring servers, such as Graphite[1], though
that's not germane to this post.

0 - [http://metrics.dropwizard.io/3.1.0/](http://metrics.dropwizard.io/3.1.0/)

1 - [http://graphiteapp.org/](http://graphiteapp.org/)

~~~
twic
I'd hesitate to call Metrics 'excellent'. It does quite a bit of processing on
numbers before it emits them - averaging over timespans and so on - so if
you're feeding them into something like Graphite which does further
processing, you're averaging twice, and getting the wrong numbers.

If you're just going to print the numbers out and look at them, it's very
handy. But if you want to handle them further, which you really do, I'd avoid
it. Put individual measurements in your logs, then scrape them out with
Logstash or something and send them to Graphite for aggregation.

~~~
AdieuToLogic
Use of Metrics functionality is going to vary based on need. So if a system is
using something such as Graphite, then the measurements sent should reflect
that. This is independent of a particular library IMHO.

So when you say:

    
    
      ... if you're feeding them into something
      like Graphite which does further processing,
      you're averaging twice, and getting the
      wrong numbers.
    

This would be an issue with sending Histogram, Meter, and/or Timer values, but
I'd categorize that as a design defect. Using Metrics to send Gauge and
Counter data to a Graphite-esque system shouldn't be a problem at all. For
systems which do not have a metrics aggregator or when some need to be exposed
to external agents (such as client systems), using the types provided by
Metrics can be very useful.

As for using log files to "scrape out" system metrics, I realize this is a
common practice yet is one which I am fundamentally against. What I've
observed when taking the "put it in the logs and we'll parse it out later"
tactic is that a system ends up with sustained performance degradation due to
high log output as well ops as having to incorporate yet-another system just
to deal with profuse logging.

Treating system metrics as a first-class concern has shown itself to be very
beneficial in my experience. Persisting them is a desirable thing, agreed, and
a feature point fairly easily met when metrics are addressed separately from
logging.

------
lightbendover
I feel like tracing is the biggest thing missing from the article, especially
since the author takes a hard stance on logging. Tracing is one of those
things that will eventually save the day for you (ops released a bad config on
some load balancer you don't know about -- good luck) and if you don't
approach your design with tracing in mind, it will be very hard to add it
later.

~~~
alanning
The author doesn't explicitly say "tracing" but he specifically calls out
Dapper and Zipkin in the section, "“It’s slow” is the hardest problem you’ll
ever debug."

Dapper -
[http://research.google.com/pubs/pub36356.html](http://research.google.com/pubs/pub36356.html)

Zipkin - [http://engineering.twitter.com/2012/06/distributed-
systems-t...](http://engineering.twitter.com/2012/06/distributed-systems-
tracing-with-zipkin.html)

------
jondubois
Those are really good points. Particularly the one about how there are so few
open source systems built to scale.

I think this is starting to change (slowly). I run such a project called
SocketCluster and we recently released some Docker images and Kubernetes
config files to make running SC on a distributed Kubernetes cluster super
easy.

See
[https://github.com/socketcluster/socketcluster#introducing-s...](https://github.com/socketcluster/socketcluster#introducing-
scc-on-kubernetes)

I think there will come a point when most open source stacks will be designed
to run on one or more orchestration systems like Kubernetes, Swarm and/or
Mesos.

The main problem at the moment is lack of skills in that area - There is
currently a big divide between DevOps engineers and software engineers.

As a long-time open source developer, I felt somewhat out of my comfort zone
learning Kubernetes but it was well worth it.

~~~
simplify
Based on your DevOps experience, what's your opinion of Nix[1] and/or its OS?
I keep hearing it solves a lot of pain for deploying systems, but I don't have
any DevOps experience to evaluate it.

[1] [https://nixos.org/nix/](https://nixos.org/nix/)

~~~
jondubois
I haven't used Nix. Kubernetes runs pretty well on all the Linux distros that
I've come across - I guess this is because K8s itself runs most of its
components inside Docker containers so K8s isn't generally prone to
configuration issues.

That said, you still have to deal with certain host-level configs like file
limits and such - So I guess for more advanced use cases something like NixOS
could come in handy - I just haven't needed it yet - I'm running most of my
stuff on Amazon EC2 - So I just make sure that my base AMI has the correct
host configs for the kinds of use cases I deal with.

------
jamesblonde
I would recommend people interested in learning about distributed systems to
take this course on EdX starting next month:
[https://www.edx.org/course/reliable-distributed-
algorithms-p...](https://www.edx.org/course/reliable-distributed-algorithms-
part-1-kthx-id2203-1x)

~~~
pejrich
KTH (Stockholm) and they're not using Erlang. Blasphemy!

~~~
jamesblonde
And it's Joe Armstrong's supervisor giving the course! It's using kompics,
which is also actor-based but runs on Scala and Java. The course will use
Scala.

------
rdtsc
> One last note: Out of C, A, and P, you can’t choose CA.

I have my own, which is not obvious I think:

"CAP doesn't mean your system is automatically CP or AP, it also means, it
could be neither"

~~~
davidw
Yeah, these things are often like that. "Fast, Cheap, Good: pick two" \- some
people manage to only get one.

------
panic
Some good previous discussion here:
[https://news.ycombinator.com/item?id=5055371](https://news.ycombinator.com/item?id=5055371)

------
AdieuToLogic
Regarding using timeouts:

I recommend using deadlines established at the initiation of a workflow
instead of timeouts. This allows all collaborations to be constrained instead
of having the additive effect of multiple discrete timeouts.

For example, specifying:

    
    
      Customer login must complete within 10 seconds of
      initiation.
    

Is preferable to:

    
    
      Read customer record within 5 seconds.
      Verify passphrase within 1 second.
      Update account access within 5 seconds.
      Generate response within 2 seconds.
    

Note that the cumulative seconds in the second example exceed the 10 second
threshold of the first. This is intentional as it exemplifies the very common
situation of "this should never take longer than X" timeout specifications
resulting in potentially violating the system response constraints. It is also
indicative of developers having to pick what is considered reasonable maximums
within a workflow due to not having context. By providing a deadline
downstream, those components need only be concerned with the question "do I
have any time left to do what I need to do?"

For an example of both deadline and timeout types, see:

[http://www.scala-
lang.org/api/current/#scala.concurrent.dura...](http://www.scala-
lang.org/api/current/#scala.concurrent.duration.Deadline)

[http://www.scala-
lang.org/api/current/#scala.concurrent.dura...](http://www.scala-
lang.org/api/current/#scala.concurrent.duration.Duration)

------
agentgt
I have general and abstract problems for large systems that I haven't really
figured out easy solutions for other than using metrics and logs:

* Downstream dependency failures and isolation (circuit breaker pattern sort of helps but what do you set the thresholds to?)

* When to split resources to different machines (ie resource contention) (if you do it you then have more coordination issues)

* Alerting intelligently with out causing fatigue.

* Reducing latency at scale (this I haven't done well but some hit multiple duplicate services at the same time and pick the fastest one ala google).

* The index problem... if we add the index we will improve read speed but at the cost of write speed. Some systems have a careful balance or is dynamic depending on environment (time of day).

* Queues are easy to understand and implement but can fail in epic catastrophe. The alternatives are often hard to understand and implement (ie the backpressure problem).

~~~
existencebox
I am by no means an expert, just a guy who has written these sort of systems a
bit. These are my answers, take them with a massive grain of "pragmatism"
salt.

(EDIT: realized how long this was after writing it. Christ; apologies in
advance for this perhaps literal response to a partially rhetorical question?
Hope this wall of text doesn't fall on anyone.)

\- How to set thresholds: "Find out!" Break your systems. introduce failures.
track the circuit breaker threshold metric and find what its distribution is,
and what std.dev. of outliers correlate with (or which you consider to be) an
error.

\- How big is the system going to be? Is perf going to be a big consideration?
If you find these to be "yes" then you probably _should_ prematurely optimize
a LITTLE, at least in the sense of "my system has these natural coupling
boundaries, if I split them there I'll get a very intuitive way to scale each
component individually" (obviously things like nosql on one platform, sql on
another but then you can dig deeper with tables built in such a way that you
could split them out if they became problematic, have an understanding of what
you can use as partition keys, etc) Coordination issues are certainly one of
the doozies of distributed systems, and I can't give a fix-all, you'll need to
think about coordination deeply, however, for most resource contention I've
seen its more an issue of just having a good story ahead of time for how a
given component will scale, and the actual coordination usually happens at a
meta-level to that.

\- See #1. It's a learning experience for every sytem. Obviously there are
KPIs that are easy to alert for("half my nodes just went down") but those
aren't the interesting ones. Learn your system; learn the telemetry behavior,
make alerts that let you make an ACTIONABLE response to the occurence.
(emphasis on actionable, since non-actionable alerting is in my experience the
leading cause of ignoring it; whereas flappy/loud alerting that IS actionable
indicates that the system may need to be more robust/fixed)

\- There are a lot of solutions for this, the one you cited is a thing but
mostly for issues like geolocality/really unfortunate scheduling/contention.
At the risk of falling back onto a generic answer, if latency starts being a
real issue, some form of caching (and locality) are two things I might look
into.

\- Going to keep coming back to "Know your workload". You can certainly
optimize it to be read/write friendly. the over time dynamicism is an
interesting point, since candidly, I haven't run into that sort of a situation
in my day to day. I know that within certain systems (MSSQL) you can hint to
adjust the query plan in certain ways/have it not used cached plans if you
realize your profile has shifted, and this gives you SOME power, but it's
starting to get into things I can only hand wave about at best. To end this
less than satisfactory answer, I tend to err on "careful balance with semi-
regular reassessment based off of movement in core KPIs"

-EVERYTHING can fail in an epic catastrophe. Depending on the semantics of how you build your queues, however, (and tables and constraints and etc.) you can handle that a little better. Think pathologically from day one; "if this DB gets corrupted, if this node entirely starts returning garbage, how to we recover?" and find solutions that build recovery into normal function. This is another super hand wavy answer but more because it goes into the entire space of distributed system design and I would likely make a fool of myself both in general and in this small space if I tried to move past merely "what has worked for me." However; I will warn, "the backpressure problem" isn't going to go away just because you make a more robust pipeline. In the same way that you've found benefits of circuit breakers as a design pattern, backpressure, even if it's passively observed backpressure, is a broadly useful tool for the general case of "I need my components to be contextually aware to not shoot each other in the head." (That being said, if you can design WITHOUT needing more moving parts/backpressure/etc, _do it_.)

My one big takeaway if I could is that the more general and abstract problems
you remove out of the simplicity of your implementation, the more pain you
save in having to answer and maintain them; at the risk of stating the
obvious. (I'm almost sure I said some things here where dist. systems engs
more senior than me will point out terrible design choices, but I hope they do
since then I'll get to learn from your questions as well :) )

~~~
agentgt
I appreciate your answer. In large part what you have said is what I have done
in the past. The question/issue is can this ever be automated or made easier
with out reliance on expensive PaaS services?

Right now it seems the easy answer in general is... just monitor and deal with
shit but that gets old fast. Particularly if you find out you were wrong in
some architecture assumption it can become very expensive to change. These
things should be foreseen but due to time, resources and economic dynamics
they are not.

In large part this is why there has been so much success with PaaS... let
someone else figure out how to scale but again these services lock you in and
generally are expensive (in terms of scale).

You would think that such a closed system where you can tinker and adjust a
few settings (eg thread pools, timers, etc) would be an easy problem for
machine learning but I have yet to see many companies employ techniques like
these for scaling.

~~~
existencebox
It's funny you mention PAAS; in my last two jobs I've been respectively
devving for a large on-prem datacenter, and an entirely PAAS/IAAS team. Both
had their own pitfalls, and both can be supplemented with automation in some
areas, it really is to some extent an apples to oranges comparison, and
automation is a boon in both environments.

re: the arch assumptions, man you've touched on a can of worms there (things
should be forseen but aren't due to time.) I have so many things to say about
this but in brief; the longer you do this "sort of stuff" the faster you are
at seeing common patterns/pitfalls/avoiding that, and to that end is why I
recommend SO STRONGLY working with someone who is a seasoned expert at this
since a lot of that intuition is very domain specific/voodoo-esque. Now, this
will help you in the future, but it won't help you now; in the absence of
proper expertise/time/resources to determine the arch, I take two approaches:
"thinking about it really hard", alongside my team. If I'm not an expert, and
my peer isn't an expert, maybe however both of us together may manage to catch
each other's mistakes in logic better than alone. Pair architecting can really
facilitate working through a crunch, additionally if you do so with the
philosophy of, "assuming any of these decisions are CRAP, do we have a path to
remediating it without pain"; such that even if/when you make the wrong
choices, you can see a direction to rectify it. I realize this is nebulous but
I hope it conveys at least the shape of an approach.

Re: the success of PaaS, absolutely, however, it incurs pitfalls commensurate
with its benefits. Take Azure SQL for instance. 1 TB DB limit. It's a
FANTASTIC managed service, but for many truly big data scenarios that's an
unworkable cap. Additionally, if someone really deep breaks, you may have a
whole other support chain to go through rather than in-house expertise. (I do
not mean to direct that latter statement explicitly at azure, I've seen it
across PAAS/IAAS providers) WRT Lock In, I've taken to aggressively avoiding
*AAS systems that I don't have a good migration story for upfront (paramount
to my statement re: if this is a terrible choice have a way out); use these
tools and a knowledge of the costs/benefits to chose the components that solve
the problems you want, we have a toolchest and can often pick and chose the
most appropriate ones.

Re: ML: AML does this to some extent; endpoint scalability and custom ML
modules internally that you can paramaterize to dynamically tune some
functionality. It certainly is not a perfect solution but it does some neat
things, I've found some success using it but I by no means intend to paint it
as a silver bullet; however I share your sentiment that in the long run these
capabilities are very useful and more plug-and-play/flexible hosted ML seems a
natural next step with the direction we've been going in to more "commoditize"
some aspects of data analytics.

Disclaimer since I mention some MS products; I am an MS eng but all of this is
just my own ramblings as a dev. Any advocation/caveats I say are only my own
experiences, and I am not even a true expert on all of our in-house platforms
so much as I am a consumer.

------
bluejekyll
This is a superb list, but I have a nitpick:

> Distributed systems are different because they fail often

And:

> design for failure

While this is important, I think it's better to come up with fail-fast
solutions, and some type of backup recovery.

Example: when designing a push notification system, what do you do when the
push fails? I tend to then rely on pull, but then you have two competing
operations... So maybe it's better to always initiate push operations with a
pull simplifying the code, and handling both cases and a failure mode.

Also, most networks do not fail often, if they did we would have serious
problems. See the update to the CAP theorem that discusses this, failure will
happen, but for most programming not nearly as often as we try to deal with
it.

Computers don't crash that often; local networks are amazing stable; and power
outages are rare. So don't waste your time over-engineering for these things,
spend you time making simple design decisions that handle these cases and then
move on.

Example: split brain, focus on discovery that you are in a split brain
situation, then fail-fast and notify until the network recovers. This is far
simpler to deal with than say merging vector clock based records after the
fact. (Obviously some systems can't deal with any down time, but my experience
has shown that for a local network this is generally a safe call).

My basic pint is, keep your code simple, trying to deal with error conditions
will creat brittle rarely tested code. You want as much code to be consistent
across all executions as possible. Also, make your code testable without
needing to spin up multiple processes, this means writing protocol utilities
that accept generic readers and writers as opposed to IO channels.

Edit: one more thing, make sure you focus on releasing as quickly and
dependably as possible to fix bugs/errors. People often leave this to last,
and realize that they have a system that is impossible to manage.

~~~
brianolson
Given sufficiently many operations, an operation that fails rarely will fail
frequently. So if something fails 'one in a million' times, and you do ten
million things, it'll fail ten times. Large distributed systems press on this.

~~~
bluejekyll
I don't disagree with this, but I think making this the primary issue that you
solve for is going to significantly slow down development.

------
zianwar
You can also find a handful of useful blogs about DS in this Quora thread:
[https://www.quora.com/What-are-some-good-blogs-about-
distrib...](https://www.quora.com/What-are-some-good-blogs-about-distributed-
systems)

------
bogomipz
Can someone explain to me what this means:

"Suppose you’re going from a single database to a service that hides the
details of a new storage solution. Have the service wrap around the legacy
storage, and ramp up writes to it slowly. With backfilling, comparison checks
on read (another feature flag), and then slow ramp up of reads (yet another
flag), you will have much more confidence and fewer disasters."

Does the author mean to say:

"Have the service wrap around the legacy storage, and ramp up writes to the
new system slowly"? It reads like he's saying ramping up writes to the legacy
system.

Also I am confused by his use of the word "backfilling", what would being
backfilled? IOPS to the new system after IOPs to the legacy system? Is that
the idea.

This was a good read. Kudos to the author to putting it out there.

~~~
Falell
If you want to migrate from system A to system B you don't start writing all
data to B immediately. You don't know if B can handle the load, if your code
is right, if double writing will cause an unacceptable slowdown etc. Instead
you mirror a % of all writes from A to B and increase it over time.

Now you have a problem because B only has a % of the data. To fix this you
need a separate task that periodically identifies records in A missing from B
and syncs them. This is the "backfill" \- it isn't part of your running
application and is only needed for the duration of the migration.

Once you think you have sufficient data in B you can start checking your work,
run a % of your reads from both A AND B and make sure the results match.

Once that matches for a while you can start making a % of reads from B without
ever contacting A. When this reaches 100% you stop writing to A.

Each % here is a separate feature flag.

~~~
bogomipz
Yep, makes perfect sense, I've handled this in quite a similar fashion. Thanks
for clarifying.

------
yumaikas
Interesting that he considers logs to not be worth the investment. I can
appreciate his argument and all, and logging seems pretty hard to get right,
but still.

Thankfully, my current job is a monolith database and webserver on a box, so
I'm still in the realm of pointers for now.

~~~
tristor
You misunderstood what he's saying here. He's not saying logs aren't worth the
investment, he's saying that logging is rarely done usefully/correctly, so
don't have an over reliance on log output. Instead, you should track and rely
on system-wide metrics.

Of course you should put the time in to do logging, but at every point of
control over what the contents of the logs are you should be ensuring that
what you log is information about the state of the system, not picking a class
of errors you think is possible at the time and logging only that. If you've
worked with any "enterprise" type Java software that ends up in a distributed
system, you'll know exactly what he means. Your logs end up filled with huge
XML outputs that are completely irrelevant about one particular error that
actually doesn't really affect system state overall, while the things that
plague you in a global way are almost impossible to hunt down without building
a test cluster and running captured requests through it.

Logging request/response, logging process steps by request ID, etc. are much
more useful. Having the IDs in your logs correlate to the IDs in your data is
also helpful because it allows your Ops people to do queries across both
datasets to find common factors for where requests fail and what that type of
data looks like vs requests that succeed, as an example.

As he mentions though, most people don't think it's important to log
successes, but in fact what you care most about is correlating request
types/request parameters to metrics. Whether a request failed or succeeded is
just a metric itself, but the more data you have the better understanding of
the system you gain. The real challenge is that distributed systems are
designed to be opaque in their complexity to the users external to the system
and this often leads to them being opaque in their complexity to the
developers internal to the system as well.

~~~
yumaikas
Thanks for the clarification on all that. You have some good "simple" (but not
obvious) ideas there too, like the correlated ids in logs

------
youjiuzhifeng
You can find a lot of materials about DS on this GitHub project [0]. Althought
they are writen in chinese, the materials are almost english.

[0][https://github.com/ty4z2008/Qix/blob/master/ds.md](https://github.com/ty4z2008/Qix/blob/master/ds.md)

------
firasd
Great line: "Learn to estimate your capacity. You’ll learn how many seconds
are in a day because of this."

~~~
bogomipz
It's funny I think that anyone who has ever had to manage their own DNS
infrastructure knows the number 86400. And DNS is one of the oldest
distributed systems out there.

------
honkhonkpants
On backpressure: "Common versions include dropping new messages on the floor
(and incrementing a metric) if the system’s resources are already over-
scheduled."

What I would add here is that "and incrementing a metric" must use fewer
resources than would have been spent serving the request. I once worked on a
system that would latch up into an overload state because in overload it would
log errors at a severity high enough to force the log file to be flushed on
every line. This was counterproductive to say they least. From that I learned
that a system's flow control / pushback mechanism must be dirt cheap.

~~~
susan_hall
On dealing with backpressure, I would recommend what Zach Tellman writes about
in his video "Everything will flow":

[https://www.youtube.com/watch?v=1bNOO3xxMc0](https://www.youtube.com/watch?v=1bNOO3xxMc0)

Tellman also tells his own adventure about a system something like this:

"a system that would latch up into an overload state because in overload it
would log errors at a severity high enough to force the log file to be flushed
on every line"

He talks about some problems he faced when writing to Amazon S3, when S3
stopped accepting requests. It's a fascinating talk.

------
vegabook
just lean on the shoulders of giants who have thought all these points out
before, and use Erlang (or Elixir). That includes back pressure.

[http://elixir-lang.org/blog/2016/07/14/announcing-
genstage/?...](http://elixir-lang.org/blog/2016/07/14/announcing-
genstage/?utm_campaign=elixir_radar_58&utm_medium=email&utm_source=RD+Station)

~~~
felixgallo
Erlang doesn't have great inter-node backpressure. Messages sent from one node
to another in a cluster will suspend the sender if the output queue is too
big, which is a global value independent of the queue length of the receiver.
Also, the inter-node protocol is based on TCP, which has the usual TCP
problems (head of line blocking, etc.).

Erlang's distribution story is great when you're on a single machine; pretty
good if you're on a very high reliability, high bandwidth LAN; and not great
if you're going across any network not under your direct control (e.g. AWS,
the public internet).

~~~
vegabook
k but a) back pressure as per my original post with Elixir (or Apache Storm,
or Flink, amongst many others), and b) if you need latency-aware cross-
timezone you should probably be letting another piece of distributed
infrastructure, explicitly designed for this, handle the issue, for example
local/cluster/rack/data-center aware Cassandra. There is never any reason to
re-invent something which big, multi-national corpos have worried about for
you for a long time already. It doesn't strike me that anything in the OP post
has not already been solved for 99% of us, using the right, freely available
and well-documented tools. Distributed computing is hardly a new technology.

------
dschiptsov
> garbage collection pauses make masters “disappear”

This happens only with piles of Java crap (gigabytes of jars). With [mostly]
functional Erlang, Common Lisp, Haskell everything is fine.

~~~
mSparks
care to point me to a robust in production distributed system written in one
of those other languages?

~~~
erpellan
Mnesia?
[http://erlang.org/doc/man/mnesia.html](http://erlang.org/doc/man/mnesia.html)

~~~
catwell
Well, for Erlang it's pretty easy. Also including WhatsApp servers, Riak, a
lot of router code...

The question is still interesting for the other two languages though.

------
dschiptsov
There is a _much better_ reading for Young Bloods

[https://pragprog.com/book/jaerlang2/programming-
erlang](https://pragprog.com/book/jaerlang2/programming-erlang)

