
Challenges with distributed systems - fagnerbrack
https://aws.amazon.com/builders-library/challenges-with-distributed-systems/
======
yamrzou
Interesting read.

Relatedly, FoundationDB has a distributed testing framework called
“Simulation”, which can simulate distributed failures on a single machine
(thread!). Quoting from [https://pierrezemb.fr/posts/notes-about-
foundationdb/](https://pierrezemb.fr/posts/notes-about-foundationdb/) :

> We wanted FoundationDB to survive failures of machines, networks, disks,
> clocks, racks, data centers, file systems, etc., so we created a simulation
> framework closely tied to Flow. By replacing physical interfaces with shims,
> replacing the main epoll-based run loop with a time-based simulation, and
> running multiple logical processes as concurrent Flow Actors, Simulation is
> able to conduct a deterministic simulation of an entire FoundationDB cluster
> within a single-thread! Even better, we are able to execute this simulation
> in a deterministic way, enabling us to reproduce problems and add
> instrumentation ex post facto. This incredible capability enabled us to
> build FoundationDB exclusively in simulation for the first 18 months and
> ensure exceptional fault tolerance long before it sent its first real
> network packet. For a database with as strong a contract as the
> FoundationDB, testing is crucial, and over the years we have run the
> equivalent of a trillion CPU-hours of simulated stress testing.

~~~
jffhn
>simulation of an entire FoundationDB cluster within a single-thread!

Reminds me of an InfoQ post about simulating robot swarms in a single thread:
[https://www.infoq.com/articles/java-robot-
swarms/](https://www.infoq.com/articles/java-robot-swarms/) (links to Java
libraries for that in the comments)

Being able to run everything in a single thread allows to:

1) Do a lot of tests quickly (with virtual time flowing as fast as possible),
or arbitrarily slowly if you prefer slow-motion in some contexts.

2) Have determinism. Great for debugging: can reproduce bugs at will, even
with full logs on (one trick is to enable them just some time before the bug
occurs, when the bug is far from start time).

3) More easily figure out whether a bug is in domain code or a threading
issue.

4) Use single-threaded execution as a bench of domain code, and see how much
multi-threading/distribution can make them faster or do make them slower.

One constraint is that it rules out some programming styles, since the code
must never use waiting constructs, like futures (if the single thread starts
to wait, it will wait forever since nothing happens outside of it).

~~~
zzzcpan
_> One constraint is that it rules out some programming styles, since the code
must never use waiting constructs_

The constraint here is for testing/simulation to be able to supply their own
implementation of waiting constructs, not that waiting constructs cannot be
used, of course they can.

~~~
jffhn
I meant rules out for the domain code that you want to be able to run in a
single thread, which should be by far most of the code (unless you ensure that
whatever your code wants to wait for, has then necessarily already happened,
but that seems brittle to me).

Inside of the technical layers you use to run it in multiple threads, there
are of course wait/notify mechanisms (or similar).

Maybe you thought about wait implementations that would not wait but that
would "help", and go on with other computations while the condition is not yet
met? If not then I would like you to expand on what you mean, ideally with a
few lines of code as a sample to make it clear.

~~~
zzzcpan
I mean wait implementation doesn't have to actually wait, it just has to
register an event handler and store some context to continue with. In the
simplest case if we assume there is nothing else to process the condition will
be met right in the next iteration and will call back that event handler and
continue from that point all without any waiting.

~~~
jffhn
Ok, I was talking about actually waiting (like in "while(!condition)yield();")
with more code to execute at the _same_ (&#42;) virtual time after the wait,
not taking care of having some code executed _later_, which is indeed a proper
approach in our case, and what "waiting" could mean in some informal
specification.

(&#42;) When doing deterministic virtual time scheduling, computations are
scheduled to take place at given times, and are processed exactly at the time
they are supposed to be. The time can only change once everything that had to
be computed at current time has been computed (if in "as fast as possible"
mode, the scheduler then just jumps the clock to the next time something is
scheduled to happen) (if you want to read more about that, see "time advance
request" and "time advance grant" in the HLA norm).

------
yamrzou
> It’s almost impossible for a human to figure out how to handle UNKNOWN
> correctly. What does UNKNOWN really mean? Should the code retry? If so, how
> many times? How long should it wait between retries? It gets even worse when
> code has side-effects. Inside of a budgeting application running on a single
> machine, withdrawing money from an account is easy, as shown in the
> following example.

> Figuring out how to handle the UNKNOWN error type is one reason why, in
> distributed engineering, things are not always as they seem.

How are such UNKNOWN errors handled in practice? The article doesn’t talk much
about it.

~~~
finaliteration
In my experience designing systems like these: There are no hard and fast
rules and you handle them in a way that’s been decided on a case-by-case basis
for each system. Some messages may need to keep retrying with an exponential
back off indefinitely. Others may need to retry only once and then send an
email to someone because two failures in five minutes is a critical failure.
You also have to design things so that if one failure occurs, maybe it stops
the entire process, or maybe other messages can go through still and you just
log the failure.

It all comes down to the rules of your business and how critical these systems
are. Maybe unknown means “failure” or maybe unknown means “someone should get
an alert about this and check it out”.

I think it’s hard because there is no highly visible “crash” that occurs like
in a non-distributed system when an unexpected exception occurs and the entire
program shuts down. Failures often happen silently and it’s difficult to tell
where or why something failed. So you have to design each system with that in
mind and figure out how each piece needs to deal with uncertainty.

~~~
stingraycharles
For what it’s worth, in my experience it’s very effective to add this
information to the error context: is it a permanent failure (eg validation) or
a retryable error. If it is retryable, also add to the error context _when_ it
should be retried.

This will allow you to handle these errors appropriately without having to
handle these things on a case by case basis.

~~~
clarry
> it’s very effective to add this information to the error context: is it a
> permanent failure (eg validation) or a retryable error

The discussion was specifically about UNKNOWN errors, i.e. you sent a message
but never got a reply back. You don't know whether it was a validation failure
or temporary hiccup. For all you know, it's possible the message was received
and processed correctly but the response never made it back.

How to handle these unknowns is always going to be case by case. Some
combination of retry and give up works for most cases, but there is no silver
bullet and usually you have to think hard about the consequences of 1)
retrying 2) giving up thinking the request failed even though it actually
(silently) succeeded.

------
phtrivier
Is there an article (in this "Amazon Builder Library" or elsewhere) that
offers practical advices about handling such problems ? Or is there nothing
else but "do it on a case by case basis ?"

------
tjchear
I wonder what if we decide that 'fate sharing' is OK, i.e the client request
flow (not the entire process) fails when a network step fails. How long can a
system stay operational after it started before the whole thing grinds to a
halt? Seconds? Hours? Days? Would there be a huge ding or small ding on the
SLIs?

------
tyingq
Designing your software to work interchangeably with unix domain and regular
sockets can also make initial testing much easier. Then you have a no hassle
way to simulate lots of ip/port pairs via file names, without port clashes and
creating virtual nw interfaces.

------
AkshatM
I wonder if there are any advantages to using formal methods over thread-local
simulation in this case.

Eighteen months is a long time of development, and there are at least some
benefits to catching bugs in the design phase before pursung implementation.

------
fyp
Did anyone else find their "eight failure modes of the apocalypse" incredibly
arbitrary given the important sounding name?

------
ex_amazon_sde
meh

------
marknadal
Hmm, how I handle UNKNOWN in our distributed system (10M+ monthly users
running on $0 cost infrastructure,
[https://github.com/amark/gun](https://github.com/amark/gun) ) is to say that
originator is responsible for retries until satisfactorily ACKed.

This means all other nodes in the network (routers, relays, security, storage,
etc.) can safely error or not recurse or not retry, without any loss in
redundancy guarantees.

Either the operation did succeed silently, and it is OK for originator to try
again (idempotency keeps this safe), or things never finished processing, and
it is OK to try again despite being out of order (CRDTs keep this safe).

~~~
Rapzid
This article brought to mind another article which I personally found a bit
more profound: [https://bravenewgeek.com/you-cannot-have-exactly-once-
delive...](https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/)

