
Lessons learned writing highly available code - _jomo
https://medium.com/imgur-engineering/lessons-learned-writing-highly-available-code-7eaf3d7aae00
======
notacoward
Many good points there, but I think a lot of people could be misled by the
advice about timeouts and retries. In my experience, which is a lot more than
the author's two years, having each component retry and eventually fail on its
own schedule is a _terrible_ way to produce an available and debuggable
system. Simple retries are fine and even necessary, but there should only be
_one_ part of the system making the often-complex decisions about when to
declare something dead and give up. That way you don't end up with things that
are half-dead or half-connected, which is a real pain in the butt.

~~~
felixgallo
in my experience, which is also a fair bit more than two years, that is deeply
not-great advice for what the author's talking about, which is available
systems. Systems with one-true-deciders need to serialize through the decider,
which tends to lower availability and limits scalability. And, the decider can
die or make wrong decisions, or get partitioned outside of a perfectly
functioning set of shards or whatever.

It is, no lie, far harder to make a system that can stay available and does
the right thing during the partition. Even though the research has been around
for decades, we're still in baby steps in actual functioning systems. Even
Riak, the poster child for this sort of thing, was straight up last-write-wins
for years while they were making fun of everyone else. Hard problems!

~~~
notacoward
I think you're confused about what "one part of the system" means. I don't
mean one process running on one piece of hardware. I mean one software module
with many instances - itself a distributed system, with one purpose in life
(determine liveness) so it can be simple and verifiable. There's no
serialization, and it does _not_ lower availability or scalability as you
claim. In fact, it preserves those properties far better than "everyone doing
their own failure detection a hundred different ways" does.

~~~
felixgallo
oh, not confused at all! I use erlang, like the OP. Right now I have several
erlang nodes that are doing a few hundred thousand http connections
concurrently. Looking at the logs, it looks like about 10-20 of those
concurrent processes are dying every second for various reasons (upstream
crappiness, network errors, parse errors, sunspots for all I know right now),
and are getting resurrected by their supervisors. Every blue moon, an entire
network target will go out and the restart frequency will be too high, and the
supervisors themselves will commit suicide and be restarted by their
supervisors.

Sometimes a machine will go down and come back up. Looks like the one I'm on
has an uptime of about 90 days, which seems low, but doesn't really matter. If
too many of them go down at the same time, I'm notified. Haven't ever been
notified.

You don't have to do 'failure detection a hundred different ways' (what does
that even mean?) if you just crash, and build your systems to go with the
flow, independently, like ants.

~~~
notacoward
Well, _somebody_ has to detect the failures, and restart failed processes,
etc. so you can just "go with the flow". If you can "just crash" that's great,
but it doesn't lead to much helpful advice for people who actually design the
underlying systems.

~~~
felixgallo
Sorry I wasn't clear. That's exactly the advice. Design systems so that they
can crash harmlessly in the small, and availability and scalability come to
you quite naturally.

I agree the (distant) second best alternative is to do as you suggest: to try
to rely on a paxos implementation that keeps a stable leader and tries to keep
track of liveness. That's pretty hard too! Besides all of aphyr's jepsen test
series, check this shit out:
[https://github.com/coreos/fleet/issues/1289](https://github.com/coreos/fleet/issues/1289)
. Tough! Tough even when you think you know what you're doing. And of course
no help against freak occurrences where you can't elect a stable leader any
more, or you flap leaders, or the election gets stuck.

Currently the only known-good implementation of raft is in Idris. Do you know
Idris? I sure as hell don't. Fortunately I know ants. Ants can die in great
numbers and still move the leaf. In practice, in production, thousands of
terabytes, even on AWS, my ants move and never bother me. Let it crash.

~~~
YuriNiyazov
Could you go into greater detail about how the only known-good implementation
of raft is in Idris? Never heard of raft before you mentioned it (i don't deal
with distributed systems day-to-day), but I am reading the paper now.

and what's ants?

~~~
felixgallo
First, a correction -- the only known-good implementation of Raft is actually
in Coq ([https://github.com/uwplse/verdi](https://github.com/uwplse/verdi)),
which predates Idris (which intends to be an easier-to-use Coq).

So the problem with distributed consensus algorithms is that they are hard to
understand. It didn't help that Lamport wrote his original paper playfully,
using a complex and unfamiliar metaphor. But as a result, many implementations
of the relevant algorithms tend to miss complex edge cases. Even famous ones
that many large companies rely on have either had meaningful serious bugs or
have been misunderstood and misused by downstream applications.

There are a couple of ways to try to fix this. The common way is to try to
write a bunch of unit tests. This doesn't work. Unit tests test only those
things that your tests manage to cover, and you will probably not think of all
of the edge cases.

The next most common way is to use something like QuickCheck, which
automatically generates millions of cases and spends as long as you want
(days, hours, weeks) hammering your code. This is much better, but still
nondeterministic.

The better way is to go fully deterministic, and prove out that your algorithm
works, either by exhaustively checking all possible interleavings (code paths)
with a model checker, or by mathematical proof.

Historically, the model checkers (e.g. Spin or Groove) have used a
pseudolanguage that you describe your algorithm in, which is then exhaustively
run to completion. This approach can prove that you are on the right track,
but since you cannot run those pseudolanguages in production, they are not the
complete solution, because you must then translate the pseudolanguage into
source code of your chosen language. This is nontrivial and very frequently
there are subtle transcription errors.

An alternative approach is to use a model checker that uses your native
language directly; e.g., Erlang has the brilliant Concuerror program, which
instruments your actual code. This is great because if it can verify that your
code works properly, then you are done; no transcription is needed.
Nevertheless I don't believe there are yet any Concuerror-assisted public
distributed consensus algorithms, even in Erlang. I would love to be mistaken
on this point.

The last approach is to take a proof assistant language like Coq or Idris, and
formally prove out the properties of the algorithm using mathematical proof
techniques. This is probably optimal, because the exhaustive model checkers
can, with complex/slow enough algorithms, run forever or run out of memory
trying to test all the cases. However, Coq and Idris are not exactly popular
languages and at this time it's not easy to implement line of business
applications with them. So although there is a proven, correct implementation
of Raft in Coq that guarantees linearizability, good luck accessing it. If you
don't use Coq, then you're forced to transcribe it to your chosen language,
which, as before, is error prone and does not result in a proven
implementation.

It would be possible to mechanically transcribe Coq/Idris code into a more
common language while maintaining its provability, but to my knowledge that
hasn't happened yet. More likely is that Idris and its successors inherit
mainstream language features and start making inroads.

Note also that maybe you don't care about being correct. For example, in the
spirit of moving fast and breaking things quickly or whatever, at least one
major notable VC-funded project in the news has taken the approach of just
increasing timeouts in order to mask bugs in their implementation. And 99.999%
of the time that will probably work fine, just as centralizing your database
into one king-hell instance and just dealing with downtime every blue moon
will probably be fine too. Own your own reliability alpha.

------
cpeterso
If you are interested in highly available systems, I highly recommend the
Pragmatic Programmers book "Release It!: Design and Deploy Production-Ready
Software". The author includes some amusing but educational horror stories,
such as debugging deadlocks in Oracle's JDBC drivers when "can't happen" Java
exceptions thrown connection close() were silently eaten.

[https://pragprog.com/book/mnee/release-
it](https://pragprog.com/book/mnee/release-it)

Another clever tip I read (not from this book) was to use prime numbers for
recurring events and retry wait times. The idea was that prime numbers would
help avoid unwanted system stampedes or synchronization.

~~~
nickpsecurity
I second that book. The airline story was particularly shocking and
illustrated well the need for its techniques.

------
RyanZAG
_> 6\. Prefer battle-tested tools over the “new hotness”._

Well, that won't go over well with the HN crowd, I bet! It's also the most
important point in my opinion. Just because a new tool solves an old problem
doesn't mean it won't have all new problems of its own. Can probably go even
further:

For any popular tool or framework less than 1 year old, it has critical
problems that aren't known yet. If it's not a popular tool, it will always
have critical problems that aren't known.

~~~
ltcoleman
I agree completely here. Many of my dev1s want to use the latest greatest
stuff, but you will most likely not find those technologies in large stable
production systems unless to facilitate some one-off service or piece of
architecture that does not need SLAs.

I remember a slew of small tech companies having issues when they decided to
build out there app stack with nginx fronting a nosql hotness which inevitably
led to overload scenarios because of the lack of constraints/scalability
concerns.

Please note : this was years ago, and I know that nginx is a great web server
and mongo is a great nosql db, but without a focus on scalability, you can end
up with a large headache.

------
moatra
Exponential back off is a good idea, but to make it even better you'll want to
add some random jitter to the delay period. Especially in the case of
deadlocks this helps avoid another deadlock when two things would have retried
at exactly the same time.

~~~
mjb
Absolutely. I made some graphs of different kinds of jitter approaches at
[http://www.awsarchitectureblog.com/2015/03/backoff.html](http://www.awsarchitectureblog.com/2015/03/backoff.html).
Simply backing off with delay does very little to spread out spikes of work,
and can lead to longer MTTRs if the cause of the issue was overload.

~~~
sgk284
Would it be possible for the server to simply return a queue position to the
contending clients? For instance, "There are 37 clients ahead of you. Try
again in 37ms." (assuming each write is averaging 1ms)

I suspect if the client and server worked together in this fashion you could
get much closer to the linear completion time seen with no backoff and also
closer to linear work (since each client would try exactly twice in an ideal
world where the server accurately predicted when it should try again).

------
ctide
I'm guessing MySQL doesn't support this (hence the need for a cron job), but
postgres lets you set a statement_timeout on the connection. It will force
kill queries that go beyond that timeout. I worked on an app not too long
enough that occasionally would have some queries go off the rails and start
blocking everything. We set up postgres to just kill off anything taking 30s
automatically, and then were able to root out the issues without worrying
about everything blocking on these broken queries and taking down our systems.

~~~
greenleafjacob
Author here. We thought about using statement_timeout, but we didn't like the
lack of flexibility when we _do_ have long-running queries that aren't
necessarily deleterious to performance. Instead we opted to use two different
users ("cron" for long running jobs, and a normal read/write user pair for
normal) that the long query killer script will kill at different timeouts,
effectively implementing a per-user timeout rather than a global timeout.

I bet the best approach might be to have the statement_timeout be the largest
of all of your per-user timeouts (in case your watchdog script fails, can't
connect, etc. for whatever reason).

~~~
e12e
I've not tried it, but according to PostgreSQL documentation, this should be
possible (and recommended) to set on a per role basis?

> The ALTER ROLE command allows both global and per-database settings to be
> overridden with user-specific values. (...)

> The SET command allows modification of the current value of those parameters
> that can be set locally to a session; it has no effect on other sessions.
> The corresponding function is set_config(setting_name, new_value, is_local).
> (...)

> Setting statement_timeout in postgresql.conf is not recommended because it
> would affect all sessions.

------
ww520
Those are good advice and they are battle hardened experience.

I would like to add a few points.

1\. Design the program as if it has crashed the last time. Always start up to
recover from the last crash state. A normal shutdown is just simply the
degenerate case. This would make restart and retry of program so simple.

2\. Do retry with RANDOM exponential backoff to spread out the load. You don't
want all the retries end up at the same intervals.

~~~
dvirsky
re 1 - better design a program with no state beyond configurations.

~~~
gabbo
Easier said than done, if you don't have _any_ state you're probably not doing
something all that interesting.

The trick is to keep your state in the right place, the answer usually being
to punt state management to a database/message queue/etc. - you definitely
want to avoid, say, stateful frontends.

------
setheron
At Amazon (AWS) I had started to refer to some of our control plane components
as "eventually reliable"; somewhat of a pun with a good dash of truth.

------
amelius
How about (automated) testing? Seems the author missed an important one.

~~~
nostrademons
It's necessary, but insufficient. You should have automated testing for each
individual component. The problem is that by itself, automated testing won't
give you a reliable distributed system. You need to assume that components
will fail in ways that won't be caught by your tests, and design your
architecture accordingly.

(Among other problems, automated testing won't catch a hardware failure, or
someone tripping over the power cord, or a cosmic ray flipping a bit and
corrupting memory.)

~~~
byroot
> automated testing won't catch a hardware failure, or someone tripping over
> the power cord

While it's never 100%, you can still test quite a lot of those conditions.

We developed a TCP proxy[0] that allows us to test such cases: Datastore X
being down / timing out / being excessively slow.

[0]
[https://github.com/shopify/toxiproxy](https://github.com/shopify/toxiproxy)

------
nickpsecurity
Those are nice. The exponential backoff is the exception: it was the cause of
much degradation in TCP. Just left unnecessary holes in service. Amazon
ditched it in exchange for jitter for same reason but different problem. If
using backoff, I suggest doing _non-exponential_ where you tie it to likely
disruption times w/ some error margin.

Another commenter mentioned testing. That's a good idea. Here's two articles
on that:

[https://queue.acm.org/detail.cfm?id=2800697](https://queue.acm.org/detail.cfm?id=2800697)

[http://www.drdobbs.com/testing/testing-complex-
systems/24016...](http://www.drdobbs.com/testing/testing-complex-
systems/240168390)

------
gabbo
Fixed-interval + jitter can work, exponential + jitter can work too. Be super
careful with retries that don't involve jitter, depending on your system
architecture you may end up with highly correlated bursts of retry load. You
need to be careful of intermediate levels in your system when retrying. And be
able to reason about what's going to happen under common failure modes (you
won't able to anticipate the rare ones or freak occurrences, but you can
defend in depth as best as you can).

If a low-level dependency starts to flake out (bad code/config/data push,
unreliable host or network, load spike causing your storage to melt), retrying
at intermediate stages can amplify the load and turn a small/medium-sized
problem into a cascading failure. You can partially mitigate by retrying 0 or
1 times and then just give up, especially if you control all your callers or
can afford that flakiness in your error budget.

Otherwise, depending on your needs you could maintain a retry budget (don't
unless you really know you need it): instead of a flat K retries before giving
up, enforce a process-wide per-operation cap on your retry rate. When the
system is healthy, overall retry rate will be low so you have "room" to issue
all K retries if needed. But if your load spikes, or your downstream
dependencies start flaking out for some reason (your "fault" or otherwise),
you'll avoid a positive feedback loop of death by quickly hitting your retry
rate cap and shedding load which only would have made things worse and was
destined to fail anyway.

EDIT: A few other remarks:

* OP's general point about putting limits on everything is really important, and closely related to the general wisdom in distributed systems development: everything can and will fail, so make the failure modes obvious and explicit. You get general robustness by avoiding runaway queries, but you also force discussions at design/development time like "what happens when this queue fills up" or "what happens if you're never able to take that lock" (though you should try to avoid needing to take locks ;)

* Instrument the hell out of the paths through your system and add lots of metrics. Logging is great, do it wherever it makes sense and you can, but a lot of the time you'll be able to get more insight (or the same amount of insight with less time spent doing ad-hoc log analytics) with something simple like counters whose value you can track over time and plot on a graph somewhere. Examples: classify downstream dependency latency into buckets, count of times you hit a code path (maybe some users require a more complex DB query than others or you're coding defensively, never expect a case to happen, but want to know if it does), cache hit rate, top sources of traffic in your system. Eventually you'll want to optimize bottlenecks and without data to identify them and prove they're fixed, you're flying blind.

------
eikenberry
Another rule of mine is to avoid languages/frameworks with large, complicated
run-times. They introduce a huge cognitive load and go against the rule of
keeping things simple.

------
TheHippo
Imgur is the only site I regularly see offline or overloaded.

------
mrbig4545
I wouldn't call imgur highly available. More like highly unavailable.

~~~
dexterdog
They are rarely down in my experience. Then again most of the work is just
serving images from AWS so they shouldn't be down.

------
EugeneOZ
Stopped reading after story about cron script killing long running db queries
- it means this system is a huge mess, where killing request is more simple
solution than finding which part of code generates it. I saw such systems and
I'm sure it's hopeless state and only as sarcasm I can read "thankful to
architecture principles" here.

~~~
shaftoe
Why not both? It's not unheard of for a deploy with index issues to cause a
load spoke in production. Yes, you need to know about the query to fix it, but
also killing runaway queries agree they're obviously orphaned can mean the
difference between an outage and a slight site degradation in response times.

~~~
EugeneOZ
Because if you killing queries, you can forget about data consistency. And
when data is in such condition, we should never call it "good enough way".

