
Some items from my “reliability list” - luu
http://rachelbythebay.com/w/2019/07/21/reliability/
======
B-Con
Required reading. Have you worked as an SRE?

> Item: Rollbacks need to be possible

This is the dirty secret to keeping life as an SRE unexciting. If you can't
roll it back, re-engineer it with the dev team until you can. When there's no
alternative, you find one anyway.

(When you really and truly cross-my-heart-and-hope-to-die can't re-engineer
around it fully, isolate the the non-rollbackable pieces, break them into
small pieces, and deploy them in isolation. That way if you're going to break
something, you break as little as possible and you know exactly where the
problem is.)

Try having a postmortem, even informal, for every rollback. If you were
confident enough to push to prod, but didn't work, figure out why that
happened and what you can do to avoid it next time.

> Item: New states (enums) need to be forward compatible

Our internal Protobuf style guides strongly encourage this. In face, some of
the most backward-compatible-breaking features of protobuf v2 were changed for
v3.

> Item: more than one person should be able to ship a given binary.

Easy to take this one for granted when it's true, but it 100% needs to be
true. Also includes:

* ACLs need to be wide enough that multiple people can perform every canonical step.

* Release logic/scripts needs to be accessible. That includes "that one" script "that guy" runs during the push that "is kind of a hack, I'll fix it later". Check. It. In. Anyway.

* Release process needs to be understood by multiple people. Doesn't matter if they _can_ perform the release if they don't know _how_ to do it.

> Item: if one of our systems emits something, and another one of our systems
> can't consume it, that's an error, even if the consumer is running on
> someone's cell phone far far away.

Easy first step is to monitor 4xx response codes (or some RPC equivalent).
I've rolled back releases because of an uptick in 4xxs. Even better is to get
feedback from the clients. Having a client->server logging endpoint is one
option.

And if a release broke a client, rollback and see the first point. Postmortem
should include why it wasn't caught in smoke/integration testing.

~~~
pmlnr
> Rollbacks need to be possible

I always feel like people who write these never faced SQL schema changes or
dataset updates. I wonder what rollback plans are in place for complete MySQL
replication chains, for example.

~~~
grey-area
This is addressed in the article (in passing). You split the changes up -

First make sure your code handles both old and new schemas

Then introduce the schema change

Then introduce the code which depends on the schema

Each of these steps is performed separately and monitored for problems, with
rollback possible at each stage. The most painful thing to rollback is the
database, but it is possible, though rarely necessary if done in isolation and
tested against the old code first.

There may be a few more steps after if you want to tidy up the schema by
removing old data etc. It's trading off complexity in development/deploy for
reliability.

~~~
B-Con
Pretty much this.

Writing code that can handle both old and new schema is admittedly annoying.
But it's safe and forces a thoughtful rollout.

Even data mutations can be rolled back. Dual storage writes, snapshots, etc.

The goal isn't to eliminate risk, it's to reduce the risk and make it
calculated and bounded. Heck, having a rollback plan that says "we'll restore
a snapshot and lose X minutes worth of data mutations" is fine in some cases.
It's tradeoffs.

(I've seen a case where literally adding a column caused an outage, because
the presence of the column triggered a new code path. Rollback was to delete
the column.)

~~~
tonyarkles
And to add to this, one of brutal anti-rollback gotchas I’ve seen on smaller
projects (not distributed systems failures, but code bugs that put you in a
painful spot): not having transactional DDL for your schema changes.

The typical scenario I’ve had to help with: MySQL database that’s been running
in prod for a while. Because of the lax validation checks early in the
project, weird data ends up in a column for just a couple rows. Later on you
go to make a change to that column in a migration, or use the data from that
column as part of a migration. Migration blows up in prod (it worked fine in
dev and staging due to that particular odd data not being present), and due to
the lack of transactional DDL, your migration is half applied and has to be
either manually rolled back or rolled forward.

~~~
evanelias
In my experience this is rare, but I suppose it depends what you mean by
"Migration blows up in prod". Do you mean the ALTER TABLE is interrupted
(query killed, or server shutdown unexpectedly)? Or do you mean you're
bundling multiple schema changes together, and one of them failed, which is
causing issues due to lack of ability to do multiple DDL in a single
transaction?

The former case is infrequent in real life. Users with smaller tables don't
encounter it, because DDL is fast with small tables. Meanwhile users with
larger tables tend to use online schema change tools (e.g. Percona's pt-osc,
GitHub's gh-ost, Facebook's fb-osc). These tools can be interrupted without
harming the original table.

Additionally, DDL in MySQL 8.0 is now atomic. No risk of partial application
if the server is shut down in the middle of a schema change.

The latter case (multiple schema changes bundled together) tends to be
problematic only if you're using foreign keys, or if your application is
immediately using new columns without a separate code push occurring. It's
avoidable with operational practice. I do see your point regarding it being
painful on smaller projects though.

~~~
tonyarkles
Generally in my experience it's been bundled schema changes. As an (dumb)
example in an SQL-like pseudocode (I've been doing EE work recently; apologies
for imperfect syntax):

    
    
      ALTER TABLE user ADD COLUMN country; -- oops
      CREATE TABLE user_email (blah blah); 
      -- process here to move data from user.email column to user_email table. This blows up because someone has a 256-byte long email address and the user_email table has a smaller varchar()
      ALTER TABLE user ADD state_province;
    

Yes, this is sloppy and should be cleaner. The net result is still the same
though; you have the user_email table and country column created, and due to
MySQL DDL auto-commit, they're persisted even though the data copy process
failed. The state_province table does not exist, and now if you want to re-run
this migration after fixing the problem, you need to go drop the user_email
table and country column.

With e.g. Postgres, you wrap the whole thing in a transaction and be done with
it. It gets committed if it succeeds, or gets rolled back if it fails.

~~~
evanelias
Makes sense, thanks. fwiw, generally this won't matter in larger-scale
environments, as it eventually becomes impractical to bundle DDL and DML in a
single system/process/transaction. In other words, the schema management
system and the bulk row data copy/migration system tend to be separated after
a certain point.

Otherwise, if the table is quite large, the DML step to copy row data would
result in a huge long-running transaction. This tends to be painful in all
MVCC databases (MySQL/InnoDB and iiuc Postgres even more so) if there are
other concurrent UPDATE operations... old row versions will accumulate and
then even simple reads slow down considerably.

A common solution is to have the application double-write to the old and new
locations, while a backfill process copies row data in chunks, one transaction
per chunk. But this inherently means the DDL and DML are separate, and cannot
be atomically rolled back anyway.

~~~
tonyarkles
Yeah, definitely for larger scale systems the process gets quite a bit
different, although "large scale" will remain undefined :).

It's admittedly been a long time since I've used MySQL for anything
significant, but I feel like teams in the past have run into issues where DDL
operations on their own have succeeded in dev/staging and failed in prod, even
though they're running on the same schema. Simply due to there being "weird"
data in the rows. I don't know that for sure though. If I'm remembering right,
one of those had something to do with the default "latin1_swedish_ci" vs utf8
thing...

------
pjungwir
> Also, if you are literally having HTTP 400s internally, why aren't you using
> some kind of actual RPC mechanism? Do you like pain?

I just had a discussion about this yesterday where we have an internal JSON
API that auths a credit card, and if the card is declined it returns a status
and a message. Another developer wanted it to return a 4xx error, but that
made me uneasy. I think you could make a good argument either way, but to me
that isn't a failure you'd present at the HTTP layer. 4xx is better than 5xx,
but I was still worried how intermediate devices would interfere. (E.g. an AWS
ELB will take your node out of service if it gives too many 5xxs, and IIS can
do some crazy things if your app returns a 401.) Also I don't want declined
cards to show up in system-level monitoring. But what do other folks think? I
believe smart people can make a case either way.

EDIT: Btw based on these Stack Overflow answers I'm in the minority:
[https://stackoverflow.com/questions/9381520/what-is-the-
appr...](https://stackoverflow.com/questions/9381520/what-is-the-appropriate-
http-status-code-response-for-a-general-unsuccessful-req)

~~~
citrusx
In my opinion, a rejection is an expected outcome, and therefore should have a
response code of 200. You're not asking, "Does this card exist?", and sending
a 404 if you have no record. You're asking a remote system to do a job for
you, and if that job completes successfully, it's a "Success" in the HTTP
world.

At that point, you rely on the body content to tell you what the service
correctly determined for you. A result that the user doesn't like is way
different than a result that comes about because something was done wrong at
the client side (4xx) or a failure on the server side (5xx).

~~~
sagichmal
> You're asking a remote system to do a job for you, and if that job completes
> successfully, it's a "Success" in the HTTP world.

But a provided credit card number not existing is not success, it is
unambiguously a failure.

HTTP is a transport but its response codes are clearly designed to map
_somehow_ to the "business logic" of the thing underneath it. We have to twist
ourselves up in knots to map most of our business systems to the set of status
codes that make sense for the default HTTP-fronted application (Fields' thesis
stuff) but that's what we sign up for when we decide to expose our business
services via HTTP. Other transports require different contortions.

~~~
treis
>But a provided credit card number not existing is not success, it is
unambiguously a failure.

But the failure is ambiguous. If I get a 404 back I can't rely on that to mean
the credit card doesn't exist. It could mean that I have the wrong URL or that
the service application screwed up their routing code.

~~~
nulagrithom
You wouldn't put a card number in the path anyway (for obvious reasons). Far
more sensible to put that in the request body.

And who's to say you can't put the reason in the body and still keep the code?
What are you hurting by sending back 400? Unless you have lb's taking out
nodes because of excessive 4xx's (which sounds like insanity) I don't see a
reason _not_ to send 4xx's. At the very least it's a useful heuristic tool.

~~~
tlynchpin
What are the obvious reasons? I'll presume you are referring to disclosure of
the card number.

I had this discussion recently about 'security' with regard to X-Header versus
?query=param. Either it's http all plaintext on the network or it's http with
tls all cyphertext on the network. Every bit in the http request and response
is equivalent - verb, path, headers, body, etc - agree?

You could represent the card number as cyphertext in the request body, that's
a good practice regardless of tls, but of course don't roll your own crypto.
You could put that cyphertext in the path as well but if the cyphertext isn't
stable that makes for a huge mess of paths.

You could make a case for trad 'combined' access logs situation with the path
disclosed in log files. I can appreciate keeping uris 'clean' makes it safe to
integrate a world of http monitoring tools, I would make this argument. In the
case of the card represented in a stable cyphertext it's kinda cool to expose
it safely to those tools.

Anything else?

~~~
tda
If you grab something from an external service (say a cdn) then I believe by
default the referrer will contain the url + query params, but not you
X=supersecret header. Bit me once

------
nieve
The one I've seen missed most often in startups is directly implied by a lot
of the other points and obvious to anyone with long experience: Take the time
and put a lot of thought into how to break up your big transitions into
smaller stages, each of which are functional. It's usually possible to at
least narrow down the risky parts to a few finer grained steps and when
something fails only rolling back one part to get to a good state is almost
always faster and safer.

It's very easy to get absorbed into the awareness of the high level change
you're making and miss the details of the process. Even just sitting down
together and outlining what you think is actually going to go on (and then
breaking those down into what they each are comprised of) can make it really
clear that you don't have to run as many giant risks. I'm occasionally amazed
how brilliant people (including some with big names in devops) can forget it's
an option.

It's like taking small steps from stable to stable when you're going across a
steep scree slope and only jumping when you have to - sometimes it feels
riskier to take lots of small steps, but if you start to slide it can be a lot
easier to recover from. Your chance of dying taking a big leap isn't the sum
of the equivalent small steps. Perhaps complex computer systems have the
equivalent of an Angle of Repose?

------
mahkoh
On JSON:

"if you only need 53 bits of your 64 bit numbers"

JSON numbers are arbitrary precision.

"blowing CPU on ridiculously inefficient marshaling and unmarshaling steps"

On the other hand I am not blowing dev and qa time on learning/developing
tools to replace curl/jq/browser/text editor.

~~~
patrec
> JSON numbers are arbitrary precision. > On the other hand I am not blowing
> dev and qa time on learning/developing tools to replace curl/jq/browser/text
> editor.

What a pleasant surprise it will be for you when you find out that jq silently
corrupts integers with more than 53 bits.

~~~
mr_crankypants
It's perhaps less a limitation of JSON and more a limitation of JavaScript
itself[1].

But still, given that easy consumption from JavaScript is the ultimate
primordial reason for choosing JSON over other formats, it seems like trying
to transmit integers with more than 53 bits of precision over JSON is asking
for trouble. Because it's only a matter of time until someone will want to do
something like write a new service in Node, and the JavaScript parsers for
other formats are at least somewhat more likely to guide people toward using
BigInt for large integers.

[1]: [https://developer.mozilla.org/en-
US/docs/Web/JavaScript/Refe...](https://developer.mozilla.org/en-
US/docs/Web/JavaScript/Reference/Global_Objects/Number/MAX_SAFE_INTEGER)

~~~
apaprocki
In TC39, we specified BigInt to not participate in JSON by default, precisely
because emitting it as a native JSON number would not be able to be read back
by JSON.parse() and many other environments would also not be able to parse it
without extra logic if they simply use IEEE754 double.

I explicitly asked for and achieved step 2. in the modified
SerializeJSONProperty algorithm[1] so that users could decide and opt-in to
serializing BigInts as strings if they so choose, with or without some sigil
that could be interpreted by a reviver function. e.g.:

    
    
      > JSON.stringify(BigInt(1))
      TypeError: Do not know how to serialize a BigInt
      ...
      > BigInt.prototype.toJSON = function() { return this.toString(); }
      > JSON.stringify(BigInt(1))
      '"1"'
    

[1]: [https://tc39.es/proposal-bigint/#sec-
serializejsonproperty](https://tc39.es/proposal-bigint/#sec-
serializejsonproperty)

------
wyc
I love item #2, as it talks about writing code that can safely handle "future"
enums and values as a result of rolled back code. Maybe we should call it the
Sarah Connor Pattern.

I haven't heard enough people discuss the deployment management of growing
enums or state machine evolution. This is a problem more particular to
software than hardware, as once hardware is shipped it's usually set in
silicon, but growing of the state garden is an expectation in many software
architectures.

------
VBprogrammer
One of the fun challenges I have in my current job is that we provide releases
to customers according to the customers schedule (which is related to needing
hours of downtime because it's a creaky old system).

Some customers will skip releases altogether making strategies like add a new
column, back populate it online, then the next release uses the new value
impossible.

I guess that point is slightly moot when it'd take 2-3 releases to achieve the
end goal and each release cycle is about a month.

~~~
jefftk
You're not in a great situation, but there are still lessons from the post.
For example, you could break upgrades into a series of changes that can be
individually rolled back, and run smoke tests in between each one.

But your biggest reliability improvement would come from getting this system
moved to continuous deployment without downtime. Now you can make one change
at a time and roll back if it doesn't work.

~~~
aeorgnoieang
If they're in anything like the last situation I was in that seems similar,
continuous deployment may not be possible – not technically, but with respect
to other considerations, e.g. management or even the business's finances.

There's still a lot of small software companies maintaining, and selling, on-
premises installations of systems that use, e.g. Microsoft Access as a front-
end client. Even then, continuous deployment is possible, and all-else-equal,
a huge improvement for the developers and support staff, but also something
that lots of management or owners may be (reasonably) averse to committing to
implementing.

------
jefftk
This is a great list! One thing I would add, once you have everything on this
list, is a way to experiment on your changes. Instead of flipping a flag,
seeing your error rate jump, and flipping it back, you run an experiment where
to flip the flag on 0.1% of requests. Now you can compare this traffic to a
control group, and aren't stuck wondering "did errors go up by 5% during our
rollout because we broke things, or by chance?". If things look as expected at
0.1% you can ramp to 1%, then 10% before releasing.

------
ricardobeat
> using something with a solid schema that handles field names and types and
> gives you some way to find out when it's not even there [...] ex: protobuf

In proto3 all fields are optional, and have default values, so it becomes
impossible to detect the absence of data unless you explicitly encode an
empty/null state in your values.

~~~
asark
Ew, what was the reason for that?

~~~
jrockway
Because in proto2 everyone used "optional" anyway.

------
punnerud
I managed releases for one of Norway’s largest hospitals, and when all in the
‘reliability list’ is checked and you have frequent releases the real headache
is ‘cross system rollback’ between several systems/companies. Add that this is
done with the whole hospital in emergency procedure..

------
C4stor
Seems like a typical article from I assume a gafa employee, good advice mixed
with "how to be google even if you don't need too" advice.

------
crummy
I don't know anything about databases. How do you roll back a significant
schema change?

~~~
evanelias
A few strategies used by large companies, e.g. Facebook where the author
worked for some time:

* Use external online schema change tooling which operates on a shadow table, so the tooling can be interrupted without affecting the original table. (Generally all of the open source MySQL online schema change tools work this way.)

* Use declarative schema management (e.g. tool operates on repo of CREATE TABLE statements), so that humans never need to bother writing "down" migrations. Want to roll something back? Use `git revert` to create a new commit restoring the CREATE TABLE to its previous state, and then push that out in your schema management system. (Full disclosure, I spend my time developing an open source system in this area, [https://skeema.io](https://skeema.io))

* Ensure that your ORMs / data access layers don't interact with brand new columns until a code push occurs or a feature flag is flipped.

~~~
aeorgnoieang
Skeema looks cool, and I've long been a fan of a similar tool that's specific
to SQL Server – DB Ghost. I even implemented some pretty handy additional
automation on top of it at various companies at which I worked. There were
tho, inevitably, a small number of schema changes that needed to be
encapsulated as 'migrations', and it could be arbitrarily complex and
difficult to write, and test, them in a way that supported automatic
rollbacks.

------
Thorrez
>In this case, you need to make sure you can recognize the new value and not
explode just from seeing it, then get that shipped everywhere. Then you have
to be able to do something reasonable when it occurs, and get that shipped
everywhere. Finally, you can flip whatever flag lets you start actually
emitting that new value and see what happens.

Can't those first 2 steps be combined together? Why do they need to be shipped
separately?

~~~
akuji1993
Depends on your updating architecture.. If you have a central backend service,
but scheduled patches or the user has to manually update, that might become a
problem. There could also be the case that client and backend can be updated
independently and one might not be done before the other. The issue might only
exist for a few minutes, or an hour, but would still be a huge impact on the
software.

~~~
Thorrez
Sorry, I still don't understand. The post say you need to do

1\. Have the code recognize the new value. Get that shipped everywhere.

2\. Have the code do something reasonable when the new value appears. Get that
shipped everywhere.

3\. Start emitting the new value.

My proposal is

1\. Have the code recognize the new value. Have the code do something
reasonable when the new value appears. Get that shipped everywhere.

2\. Start emitting the new value.

If the user has to manually update, and might refuse to do so, the problem
exists under both the 3 step and the 2 step process. So nothing is gained by
using the 3 step process.

If the client and backend are updated independently, that should still be fine
with the 2 step process. The 2 step process says "Get that shipped
everywhere", meaning shipped to both the client and the backend.

~~~
joshuamorton
I would rephrase this as

1\. Support the new value in your schema

2\. Support the new value in the client

3\. Emit the new value from the server

Combining 1 and 2 is probably possible, but not a great idea. Imagine if you
end up rolling back 2 and 3, but there are still potentially new values in
flight. If 1 is still there, you're good. If 1 and 2 are combined, you
rollback all 3, but the new value is still in flight and your client crashes.

~~~
aeorgnoieang
The advice seemed to be to, first, support (i.e. correctly ignore) any
_possibly new_ value in both the client and server (i.e. any client of the
'schema').

------
barbarbar
She has written many very interesting posts. Also one about "The One".

------
z3t4
Hot patching is scary but it makes you optimize for the right things like easy
to understand, easy to update, detectable early errors, easy to
recover/rollback. And it makes you think and understand _before_ writing the
actual code.

------
torbjorn
I love this and it makes me feel both excited and scared since I am suddenly
realizing the ways my org is not in compliance with this good advice.

Are there any good books that are full of more rules of thumb like these?

~~~
dgacmu
The Google SRE book (free online or available as dead tree):

[https://landing.google.com/sre/books/](https://landing.google.com/sre/books/)

------
maximente
what is the load balancing story for these RPC services thusly recommended? it
was completely glossed over as if it was not even relevant; i know gRPC uses
HTTP/2 and persists connections so it's not as simple as throwing a proxy in
front.

that seems like a non-trivial point of friction when it comes to "just using
solid storage/RPC formats" or whatever.

~~~
elteto
For gRPC there is Envoy, but I have not used it so I can’t give you more info
on it.

However, I think the point in the article is that you need to have well
defined schemas for inter-service messaging. Something like protobuf or thrift
or flatbuffers. Whether you layer gRPC on top of that is a separate concern.
For example I have used Protobufs extensively at work but never gRPC, since we
mostly have point-to-point connections. We checked in the message schemas into
their own repo and all users across the company pull from it. We have Python
and C++ codebases sharing the same schemas, it’s quite wonderful.

------
akuji1993
I think something that's missing in this list is to have a really good and
consistent QA cycle. You need to have rollbacks, I agree, but even better is,
when bugs can't even make it into your production build. Having automated
testing set up, (actually correctly done) code reviews and quality gateways in
place can save you a lot of time rolling back your code. Catch the bug before
it goes into live.

~~~
msh
But you will never catch them all.

~~~
akuji1993
Sure, it's just another net the bug has to fall through though. And one that
catches a lot of errors, at least for us.

The biggest impacts for our code health are

\- Typescript

\- Automated Testing through Gitlab CI

~~~
msh
Of cause, your comment just sounded like if you had good QA rollbacks would
not be needed.

~~~
akuji1993
Oh no, that wasn't at all what I meant. I just wanted to mention another tool,
that can reduce issues with your production code. Rollbacks need to be there
in any case.

