Scale still has something to do with it. If you are small scale, spending an eng...

remus · on Feb 16, 2021

> Scale still has something to do with it. If you are small scale, spending an engineer month to think about and make informed decisions and implement it may be more expensive than just accepting whatever downtime happens along the road.

This is discussed pretty explicitly in the SRE book, in particular the idea of an error budget. Obviously they use some google services as examples, but the key message I got from it was "think about how much downtime is acceptable to you and work around that".

> OTOH, obviously for Google, downtime must be minimized at almost any cost.

Interestingly the SRE book pretty explicitly says that this is a poor goal for pretty much everyone. The cost of chasing more 9s goes up exponentially, while for most users whether your service is 99.999% available or 99.9999% available makes no difference, because 0.1% of the time their shitty router crashes and they have to restart it.

Better to pick a level of availability that strikes a balance between cost and user experience then work towards that.

jq-r · on Feb 16, 2021

True, it is a simple concept as it seems to me.

1. define some reliability target (better expressed by some SLOs) in advance and what steps to do if that is not reached 2. if the service fails to reach it, do the steps to increase reliability arranged in step 1. 3. repeat at some regular intervals

The point I think is that the things are arranged in advance. Not after some shit happens because people get very subjective about "their own" service. The target is there, so lets try to reach it. We have error budget as well, lets use that one. If you don't have anything (as I've seen in a lot of places, or wishful 100% reliability), you'll have major reliability problems I'm absolutely sure.

So the SRE book tries to give you a solution to a lot of headaches some medium to large companies might be facing.