- Metric-driven design
-- Define your metrics and success criteria up front, as graphs on a dashboard, and build your system so that it's clear from the dashboard that the system is meeting its success criteria
-- Unit test your business logic
-- For every logic bug, write a test which fails, fix the bug, and verify that the test succeeds (regression test)
-- Integration test all your APIs and the "success criteria"
--- try to keep the implementation-agnostic
-- Load test for latency-sensitive components
--- The results of load tests can be published as metrics in your metrics dashboard, since latency is often a success criteria
-- Continuous Delivery pipeline
--- Not only does this improve productivity and stability, but it is super useful for emergent deployments, so that you can know quickly if your fix is breaking anything, and you can add regression tests quickly
-- Be liberal with logging
-- Log internal state changes (cache evictions, refreshes, connections opening/closing, etc.)
-- Add lots of logging in complicated business logic (there will be bugs, and logs will make them easy to find)
- Liberally sprinkle assertions throughout your code (especially if data quality is more important than resilience)
- Be conservative with dependencies
-- Think carefully before using that new sexy library
--- Do the authors make backwards-compatible changes?
--- Is it well-maintained?
--- Is it likely to be abandoned?
-- Think carefully before coupling systems together with code dependencies
Then iterate and add all the stuff mentioned above.
If you want resilience, make an upper layer that will recover your service on failure, deploy it on redundant hardware, add strict alerts for when recovery is impossible; and still sprinkle assertions through your code.
Data quality is never secondary.
-set up automatic notification of failure conditions in those metrics. People shouldn't have to stare at and interpret graphs to know something is wrong.
State 5. Never change anything or do any useless activities or upgrades unless you really have no other choice and there's clear demonstrated value in the change.
Stage 6. Get someone else do it for you. Like the silly enthusiastic junior dev who is willing to sacrifice his weekend for some stupid problem.
Stage 7. Switch career to something sensible such as a baker.
State 6. Realize that painful steps need to be automated and repeated often. Then each upgrade has much fewer changes, and it's easier to locate the source of errors.
If you have a team that's actively working on the code, incremental upgrades to the dependencies should be pretty comfortable. But if you've got some legacy thing that no-one really knows, then is the time to start being careful about updating little bits.
It's not fun, but as a consequence you don't get stuck with something that you absolutely can't maintain.
< State 5. Never change anything or do any useless activities or upgrades unless you really have no other choice and there's clear demonstrated value in the change.
> Stage 5. Never change anything or do any useless activities or upgrades unless you really have no other choice and there's clear demonstrated value in the change.
"a much more experienced SRE later came in to work with the team on making the same service operate better, and I got to see what he did and what his process for improving things looked like"
Operations is an expertise that you can develop, or you can decide that it isn't what you want to be doing. Just like machine-learning, games, databases, or operating systems internals: everybody can tackle a tiny project, but only a minority of people will like it and be good enough at it to
make it their life-work.
1) Blind optimism
2) Blind pessimism
3) Integrates both.
I was first exposed to it through this (disclaimer: religiously) themed speech: https://speeches.byu.edu/talks/bruce-c-hafen_love-is-not-bli...
I love this quote he the speaker uses:
Some stupid people started the idea that because women obviously back up their own people through everything, therefore women are blind and do not see anything. They can hardly have known any women. The same women who are ready to defend their men through thick and thin . . . are almost morbidly lucid about the thinness of [their] excuses or the thickness of [their] head[s]. . . . Love is not blind; that is the last thing that it is. Love is bound; and the more it is bound the less it is blind. [G.K. Chesterton, Orthodoxy(Garden City, N.Y.: Image Books, 1959), pp. 69–71.]
Are there any good guides that focus less on "best practices" and more on trade-offs (especially from the point of view of someone who's more interested in a simple route to 4-nines than going much beyond that)?
Build with distributed and fault tolerance in mind. You should be able to knock out a random server and still run with no effect.
Monitoring with alerts when certain are broken
Build a continuous delivery pipeline, and keep working on it till your so confident that deployments become none events, which can happen at any point. see automated testing.
Replicas, and HA setups should be the norm.