
Operations for software developers for beginners - ingve
https://jvns.ca/blog/2016/10/15/operations-for-software-developers-for-beginners/
======
UK-AL
Unit testing

Integration testing

Distributed logging

Build with distributed and fault tolerance in mind. You should be able to
knock out a random server and still run with no effect.

Monitoring with alerts when certain are broken

Build a continuous delivery pipeline, and keep working on it till your so
confident that deployments become none events, which can happen at any point.
see automated testing.

Replicas, and HA setups should be the norm.

------
dasmoth
There's a lot of "doing things the Ops way" stuff that's clearly super-
helpful, if not essential, if you're seriously shooting for "five nines". On
the other hand, I've seen it go wrong: for example, redundant servers behind a
load balancer yielding dramatically worse real-world reliability than a single
instance of the backend service managed on its own (while the load balanced
setup was also a nightmare to debug...).

Are there any good guides that focus less on "best practices" and more on
trade-offs (especially from the point of view of someone who's more interested
in a simple route to 4-nines than going much beyond that)?

------
swsieber
I once heard a useful paradigm about the three levels of perspective:

1) Blind optimism

2) Blind pessimism

3) Integrates both.

I was first exposed to it through this (disclaimer: religiously) themed
speech: [https://speeches.byu.edu/talks/bruce-c-hafen_love-is-not-
bli...](https://speeches.byu.edu/talks/bruce-c-hafen_love-is-not-blind-
thoughts-college-students-faith-ambiguity/)

I love this quote he the speaker uses:

Some stupid people started the idea that because women obviously back up their
own people through everything, therefore women are blind and do not see
anything. They can hardly have known any women. The same women who are ready
to defend their men through thick and thin . . . are almost morbidly lucid
about the thinness of [their] excuses or the thickness of [their] head[s]. . .
. Love is not blind; that is the last thing that it is. Love is bound; and the
more it is bound the less it is blind. [G.K. Chesterton, Orthodoxy(Garden
City, N.Y.: Image Books, 1959), pp. 69–71.]

------
dsr_
The key sentence in this post:

"a much more experienced SRE later came in to work with the team on making the
same service operate better, and I got to see what he did and what his process
for improving things looked like"

Operations is an expertise that you can develop, or you can decide that it
isn't what you want to be doing. Just like machine-learning, games, databases,
or operating systems internals: everybody can tackle a tiny project, but only
a minority of people will like it and be good enough at it to make it their
life-work.

~~~
vacri
Ops is definitely something you want to learn about from others - you don't
want to be learning _all_ the hard lessons yourself, particularly ones about
robust data storage...

------
ensiferum
Stage 4. Acknowledge that your stuff (and software in general) is now and
always buggy somewhere cause it's just too complex for humans to write
correctly and whatever your efforts it will just malfunction.

State 5. Never change anything or do any useless activities or upgrades unless
you really have no other choice and there's clear demonstrated value in the
change.

Stage 6. Get someone else do it for you. Like the silly enthusiastic junior
dev who is willing to sacrifice his weekend for some stupid problem.

Stage 7. Switch career to something sensible such as a baker.

~~~
perlgeek
> State 5. Never change anything or do any useless activities or upgrades
> unless you really have no other choice and there's clear demonstrated value
> in the change.

State 6. Realize that painful steps need to be automated and repeated often.
Then each upgrade has much fewer changes, and it's easier to locate the source
of errors.

~~~
vacri
I'll second this. At one of the places I currently work at, we have the same
codebase split between nodejs 0.10.40 and node 6.2.something, because the devs
were scared of the incremental changes and having to fix small things along
the way. It's a nightmare now, and transitioning the final stuff off is
proving to be a massive wallop of tech debt.

If you have a team that's actively working on the code, incremental upgrades
to the dependencies should be pretty comfortable. But if you've got some
legacy thing that no-one really knows, _then_ is the time to start being
careful about updating little bits.

~~~
coredog64
I once worked with a CIO that had a policy that we should be no more than one
version behind current on software. The idea was that most vendors will have
an easy migration plan to current, but beyond that you usually had to upgrade
twice with all the attendant pain.

It's not fun, but as a consequence you don't get stuck with something that you
absolutely can't maintain.

------
cle
I'm a developer who's dealt with heavy operational loads for years. Some
random things that help:

\- Metric-driven design

\-- Define your metrics and success criteria up front, as graphs on a
dashboard, and build your system so that it's clear from the dashboard that
the system is meeting its success criteria

\- Testing

\-- Unit test your business logic

\-- For every logic bug, write a test which fails, fix the bug, and verify
that the test succeeds (regression test)

\-- Integration test all your APIs and the "success criteria"

\--- try to keep the implementation-agnostic

\-- Load test for latency-sensitive components

\--- The results of load tests can be published as metrics in your metrics
dashboard, since latency is often a success criteria

\-- Continuous Delivery pipeline

\--- Not only does this improve productivity and stability, but it is super
useful for emergent deployments, so that you can know quickly if your fix is
breaking anything, and you can add regression tests quickly

\- Logging

\-- Be liberal with logging

\-- Log internal state changes (cache evictions, refreshes, connections
opening/closing, etc.)

\-- Add lots of logging in complicated business logic (there will be bugs, and
logs will make them easy to find)

\- Liberally sprinkle assertions throughout your code (especially if data
quality is more important than resilience)

\- Be conservative with dependencies

\-- Think carefully before using that new sexy library

\--- Do the authors make backwards-compatible changes?

\--- Is it well-maintained?

\--- Is it likely to be abandoned?

\-- Think carefully before coupling systems together with code dependencies

~~~
solipsism
I'm with you but I'd add:

-set up automatic notification of failure conditions in those metrics. People shouldn't have to stare at and interpret graphs to know something is wrong.

~~~
spronkey
\-- and make sure, if those notifications are emails, that they are not
triggered every single time the error occurs, potentially flooding you with
thousands of emails.

~~~
foobarian
What do people use for this kind of thing? I'm aware of Bosun
([https://bosun.org/](https://bosun.org/)) and Prometheus
([https://prometheus.io/](https://prometheus.io/)). Both can alert based on
aggregated metrics, using rich rules such as values moving away from
historical averages by a certain threshold.

