
What I Learned Managing Site Reliability for Some of the Busiest Gambling Sites - slyall
https://zwischenzugs.wordpress.com/2017/04/04/things-i-learned-managing-site-reliability-for-some-of-the-worlds-busiest-gambling-sites/
======
jedberg
I'm going to have to respectfully disagree with a big chunk of this article.
Documentation is generally a waste of time unless you have a very static
infrastructure, and run books are the devil.

You should never use a run book -- instead you should spend the time you were
going to write a run book writing code to execute the steps automatically.
This will reduce human error and make things faster and more repeatable. Even
better is if the person who wrote the code also writes the automation to fix
it so that it stays up to date with changes in the code.

At Netflix we tried to avoid spending a lot of time on documentation because
by the time the document was done, it was out of date. Almost inevitably any
time you needed the documentation, it no longer applied.

I wish the author had spent more time on talking about incident reviews. Those
were our key to success. After every event, you review the event with everyone
involved, including the developers, and then come up with an action plan that
at a minimum prevents the same problem from happening again, but even better,
prevents an entire class of problems from happening again. Then you have to
follow through and make sure the changes are actually getting implemented.

I agree with the author on the point about culture. That was absolutely
critical. You need a culture that isn't about placing blame but finding
solutions. One where people feel comfortable, and even eager, to come out and
say "It was my fault, here's the problem, and here's how I'm going to fix it!"

~~~
wink
There's stuff where you can either spend weeks or months automating it - or
live with the fact that your oncall engineer has to use a runbook a few times
per year to fix the problem.

Also your Netflix example - how many people do you have there? Probably the
smallest of teams is bigger than my company's whole engineering department.
We're running a whole company and not a few services. (I'm absolutely not
trying to discourage what you're doing - but I strongly feel it's a different
ballgame.) The smaller your team with too many widespread responsibilities
(not: make THIS service be available 99.9%, but make ALL services available
90%, then 95%, ...) the more you only automate what happens frequently. And
yes, we try to first a basic runbook when _something_ happens - and only if
the same thing happens repeatedly, we automate it.

~~~
jedberg
We were doing this with a team of sometimes one or two reliability engineers,
but we were cheating, because our company culture meant that the engineers who
built the systems are responsible for keeping them running, so they would
invest their engineering time in fixing the problems along with us.

I personally found that runbooks were even worse for small size teams (like
our four person reddit team) because they would get out of data even quicker
than at the bigger places due to the rapidly changing environment.

I wrote down thread that if all of your deployment is automated than it is
much easier to automate remediation, because you just change your deployment
to fix the problem, as long as you can redeploy quickly.

~~~
toomuchtodo
> We were doing this with a team of sometimes one or two reliability
> engineers, but we were cheating, because our company culture meant that the
> engineers who built the systems are responsible for keeping them running, so
> they would invest their engineering time in fixing the problems along with
> us.

What advice would you give for an org where the engineers who build systems
are not responsible for keeping them running, and everyone on a (much smaller
comparatively) infrastructure team is (which is slowly turning into an SRE
team by necessity)?

Anecdotally, I've found documentation to be useless; despite documentation
being of a high quality, no one refers to it, even after iterating to continue
to add information, make it more relevant, streamline, etc.

~~~
jedberg
My advice would be to push as hard as you can to change the culture, or you'll
be drowned. Engineers will not make it a priority to fix anything that causes
outages because they will be evaluated on feature velocity, not uptime.

If you can make the company culture focus on uptime, or get engineers involved
in remediation, then you'll be better off.

If you can't do that, try to at least push for the Google model: The engineers
are responsible for uptime of their product until they can prove that it is
stable and has sufficient monitoring and alerting, and then they can turn it
over to SRE, with the caveat that it will go back to the engineers if it gets
lower in quality.

~~~
j_s
Or push to change the culture to match the one given in the article, where
documentation is important and kept updated?

------
mfonda
I'm a big fan of checklists as well; however, working through a checklist is
still a manual process with room for human error. I've gone down the road of
creating checklists, then realized that many of the items would be much better
automated. For example, suppose a list item was "Ensure command X is called
with arguments -a, -b, and -c, then command Y is called". This could all be
wrapped into a simple script that calls these commands, eliminating the
potential for human error. I've found that as I create checklists, they often
turn into a list of things I really need to automate.

~~~
Cyranix
Definitely agree — and that's not a knock against checklists.

Externalizing a process that previously existed only in someone's head is a
win. Creating a checklist is a straightforward framework for such
externalizing, and allows you to separate the question "how do I accomplish my
goal?" from the question "how do I express this with code?" (like writing
pseudocode before real code).

Whether a process is automated or manual, there's always room for error if the
surrounding context changes. When an automated process is annotated _like_ a
checklist, I find that I get the best of both worlds: minimal affordance for
human error with a clearly described thought process to fall back on in the
event of a problem.

(It's also not terribly uncommon to have steps that can't be fully automated,
like authenticating to a VPN with 2FA under certain security frameworks...!)

------
rb808
Great article. Documentation is always tough - I still haven't found the right
solution. Its nice to see some SRE people that care, most 1st/2nd line support
I've worked seem to just escalate anything non-trivial which is frustrating,
but also useful.

If SREs are too good developers lose touch with production and get lazy.

~~~
zhengyi13
Reflecting on my time in frontline support, 1st level folk have limited
skills, limited resources in terms of time and tooling, and in particular tend
to have pretty tight metrics applied to them. They pretty much _have_ to
escalate quickly, or they'll be yelled at, or worse.

Some of them flat out don't belong in Support. But poor management and poor
metrics drive behavior in unhelpful directions, too.

------
elorant
I wish there were more articles like this for sites that are considered gray
areas like gambling or porn.

~~~
kbenson
For porn, Youporn has had some technical (or technical PR) submissions
multiple times over the years.

"Youporn.com is now a 100% Redis Site" (126 comments, 5 years ago)[1]

"How YouPorn Uses Redis: SFW Edition" (95 comments, 4 years ago)[2]

"YouPorn: Symfony2, Redis, Varnish, HA Proxy... (Keynote at ConFoo 2012)" (49
comments, 5 years ago)

There's more in the HN search.

1:
[https://news.ycombinator.com/item?id=3597891](https://news.ycombinator.com/item?id=3597891)

2:
[https://news.ycombinator.com/item?id=6137087](https://news.ycombinator.com/item?id=6137087)

3:
[https://news.ycombinator.com/item?id=3750060](https://news.ycombinator.com/item?id=3750060)

------
systems
i am always surprised, that gambling sites and similar business with
questionable morality

dont face hard time recruiting top engineers , i always though this would be a
major concern

~~~
zwischenzug
In the UK it is less of an issue, because the industry is well-regulated and
gambling (esp sports betting) is seen as an acceptable pastime akin to social
drinking.

Occasionally people expressed disquiet, but since the main alternative in
London is working for banks, there wasn't a great deal of choice. Personally I
don't see the excessive advertising we are subject to as much better for
society than gambling being available, but hey.

~~~
user5994461
> Occasionally people expressed disquiet, but since the main alternative in
> London is working for banks, there wasn't a great deal of choice.

Did both. Gambling is very similar to finance.

Turns out that finance pays more and treats their employees better.

~~~
zwischenzug
Have done both also, with the same experience... we may have worked for the
same orgs :)

------
jldugger
> The team I joined was around 5 engineers (all former developers and
> technical leaders), which grew to around 50 of more mixed experience across
> multiple locations by the time I left.

Unfortunately, it's difficult for the audience to determine whether the
author's success is attributable to the philosophy in the post, or the 10x
growth in staff.

~~~
stephengillie
The author attributes both his success, and the team's growth, to his
documentation philosophy. It does make some sense - improving process
documentation for the most common issues will help ensure new team members
resolve these issues consistently and effectively, both increasing their
utility to the team and their personal morale. How much is directly
attributable to him is open to debate.

~~~
zwischenzug
The team's growth was partly because of its perceived success at managing
operations, but also because of large growth in customer base and application
sprawl. The business model we had depended on fast releases and customers who
refused to pay for much-needed testing or QA, but were happy to pay for the
support we provided. Not saying that's a good thing, but it was the plate we
were served.

