
Team Structure for Software Reliability Within Organizations - kiyanwang
https://www.blameless.com/structuring-team-software-reliability/
======
sethammons
Re SREs: we've taken a different approach. Devs own their systems. If the
issue is network or hardware, there are more ops oriented folk that make that
work. But when it comes to services, teams own their stuff. Bug a 2am causes a
service to go down? The team is expected to start debugging it and pull people
in as needed.

The result? Shifting to this model from an SRE firefighting model has lowered
total alerts, incidents, and outages. When we first shifted, people dreaded
going on their on call rotation for their team; you expected to be paged
multiple times a night. Fast forward and now I can count on one hand the
amount of pages I've had on my last multiple on call rotations. Work is
prioritized for system reliability because we feel the pain.

~~~
calciphus
This answer always downplays what "on call" really means. Even if you only get
paged once a year, if you're "on call" and the expectation is that within X
minutes you are at a computer working, this affects the life you live. You
can't go camping, out to a nice dinner, on vacation, etc without lugging a
laptop, internet connection, etc with you.

I've heard managers often tout that "if your code is stable you'll never get
paged". But that hides the fact that you're still on the hook to be available.
So either you start shirking your responsibility or you reshape your life
around some percent of non-working hours being owned by the company still.

~~~
rednerrus
Or you put your laptop in the trunk of your car when you're on call and have
tethering on your phone.

Expecting another group to maintain your code seems an awful lot like throwing
it over the wall.

~~~
lazyasciiart
And you keep your phone on during a film, and at the nice dinner, and during
your daughter's wedding, and never go camping without phone service. That is
_exactly_ what is being described by "This answer always downplays what "on
call" really means."

~~~
rednerrus
During your rotation, you have to have your phone with you. I go to the movies
and to nice dinners. I would definitely get my rotation covered for a wedding
or camping.

You should have faith in the systems that you build. If you don't want to
carry a pager for them, why does someone else want to do it?

------
JMTQp8lwXL
I work in a publicly traded company. I do everything from requirements
elicitation, implementation, unit testing and end-to-end testing, own the
deployment process, as well configure logging and monitoring to ensure
deployments continue to stay healthy. Our software's reliability is as strong
or as brittle as each individual engineer deciding to implement or skip these
processes, since most micro-services are made by groups of 2-5 people (and I'm
currently a team of 2, and for a few months was a team of 1).

~~~
jungturk
Would defining a (required) key set of observables not help to reduce the
variability across teams? Or perhaps the issue you're describing is that teams
aren't required to demonstrate any?

In my experience (as a chief architect and engineering VP), laying out some
baseline metrics closes much of the reliability variance across teams.

As the teams get more competent at baking in the basics (overall load,
latency, resource utilization, error rates, error events posted to chat) you
can ratchet up the competency to include higher order observables (scaling
events, business transactions, circuit statuses, traces, anomalies).

Which is to say its been straightforward (in my experience) for most teams to
raise the bar once they know there _is_ a bar and once they can see the bar.

~~~
JMTQp8lwXL
Teams aren't required to demonstrate any. Engineers who understand the
importance of quality bake in time, and management is supportive of self-
imposed controls. Our NPS numbers are trash, though. But it's a different kind
of quality issue: we aren't fighting fires every week. The problem is our API
response codes are often wrong or unpredictable. The original engineers
clearly did not understand the HTTP protocol.

------
DrScientist
The article misses out the people element.

Isn't it best to design for reliability in the first place?

Don't you incentivize that best by having the same people supporting and app
as well as developing?

Nothing makes you write reliable software like being the person being called
out, or dealing with the customer issue.

~~~
karatestomp
I’m becoming more sympathetic over time to the notion that this everyone-must-
do-everything movement is basically a failure. There’s too much.

Be good with: two or three PM and issue tracker tools, a graphics or UI mockup
program or three, various analytics tools and libraries, a payment system or
three, several build and packaging systems, a couple reverse proxies, SSL
provisioning, several ways to schedule jobs, a few message or event buses, a
couple cloud management UIs (command line and browser!) and also WTF all the
brand-names in there do, several major database systems with a few different
query languages and paradigms, three or more programming languages for weekly
use, various communication tools, your own command line environment, the
typical environment on your deployment targets and there may be more than one
of those, infra as code languages/tools, at least one CI system at any given
time, git, the quirks and bullshit of dozens of major libraries across
multiple languages, testing frameworks and practices, networking, and on and
on.

It’s plainly too much and functionally no-one’s doing all that well, which
explains some of what we see from software in the wild. We could probably use
about three specialist developers and a secretary for every do-everything
developer like that.

~~~
rednerrus
You're describing what Devops and SREs do.

~~~
karatestomp
Yes, at places that have them and where those roles are well-integrated and
their teams reasonably well run, which is... not everywhere, by a long shot.

------
coldcode
Articles like this always assume an organization is just one thing, in this
case server and/or web. Our organization is dozens of server teams, internal
apps, mobile apps and web clients, plus interactions with real world systems
(and people). It's impossible to actually model reliability in such a complex
environment.

------
madhadron
Shucks, I was hoping for an analysis of team structure based on Microsoft's
paper[1]

[1]: [https://www.microsoft.com/en-us/research/publication/the-
inf...](https://www.microsoft.com/en-us/research/publication/the-influence-of-
organizational-structure-on-software-quality-an-empirical-case-study/)

~~~
mooreds
Isn't that paper from 2008? Do you think it is still valid?

~~~
cpitman
I don't see what would have changed. Most large companies still have arcane
reporting structures and strong silos that prevent clear communication and
collaboration.

------
nradov
Team culture and incentives matter more than structure.

~~~
JoshMcguigan
Having the right team and organizational structure can go a long way toward
aligning incentives.

------
halbritt
In my experience, the best team structure depends on the organization.
Certainly for smaller orgs and startups it is possible to hire a small set of
engineers with the necessary abilities to handle the reliability work.

In larger, more-established organizations, scrum teams tend to gravitate
toward feature development in which it becomes difficult to prioritize
reliability work even if the skills are present. With very many feature
development teams, it becomes difficult to hire for a programmer that has the
primary skill desired and the ancillary skill of reliability and systems
operations.

I'm currently of the belief that in most orgs, a team should be dedicated to
this. There's the ongoing operations of a service, which I think is a shared
duty, and the infrastructure engineering components necessary to abstract the
infrastructure sufficiently well that it is uniform across feature-development
teams and also reliable, scalable, and secure by default.

As with anything, the devil is in the details.

------
ignoramous
AWS, incidentally, spoke a tonne at re:Invent 2019 on how they run their
operations. It is interesting contrast against the article because AWS takes a
different approach (no dedicated SREs)-- The engineers that build the system
run ops, too.

Some references:

[0] Colm MacCarthaigh, How to take control of systems, big & small,
[https://www.youtube-nocookie.com/embed/O8xLxNje30M](https://www.youtube-
nocookie.com/embed/O8xLxNje30M) (2018)

[1] Eric Brandwine, Aspirational security, [https://www.youtube-
nocookie.com/embed/ad9180b4Xew](https://www.youtube-
nocookie.com/embed/ad9180b4Xew)

[2] Marc Brooker, Amazon's approach to building resilient services,
[https://www.youtube-nocookie.com/embed/KLxwhsJuZ44](https://www.youtube-
nocookie.com/embed/KLxwhsJuZ44)

[3] Peter Ramensky, Amazon's approach to high-availability deployment,
[https://www.youtube-nocookie.com/embed/bCgD2bX1LI4](https://www.youtube-
nocookie.com/embed/bCgD2bX1LI4)

[4] Andy Troutman, Amazon's approach to running service-oriented teams,
[https://www.youtube-nocookie.com/embed/n1d20Yok000](https://www.youtube-
nocookie.com/embed/n1d20Yok000)

[5] Colm MacCarthaigh, Amazon's approach to security during development,
[https://www.youtube-nocookie.com/embed/NeR7FhHqDGQ](https://www.youtube-
nocookie.com/embed/NeR7FhHqDGQ)

[6] Becky Weiss, Amazon's approach to failing successfully,
[https://www.youtube-nocookie.com/embed/yQiRli2ZPxU](https://www.youtube-
nocookie.com/embed/yQiRli2ZPxU)

[7] Thomas Blood, Amazon's culture of innovation, [https://www.youtube-
nocookie.com/embed/2ZQKPUD7vKE](https://www.youtube-
nocookie.com/embed/2ZQKPUD7vKE)

[8] Andy Warfield and Seth Markle, Lessons from Amazon S3's culture of
durability, [https://www.youtube-
nocookie.com/embed/DzRyrvUF-C0](https://www.youtube-
nocookie.com/embed/DzRyrvUF-C0)

[9] Colm MacCarthaigh, Lessons from Amazon's highest available data-planes,
[https://www.youtube-nocookie.com/embed/2L1S0zfnIzo](https://www.youtube-
nocookie.com/embed/2L1S0zfnIzo)

[10] Eric Brandwine, The tension between absolutes & ambiguity in security,
[https://www.youtube-nocookie.com/embed/GXTvlQXVCOs](https://www.youtube-
nocookie.com/embed/GXTvlQXVCOs) (2018)

------
hibbelig
The article talks about adding SREs. As if the developers wrote buggy software
and then the SREs fix the bugs. I’m sure I’m misunderstanding. What is meant?

~~~
treerock
Sounds like SRE are fixing bugs _and_ introducing tools and processes that
improve reliability.

> There are two main types of reliability work. The first is mitigation, which
> is a linear fix that’s often referred to as firefighting...The second is
> change management, which is a non-linear fix that proactively reduces the
> defect rates through projects like migrating to better tools and refactoring
> spaghetti code. While SREs support both these types of work, they should
> spend more time on the latter.

