
The O-Ring Theory of DevOps - r4um
http://blog.acolyer.org/2015/11/11/the-o-ring-theory-of-devops/
======
exelius
I don't know if this is O-Ring theory as much as a "six sigma" style process.
The quality of each output depends on the quality of the inputs, and
imperfections grow exponentially when things fall out of tolerances.

The whole point of DevOps is really to get your software developers to share
in the pain that certain design / development choices can cause. In the
absence of some DevOps-like process, dev teams will naturally shovel all their
tech debt on the operations team. Rather than build their software to handle
multi-tenancy, they'll just say "Eh, deploy another VM with a different
configuration". Not that that's necessarily the wrong answer, but because it's
less work for them, they don't take into account how hard it will be for the
operations team to manage. If the dev team is responsible for both building
the software, the tools used to manage the deployment, and responsibility for
incident resolution, they'll make decisions that are better for the company
and its customers.

That's my take, anyway.

~~~
thwarted
_The whole point of DevOps is really to get your software developers to share
in the pain that certain design / development choices can cause._

This is a great characterization and one of the few views of DevOps I can get
behind.

Unfortunately, in my experience some higher up emits an edict to "do DevOps"
and for whatever reason that has fallen on operations (my area) to "implement"
(a hazy concept and task) with little buy in from the design/development side;
meanwhile, some developers complain that they didn't sign up to "be on call",
despite that they are only being tasked with such for their own team's code.
In other situations, developers take it upon themselves to deploy stuff
without feedback or involvement from the operations experts, and you end up
with Shadow IT.

It has to be an organization-wide endeavor, and there has to be a culture of
involvement of all the areas of expertise.

~~~
serge2k
> meanwhile, some developers complain that they didn't sign up to "be on
> call", despite that they are only being tasked with such for their own
> team's code

If they didn't then how is that not a legitimate complaint? If I take a job
with the knowledge that oncall is required then it's fine. If I take a job and
6 months later they hand me a pager and say starting now you're oncall one
week a month I'm going to get upset.

~~~
thwarted
It is a legit complaint, but I lacked both the influence and the authority to
get members of other teams to step up, and honestly that wasn't my job (not
being their manager). And when there were outages operations had to be there
to resolve issues because other teams had, for example, not formalized their
on-call rotation. And yet the service needs to remain running and available.

This is why I said it has to be the goal of the entire organization. If 6
months in you are tasked with being involved with deployment and live
maintenance of your code, you're welcome to leave the company because it's
going in a direction you don't agree with. Don't be shitting on the operations
team because _they_ didn't implement some VP-read-in-a-book version of DevOps.

You know, there's a lot of talk about how "lack of culture fit" isn't a legit
reason to reject candidates. While it is often used as a catchall rejection,
there _are_ things such as fitting into the organization's processes and
having views that are compatible with the rest of the team that legitimately
fall under the "culture fit" evaluation umbrella. If you're joining a team
that has successfully done waterfall for a long time and doesn't have the
problems that are traditionally associated with the things that agile can fix,
sorry, you're not going to fit in and you'll most likely get resistance to
change. If you're used to throwing code over the wall to operations to deploy
and you're joining a team that takes a more distributed view of
responsibility, you're not going to fit in either.

------
memracom
This is what people mean when they talk about "technical debt". Many of us
have realized that when mistakes are not fixed early, we end up doing 10x,
20x, and even 200x the amount of work in dealing with the consequences of
those mistakes.

In Software Development good tools like an IDE with static code analysis and
good practices like unit testing + integration testing + functional testing +
automated acceptance tests help us to avoid that 200x downside.

DevOps is not all software development, but part of it is, and so it should
respond in the same way to good tools, and good test practices. For the rest
of it, there are lots of old practices that do make a difference such as
having regular backups of everything, having live-live standby servers, hiring
a DBA to clean up the mess in your data models, NEVER patching the database
except using code that has been unit tested and which generates a script to
undo everything that it has patched along with a hash check of the database to
guarantee that the undo step worked. And so on.

Maybe the cute moniker "O-ring economics" will help to explain it to
management types better but the jury is still out on that, IMHO.

But what manager would NOT want everybody to work as a team, to learn from
their mistakes, and to constantly up their game, as a team? Call it Agile or
call it something else, but this is what is proven over the last 70 years or
so (perhaps even longer).

------
jjuhl
Excellence in development and exellence in operations does not necessarily
overlap. IMHO the whole "devops" term is bogus since it's just the worst of
both worlds.

You want experts at both ends of the spectrum, not a developer who can do some
sysadmin tasks but not too well or a operational engineer who can write some
code but not too well. Acknowledge that there are different fields and you
need experts in both. Forget about aiming for a hybrid - it doesn't work.
You'll just get incompetents who are able to pose as being good at both sides.

~~~
mpdehaan2
(Ignoring the article ATM).

Ultimately the "devops" term is widely applied to a bunch of things, usually
meaning one or more of:

* automating things as much as possible - infrastructure as code, continuous deployment, chatops, etc. * building tools for your organization to use to run your own infrastructure * fixing cultural and process problems between development and operations and cutting down on change requests, fixing communication, etc. * Often, about getting developers to build environments where the code they deploy uses the same automation in dev/test as in prod, and more closely resembles prod * Getting developers to care about production issues * Getting operations to understand more about development details rather than have a black box they are less likely to understand.

Sometimes the title "Site Reliability Engineer" can mean that, sometimes.
Sometimes DevOps means that. Lately "DevOps engineer" means some mix of
infrastructure automation (basically systems administration) and writing a few
tools or scripts (often these tools require some degree of coding). It can get
more custom code heavy depending on the organization, but there's no real
standard.

But ultimately, "DevOps" is a (usually confusing) catch-all. Most "DevOps"
conference are really a mix of agile, process, and systems administration
topics, and many DevOPs meetups are usually about elevating one's game at
systems administration and sometimes about Continous Deployment -- though even
this can vary.

Personally I like the tech parts and don't care as much for the cultural
parts, having found a lot of it, once heard 4 or 5 times, becomes a bit
"preaching to the choir" \- i.e. folks know it already. Which is why I think
once you learn the agile/process/communication bits, the meetups get to be
more about tools - these things change and can be new, but there's some
percieved duty to keep educating the new folks on the cultural stuff
(especially as folks may feel some industries that are less startup-like are
slower movers, or for folks in large corps that are slower to adopt things).

~~~
area51org
It's a catch-all that's used far too often to mean too many things. I once
came across a post (from Red Hat maybe?) insisting that "doing DevOps" meant
that you "get management and the techies together and talk about the elephant
in the room". There was no explanation of what this elephant might be, or how
having a meeting was "doing DevOps".

~~~
serge2k
What do you expect, it's a buzzword.

------
powera
Uh, that isn't the "O-Ring Theory" I would have expected.

The Challenger failed because, while the O-Rings were within tolerances, they
shouldn't have been varying _at all_. So because they were varying
unexpectedly, they inevitably failed.

I would claim the corresponding devops theory is "measure everything, know
what is measurement noise and what is user activity, and eliminate everything
else before it causes an outage".

(from Wikipedia: "In one example, early tests resulted in some of the booster
rocket's O-rings burning a third of the way through. These O-rings provided
the gas-tight seal needed between the vertically stacked cylindrical sections
that made up the solid fuel booster. NASA managers recorded this result as
demonstrating that the O-rings had a "safety factor" of 3. Feynman
incredulously explains the magnitude of this error: a "safety factor" refers
to the practice of building an object to be capable of withstanding more force
than the force to which it will conceivably be subjected. To paraphrase
Feynman's example, if engineers built a bridge that could bear 3,000 pounds
without any damage, even though it was never expected to bear more than 1,000
pounds in practice, the safety factor would be 3. If a 1,000 pound truck drove
across the bridge and it cracked at all, even just a third of the way through
a beam, the safety factor is now zero: the bridge is defective.")

~~~
msandford
> The Challenger failed because, while the O-Rings were within tolerances,
> they shouldn't have been varying at all. So because they were varying
> unexpectedly, they inevitably failed.

That's not entirely correct. The idea that "the o-rings were within tolerance"
isn't right. There is effectively no tolerance on the o-ring, either it's
very, very nearly perfect or there's something wrong with your design, just
like the bridge. The o-rings should be getting immeasurably small amounts of
wear so the idea that they were "only" burning up 1/3 is a _huge_ red flag.
Any measurable wear on them is a problem, so 1/3 of them being gone means they
were perhaps 30 or 300 times over tolerance.

~~~
powera
Yes, of course. Perhaps "while the O-Rings hadn't yet catastrophically failed"
would be the more accurate way of phrasing that.

------
bmh100
This is very true in the world of business intelligence and analytics. A
single breakdown in the data pipeline from source to report/dashboard affects
the entire system's value. "Garbage in, garbage out" is a maxim. If the ETL
pipeline does not clean up the data and correct errors, it can make dashboards
nearly worthless.

------
franzpeterfolz
Let's assume the leadtimes in the pipeline differ dramatically. l_1 ... l_n-1
take just minutes and l_n takes some days.

Maybe its a huge simulation or learning of a neural network.

In this theory the leadtimes are getting normalized and multiplied.

But in this pipeline, the process n is overweighting all the other processes,
so that the leadtimes of the other processes are almost irrelevant compared to
the leadtime of process n.

a. Lets assume leadtime of process 1 is missed by 50%. So it takes an
additional minute.

b. Compared to process n, missing leadtime by 50%, taking an additional week.

Expected Quality would be the same for a and b. But the overall leadtime will
differ in one week.

