Hacker News new | past | comments | ask | show | jobs | submit login
Team Structure for Software Reliability Within Organizations (blameless.com)
125 points by kiyanwang on Feb 17, 2020 | hide | past | favorite | 50 comments



Re SREs: we've taken a different approach. Devs own their systems. If the issue is network or hardware, there are more ops oriented folk that make that work. But when it comes to services, teams own their stuff. Bug a 2am causes a service to go down? The team is expected to start debugging it and pull people in as needed.

The result? Shifting to this model from an SRE firefighting model has lowered total alerts, incidents, and outages. When we first shifted, people dreaded going on their on call rotation for their team; you expected to be paged multiple times a night. Fast forward and now I can count on one hand the amount of pages I've had on my last multiple on call rotations. Work is prioritized for system reliability because we feel the pain.


This answer always downplays what "on call" really means. Even if you only get paged once a year, if you're "on call" and the expectation is that within X minutes you are at a computer working, this affects the life you live. You can't go camping, out to a nice dinner, on vacation, etc without lugging a laptop, internet connection, etc with you.

I've heard managers often tout that "if your code is stable you'll never get paged". But that hides the fact that you're still on the hook to be available. So either you start shirking your responsibility or you reshape your life around some percent of non-working hours being owned by the company still.


On call is one of the things that nobody told me about honestly. You’re absolutely right that it affects ones life: you need to ensure availability for OC work. Managers who downplay this are liars.

That being said: the SRE model has encouraged orgs to improve observability and set clear expectations and it often coaxes teams into building more reliable systems under the threat of limitless OC toil.


Not sure what was being downplayed. You are def on call and have n minutes to respond, so you need to stay in range of cell towers or wifi. However, when your system is more reliable, you wake up in the middle of the night less and less. We rotate through on call on our team. One week out of six or so.

During that time I keep my laptop and my phone that can tether with me. I've been on call and at amusement parks. I got paged once and had to go out to the car and work an hour. Calculated risk on my part. What I should have done is ask a team mate if they would cover me for the several hours I was at the park.


My strategy is to not work places where I need to be on call, and quit if it ever becomes a requirement. I've never been paid enough to give up my free time.


I agree on principle with this strategy.

However, I do want to work in places where I am responsible for operating what I build. I want to know how well the service I built is doing. If I promise it will be up 99.99% of the time but is not, then I need to know if its down so I can figured out why it went down.

Being on call for systems you build also leads to better software. It makes me design software in ways that are helpful to investigate when it fails. e.g. my error messages provide a lot of context. There is investment in making sure the service is able to terminate gracefully. Metrics are instrumented to locate when things might be off.

What this means is that for things that I build, I often don't get paged as often. Its in my interest to ensure that the service is honest, well designed and well built. When I do get paged, its for something that's seriously wrong.

Anyways, I've been lucky to be in this position. What I've observed for many teams is that they inherit flaky systems which they have to make more reliable; but in the meanwhile every OC shift is an absolute slog, requiring 40 hours/week dedication by people.


My approach has been to request ownership before agreeing to be on call.

If I'm going to be responsible for what I build at all times, then I'm going to be compensated for what it's earning at all times.

That means options, stock, or profit sharing (with a large preference for stock).

Otherwise I feel like this relationship is strictly abusive.

Alternatively, my contracting rate is 150 an hour. I'll agree to work up to 70 hours a week, but everything after 50 is compensated hourly. On call counts, regardless of whether I actually get pinged.


This is the right answer. If a company thinks its system is critical 24x7 then it should have actual staffing 24x7.

On-call is the type of egregious abuse of employees that an IT union would be able to fight against. Right now companies can take advantage of people who are desperate (H1-B, young parents, etc) and don’t have the freedom to take a stand like the above poster but if all IT workers banded together it would be possible to fight against the practice.


At my previous company, we could switch our on call day with other people on the team. This mitigated the issue you're talking about. I never had an issue scheduling a switch with someone for the year I was on call there. That's just one specific company though, not sure if other teams allow this.


Or you put your laptop in the trunk of your car when you're on call and have tethering on your phone.

Expecting another group to maintain your code seems an awful lot like throwing it over the wall.


And you keep your phone on during a film, and at the nice dinner, and during your daughter's wedding, and never go camping without phone service. That is exactly what is being described by "This answer always downplays what "on call" really means."


During your rotation, you have to have your phone with you. I go to the movies and to nice dinners. I would definitely get my rotation covered for a wedding or camping.

You should have faith in the systems that you build. If you don't want to carry a pager for them, why does someone else want to do it?


This is why SRE organizations typically have stringent requirements for stability before accepting a rotation, and will give it back if reliability drops below a certain level.

SREs have global responsibility to the system and have a correspondingly global view. If your system broke down because of a bug you wrote, that’s on you. If a system three times removed from yours that you didn’t know depended on you broke because you changed a non-API behavior, that’s where SREs shine: they know how to quickly isolate the problem, roll back the necessary systems, and define how to avoid similar problems in the future.


If thats priced into the contract and made clear from the beginning sure. But it doesn't come for free.


I’ve seen this shared responsibility model only work when feature development is traded off with reliability as a business level guidance on top of empowerment of developers to own their systems fully. Oftentimes by the time things get bad with reliability I’ve rarely seen a business get out of the death spiral of bad technical debts being paid out with just interest daily and burning everyone out while feature requests flood in and customers leave unless there is massive funding (and subsequent hiring with a solid plan for enabling these engineers) growth.

Only ever seen it possible to turn around with empowered engineers that have the cycles to fix crippling systemic defects and the business has the means to support it. Otherwise it’s basically a Sisyphean task to maintain things and the business should have offshored the labor by then to save on costs if the cost-feature curve isn’t working out in growth stages.


This addresses only 50% of what SREs contribute. The other 50% is designing the production infrastructure, processes, and tools that facilitate efficient response. Depending on the size of your company, you can of course make do with off-the-shelf solutions, but that’s not scalable part a certain point; creating robust production systems (as opposed to good code) is a specialty like any other.


I've seen this service ownership model fail when implemented across a dozen service teams. Each team was a snowflake in that it's production infrastructure was unique, and not necessarily reliable.

It also assumes that designing production infrastructure or even systems level thinking is a skill that most software developers have. Certainly, enough do that it's common, but it's not universal.

In this particular case, the org standardized on terraform. Some developers welcomed the ability to self-service. Others wanted nothing to do with it.


This assumes a level of control and autonomy that not many at all have. Downtime is valuable currency in intra-corporate negotiations, and the ones to answer the phone at 3am are business inputs, not people.


That is the whole purpose of doing microservices. Every team can operate independently. Choose their tools, processes, etc. As long as they respect their API Contracts.


I think it's a misinterpretation of microservices to say that the purpose is that the teams can choose their tools and processes.

Microservices merely ensure that the complexity of using different tools per microservice does not lead to an increase of maintenance burden.

Nonetheless, you still have a maintenance burden if every microservice is built upon different tools and processes, since you cannot address patterns of problems easily that occur within combinations of patterns and processes. The cardinality just increases with every tool and process added into the mix.

This burden of maintenance automatically takes its toll on productivity. Teams will not be able to create, maintain or iterate on code that produces business value (as opposed to managing the platform of microservices itself)

I think there is value in a well defined set of processes and tools because there will eventually be platform concerns that become increasingly difficult otherwise.

I do not have insight in how most companies with successful microservice architectures achieved their success, but I would bet my life that a majority of those companies do not let their engineering teams use tools and processes arbitrarily, unless it serves to REALLY produce value that would be unachievable by the tools and processes used up until that point (for example, because performance is usually sufficient, but a specific service suddenly needs to outperform everything that came so far, so you use C, Go or Rust instead of Python)

I am biased into thinking this because I suspect that most of these companies did either start with monolithic architecture (i.e. supporting multiple languages was impossible or just not feasible enough) or they started with a microservice approach focusing on producing value as quickly as possible after an initial ramp up time. Supporting many different tools and processes from the get go would make the ramp up time longer than necessary.

TL;DR Microservice platform maintenance suffers if tools and processes are chosen freely for each microservice, without any push towards unification.

Oof I need to sleep. Don't mind me, just trying to sort myself out.


> I think it's a misinterpretation of microservices to say that the purpose is that the teams can choose their tools and processes.

The crucial feature of microservices is to enable large number of people to work on delivering the same service, but not being constrained by the traditional modes of software release, where the entire monolith could only be updated at a fixed cadence and where the whole thing had to pass through a rigorous suite of tests etc.

One freedom that this gives a team is to choose a different language for their service... on this point there is large consensus. But one need not stop there; you could technically run your service on another cloud with a completely different deployment architecture. All that is asked for this insane freedom is that your service honor the SLA's that you provide. SLA's define the interface, everything else is left to the team.

That being said, shared tooling does provide some nice benefits where learning is shared across teams allowing new ones to bootstrap quickly. However, that is something the team should be allowed to choose.


I understand this point of view is contrary to popular opinion. The justification of microservices is to allow a team of let's say 10 developers to scale to 100. What you do is create 12 teams of 8. Each team owns their microservice with well defined APIs, and are allowed some flexibility on languages, tools, process, etc. Of course, it is a good idea to leverage common tools and such, but not obligatory.

Commonly we see a team of 20 maintaining 40 microservices, which becomes very difficult to manage and provides little gains. IMHO, this si the wrong approach. Microservices are tiny from the point of view of Netflix, or Amazon.


>The result? Shifting to this model from an SRE firefighting model has lowered total alerts, incidents, and outages. When we first shifted, people dreaded going on their on call rotation for their team; you expected to be paged multiple times a night.

I implemented a similar change at a previous company. Prior to the change, the operations team fronted all on-call issues. After, anything related to the software/services was handled by software teams. We went from nightly alerts to less than monthly. As a bonus, time to diagnose and repair went waayyyyy down due to much better logging.

Note, better logging <> more logging in this case. It meant less spurious errors/warnings, more useful INFO, and better observability of the system.


What happens when teams change? What happens when a team no longer owns a service or it's in a state where no one is actively working on it? I think you need dedicated resources to thinking about the picture as a whole.


For sure. Unlike most developers these days (or so it seems), I've been at my company for nearly a decade. Started out as engineer 15 in a company that is now a couple thousand. I've seen services transfer teams, get forgotten about, be rewritten, and just about every permutation in between. The way we mitigate this is by tracking the service ownership by team and audit repos from time to time. Every running service is required to have a runbook/playbook that should be thorough enough that a new team member could fix something broken. The biggest problem I've seen is some teams being afraid of really digging into a legacy service if it is transferred to them.


Did everyone get paid more to take on these responsibilities? I feel we're about to go through exactly the same thing at my company.


We’re many years down the road, so we’ve mostly baked it into job descriptions and expectations at this point. In some geographies, we’ve had to craft those responsibilities and pay/comp time philosophies in specific ways to comply with local laws. In others where the law is silent, we’ve taken a laxer “if you got crushed by off-hours work, take appropriate comp time”.

During the old days of “you commit code and ops runs it” and during transition to “you build and run it”, we did have some bonuses for being on-call or doing stints in problem management. Now, we bake that expectation into base salary except in locales where we have to split it out to comply with local laws.

Along the way, we've given enough merit increases, market adjustments, and promotions that it's impossible to say whether we had a bump specifically because of the industry's new philosophy on ops. If you wanted to make the case that we did, you could. If you wanted to make the case that we didn't, you equally could.


The logic was that we are highly paid developers and are expected to build robust, reliable software with high availability. Typically a brutal on call night yields untracked time off.


At my previous company, on call was an opt-in responsibility that came with extra pay. We had enough developers so not every developer had to do on call.


I'm curious: how many people did you lose because of the shift?


I work in a publicly traded company. I do everything from requirements elicitation, implementation, unit testing and end-to-end testing, own the deployment process, as well configure logging and monitoring to ensure deployments continue to stay healthy. Our software's reliability is as strong or as brittle as each individual engineer deciding to implement or skip these processes, since most micro-services are made by groups of 2-5 people (and I'm currently a team of 2, and for a few months was a team of 1).


Would defining a (required) key set of observables not help to reduce the variability across teams? Or perhaps the issue you're describing is that teams aren't required to demonstrate any?

In my experience (as a chief architect and engineering VP), laying out some baseline metrics closes much of the reliability variance across teams.

As the teams get more competent at baking in the basics (overall load, latency, resource utilization, error rates, error events posted to chat) you can ratchet up the competency to include higher order observables (scaling events, business transactions, circuit statuses, traces, anomalies).

Which is to say its been straightforward (in my experience) for most teams to raise the bar once they know there _is_ a bar and once they can see the bar.


Teams aren't required to demonstrate any. Engineers who understand the importance of quality bake in time, and management is supportive of self-imposed controls. Our NPS numbers are trash, though. But it's a different kind of quality issue: we aren't fighting fires every week. The problem is our API response codes are often wrong or unpredictable. The original engineers clearly did not understand the HTTP protocol.


The article misses out the people element.

Isn't it best to design for reliability in the first place?

Don't you incentivize that best by having the same people supporting and app as well as developing?

Nothing makes you write reliable software like being the person being called out, or dealing with the customer issue.


I’m becoming more sympathetic over time to the notion that this everyone-must-do-everything movement is basically a failure. There’s too much.

Be good with: two or three PM and issue tracker tools, a graphics or UI mockup program or three, various analytics tools and libraries, a payment system or three, several build and packaging systems, a couple reverse proxies, SSL provisioning, several ways to schedule jobs, a few message or event buses, a couple cloud management UIs (command line and browser!) and also WTF all the brand-names in there do, several major database systems with a few different query languages and paradigms, three or more programming languages for weekly use, various communication tools, your own command line environment, the typical environment on your deployment targets and there may be more than one of those, infra as code languages/tools, at least one CI system at any given time, git, the quirks and bullshit of dozens of major libraries across multiple languages, testing frameworks and practices, networking, and on and on.

It’s plainly too much and functionally no-one’s doing all that well, which explains some of what we see from software in the wild. We could probably use about three specialist developers and a secretary for every do-everything developer like that.


It is easy to swing too much towards self serve teams. The separation that works well for us is the team owns the code, the emissions of artifacts like logs and metrics. Alerts and such go to the team.

However, there is a team that maintains the build system. A team that deals with maintaining the monitoring systems, a team that deals with network, hardware, proxies, etc.

For us, the one we are learning to balance is the data layer. We've traditionally had a DBA team doing most of that. But the team building the service knows their data better. We are shifting to a DBA as a consultant resource model and seeing how that goes.


Of course you need some tech stack discipline and the dev team needs to operate as a team not a bunch of individuals - the team has to be able to support the system if a single developer wants to go on holiday.

A mess of a tech stack is more of a problem for developers making future changes than it is for production ( if you have fully automated the deployment, monitoring and service recovery ).

ie the problem of a mess of a tech stack isn't due to being developer led per se, it's a problem of a lack of team working and organisation - which can happen in any team or situation.


Do you think the complexity may be a result of the diverse ecosystem of technologies we find ourselves in today? Never before have we seen so many ways of doing so many different things. Every problem has an OSS solution, companies deploy variations of those tools internally. The problem will only get more pronounced in the future... there seems no escaping unless you restrict the adoption and usage of new technologies.


You're describing what Devops and SREs do.


Yes, at places that have them and where those roles are well-integrated and their teams reasonably well run, which is... not everywhere, by a long shot.


Articles like this always assume an organization is just one thing, in this case server and/or web. Our organization is dozens of server teams, internal apps, mobile apps and web clients, plus interactions with real world systems (and people). It's impossible to actually model reliability in such a complex environment.


Shucks, I was hoping for an analysis of team structure based on Microsoft's paper[1]

[1]: https://www.microsoft.com/en-us/research/publication/the-inf...


Isn't that paper from 2008? Do you think it is still valid?


I don't see what would have changed. Most large companies still have arcane reporting structures and strong silos that prevent clear communication and collaboration.


Team culture and incentives matter more than structure.


Having the right team and organizational structure can go a long way toward aligning incentives.


In my experience, the best team structure depends on the organization. Certainly for smaller orgs and startups it is possible to hire a small set of engineers with the necessary abilities to handle the reliability work.

In larger, more-established organizations, scrum teams tend to gravitate toward feature development in which it becomes difficult to prioritize reliability work even if the skills are present. With very many feature development teams, it becomes difficult to hire for a programmer that has the primary skill desired and the ancillary skill of reliability and systems operations.

I'm currently of the belief that in most orgs, a team should be dedicated to this. There's the ongoing operations of a service, which I think is a shared duty, and the infrastructure engineering components necessary to abstract the infrastructure sufficiently well that it is uniform across feature-development teams and also reliable, scalable, and secure by default.

As with anything, the devil is in the details.


AWS, incidentally, spoke a tonne at re:Invent 2019 on how they run their operations. It is interesting contrast against the article because AWS takes a different approach (no dedicated SREs)-- The engineers that build the system run ops, too.

Some references:

[0] Colm MacCarthaigh, How to take control of systems, big & small, https://www.youtube-nocookie.com/embed/O8xLxNje30M (2018)

[1] Eric Brandwine, Aspirational security, https://www.youtube-nocookie.com/embed/ad9180b4Xew

[2] Marc Brooker, Amazon's approach to building resilient services, https://www.youtube-nocookie.com/embed/KLxwhsJuZ44

[3] Peter Ramensky, Amazon's approach to high-availability deployment, https://www.youtube-nocookie.com/embed/bCgD2bX1LI4

[4] Andy Troutman, Amazon's approach to running service-oriented teams, https://www.youtube-nocookie.com/embed/n1d20Yok000

[5] Colm MacCarthaigh, Amazon's approach to security during development, https://www.youtube-nocookie.com/embed/NeR7FhHqDGQ

[6] Becky Weiss, Amazon's approach to failing successfully, https://www.youtube-nocookie.com/embed/yQiRli2ZPxU

[7] Thomas Blood, Amazon's culture of innovation, https://www.youtube-nocookie.com/embed/2ZQKPUD7vKE

[8] Andy Warfield and Seth Markle, Lessons from Amazon S3's culture of durability, https://www.youtube-nocookie.com/embed/DzRyrvUF-C0

[9] Colm MacCarthaigh, Lessons from Amazon's highest available data-planes, https://www.youtube-nocookie.com/embed/2L1S0zfnIzo

[10] Eric Brandwine, The tension between absolutes & ambiguity in security, https://www.youtube-nocookie.com/embed/GXTvlQXVCOs (2018)


The article talks about adding SREs. As if the developers wrote buggy software and then the SREs fix the bugs. I’m sure I’m misunderstanding. What is meant?


Sounds like SRE are fixing bugs and introducing tools and processes that improve reliability.

> There are two main types of reliability work. The first is mitigation, which is a linear fix that’s often referred to as firefighting...The second is change management, which is a non-linear fix that proactively reduces the defect rates through projects like migrating to better tools and refactoring spaghetti code. While SREs support both these types of work, they should spend more time on the latter.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: