Hacker News new | comments | show | ask | jobs | submit login
Habits of Highly Successful Site Reliability Engineers (newrelic.com)
24 points by kungfudoi 11 months ago | hide | past | web | favorite | 5 comments



I worked in a SRE job for 5ish years. It sucks. Developers release a bunch of crap, when it breaks they're last to be found. You can work your ass off and do some magic and keep the system running, but rewards for that are slim and quickly forgotten. If you push back and ask for more reliable stuff usually no one listens because business goals are more important.

I'm back as a developer again and things are so much easier. I do spend a few hours helping out with tools and helping out the SRE guys and I get much more credit doing that as a developer than actually as the SRE.


What you are describing sounds like a typical sysadmin role, not SRE.

There are several principles in SRE which are specifically designed to avoid the issues you mentioned and to keep software engineering teams invested in reliability:

1. SLO is defined by the business, which describes reliability requirements for the system;

2. If the system is less reliable than SLO and runs out of error budget, feature development freezes, directing engineering resources to reliability improvements (this prevents the "last to be found" issue).

3. There is 50% cap on ops work that SRE do; everything over 50% goes to the feature development team. This prevents the "working your ass of to keep the system running" issue.

All of this is well described in the "Keys to SRE" talk by the guy who actually invented SRE: https://www.usenix.org/conference/srecon14/technical-session...


It really depends on the organization.

It's similar in security and QA in that the more ability you have to tell other groups to not do stupid things and enforce it the less soul crushing it is to do your job and the better of a job you can do.

I currently work as an SRE. If your stuff breaks and whichever SRE is on duty/call can't figure out why and fix it(which is almost a certainty since we cannot be an expert in everyone's software) guess who's getting called...

and if we can't reach who we're calling guess which direction on the org chart we go...


I think the key, unfortunately, might be to measure your own impact as an SRE. I.e. if this check was not in place, you'd have X more failures.

This is also related to how an engineer can truly have a "10x" output of another engineer. Making the right decision on what not to build give you that 8x. And building it quickly and well gives you the 2x. But it's not often recognized as such.


> At New Relic, we describe it internally as “someone who is constantly analyzing every change for its risk and what its impact could be down the road, not just today. And what does that mean for the larger infrastructure?”

In many orgs, this is regarded as nay-saying and ignored.

> There’s little upside in a siloed approach that throws a change over the wall with no concern for how it might affect the person sitting on the other side.

Often, any change that impacts another person _at all_ is considered to be too much of an impact, regardless of the benefits. Developer workflows are usually prioritized.

> You embrace every opportunity to automate

My experience has been that automation is not prioritized if it blocks anything else. Running the business takes priority. The signal missed here is that this usually means there isn't enough staff to handle the daily run-the-business tasks and the necessary automation.

> “You need to be able to dig in and say ‘stop’ and ‘no, we really need to to do this thing now,’ which can be difficult to do in some engineering organizations,”

Asking your engineers to sell stability rather than backing them up with institutional support is an organizational smell.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: