
What Is ‘Site Reliability Engineering’? - peterkshultz
https://landing.google.com/sre/interview/ben-treynor.html
======
StreamBright
One thing that I was not aware prior to working as an SRE is how much they
rely on statistics. This approach that you can use stats to determine what is
a normal or abnormal level for a particular metric (like packet loss for
example) became pretty useful through my career.

A quick example, I was called into a meeting at a company where I worked in a
non-SRE role, and the team explained to me that they are not able to identify
what is wrong but their cluster is misbehaving and there is a node that gets
kicked out of it regularly. I pulled up a console and started to compare OS
level metrics across the cluster. The sysadmin team thought I am stupid
because they explicitly told me which node was in trouble. After the third
metrics I checked I found out that the node in question was doing 5000x more
packet loss than the second worst in the cluster. It was a faulty NIC at that
time. The sysadmin team was checking all of the metrics on the broken node but
never compared the results to the healthy ones.

~~~
graycat
Yup, such things are some of why, when I did work in anomaly detection, I
wanted it to be _multi-dimensional_.

What you observed is the same song, second verse of: There was a cluster doing
transaction processing. One of the computers in the cluster got a little
_sick_ and was throwing all its incoming transactions into the _bit bucket_.
The load leveling for the cluster sent the next transaction to the least busy
computer in the cluster, and the sick computer, not doing any real work, was
usually the least busy so got nearly all the incoming transactions. So, really
the one sick computer was throwing away nearly all the incoming transactions
of the whole cluster, made the whole cluster look sick.

So, sure, some anomaly detection that takes as its input data, say, the CPU
busy of each of the computers in the cluster should see that the one sick
computer was comparatively low on CPU busy, call that an _anomaly_ , raise an
alarm, and let diagnosis begin.

Sure, if do have such an anomaly detector, very much want good control over
false alarm rate.

So, can see what I posted five days ago (gee, suddenly anomaly detection is
getting popular on HN?) in

[https://news.ycombinator.com/item?id=14119753](https://news.ycombinator.com/item?id=14119753)

and

[https://news.ycombinator.com/user?id=graycat](https://news.ycombinator.com/user?id=graycat)

------
SwellJoe
I have a theory that Google invented the SRE because they didn't know how to
hire system administrators, but they knew how to hire software engineers. So,
they just hired software engineers and told them to figure out the systems.

I say this only partially in jest.

~~~
StreamBright
This is not true at all. Most of the great first SRE were systems engineers
(this is how Amazon used to call SREs) who understood the entire stack from
the UI down the the network packet and OS kernel level. Most software
engineers do not have serious networking or operating system knowledge. On the
top of that, there is a reason why performance engineers are so rare,
understanding the complex computing systems we have is not an easy task.
System administrators are able to configure your services but not necessarily
know all the details of the operating system and file systems that is required
to be an SRE. I think you are trying to oversimplify this subject quite a bit.

~~~
throwaway2016a
> Most software engineers do not have serious networking or operating system
> knowledge.

Anecdotally, this seems wrong but may actually be sadly right.

I don't know how you can get through a decent Computer Science program without
knowing these things. Although, I acknowledge quality in CS programs vary
hugely.

We had to write an Operating System (with basic memory and process
management), know the OSI model up and down, and be able to deep dive into
TCP/IP and UDP packets. In addition to coding and math.

I'm actually pro software "engineer" but perhaps we should draw the line
somewhere on a bare minimum of skills required. Probably about the line of an
ABET accredited undergraduate BS in CS degree?

~~~
StreamBright
I can speak of based on the ~100 interviews I conducted hiring SDEs. There are
very few who could pass an SRE I/II interview, yet I know of few.

~~~
dmoy
What's the pass/fail rate for sde though? If you interview 100 and 10 pass
through to sde, and 2 of those could do sre, that's not actually that bad.

------
nailer
> We care deeply about keeping SRE an engineering function, so our rule of
> thumb is that an SRE team must spend at least 50% of its time doing
> development.

Prior to SRE any good system administrator was doing this: "if it's worth
doing, it's worth automating". But there was another half who were cutting and
pasting shit from Word files into Solaris boxes. sysadmin -> SRE seems to have
cleaned out the chaff.

~~~
traf68
How many developers selected hardware, configured hardware, burnt it in,
racked and cabled it, entered it into a company insurance roster, made and
maintained the interfaces and documented it?

This _was_ SA 101 and is the part you ignore. Hell even in 2007 if you didn't
know hardware you were unemployable as an SA. What do devs writing json-rpc
interfaces to monolith 1212 running oracle X as backend care about infra?

This is the sea change and not everybody agrees with it. You have a 'cloudy'
perspective.

~~~
chrisp_dc
I think knowing the hardware is less challenging than previously. The trend is
to make everything whitebox and then abstract as software. Hypervisors replace
supporting many hardware configurations and software defined SANs replace
dedicated appliances.

------
imesh
I work at a web host, and my SRE title means being being a developer who gets
constantly interrupted by alerts and customer chats.

~~~
a_imho
Do you earn a developer salary?

~~~
poikniok
I would presume more than a developer salary.

~~~
FLUX-YOU
IME, it is safer to assume they are NOT paying him for knowledge across
multiple roles.

It's a business win to find someone with multi-disciplinary skills without
having to pay them the combined salary of those disciplines.

It's even possible they are only paying him a developer's salary.

------
atsaloli
[https://www.usenix.org/conference/lisa16/conference-
program/...](https://www.usenix.org/conference/lisa16/conference-
program/presentation/closing-plenary) is a video of Niall Murphy's excellent
presentation with Todd Underwood of how smaller organizations can implement
SRE basics. Dec 2016. USENIX LISA in Boston. I had the privilege to attend it.

------
raz32dust
With more automation and containerization, I see the SRE role and dev role
coming together, eventually merging into "devops". Today, these roles are
separate because they require slightly different skill sets. Maintaining
production systems takes up about as much time as developing new features. As
it becomes easier and easier, dev will be the ops, even in big companies.

~~~
adrianN
Maintaining production systems won't become easier in the same way software
engineering didn't become easier because of the introduction of high level
languages. The systems only become more complex if it becomes easier to manage
complex systems.

~~~
rconti
... and in the same way virtualization didn't give us time to kick back and
relax in all of our 'free' time now that we're not racking boxes and cabling
stuff all the time.

Instead of creating dedicated application users and chroot jails and alternate
port numbers to let applications coexist on a server, we're spinning a
bazillion instances and building out storage backend to support it, and so on.

And the lowest-level problems still exist, though we don't troubleshoot them
as often; we just re-spin. Same reason nobody's repairing their RAID card with
a soldering iron anymore.

------
burntrelish1273
Here's a script to fetch an offline copy
[https://gist.github.com/steakknife/76214a4bb378592669655e3bb...](https://gist.github.com/steakknife/76214a4bb378592669655e3bbc30a1cc/)

------
NickNameNick
IEEE software engineering radio did a good episode on Site Reliability
engineering.

[http://www.se-radio.net/2016/12/se-radio-
episode-276-bjorn-r...](http://www.se-radio.net/2016/12/se-radio-
episode-276-bjorn-rabenstein-on-site-reliability-engineering/)

------
NotQuantum
I've fallen in love with SRE field. I'm a Computer Engineering senior
currently. I'm used two kinds of classes: CS ones where you learn a lot of
theory and apply it on a test, then the CprE ones where you also learn, but
then have to make it work in labs. I've always liked lab based classes where
you have to take a concept to fruition.

I've been interested in all aspects of CprE, and I taught myself how to run a
Linux box along with DNS, VPN, and other services. Last year around this time,
I was contacted by a recruiter for an SRE internship. At the time, I had no
idea what SRE was and I thought it was just a glorified IT job. Boy was I
wrong.

I got through a few interviews and got the position for the summer. About a
week or two into the internship I fell in love. This job was all about
designing and implementing systems that have to be resilient and must scale.
The idea of building automation to make my job easier was and is great. It was
just like the labs I enjoyed in college.

Fast forward to now, and I'm accepting a full time SRE position at the same
company. I couldn't be happier with my choice in specification. The need for
resilient, distributed systems will only grow in the coming years, and I'm
looking forward to being an SRE.

------
zatkin
>We've held that hiring bar constant through the years, even at times when
it's been very hard to find people, and there's been a lot of pressure to
relax that bar in order to increase hiring volume. We've never changed our
standards in this respect. That has, I think, been incredibly important for
the group. Because what you end up with is, a team of people who fundamentally
will not accept doing things over and over by hand, but also a team that has a
lot of the same academic and intellectual background as the rest of the
development organization. This ensures that mutual respect and mutual
vocabulary pertains between SRE and SWE.

It seems like changing their hiring process is a double edged sword. If they
change it to allow more hiring volume, then other employees might become
frustrated with how easy it becomes to work at Google. On the other hand,
keeping an old hiring process where false negatives continue to occur seems
very bad.

~~~
pm90
Good point. I've had a very cautious opinion about hiring process at Google; I
know that there are people who feel extremes either way. But something really
seems to be wrong if the company is still so hugely dependent on search
advertising for revenue after more than a decade of business. Maybe you need
some (relatively) dumber people to discover new revenue streams.

~~~
workerIbe
We prefer the term "non-linear thinkers".

------
robhirschfeld
SRE is a job function. By design, it's intended to be equivalent in pay and
status with developers (SWE) to overcome the bias against operators and
sysadmins in organizations. This is an important recognition because cloud-
first operations requires a lot of automation and coding expertise that
previous operations roles did not demand.

DevOps is really a process definition with Lean system thinking and code
workflow priorities. Many people will tell you that it is NOT a job function
but a culture or approach. DevOps for developers generally means CI/CD
pipelines and owning code into production. DevOps for operators generally
means building configuration automation and integrated monitoring tools. In
this was, DevOps highly complementary of the SRE job function.

I've been writing a lot about this on my personal (robhirschfeld.com) and
company (rackn.com/sre) blogs. I'd be happy to discuss this in more detail
here.

------
dogecoinbase
SREs are a tool to turn N ops engineers paid X each into 1 SRE paid 2X and N
manual laborers paid X/4 each.

This doesn't make the role bad. But it's important to remember that the role
exists as a cost savings to the org, not because it's an inherently better way
to run a technical infrastructure.

~~~
icebraining
Manual laborers? At Google? What are they doing?

~~~
tyingq
I would guess they are referring to the people racking and stacking equipment,
running network cables, doing any needed physical reboots, swapping out old
servers, etc.

------
rodionos
It's a euphemism for a system administrator with responsibilities to test,
integrate, and automate systems with code.

------
sigi45
Jepp thats how i always wanted to do software engineering: Understanding /
controlling the full stack and taking responsibility for it.

~~~
ZanyProgrammer
Understanding, sure. Taking responsibility? No way.

------
zeckalpha
Is this interview new or was it released as part of the book?

~~~
pronoiac
I think it's new to the website. I don't see it in the table of contents for
the rebook, and it's been online since August, according to [https://web-
beta.archive.org/web/20160804182333/https://land...](https://web-
beta.archive.org/web/20160804182333/https://landing.google.com/sre/interview/ben-
treynor.html)

------
HeavenBanned
A "SRE" is what happens when you want to pronounce the word "SWE" but can't.
For some reason you keep saying "SRE" over and over and over again.

They were overcompensating for the fact that SREs aren't SWEs so hard. It's
like "we get it, SREs are wannabe SWEs, stop trying to sugar coat it". 50%
development? What a disaster. If half your job is the job that you want and
the other half is administrative bullshit, why in the living fuck would you
try to make a puff piece about that?

It seems as though from what everyone has said in this thread, that SRE is
basically a scam along with DevOps and that the real job people want is the
SWE.

I don't like internal memo propaganda pieces by big companies. It's not
intellectually stimulating: it's hogwash. Let the truth reign always.

~~~
rconti
Assuming SWE means software engineer.

No. Thanks.

------
grabcocque
SRE: because DevOps isn't buzzwordy enough these days.

~~~
StreamBright
SRE predates DevOps by 5 years at least.

------
traf68
It is stupidity, hubris and a disposition to chaos.

------
deckardb26354
SRE? Apparently the only 'software' job Google has in Dublin. It doesn't
matter if you have a PhD or wrote your own kernel, want to write code for
Google, move to mountain view. Oh and the seven interviews. Complete waste of
time.

------
awkbug
I recently attended interviews at LinkedIn, attlasian and my experience was
very bad. First round is online exam and I answered all the questions.
Attlasian rejected even having 100% right with all test cases. No response
from LinkedIn. They told I can use any language to solve and I chose bash. I
think they didn't like me using bash. The guy who interviewed me at LinkedIn
is system administrator with sre title. Funny thing is he said he doesn't do
programming. Companies are just misusing these titles. They need software
engineering who can do system administration. The types who run apt-get on
Centos :p

