What Is ‘Site Reliability Engineering’?

StreamBright · on April 20, 2017

One thing that I was not aware prior to working as an SRE is how much they rely on statistics. This approach that you can use stats to determine what is a normal or abnormal level for a particular metric (like packet loss for example) became pretty useful through my career.

A quick example, I was called into a meeting at a company where I worked in a non-SRE role, and the team explained to me that they are not able to identify what is wrong but their cluster is misbehaving and there is a node that gets kicked out of it regularly. I pulled up a console and started to compare OS level metrics across the cluster. The sysadmin team thought I am stupid because they explicitly told me which node was in trouble. After the third metrics I checked I found out that the node in question was doing 5000x more packet loss than the second worst in the cluster. It was a faulty NIC at that time. The sysadmin team was checking all of the metrics on the broken node but never compared the results to the healthy ones.

graycat · on April 20, 2017

Yup, such things are some of why, when I did work in anomaly detection, I wanted it to be multi-dimensional.

What you observed is the same song, second verse of: There was a cluster doing transaction processing. One of the computers in the cluster got a little sick and was throwing all its incoming transactions into the bit bucket. The load leveling for the cluster sent the next transaction to the least busy computer in the cluster, and the sick computer, not doing any real work, was usually the least busy so got nearly all the incoming transactions. So, really the one sick computer was throwing away nearly all the incoming transactions of the whole cluster, made the whole cluster look sick.

So, sure, some anomaly detection that takes as its input data, say, the CPU busy of each of the computers in the cluster should see that the one sick computer was comparatively low on CPU busy, call that an anomaly, raise an alarm, and let diagnosis begin.

Sure, if do have such an anomaly detector, very much want good control over false alarm rate.

So, can see what I posted five days ago (gee, suddenly anomaly detection is getting popular on HN?) in

https://news.ycombinator.com/item?id=14119753

and

https://news.ycombinator.com/user?id=graycat

SwellJoe · on April 20, 2017

I have a theory that Google invented the SRE because they didn't know how to hire system administrators, but they knew how to hire software engineers. So, they just hired software engineers and told them to figure out the systems.

I say this only partially in jest.

StreamBright · on April 20, 2017

This is not true at all. Most of the great first SRE were systems engineers (this is how Amazon used to call SREs) who understood the entire stack from the UI down the the network packet and OS kernel level. Most software engineers do not have serious networking or operating system knowledge. On the top of that, there is a reason why performance engineers are so rare, understanding the complex computing systems we have is not an easy task. System administrators are able to configure your services but not necessarily know all the details of the operating system and file systems that is required to be an SRE. I think you are trying to oversimplify this subject quite a bit.

jacquesm · on April 20, 2017

> Most of the great first SRE were systems engineers (this is how Amazon used to call SREs) who understood the entire stack from the UI down the the network packet and OS kernel level.

Aka the 'full stack developer' as opposed to the 'full stack developer' that knows both JavaScript, PHP and how to install the base distribution of CentOS.

metaphorm · on April 20, 2017

you're commenting on sloppy use of language. you're right, we do tend to abbreviate and exclude relevant descriptive words. the "full stack developer" that knows both JavasScript and PHP (or whatever), and probably some database stuff, and how to config a HTTP server, and a few other bits of web odds and ends is really a "full stack web application developer". usually there is enough context to know if "web application" has been abbreviated out of the term though.

as for the other kind of "full stack developer", the legendary creature that is equally comfortable doing UI scripting and Linux Kernel hacking...well personally I've never met one. mythical beast in my experience.

throwaway2016a · on April 20, 2017

> Most software engineers do not have serious networking or operating system knowledge.

Anecdotally, this seems wrong but may actually be sadly right.

I don't know how you can get through a decent Computer Science program without knowing these things. Although, I acknowledge quality in CS programs vary hugely.

We had to write an Operating System (with basic memory and process management), know the OSI model up and down, and be able to deep dive into TCP/IP and UDP packets. In addition to coding and math.

I'm actually pro software "engineer" but perhaps we should draw the line somewhere on a bare minimum of skills required. Probably about the line of an ABET accredited undergraduate BS in CS degree?

pzh · on April 20, 2017

Why are people surprised? Taking all the coursework in Math, Programming, Algorithms, Operating Systems, Computer Networks, Security and Cryptography, Distributed Systems, Databases, etc., and also learning and internalizing all that knowledge is a significant time investment. Yet many people here would be offended at the idea that you can't be a proper engineer without a CS degree, and would insist that all you need is a three-month JS/Angular bootcamp and then you're golden. A formal CS degree in itself may not be a requirement, but the time investment in acquiring the knowledge sure is.

StreamBright · on April 20, 2017

I can speak of based on the ~100 interviews I conducted hiring SDEs. There are very few who could pass an SRE I/II interview, yet I know of few.

dmoy · on April 20, 2017

What's the pass/fail rate for sde though? If you interview 100 and 10 pass through to sde, and 2 of those could do sre, that's not actually that bad.

Sean1708 · on April 20, 2017

Because Bachelor's degrees don't give you a serious knowledge of or experience in anything, they give you shallow knowledge of and experience in a broad array of topics.

snowwrestler · on April 20, 2017

This is just not true. Engineers in many disciplines other than CS (mechanical, electrical, civil, etc) begin serious professional work with only a bachelors degree.

throwaway2016a · on April 20, 2017

> other than CS

This is simply not true as well. CS majors also can begin serious work with just a BS degree. I have no idea what school(s) the parent topic is refering to but remind me not to recommend people go there...

Four years is a LOT of time to get deep on multiple CS topics not just a broad overview. If you finish 4 years of serious study and don't have deep knowledge of a few things you did college wrong.

College is the only time in your adult life where your only job is to learn. It is incredible that many of us get that opportunity and that some people come out of it with just broad knowledge is disappointing.

In my CS program, senior and most of junior year were almost entirely specialized courses where we had a choice to pick a topic that interests us and deep dive. (AI, graphics, networking, bioinformatics, big data, programming language theory, etc). All the general studies were mostly done by the end of sophomore year.

snowwrestler · on April 21, 2017

That's great. The "CS grad who doesn't actually know how to write production software" is a well-trod trope, but hopefully it is less true than it used to be.

dsfyu404ed · on April 20, 2017

You are absolutely wrong in those examples. Those fields are so broad that there is not time to specialized in college. They only learn the fundamentals.

You could have two identical ME grads go on to specialize in completely different things. One guy might spend his entire career working on hypoid gear sets and another might work exclusively on marine diesel rotating assemblies.

Ditto for EE, CE and so on.

mentat · on April 20, 2017

How it runs in the ideal case is different from experience in how it "actually" runs and fails. That's the biggest gap IMHO.

graycat · on April 20, 2017

> reason why performance engineers are so rare, understanding the complex computing systems we have is not an easy task.

At times it has been possible to make some progress on the complexity by regarding each application, virtual machine, server, server cluster, or server farm as a network of queues and apply queuing theory and/or Monte Carlo simulation. In addition some of the old work in optimization under uncertainty likely applies.

srean · on April 21, 2017

Have you looked at stochastic network calculus ?

Network of queues can become hrrendoulsy difficult to analyze unless one assumes memoryless arrivals etc etc. Network calculus and stochastic network calculus simplifies things a great deal by going after bounds rather than exact answers. You might like it

graycat · on April 21, 2017

I never heard of "Network calculus and stochastic network calculus". Certainly no such topics were in any of the applied math grad courses I took.

For "going after bounds rather than exact answers", for analyzing network performance, there is a hugely simplifying approach: Just look at the bottlenecks and largely f'get about the rest; maybe that is related to what you mentioned.

For network queuing, really I was suggesting using that as a paradigm to formulate an analysis; I accept that analytic solutions are unpromising -- more generally, the exact probabilistic calculations out of queuing theory research are complicated even for simple cases. So, for solutions, i.e., actionable information, use Monte Carlo.

Once I knew a guy at IBM's Watson lab who was big on such things. There was a claim that at one time his work was useful in designing some of an IBM mainframe I/O subsystem. He had some software that I used once. His software wanted to collect the usual descriptive statistics, but I wanted the sample paths from the Monte Carlo, got those, and did more analysis of those.

srean · on April 21, 2017

> I never heard of "Network calculus and stochastic network calculus".

...and now you have :)

> Certainly no such topics were in any of the applied math grad courses I took.

Yeah they are a newer development. Stochastic network calculus more so. Network calculus is older, it even has a wikipedia page https://en.wikipedia.org/wiki/Network_calculus A key idea is convolution but in the max-plus algebra and martingale large deviations.

> For network queuing, really I was suggesting using that as a paradigm to formulate an analysis;

Indeed. SNCs can be a helpful tool there. Yes they are bounds, but a lot of progress has been made to make them tight https://arxiv.org/pdf/1303.4114.pdf they arent quite there yet numerically but tight enough to give an intuition. NC bounds are a lot looser.

samfisher83 · on April 20, 2017

If you take comp sci you are required to take a class in os and networking. I think many swe are cs majors.

StreamBright · on April 20, 2017

Yes, and those classes teaches you the theory while when you are working for companies you need to know the practical side of things. In our country even the theory is lagged behind the practical usage by 5-10 years. Funnily enough most the senior SREs I know do not have any formal education in CS (anecdotal evidence).

dsr_ · on April 20, 2017

[quoting myself from quite a while ago]

Once upon a time there were programmers.

Then there were systems programmers and application programmers. Systems programmers wrote operating systems and utilities for them. App programmers wrote apps. There was a lot of crossover.

Then there were operators, systems programmers and application programmers. Operator was a junior position who did physical things (mount tapes, plug in cables) and ran commands to do things on the systems. They usually moved up to being…

Systems administrators, who did some programming in service to the systems, but not too much. The more senior a sysadmin was, the more time they spent programming and the less time they spent doing physical things… unless they wanted to do that.

Sysadmins started to specialize. People who configured switches and routers and talked to telephone companies became “network engineers”. People who spent time working on firewalls and security policies and thinking about that became “security engineers”. Junior people who read scripts to end users became the helpdesk. And so forth.

Then we noticed that a bunch of people were doing things manually when they should be automated. This was especially bad in places where there were no senior sysadmins or systems programmers. But we did have the internet, and senior sysadmins got together and started writing tools to make their lives easier: infrastructure automation.

You probably know the story from there, but I’ll wrap up with one more important point: you know how when writing a business application you need to have a subject matter expert who actually knows what they are doing? Operations is exactly the same way. All the automation in the world won’t help you if you don’t have someone around who knows what they are doing. Some people can outsource this to “the cloud”, but not everyone.

acchow · on April 20, 2017

My impression is many sysadmin at the time didn't really understand the systems they were dealing with - that kind of problem solving, hacking, and mapping the territory comes with software engineering. So why not just hire software engineers?

StreamBright · on April 20, 2017

You think if a sysadmin does not understand the topic a software engineer magically will? I think the biggest obstacle of understanding systems the level SREs do is lack of proper documentation and formal education. This is why Google introduced the SRE university and also created the following doc.

https://landing.google.com/sre/book.html

What I mean by proper documentation is that you do not have all the knowledge in one place. I have spent countless hours of preparing to SRE interviews by reading source code and hunting for additional write ups that explain various OS functions, file systems, networking details. The amount of text you have to go through to get the knowledge is intense. The best way to become an SRE is to be hired one at Google or Amazon and just learn from your senior co-workers. Catch 22. :)

saalweachter · on April 20, 2017

My rule of thumb is that the difference between a sysadmin and a programmer is that the sysadmin has read the documentation for all of software they use.

tyingq · on April 20, 2017

I don't see either path as having an overwhelming advantage.

What happened here is that the scale of sysadmin tasks escalated to the point where software engineering skills brought some advantage. It did not, though, negate the importance of traditional sysadmin skills.

The best people in this space are those that can bridge both worlds. Not just "SRE" either. Devops is also a hybrid role, as is performance engineering.

Look at Adrian Cockroft's background as an example.

inopinatus · on April 20, 2017

DevOps is a concept, not a role. It means breaking down the silos and allowing processes, ideas, capabilities to run across both SWE & operations teams. I never knew a really first-class *nix sysadmin who wasn't also a kernel hacker, so this isn't anything new, it's just got a label.

Probably the most devops-y thing you can do in an applications environment is operate a continuous delivery pipeline.

The most un-devops-y thing you can do is start a devops team.

tyingq · on April 20, 2017

Whether it's correct or not, it's a named title people are hiring for. https://www.indeed.com/q-DevOps-jobs.html

So I thought it worth mentioning as one of those hybrid roles.

inopinatus · on April 20, 2017

That's a great example. There's nothing in those job listings that wasn't, conceptually speaking, in the job description of a system engineer or sysadmin thirty years ago.

In this usage of "DevOps" it's a buzzword, and I recommend you should avoid anyone who puts buzzwords in your job title, because it speaks volumes up front about their culture and values and clarity of thinking.

I really do like the descriptor Site Reliability Engineer, though, for exactly the same reasons.

traf68 · on April 20, 2017

It allows the company to save on dedicated resources in my experience.

A lot of first-class *nix admins in my generation have a problem dealing with dev ideas in system context. In the company I am at now we have gone a step further and allowed dev/mgmt to control ops|prod and reserved system to sidestep the crapulence of this concept.

hashhar · on April 20, 2017

You may be saying it in jest but I've seen it to be true (sample size of 1).

I am a recently graduating (this summer) CS undergrad from India and was offered the chance to interview for a customer service offering company (think Zendesk) for the role of a Site Reliability Engineer. The take home tests they gave me heavily reflected that they were looking for an engineer instead of a sysadmin as nowhere during the interview did they discuss about anything other than Chef, Terraform, Nginx and other things that mainly deal with deployment and devops and have very little to do with "reliability". They didn't have anything about networks or fault-tolerance.

PS: I really liked the interview process though in that they had take home tests and there was a good depth of discussion relating to those tests the next day. I eventually won't be joining them because of some information about their office culture in their India office. Sad.

StreamBright · on April 20, 2017

When you are applying for SRE outside Google and Amazon or maybe Facebook it mean something entirely different. Usually they just use SRE for describing sysadmins, or a little bit more, maybe adding some automation knowledge. Getting hired as an SRE at Google and Amazon is extremely challenging task, lots of areas where you need to be proficient.

vertex-four · on April 20, 2017

I would suggest that what other companies call "SREs" is not likely to be what Google calls SREs, given the difficulty of implementing it correctly.

markdown · on April 20, 2017

Freshdesk?

hashhar · on April 21, 2017

No. Kayako.

jasode · on April 20, 2017

>they didn't know how to hire system administrators, but they knew how to hire software engineers

Setting aside your partial jest, I think Google's approach of hiring "programmers" to fill the role of SRE makes perfect sense for their organization.

The "sysadmin" can have a wide range of meanings and responsibilities but many will have experience managing COTS[1] software like Oracle RDBMS, Peoplesoft HR, MS Exchange email server, etc. A typical system admin can then use complimentary COTS tools like Microsoft SCOM System Center Operations Manager, HP OpenView, IBM Tivoli, CA Unicenter/Spectrum. On Linux, it would be similar tools like Nagios, Zabbix to monitor Apache webservers, Mysql, etc.

In Google's case, they didn't cluster a bunch of COTS Oracle dbs together to serve web surfers. Instead, Google wrote a proprietary platform like BigTable or Spanner[2]. As a result, they also want additional software that monitors the health of Spanner. Since none of the monitoring tools from enterprise vendors like HP/MS/CA are adequate (because they have no out-of-box software-agents for Spanner), they need another layer of programmers to write custom management/monitoring tools. Those programmers are the SREs.

All the above enterprise tools from HP/MS/CA/etc including the mother db server, the agents, the plugins are written in C/C++. If your hiring job description is "system admin", most candidates will not have the skills to develop those enterprise monitoring tools from scratch. Yes, many sysadmins have programming skills to write Bash/Powershell scripts. E.g. a bash script might have a polling loop that checks for an error file existence and then has if/else/fi to send an email. However, that's not the same programming skill as developing a proprietary version of HP OpenView and MS SCOM. Most traditional sys admins do not write low-level C code like Zabbix.[3]

Basically, if you're a company that writes complex proprietary softare and none of the existing enterprise monitoring tools can manage it, you'll have to hire programmers instead of sysadmins as the baseline skillset for SREs.

[1] https://en.wikipedia.org/wiki/Commercial_off-the-shelf

[2] https://en.wikipedia.org/wiki/Spanner_(database)

[3] e.g. https://github.com/zabbix/zabbix/blob/trunk/src/zabbix_serve...

devonkim · on April 20, 2017

Most sysadmins on the market primarily have worked in manners and tooling hardly different between cost and revenue centers, and for Google they needed admins that are actually knowledgeable about software engineering to make them effective in a revenue center.

It's laughable what kind of tools were written for enterprise systems back in the late 90s, so it's no surprise that absolutely none of them would meet Google's rate of growth requirements either.

Furthermore, given the licensing patterns of most enterprise software suites, it'd potentially become cost prohibitive to deploy COTS ops software at Google scale anyway even with Google scale money.

There's obviously still some cost centers to big tech companies (Facebook has contract positions open for doing service desk customizations last I checked) but custom tooling everywhere makes sense when you have justifiable reasons that nobody will be able to serve your needs besides yourself.

technofiend · on April 20, 2017

Makes me wonder what they pay and how they get over the usual SV prejudice against hiring people over 30. I read their SRE description a while ago and thought it sounded perfect for me, but if you end up there and for whatever reason it doesn't work out you'll end up losing a lot of money leaving SV again. It's not cheap to uproot yourself and move across country and it's not like anyone else in SV will hire folks over a certain age; at least that's what I'm left to conclude after lurking HN for a few years.

pm90 · on April 20, 2017

Eh...what? Don't be fooled by what you read here. Take a few interviews and find out for real if that's the case. I would be really surprised if that were the case, especially at Big Co's where their staff is (comparatively) older than at Startups.

lclarkmichalek · on April 20, 2017

Eh, I think they say something similar in the book. The other side of it was that they wanted a software engineering approach to systems administration (which completely depends on your opinions of the respective professions)

SwellJoe · on April 20, 2017

It seems to be mostly working for them, though I think companies that haven't gone so far down that rabbit hole are also doing fine. One of the premises of SRE that I occasionally take issue with is "build your own tools". That's a fine idea, sometimes, but it also results in NIH syndrome, which has a real cost.

Obviously, Google is operating at a scale that no one ever has before (and almost no one else does or ever will), and so, sometimes they have to invent the technologies for doing it. But, the somewhat dismissive tone of the SRE literature I've read that implies system administrators scale linearly while SREs scale drastically higher is kind of off-putting, and misrepresentative, I think. Sysadmins have been scaling systems for as long as there have been systems; sometimes just one person or a small team in the data center, running the whole show for quite large companies.

dj_jorjinho · on April 20, 2017

The opposite also occurs: a SysAdmin gets hired to do DevOps, but they're not qualified to be Software Engineers. The results is that, if you're not looking, your ops code ends up a series of "scripts" instead of a series of well structured tools.

But I guess this depends on the company culture to begin with.

skywhopper · on April 20, 2017

Given the tone of the first answer, I would say this is correct, as they clearly didn't hire any good ones, anyway.

nailer · on April 20, 2017

> We care deeply about keeping SRE an engineering function, so our rule of thumb is that an SRE team must spend at least 50% of its time doing development.

Prior to SRE any good system administrator was doing this: "if it's worth doing, it's worth automating". But there was another half who were cutting and pasting shit from Word files into Solaris boxes. sysadmin -> SRE seems to have cleaned out the chaff.

traf68 · on April 20, 2017

How many developers selected hardware, configured hardware, burnt it in, racked and cabled it, entered it into a company insurance roster, made and maintained the interfaces and documented it?

This _was_ SA 101 and is the part you ignore. Hell even in 2007 if you didn't know hardware you were unemployable as an SA. What do devs writing json-rpc interfaces to monolith 1212 running oracle X as backend care about infra?

This is the sea change and not everybody agrees with it. You have a 'cloudy' perspective.

chrisp_dc · on April 20, 2017

I think knowing the hardware is less challenging than previously. The trend is to make everything whitebox and then abstract as software. Hypervisors replace supporting many hardware configurations and software defined SANs replace dedicated appliances.

nailer · on April 20, 2017

Configuration, asset management, interface config and documentation are included under things to automate and always have been.

rconti · on April 20, 2017

Both your "good" and "bad" examples seem to be more operator-type roles. so much sysadmin work in many shops cannot readily be automated -- doubly so if all of the 'easy' stuff has already been automated!

imesh · on April 20, 2017

I work at a web host, and my SRE title means being being a developer who gets constantly interrupted by alerts and customer chats.

nandemo · on April 20, 2017

Unless you're working a startup that's intentionally "doing things that don't scale" (because e.g. you're still yet to get to product-market fit), I can't see how it makes sense for a developer or SRE to get interrupted by "customer chats".

As for endless alerts, I've been learning that there are 3 types:

• false alerts, which were set up "just in case". The "handling" tends to be "take a look, declare that nothing is wrong after all, mark it solved". Your job is to summarily eliminate these.

• "good" alerts that indicate a temporary problem that could/should be handled automatically in an ideal world, but can't be handled automatically as things stand; your job is to move the status quo in the direction of the ideal world, gradually.

• all the rest. These are the unavoidable alerts. Usually, the best you can do is to ensure the alert message and related logs provide all the relevant information necessary to handle it.

PS: of course, TFA indicates Google has a much more systematic approach, but you probably won't be able to just copy them.

user5994461 · on April 20, 2017

1) Fix the alert

2) Fix the root cause definitely

3) Delete the alert

If you can't do any of that, just delete the alert, and also, that means you are not SRE, you are a powerless ops monkey who's getting spammed by other people's alerts (very common in many bad places).

hackermailman · on April 20, 2017

This is also what I've assumed SRE really means, plus being on call to put out fires in the middle of the night. I can't begin to imagine the hell that is trying to string hundreds of buggy containers on top of buggy kernels on top of buggy VMs and applying daily patches that don't collapse this complexity house of cards. If anybody can handle this without burning out in a few months they probably don't pay you enough

a_imho · on April 20, 2017

Do you earn a developer salary?

poikniok · on April 20, 2017

I would presume more than a developer salary.

FLUX-YOU · on April 20, 2017

IME, it is safer to assume they are NOT paying him for knowledge across multiple roles.

It's a business win to find someone with multi-disciplinary skills without having to pay them the combined salary of those disciplines.

It's even possible they are only paying him a developer's salary.

atsaloli · on April 20, 2017

https://www.usenix.org/conference/lisa16/conference-program/... is a video of Niall Murphy's excellent presentation with Todd Underwood of how smaller organizations can implement SRE basics. Dec 2016. USENIX LISA in Boston. I had the privilege to attend it.

raz32dust · on April 20, 2017

With more automation and containerization, I see the SRE role and dev role coming together, eventually merging into "devops". Today, these roles are separate because they require slightly different skill sets. Maintaining production systems takes up about as much time as developing new features. As it becomes easier and easier, dev will be the ops, even in big companies.

adrianN · on April 20, 2017

Maintaining production systems won't become easier in the same way software engineering didn't become easier because of the introduction of high level languages. The systems only become more complex if it becomes easier to manage complex systems.

rconti · on April 20, 2017

... and in the same way virtualization didn't give us time to kick back and relax in all of our 'free' time now that we're not racking boxes and cabling stuff all the time.

Instead of creating dedicated application users and chroot jails and alternate port numbers to let applications coexist on a server, we're spinning a bazillion instances and building out storage backend to support it, and so on.

And the lowest-level problems still exist, though we don't troubleshoot them as often; we just re-spin. Same reason nobody's repairing their RAID card with a soldering iron anymore.

tomtompl · on April 20, 2017

Pint for this guy.

Related post: https://medium.com/outsystems-engineering/the-law-of-conserv...

NickNameNick · on April 20, 2017

IEEE software engineering radio did a good episode on Site Reliability engineering.

http://www.se-radio.net/2016/12/se-radio-episode-276-bjorn-r...

NotQuantum · on April 20, 2017

I've fallen in love with SRE field. I'm a Computer Engineering senior currently. I'm used two kinds of classes: CS ones where you learn a lot of theory and apply it on a test, then the CprE ones where you also learn, but then have to make it work in labs. I've always liked lab based classes where you have to take a concept to fruition.

I've been interested in all aspects of CprE, and I taught myself how to run a Linux box along with DNS, VPN, and other services. Last year around this time, I was contacted by a recruiter for an SRE internship. At the time, I had no idea what SRE was and I thought it was just a glorified IT job. Boy was I wrong.

I got through a few interviews and got the position for the summer. About a week or two into the internship I fell in love. This job was all about designing and implementing systems that have to be resilient and must scale. The idea of building automation to make my job easier was and is great. It was just like the labs I enjoyed in college.

Fast forward to now, and I'm accepting a full time SRE position at the same company. I couldn't be happier with my choice in specification. The need for resilient, distributed systems will only grow in the coming years, and I'm looking forward to being an SRE.

zatkin · on April 20, 2017

>We've held that hiring bar constant through the years, even at times when it's been very hard to find people, and there's been a lot of pressure to relax that bar in order to increase hiring volume. We've never changed our standards in this respect. That has, I think, been incredibly important for the group. Because what you end up with is, a team of people who fundamentally will not accept doing things over and over by hand, but also a team that has a lot of the same academic and intellectual background as the rest of the development organization. This ensures that mutual respect and mutual vocabulary pertains between SRE and SWE.

It seems like changing their hiring process is a double edged sword. If they change it to allow more hiring volume, then other employees might become frustrated with how easy it becomes to work at Google. On the other hand, keeping an old hiring process where false negatives continue to occur seems very bad.

pm90 · on April 20, 2017

Good point. I've had a very cautious opinion about hiring process at Google; I know that there are people who feel extremes either way. But something really seems to be wrong if the company is still so hugely dependent on search advertising for revenue after more than a decade of business. Maybe you need some (relatively) dumber people to discover new revenue streams.

workerIbe · on April 20, 2017

We prefer the term "non-linear thinkers".

robhirschfeld · on April 23, 2017

SRE is a job function. By design, it's intended to be equivalent in pay and status with developers (SWE) to overcome the bias against operators and sysadmins in organizations. This is an important recognition because cloud-first operations requires a lot of automation and coding expertise that previous operations roles did not demand.

DevOps is really a process definition with Lean system thinking and code workflow priorities. Many people will tell you that it is NOT a job function but a culture or approach. DevOps for developers generally means CI/CD pipelines and owning code into production. DevOps for operators generally means building configuration automation and integrated monitoring tools. In this was, DevOps highly complementary of the SRE job function.

I've been writing a lot about this on my personal (robhirschfeld.com) and company (rackn.com/sre) blogs. I'd be happy to discuss this in more detail here.

dogecoinbase · on April 20, 2017

SREs are a tool to turn N ops engineers paid X each into 1 SRE paid 2X and N manual laborers paid X/4 each.

This doesn't make the role bad. But it's important to remember that the role exists as a cost savings to the org, not because it's an inherently better way to run a technical infrastructure.

otoburb · on April 20, 2017

And because other organizations are sold on the idea that SREs can realize cost savings benefits without compromising reliability of their technology assets, boards and c-suite teams definitively define this as a better way to run technical infrastructure.

pram · on April 20, 2017

Actually in most places I've seen 'ops' just turned into 'SRE'

icebraining · on April 20, 2017

Manual laborers? At Google? What are they doing?

tyingq · on April 20, 2017

I would guess they are referring to the people racking and stacking equipment, running network cables, doing any needed physical reboots, swapping out old servers, etc.

rodionos · on April 20, 2017

It's a euphemism for a system administrator with responsibilities to test, integrate, and automate systems with code.

HeavenBanned · on April 20, 2017

A "SRE" is what happens when you want to pronounce the word "SWE" but can't. For some reason you keep saying "SRE" over and over and over again.

They were overcompensating for the fact that SREs aren't SWEs so hard. It's like "we get it, SREs are wannabe SWEs, stop trying to sugar coat it". 50% development? What a disaster. If half your job is the job that you want and the other half is administrative bullshit, why in the living fuck would you try to make a puff piece about that?

It seems as though from what everyone has said in this thread, that SRE is basically a scam along with DevOps and that the real job people want is the SWE.

I don't like internal memo propaganda pieces by big companies. It's not intellectually stimulating: it's hogwash. Let the truth reign always.

rconti · on April 20, 2017

Assuming SWE means software engineer.

No. Thanks.

sigi45 · on April 20, 2017

Jepp thats how i always wanted to do software engineering: Understanding / controlling the full stack and taking responsibility for it.

ZanyProgrammer · on April 20, 2017

Understanding, sure. Taking responsibility? No way.

zeckalpha · on April 20, 2017

Is this interview new or was it released as part of the book?

pronoiac · on April 20, 2017

I think it's new to the website. I don't see it in the table of contents for the rebook, and it's been online since August, according to https://web-beta.archive.org/web/20160804182333/https://land...

burntrelish1273 · on April 20, 2017

Here's a script to fetch an offline copy https://gist.github.com/steakknife/76214a4bb378592669655e3bb...

hbex5 · on April 20, 2017

SRE: because DevOps isn't buzzwordy enough these days.

StreamBright · on April 20, 2017

SRE predates DevOps by 5 years at least.

ZanyProgrammer · on April 20, 2017

I always associate the word "site" in SRE with some sort of industrial/facilities engineering position when my mind first parses that word.

traf68 · on April 20, 2017

It is stupidity, hubris and a disposition to chaos.

deckardb26354 · on April 20, 2017

SRE? Apparently the only 'software' job Google has in Dublin. It doesn't matter if you have a PhD or wrote your own kernel, want to write code for Google, move to mountain view. Oh and the seven interviews. Complete waste of time.

awkbug · on April 20, 2017

I recently attended interviews at LinkedIn, attlasian and my experience was very bad. First round is online exam and I answered all the questions. Attlasian rejected even having 100% right with all test cases. No response from LinkedIn. They told I can use any language to solve and I chose bash. I think they didn't like me using bash. The guy who interviewed me at LinkedIn is system administrator with sre title. Funny thing is he said he doesn't do programming. Companies are just misusing these titles. They need software engineering who can do system administration. The types who run apt-get on Centos :p