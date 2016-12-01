A quick example, I was called into a meeting at a company where I worked in a non-SRE role, and the team explained to me that they are not able to identify what is wrong but their cluster is misbehaving and there is a node that gets kicked out of it regularly. I pulled up a console and started to compare OS level metrics across the cluster. The sysadmin team thought I am stupid because they explicitly told me which node was in trouble. After the third metrics I checked I found out that the node in question was doing 5000x more packet loss than the second worst in the cluster. It was a faulty NIC at that time. The sysadmin team was checking all of the metrics on the broken node but never compared the results to the healthy ones.
What you observed is the same song, second verse of: There was a cluster doing transaction processing. One of the computers in the cluster got a little sick and was throwing all its incoming transactions into the bit bucket. The load leveling for the cluster sent the next transaction to the least busy computer in the cluster, and the sick computer, not doing any real work, was usually the least busy so got nearly all the incoming transactions. So, really the one sick computer was throwing away nearly all the incoming transactions of the whole cluster, made the whole cluster look sick.
So, sure, some anomaly detection that takes as its input data, say, the CPU busy of each of the computers in the cluster should see that the one sick computer was comparatively low on CPU busy, call that an anomaly, raise an alarm, and let diagnosis begin.
Sure, if do have such an anomaly detector, very much want good control over false alarm rate.
Aka the 'full stack developer' as opposed to the 'full stack developer' that knows both JavaScript, PHP and how to install the base distribution of CentOS.
as for the other kind of "full stack developer", the legendary creature that is equally comfortable doing UI scripting and Linux Kernel hacking...well personally I've never met one. mythical beast in my experience.
Anecdotally, this seems wrong but may actually be sadly right.
I don't know how you can get through a decent Computer Science program without knowing these things. Although, I acknowledge quality in CS programs vary hugely.
We had to write an Operating System (with basic memory and process management), know the OSI model up and down, and be able to deep dive into TCP/IP and UDP packets. In addition to coding and math.
I'm actually pro software "engineer" but perhaps we should draw the line somewhere on a bare minimum of skills required. Probably about the line of an ABET accredited undergraduate BS in CS degree?
This is simply not true as well. CS majors also can begin serious work with just a BS degree. I have no idea what school(s) the parent topic is refering to but remind me not to recommend people go there...
Four years is a LOT of time to get deep on multiple CS topics not just a broad overview. If you finish 4 years of serious study and don't have deep knowledge of a few things you did college wrong.
College is the only time in your adult life where your only job is to learn. It is incredible that many of us get that opportunity and that some people come out of it with just broad knowledge is disappointing.
In my CS program, senior and most of junior year were almost entirely specialized courses where we had a choice to pick a topic that interests us and deep dive. (AI, graphics, networking, bioinformatics, big data, programming language theory, etc). All the general studies were mostly done by the end of sophomore year.
You could have two identical ME grads go on to specialize in completely different things. One guy might spend his entire career working on hypoid gear sets and another might work exclusively on marine diesel rotating assemblies.
Ditto for EE, CE and so on.
At times it has been possible to make some progress on the complexity by regarding each application, virtual machine, server, server cluster, or server farm as a network of queues and apply queuing theory and/or Monte Carlo simulation. In addition some of the old work in optimization under uncertainty likely applies.
Network of queues can become hrrendoulsy difficult to analyze unless one assumes memoryless arrivals etc etc. Network calculus and stochastic network calculus simplifies things a great deal by going after bounds rather than exact answers. You might like it
For "going after bounds rather than exact answers", for analyzing network performance, there is a hugely simplifying approach: Just look at the bottlenecks and largely f'get about the rest; maybe that is related to what you mentioned.
For network queuing, really I was suggesting using that as a paradigm to formulate an analysis; I accept that analytic solutions are unpromising -- more generally, the exact probabilistic calculations out of queuing theory research are complicated even for simple cases. So, for solutions, i.e., actionable information, use Monte Carlo.
Once I knew a guy at IBM's Watson lab who was big on such things. There was a claim that at one time his work was useful in designing some of an IBM mainframe I/O subsystem. He had some software that I used once. His software wanted to collect the usual descriptive statistics, but I wanted the sample paths from the Monte Carlo, got those, and did more analysis of those.
> Certainly no such topics were in any of the applied math grad courses I took.
Yeah they are a newer development. Stochastic network calculus more so. Network calculus is older, it even has a wikipedia page https://en.wikipedia.org/wiki/Network_calculus
A key idea is convolution but in the max-plus algebra and martingale large deviations.
> For network queuing, really I was suggesting using that as a paradigm to formulate an analysis;
Indeed. SNCs can be a helpful tool there. Yes they are bounds, but a lot of progress has been made to make them tight https://arxiv.org/pdf/1303.4114.pdf they arent quite there yet numerically but tight enough to give an intuition. NC bounds are a lot looser.
Once upon a time there were programmers.
Then there were systems programmers and application programmers. Systems programmers wrote operating systems and utilities for them. App programmers wrote apps. There was a lot of crossover.
Then there were operators, systems programmers and application programmers. Operator was a junior position who did physical things (mount tapes, plug in cables) and ran commands to do things on the systems. They usually moved up to being…
Systems administrators, who did some programming in service to the systems, but not too much. The more senior a sysadmin was, the more time they spent programming and the less time they spent doing physical things… unless they wanted to do that.
Sysadmins started to specialize. People who configured switches and routers and talked to telephone companies became “network engineers”. People who spent time working on firewalls and security policies and thinking about that became “security engineers”. Junior people who read scripts to end users became the helpdesk. And so forth.
Then we noticed that a bunch of people were doing things manually when they should be automated. This was especially bad in places where there were no senior sysadmins or systems programmers. But we did have the internet, and senior sysadmins got together and started writing tools to make their lives easier: infrastructure automation.
You probably know the story from there, but I’ll wrap up with one more important point: you know how when writing a business application you need to have a subject matter expert who actually knows what they are doing? Operations is exactly the same way. All the automation in the world won’t help you if you don’t have someone around who knows what they are doing. Some people can outsource this to “the cloud”, but not everyone.
https://landing.google.com/sre/book.html
What I mean by proper documentation is that you do not have all the knowledge in one place. I have spent countless hours of preparing to SRE interviews by reading source code and hunting for additional write ups that explain various OS functions, file systems, networking details. The amount of text you have to go through to get the knowledge is intense. The best way to become an SRE is to be hired one at Google or Amazon and just learn from your senior co-workers. Catch 22. :)
What happened here is that the scale of sysadmin tasks escalated to the point where software engineering skills brought some advantage. It did not, though, negate the importance of traditional sysadmin skills.
The best people in this space are those that can bridge both worlds. Not just "SRE" either. Devops is also a hybrid role, as is performance engineering.
Look at Adrian Cockroft's background as an example.
Probably the most devops-y thing you can do in an applications environment is operate a continuous delivery pipeline.
The most un-devops-y thing you can do is start a devops team.
So I thought it worth mentioning as one of those hybrid roles.
In this usage of "DevOps" it's a buzzword, and I recommend you should avoid anyone who puts buzzwords in your job title, because it speaks volumes up front about their culture and values and clarity of thinking.
I really do like the descriptor Site Reliability Engineer, though, for exactly the same reasons.
A lot of first-class *nix admins in my generation have a problem dealing with dev ideas in system context. In the company I am at now we have gone a step further and allowed dev/mgmt to control ops|prod and reserved system to
sidestep the crapulence of this concept.
I am a recently graduating (this summer) CS undergrad from India and was offered the chance to interview for a customer service offering company (think Zendesk) for the role of a Site Reliability Engineer. The take home tests they gave me heavily reflected that they were looking for an engineer instead of a sysadmin as nowhere during the interview did they discuss about anything other than Chef, Terraform, Nginx and other things that mainly deal with deployment and devops and have very little to do with "reliability". They didn't have anything about networks or fault-tolerance.
PS: I really liked the interview process though in that they had take home tests and there was a good depth of discussion relating to those tests the next day. I eventually won't be joining them because of some information about their office culture in their India office. Sad.
Setting aside your partial jest, I think Google's approach of hiring "programmers" to fill the role of SRE makes perfect sense for their organization.
The "sysadmin" can have a wide range of meanings and responsibilities but many will have experience managing COTS[1] software like Oracle RDBMS, Peoplesoft HR, MS Exchange email server, etc. A typical system admin can then use complimentary COTS tools like Microsoft SCOM System Center Operations Manager, HP OpenView, IBM Tivoli, CA Unicenter/Spectrum. On Linux, it would be similar tools like Nagios, Zabbix to monitor Apache webservers, Mysql, etc.
In Google's case, they didn't cluster a bunch of COTS Oracle dbs together to serve web surfers. Instead, Google wrote a proprietary platform like BigTable or Spanner[2]. As a result, they also want additional software that monitors the health of Spanner. Since none of the monitoring tools from enterprise vendors like HP/MS/CA are adequate (because they have no out-of-box software-agents for Spanner), they need another layer of programmers to write custom management/monitoring tools. Those programmers are the SREs.
All the above enterprise tools from HP/MS/CA/etc including the mother db server, the agents, the plugins are written in C/C++. If your hiring job description is "system admin", most candidates will not have the skills to develop those enterprise monitoring tools from scratch.
Yes, many sysadmins have programming skills to write Bash/Powershell scripts. E.g. a bash script might have a polling loop that checks for an error file existence and then has if/else/fi to send an email. However, that's not the same programming skill as developing a proprietary version of HP OpenView and MS SCOM. Most traditional sys admins do not write low-level C code like Zabbix.[3]
Basically, if you're a company that writes complex proprietary softare and none of the existing enterprise monitoring tools can manage it, you'll have to hire programmers instead of sysadmins as the baseline skillset for SREs.
It's laughable what kind of tools were written for enterprise systems back in the late 90s, so it's no surprise that absolutely none of them would meet Google's rate of growth requirements either.
Furthermore, given the licensing patterns of most enterprise software suites, it'd potentially become cost prohibitive to deploy COTS ops software at Google scale anyway even with Google scale money.
There's obviously still some cost centers to big tech companies (Facebook has contract positions open for doing service desk customizations last I checked) but custom tooling everywhere makes sense when you have justifiable reasons that nobody will be able to serve your needs besides yourself.
Obviously, Google is operating at a scale that no one ever has before (and almost no one else does or ever will), and so, sometimes they have to invent the technologies for doing it. But, the somewhat dismissive tone of the SRE literature I've read that implies system administrators scale linearly while SREs scale drastically higher is kind of off-putting, and misrepresentative, I think. Sysadmins have been scaling systems for as long as there have been systems; sometimes just one person or a small team in the data center, running the whole show for quite large companies.
But I guess this depends on the company culture to begin with.
Prior to SRE any good system administrator was doing this: "if it's worth doing, it's worth automating". But there was another half who were cutting and pasting shit from Word files into Solaris boxes. sysadmin -> SRE seems to have cleaned out the chaff.
This _was_ SA 101 and is the part you ignore. Hell even in 2007 if you didn't know hardware you were unemployable as an SA. What do devs writing json-rpc interfaces to monolith 1212 running oracle X as backend care about infra?
This is the sea change and not everybody agrees with it.
You have a 'cloudy' perspective.
As for endless alerts, I've been learning that there are 3 types:
• false alerts, which were set up "just in case". The "handling" tends to be "take a look, declare that nothing is wrong after all, mark it solved". Your job is to summarily eliminate these.
• "good" alerts that indicate a temporary problem that could/should be handled automatically in an ideal world, but can't be handled automatically as things stand; your job is to move the status quo in the direction of the ideal world, gradually.
• all the rest. These are the unavoidable alerts. Usually, the best you can do is to ensure the alert message and related logs provide all the relevant information necessary to handle it.
PS: of course, TFA indicates Google has a much more systematic approach, but you probably won't be able to just copy them.
2) Fix the root cause definitely
3) Delete the alert
If you can't do any of that, just delete the alert, and also, that means you are not SRE, you are a powerless ops monkey who's getting spammed by other people's alerts (very common in many bad places).
It's a business win to find someone with multi-disciplinary skills without having to pay them the combined salary of those disciplines.
It's even possible they are only paying him a developer's salary.
Instead of creating dedicated application users and chroot jails and alternate port numbers to let applications coexist on a server, we're spinning a bazillion instances and building out storage backend to support it, and so on.
And the lowest-level problems still exist, though we don't troubleshoot them as often; we just re-spin. Same reason nobody's repairing their RAID card with a soldering iron anymore.
Related post: https://medium.com/outsystems-engineering/the-law-of-conserv...
http://www.se-radio.net/2016/12/se-radio-episode-276-bjorn-r...
I've been interested in all aspects of CprE, and I taught myself how to run a Linux box along with DNS, VPN, and other services. Last year around this time, I was contacted by a recruiter for an SRE internship. At the time, I had no idea what SRE was and I thought it was just a glorified IT job. Boy was I wrong.
I got through a few interviews and got the position for the summer. About a week or two into the internship I fell in love. This job was all about designing and implementing systems that have to be resilient and must scale. The idea of building automation to make my job easier was and is great. It was just like the labs I enjoyed in college.
Fast forward to now, and I'm accepting a full time SRE position at the same company. I couldn't be happier with my choice in specification. The need for resilient, distributed systems will only grow in the coming years, and I'm looking forward to being an SRE.
DevOps is really a process definition with Lean system thinking and code workflow priorities. Many people will tell you that it is NOT a job function but a culture or approach. DevOps for developers generally means CI/CD pipelines and owning code into production. DevOps for operators generally means building configuration automation and integrated monitoring tools. In this was, DevOps highly complementary of the SRE job function.
I've been writing a lot about this on my personal (robhirschfeld.com) and company (rackn.com/sre) blogs. I'd be happy to discuss this in more detail here.
It seems like changing their hiring process is a double edged sword. If they change it to allow more hiring volume, then other employees might become frustrated with how easy it becomes to work at Google. On the other hand, keeping an old hiring process where false negatives continue to occur seems very bad.
This doesn't make the role bad. But it's important to remember that the role exists as a cost savings to the org, not because it's an inherently better way to run a technical infrastructure.
They were overcompensating for the fact that SREs aren't SWEs so hard. It's like "we get it, SREs are wannabe SWEs, stop trying to sugar coat it". 50% development? What a disaster. If half your job is the job that you want and the other half is administrative bullshit, why in the living fuck would you try to make a puff piece about that?
It seems as though from what everyone has said in this thread, that SRE is basically a scam along with DevOps and that the real job people want is the SWE.
I don't like internal memo propaganda pieces by big companies. It's not intellectually stimulating: it's hogwash. Let the truth reign always.
