After a year all we really had was hundreds of Confluence pages specifying policy for this and that, a bunch of ceremony added around releases, and arbitrary SLA/SLOs defined. Almost zero actual technical work was accomplished. They then started filling the team with new grads who had neither devops or sysadmin experience, kind of cementing the perception that it was a non-technical role.
Either way I’m not a fan of the entire concept after being subjected to that suffering!
Google is putting a lot of marketing money on the topic, but from what I saw they are trying to alert that there is no one size fits all for this.
I think people should realize that when you go into the hiring market looking for SREs, you are about to pay more, not less, for a highly qualified software engineer who is also a systems expert. You want these people to not just operate product, but to be closely involved in architecture and development so the product turns out to be operable. Your SREs should be working in the product code on the bits that ordinary developers might overlook or not be qualified to write, because they lack expertise in system or distributed systems behavior.
What this article is calling "real world" is just an attempt to redefine the term.
Any org I go into that says 'we want an SRE because google has one' I can not help myself I blurt out 'you are not google. what works for them will not work for you, you need to find your own way of doing this that fits your org. Trying to mimic google without the understanding of why they way they are will damage your org'
That matches my experience at most companies, the SRE/DevOps is just the sucker who is on call.
That's exactly how it is implemented in my company. We are called SREs but everything is driven my management (and by that I mean a single VP who is omnipresent).
Adopting another company's practices is often lazy thinking. There are no silver bullets.
This is 100% my problem with this article.
At no point is it ever implied WHAT or WHY or WHERE actual work is being done. Only HOW they organize it. It looks to me from the article they are using every “cool” tool and seem to care more about the how than the why. And that’s fine if it’s the point of the article which I guess it is, but it seems like it’s “reliably” structuring a husk.
I defend the costs of IT and in this case the costs of an SRE approach - but these aren’t the things that make you money and keep your business afloat.
When I spend a couple weeks making a system wide event bus to get messages from n to m that’s cool and I need it... but customers don’t pay me for that. They pay me to fix a problem they have.
I’m very cautious of articles like this that say all “the right” things but don’t even hint at solving customer problems.
Pipe dream of mine, for sure.
There's been at least 3 G SREs that have posted in here about what they do day to day and it's literally no different from any other non-FAANG SRE/Ops role any of us do at various scales except for the gatekeeping.
This helps tremendously in troubleshooting and building stable products.
Having a combination of systems, networking and programming "perspective" on a team seems to result in very robust systems all around.
For those who aren't familiar with the google acroynm soup: SE here means systems engineer not software engineer(SWE).
Best SRE/QE model I've experienced was with a couple of these guys being embedded into a dev team (reporting structure didn't really matter), such that they participated in the team meetings and rituals, becoming domain experts. One fellow eventually learned the SAML2.0 spec inside out and was invaluable for catching bugs.
I suppose being embedded but reporting up through a different org did have it's benefit: namely, the consistent dissemination of best practices across the company, but with a judicious eye to applicable context that the current do-it-all-swe cannot with the current SRE ivory tower model.
The do-it-all-swe model has yielded trash QE/observability in my experience.
This was the problem the raspberry pi was designed for GCSE and A Level students to actually get hands on experience, so that when you go to to university you had at least the basics.
- Initially every engineer does everything
- Growth gives birth to independent
SRE teams. Now we have engineers who does only SRE
- We realised SRE is bottle neck for product teams. SRE teams are removed and all SRE are embedded into the product teams to do dedicated work for them.
- Other team members are now required to learn SRE stuff, especially now that terraform etc are there so we are back to everyone is a SRE
note that FAANG level companies go back to independent infra, SRE etc maybe because of niche and custom softwares they have.
The problem at most companies is that if you have two engineers, Alice and Betsy, and Alice launches a change that generated marketing buzz that gets spun into sales, and Betsy launches a change that prevented the site from going down over the seasonal rush, Alice gets promoted and Betsy gets told to work on more impactful projects.
Which isn't to say that Alice's work isn't important or promotion worthy, but you need an incentive structure that rewards both.
That's my 10yrs+ ops experience, at least, and this is hugely exacerbated if Alice is put on such a high pedestal that they don't even have to be on-call with Betty at any point of the lifecycle.
Organizations succeed by delivering customer value. Re-orgs of teams to remove pain points and blockers is a sign of an organization flexible enough to handle their growth. Even if you go back and forth a few times between similar states, the larger point is that you are willing to do what is needed to continue to effectively deliver value to your customers.
They seem to be unable to organize and delegate work effectively even when it's needed and instead rely on polyglot-engineers to bridge that gap by overloading their responsibilities.
SRE becoming bottleneck with scale is something I see a lot. Usually regular product engineers that are more familiar with infra start doing it full-time for all the company.
That’s been over the course of about a decade.
Later on mid 90's the BT World wide intra net was run by a full service team that handled hosting admin and development
This means that management decided velocity is more important than reliability?
> This means that management decided velocity is more important than reliability?
No, given the organizational response ("SRE teams are removed and all SRE are embedded into the product teams to do dedicated work for them.") It seems to be like the motivation for Agile cross-functional teams instead of separate Analysis, Design, Coding, Testing, ... teams, and like DevOps, and like DevSecOps, etc., yet another instance of realizing that throwing products over internal walls in organizational handoffs produces inefficiencies.
While there is more structure at Google, it's basically the same job :)
The implementation varies between companies, and I understand this is what these articles try to capture. My point is that as an individual contributor in an organization where "SRE" is really a thing, I believe that the role is often quite well defined.
It has a lot of benefits:
* we have an influence on the priorities of the project, and make sure our issues are prioritized. You can't be sure of the evolution of an open-source project.
* continuous rollouts. In my previous job each major version bump of Kafka or Hadoop could become a complex migration to plan a long time in advance. Now, we still have migrations to do, but often the scope is smaller and the transition smoother.
* strong culture and policies. We hold ourselves accountable. In my previous job, we would often mitigate incidents and forget about them, not write the post mortem or really root cause the incident. Usually because we knew/believed it would not be solved anytime soon (conflicting priorities, or whatever else). I've never seen an incident swept under the rug at Google.
- "oh, so like QA in my company?"
I've asked what QA does at his company and well yeah, exactly. I've read the books and know about the SLA/SLO and honestly it doesn't make any sense for smaller companies with few services.
I code up alerts for various metrics, act upon said alerts, sometimes do the fixes my self if it's simple, otherwise just hand it off to devs by logging a bug. The remaining time I code tools for other SREs to use.
In my case:
* I have a good in-depth knowledge of one product I'm oncall for. My team may implement some design changes required (ex: for scaling) and I contribute code to this project.
* I have shallow knowledge of 2 other products, I'm trained at identifying a trigger that causes an outage and mitigate it (common mitigations are: move traffic, increase quota temporarily, rollback a release). If I can't solve an incident without in-depth knowledge of these products, I escalate to the devteam, and they will be responsible for root-causing.
A mitigation that requires a patch, cherry pick and rollout would be a terrible (and risky) thing to do, so I'd rather find any alternative first.
First, about the topic. The title, "Real-world SRE: What not FAANG companies are doing," implies that the series is going to look at how the majority of companies (as in, not startups) are implementing SRE. This is not the approach; rather, this specific article divides the world into either FAANG or SV-style startups, and ignores everything else. That's not a microcosm I'm interested in.
Second, about the writing style. The stuttering caused by using sentence fragments everywhere is really jarring. It's so bad that I couldn't focus on the actual content. Normally I'm not much of a stickler for grammar, but in this case, it's distracting.
That's exactly what I was expecting as well.
>either FAANG or SV-style startups, and ignores everything else
SV in a nutshell. Also obviously only North America exists.
In general I agree with the point you're making, though.
Did you... read the article?
This is the first issue, so this is very welcome to make sure the following are better. My goal is to actually make it for the a diverse set of companies, this is only the first one.
Noted on the style! I'm probably biased by all the SEO/marketing stuff I read. Should improve readability for humans instead of machines (Google), thanks, will work on that.
I’m not sure you can even “write for SEO” in 2020. Maybe in 2012 you could. If you are talking about keyword stuffing etc.
Making the article readable and people want to share it is paramount.
I want to read about transitions. I want to see how late-adopter and mainstream companies go from old, manual processes to modern, automated processes.
As someone who's been in both slow-moving federal government agencies and bootstrapped startups with the latest tech, I've seen a wide variety of tech journeys happen, and it's fascinating.
It's been mentioned in other comments here but I'm super fond of SRE embeds for reasons already shared here.
It was all automated and self-service, there was something like 700k jobs deployed a day when I left. I suppose it's similar to borg at Google/Facebook but way older.
The most important skill to me for an SRE/Ops is curiosity and pattern recognition, and #1 humility/empathy.
There is a massive amount of IT work being done - work that you, me or any other denizen of HN would probably rate as terrible and smelling of middle ages. But it gets the job done, even if it's being done manually. And some people just don't know any better.
That being said, the article was nice, although I wish there was more detail.
> Product teams that opt-out or don’t have their application ready for SRE run on their own. They have full access to the resources they need. The interface with the SRE team is minimal in these cases.
> Custom features lose the SRE team support.
is something that sounds great in a small team, but I do not think it would scale. It lends itself to the platform just being a loosely coupled collective of independent fiefdoms of Not Invented Heres, which then means each product team needs its own SRE specialists that can then constantly engage in infighting, which IMHO would lead to the reliability as the system as a whole go down rather than up.
Personally, I think there is much value in a "platform" where infrastructure (compute, storage, queues, monitoring, alerting, auditing, etc) are provided in a centralised fashion, owned and operated by platform teams which provides self-service touchpoints for the service teams but not provide full independence to run their own thing
So... you're pushing alerts into a Git repo and pulling them somewhere else to display them? I'd call this many things, but "extreme speed" is not one of them.
I rather like that approach myself—I would not want logic and configuration living in some stateful system outside Git.
Scenario is not like developer throwing over the wall to operations. First there is this architect team deciding things then it comes to developers then for structure changes to dbas at this point almost 80% of things are decided. If any perf issues then only logics and things will be changed then comes handovers, peer checks, approval and then to production. So there are lots of walls, outage is a big thing, implementation is a big thing when there is a structure change to a table that's handling 3000+ transactions per second. There are so many things to look out for. So it's not like I can implement anything to production in few minutes or couple of days unless it's a fix.
On the concepts of error budgeting, yeah it's nice. Since error budget is depleted so let's stop feature development. From banking perspective what if there is a regulatory change, well that has to go. I haven't seen any examples that companies saying that they stopped features which has potentials to bring revenue but stopped because we have depleted our error budgets. For example, let's say, a product making company saying we didn't release next version because we depleted our error budget. Even google hasn't showcased anything on the subject saying we didn't release these because we depleted our error budgets thats why new releases or next versions were delayed.
I will agree with half the book on SLO, they are hard, monitoring the performance of customer journey, the request travels many tiers, technologies. That's hard. Each tier or tech has its own monitoring system but tracking customer journey that's hard e-banking or mobile to mainframe in between there are so many things. That's something to think about. Would like to know how big complex organization's handle this as that's the only way one can know if something is broke, it's better to get notified from monitoring team/system than social media or news.
Tooling and automations need to be centralized there would be many duplications.
Currently we have people who are THE EXPERTS of their tech but not aware of other techs. This needs to be improved. But SRE got to be sort of jack of all trades also be expert.
Postmortems are locked up somewhere it's there but that's confidential. There is a difference between a search engine breaking down and financial institution technically/due to bug breaking down. Smaller things can be show cased as there is always an RCA for an incident.
It talks about test automation but at the moment there is growing trend on more test analysts requirements. If every organization is going for test automation, why is this trend.
Overall in adoption, there will be a hybrid. Centralized team is needed for gathering information about all platforms and build centralized monitoring application or team. There need to be a catalog of automations like git repo organization level where anyone can read readme and clone or fork and use a program as it fits their requirement. It would be good know pool of experts and ask for recommendations as currently we know only when we work with them.
Automation which was not a priority can be made priority.
Book has lots of good ideas. It's a slow process to adopt, but its completely useless just to change job roles like there are few companies with too many vice presidents literally.
Hope is not a strategy