Hacker News new | past | comments | ask | show | jobs | submit login
How They SRE (github.com/upgundecha)
275 points by zdw on Feb 16, 2021 | hide | past | favorite | 98 comments



One of the links is to Google’s Sight Reliability Workbook (https://sre.google/workbook/table-of-contents/). At my last company, we implemented SLOs based on the advice in this workbook, and I thought it was an excellent approach. Having internal reliability targets that directly map to the meaningful parts of the customer experience, and setting up dashboards and alerts based on these targets, is a very powerful approach to achieving reliability that matters for users.

A great doc from Google that they didn’t link to above is “My Philosophy on Alerting”, by Rob Ewaschuk (https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa...). A must read for anyone setting up alerts IMO.


Thanks for sharing the second link. It's the first time seeing it an encapsulates my own philosophy on alerting well, while providing examples of what to alert on.

I've been with teams who treat alerts as informational and then quickly get alert fatigue. Then when something actually does happen they don't recognize it because they've trained themselves to ignore all alerts.

At my current company we regularly review our alerts to make sure they are actionable and meaningful still.


No problem! And agreed, the advice on “alert on symptoms users actually experience, and only on things a human actually has to deal with” is great advice. Alerts that don’t fit this should be quickly tweaked or disabled, to avoid alert fatigue.

“Something went wrong, but at acceptable levels/without much user impact” belongs in dashboards, not alerts. Worth being able to monitor, but alerts are for urgency, and that’s not urgent.


Fyi... sometimes people confuse SRE==sysadmin and believe "SRE" is just a fashion label invented by Google.

To clarify, an SRE as Google thinks of it[0] is a deeply skilled programmer (e.g. C/C++ data structures algorithms knowledge) ... and the custom software they code helps keep the website running. In contrast, a classical "sysadmin" as that term was used in the 80s/90s job roles was usually not a hard-core C++ programmer. I made a previous comment about the difference that was motivated by COTS vs internal custom software stack.[1]

So Google didn't invent the concept of programmers-as-website-sysadmins because older websites like ebay.com and amazon.com already had highly-skilled programmers maintaining the websites' health. It's just that Google's "SRE" terminology seems to be the most popular label for it.

It's true that not every company needs SREs. E.g. if a dentist's website is just a Wordpress site (COTS software), you don't need SREs managing it. However, if you're offering a DPaaS DentalPractice-as-a-Service cloud platform for thousands of dentists, you now need SREs and "SRE best-practices" for your custom web solution.

[0] example tech skills desired: https://careers.google.com/jobs/results/75525862415311558-so...

[1] https://news.ycombinator.com/item?id=14155601


> "sysadmin" as that term was used in the 80s/90s job roles was usually not a hard-core C++ programme

I am now a "SRE" as a very large "webscale" company.

I was, am, and will continue to be a sysadmin. All of the principles that I've been using for the last 15 years have not really changed. only the tools.

The concept that sysadmins were only ever point-and-click, or physical lift and shift, is a stereotype that needs to die. We were _always_ programming our way out of things. The whole point of a sysadmin is that you are lazy, to be lazy you need to:

1) find out what changes are coming down the track

2) automate those changes.

You could only every be a good sysadmin if you knew what the business needed. Sure tech is fun and all that, but if we have the most stars on our repo, but no income stream then we are sunk.

That means:

o providing a rock solid platform for devs to work on.

o Removing the thorns and friction from the entire system.

o making reliable, logged, secure and metric'd code the easiest type of code to deploy

o clearly documenting how to do things

o keeping costs down

o bugging people when their shit breaks and they refuse to fix


The best sysadmins have always been programming/automating (and are lazy in an enlightened way).

But the stereotype exists for a reason; many admin roles and admin staff were/are “zero scripting expected” help desk type roles.


I feel like this version of "sysadmin" went out of favor some time ago.

Back in the early '00s I got my start in industry as a sysadmin, working in non-SV startups. Although I had taken courses in C++ and could muddle through it, I seldom did much low level programming - the C/C++ knowledge I did have was helpful to read code, submit the occasional bug report, and understand the rube-goldberg build processes which really seemed to be in style at the time.

Most of the code I wrote was the kind of glue you needed to make sites work and web-ify basic tooling so normal users could access it - and that meant hacking in shell, perl, or python. These things I simply took for granted as part of a sysadmin's job. As the "web" stuff really took off, that eventually landed me doing more with PHP, Django, and Rails.

All of these things matched my conception of what a "sysadmin" would be, but the industry seemed to disagree with me. I worked in other sectors and came to realize there were a lot of "sysadmins" who not only didn't know how to code anything, but also had no interest in learning anything. Eventually I just stopped being a sysadmin and started being a web developer.

Now here we are, with devops and SREs, and I smirk a little. The guy coding up web sites to present internal metrics and writing software to facilitate point-and-click deployments while dropping into Valgrind to help out a lost developer? He's an SRE, I'm sure. But then, so is the guy in the next cube over, who worked the help desk for a couple months and really knows how to click install and reboot.

These labels come and go and they mean different things at different places. Like "agile" or "devops" - once they escape the FAANGs, they start to lose all real meaning.


Exactly - especially in more technical areas Sysadmins quite often doubled as systems programmers as well.

I once got hired as developer on billing systems for BT as I had SYSAD experience on the relevant hardware (PR1ME) and could run the systems as well as program them.


> a classical "sysadmin" as that term was used in the 80s/90s job roles was usually not a hard-core C++ programmer

I did not work in the 80s but the sysadmins I worked with in the 90s were the most skilled C programmers I've ever worked with.

Unless you mean C++ specifically, which has always been a bit divisive that world, UNIX has always been inseparable from C. UNIX, C and TCP/IP were joined at the hip. You would have to try hard to avoid learning at least the basics.

If anything sysadmins (or SREs or DevOps Engineers) these days program less, not more. So much have moved into frameworks and yaml. As reliability expectations of individual components tends to be lesser now, things like inspecting core dumps are not as common.

The mockery of the non working sysadmin has been a running joke as long as the job existed (BOFH, anyone). But the idea of simpler times is mostly a chimera. Things have certainly changed and become more complex, but our brains are still as limited as they have always been.

On a related note, "waterfall" was also never really a development model. It's just something to call processes more rigid than you'd like. Software developers have always had to contend with fluid requirements, with the expected varying outcomes.


>"In contrast, a classical "sysadmin" as that term was used in the 80s/90s job roles was usually not a hard-core C++ programmer."

No they weren't likely "a hard-core C++ programmer" since UNIX was written in C. SysAdmins at that time were however likely to be a hardcore C programmer working at an ISP or large university network. There you had to know how to maintain and modify source to things like Sendmail, BIND, Apache and RADIUS. There was no StackOverflow or Google to consult. Your comment seems to be propagating some baseless trope that was never true. Your statements really suggest that you weren't actually there in the time period you are commenting on.


>To clarify, an SRE as Google thinks of it[0] is a deeply skilled programmer (e.g. C/C++ data structures algorithms knowledge) ... and the custom software they code helps keep the website running. In contrast, a classical "sysadmin" as that term was used in the 80s/90s job roles was usually not a hard-core C++ programmer. I made a previous comment about the difference that was motivated by COTS vs internal custom software stack.[1]

Quite the opposite actually.In the 80s/90s many of the sysadmins were skilled C/C++/Perl programmers.


Another unnecessary BS term. Other day I read SecOps... whenever I read SRE on HN, I think software reverse engineering [1], and I keep feeling disappointed every time I read the term on HN.

Classic sysadmins were, in my opinion, most certainly programmers. The shell involves programming (or "scripting" which is a derogatory term for high level languages). I know a fellow who programmed punch cards. Very unforgiving.

A sysadmin who's skilled with shell would also pick up Awk, Sed (hence regexp and Perl). Nowadays, Python. You don't need to be skilled with the shell anymore in order to admin a system. There's GUIs such as Windows, cPanel, and Webmin.

[1] https://en.wikipedia.org/wiki/Software_reverse_engineering


The fact that Google have many type of SRE, ops and more programming oriented, contradict what you're saying.

Some SRE will do little coding where other will work on tools 80% of their time.


There's an important distinction here: if your development team also holds the pager, you're not doing SRE. You're just skimping on sysadmin roles. You're doing SRE once you field a 12 person team of skilled programmers whose only duty is working on reliability.


Is it just me or do I keep seeing this 'SRE' stuff being hyped up as the next big thing everywhere on Twitter/HN/Reddit etc, just because Google invented it and now everyone wants to behave as if they have Google's problems.

I am genuinely curious why this is the case, am I missing something here?

Even my local grocer's website somehow needs Kubernetes skills all of a sudden, not kidding at all.


I'm fairly sure the SRE books by Google don't mention kubernetes once, since it's a technology agnostic set of practices [1]. I would also say that the practices are independent of scale.

Google didn't necessarily invent it inasmuch as they coined a term for keeping systems running.

[1] https://sre.google/books/


I see a lot of comments mentioning scale, and dedicated SRE teams and IaaC and thought this might be helpful.

I believe distinguishing the engineering from the engineer in SRE is important. The engineering side is essential but a dedicated full-time engineer may not be crucial.

The amount of engineering required depends primarily on the delta between your required availability (captured in SLOs) and your current availability.

The higher your availability needs are the more engineering effort you'll need as the code travels from commit to production because availability doesn't start and end in production systems.

The framework exposes or makes explicit considerations to achieve that goal. Oftentimes it comes down to a choice in balancing the cost between human time (or manual effort) and automation to replace it,and out of that the technology to achieve it. You may very well find that your availability requirements are so low that you can get by with a human pressing a button every few hours, and that this very manual approach is acceptable.

Going through the process will answer most of the questions on scale, whether a dedicated Site Reliability Engineer is required and so on. The SRE framework scales even if the system themselves don't require sizable scale.


They do mention Borg though.


SRE hasn't really got anything to do with scale. Just because Google literally wrote the book on it, doesn't mean you need Google sized problems, to adopt the philosophy.

My take away is the switch to focussing on reliability from a customer perspective, infrastructure failure is inevitable, impact to the end user doesn't have to be.

To understand how any failures do impact your users, it's usually very important to pick some good metrics and make informed decisions on what level of reliability you want and how you are doing.

That then feeds decisions on where you may want to improve your own tooling/processes to improve these metrics/end user reliability.

The dev focus is because you need the skills to automate and fill gaps in your tooling and you need to be able to understand the software you want to measure.


Scale still has something to do with it. If you are small scale, spending an engineer month to think about and make informed decisions and implement it may be more expensive than just accepting whatever downtime happens along the road.

I think premature optimization in uptime is common. Yes, being down a bit is costly, but may still be cheaper than a development team being obsessed about adding 9-s to the service level %.

OTOH, obviously for Google, downtime must be minimized at almost any cost.

It is all about scale; but scale of budget, not traffic.


> Scale still has something to do with it. If you are small scale, spending an engineer month to think about and make informed decisions and implement it may be more expensive than just accepting whatever downtime happens along the road.

This is discussed pretty explicitly in the SRE book, in particular the idea of an error budget. Obviously they use some google services as examples, but the key message I got from it was "think about how much downtime is acceptable to you and work around that".

> OTOH, obviously for Google, downtime must be minimized at almost any cost.

Interestingly the SRE book pretty explicitly says that this is a poor goal for pretty much everyone. The cost of chasing more 9s goes up exponentially, while for most users whether your service is 99.999% available or 99.9999% available makes no difference, because 0.1% of the time their shitty router crashes and they have to restart it.

Better to pick a level of availability that strikes a balance between cost and user experience then work towards that.


True, it is a simple concept as it seems to me.

1. define some reliability target (better expressed by some SLOs) in advance and what steps to do if that is not reached 2. if the service fails to reach it, do the steps to increase reliability arranged in step 1. 3. repeat at some regular intervals

The point I think is that the things are arranged in advance. Not after some shit happens because people get very subjective about "their own" service. The target is there, so lets try to reach it. We have error budget as well, lets use that one. If you don't have anything (as I've seen in a lot of places, or wishful 100% reliability), you'll have major reliability problems I'm absolutely sure.

So the SRE book tries to give you a solution to a lot of headaches some medium to large companies might be facing.


I think the mindset is important (think the chapters from the SRE book), but as you imply, it's a problem you and your company WISH you had, it's the kind of focus on quality and uptime you WISH you had.

Therefore, while I think knowledge of SRE is important, at the same time you need to fear cargo cult thinking.

A (imo) worse trend I'm seeing is the focus on microservices. I'm in the Golang slack channel, and on a fairly regular basis a fairly inexperienced dev comes in and asks about microservices - implying they're solo developers that need to set up a microservices architecture. The cargo cult is strong, and people (in general) underestimate the amount of work involved in setting up a proper distributed system (microservices on kubernetes with SRE or otherwise). And people underestimate just how much software and engineers a company like Google has and maintains.

I mean my biggest software project(s) had at best three dozen engineers. Those + additional staff already filled a building. Someone decided that we should do microservices, but even then I thought it was overkill, it's amounts of traffic we WISH we had. And of course the split was wrong, it was a rough split on domains instead of something more solid (e.g. we had a separate 'address book' service, and an 'order history' service that was separate from the actual 'order' service. Which was a gateway to an older Java application, which itself was a gateway to an even older mainframe. I mean those two needed the attention, the microservices (which were written in Scala because otherwise they would be too boring) were just self-gratification.

I have no clue if they still have them. Part of me is thinking of going back there (I was working there as a contractor at the time).


Even my local grocer's website somehow needs Kubernetes skills all of a sudden, not kidding at all.

Maybe I’m hurting myself, maybe I’m helping, who knows, but it’s become a recent tactic of mine in job interviews when a company demands Kubernetes skills (which isn’t to say I don’t have them, I’m just not as effective as I’d like to be...YET) to look dead in the camera and ask:

“How many nodes are in your production cluster right now, how many are control plane nodes and when was the last time two of them failed?”

Just to see if they’re putting in Kubernetes because they actually need it, or if they’re just faffing about with new tools because new tools and are struggling to even keep the lights on for their 3 microservices.

If there’s a better question to reveal this, I’ve got an open mind about deploying it to production.


> “How many nodes are in your production cluster right now, how many are control plane nodes and when was the last time two of them failed?”

What is a good answer and a bad answer for this, and why do two control plane nodes have to fail?


I don't know if the number needs to necessarily be two, my intention is to fish for an answer that isn't so much prescriptive but descriptive of their infrastructure needs and goals.

But a good answer would probably involve articulating some kind of genuine need that can't be solved with something else, I'd hope the answer involves some candor that the company looked at their options, vetted out that K8s was what they needed, why K8s is going to/has already solve(d) them over something (anything) else (nomad, or others but not necessarily limited to nomad), and if the interviewer is so inclined, disclosures of what they've learned trying to bring k8s into the house.

This last bit may be a follow up question from me "what have you learned so far implementing k8s? What would you have done differently if you knew better at the time?" etc.

What I'm looking for is a determination if the team thinks through the problem deliberately or if they're simply throwing their stuff at kubernetes because someone who no longer works at the company sold the leaders on kubernetes and no one asked any questions as the org jumped into those waters.

Mind you, this part is important: I am NOT anti-kubernetes, I am anti-foisting-kubernetes-on-platforms-just-because-its-hot-and-exciting. If a team wants to R&D it somewhere, prove out it's merit, create an MVP internally and can show the effort is worth undertaking to redeploy services on k8s, awesome.

My lived experience has been more teams than not are doing it because a thought leader somewhere wanted to pad their resumes, and then left the company, now the rest of Ops is paying down technical debt they never took out loans for.


As an actual full-time senior SRE: telling the grocer's website dev team they don't need Kubernetes is pretty much what the point of SRE is.


Your local grocers website doesn’t need Kubernetes.

However what’s happening is that the way we deploy software is being automated and standardized.

The dream scenario we’re trying to achieve is to tell the system here’s my application here are are some servers now go deploy it and keep it up and running.

This actually does work even in the face of failures. It’s just complex because it’s new and solving a very hard problem.

I would considers these systems as a form of AI.

When you manage tens of thousands of servers these investments make sense. And once these systems stabilize they will be usable without much difficulty. Which means in five years you’ll have an in demand skill if you learn it now.


The main thing I take from it isn't actually the scaling side. If you need to scale it's probably when you're getting a bunch of money to do it, so you'll find a way.

The thing that caught my eye is infrastructure-as-code. It's massively useful to have your setup in version control, regardless of how big you are. A few years ago I was managing code separately to eg network configs, typing commands into routers, that kind of thing. It's a natural step to have all the infra in a form where you can see changes as well as reliably deploy.


> The thing that caught my eye is infrastructure-as-code

I thought that's what DevOps was?

My understanding - from one conversation with an SRE one time over a beer! - was that an SRE's job is not to keep things running and scaling, but to work with development teams to help them keep their things running and scaling. So the key idea is that SREs get leverage by having all other developers work on operability.


Yeah, but everyone seems to be talking about the scale side more than the changes-being-accountable side.


It's human nature. People want to do what the "pros" are doing disregarding skill level, needs, experience. You can see it in the gym where people emulate their favorite's bodybuilders routine even though they weigh 110lbs soaking wet and have less testosterone than a mouse.

SRE is the same. DevOps, microservices fall in the same bucket.

The irony of course is big companies, the ones that really need DevOps, microservices architecture etc sometimes have the hardest time to adopt new ideas to leverage them. Case in point, I work in a large FinTech company, we are talking multi-billions. 2015 small group wanted DevOps, I became that, only now, 5 years later the greater part of the company is starting to adopt it. I'm still giving presentations on VM vs Docker differences. While at the same time the team that wanted DevOps deploys to production Openshift clusters multiple times a day.


It's as much "the next big thing" as "Cloud" has been since 2006 or so: as basic a valid option to organize for substantial scale operations as for the past 15 years.

"SRE" may be a Google-initiated label, the practices behind predate it (first sight I've seen was Flickr in 2008).


It predates Google, though in a manufacturing or janitorial sense.


Those damn books. They keep showing up and making life miserable for people. "How to Cargo Cult".


Side note - for those HN readers like me who had never heard this term "Cargo Cult" before, here is a quick Wikipedia definition:

> A cargo cult is a millenarian belief system in which adherents perform rituals which they believe will cause a more technologically advanced society to deliver goods. These cults were first described in Melanesia in the wake of contact with allied military forces during the Second World War.

Personally, I'm still trying to wrap my mind around the implied context here.

Sorry it's slightly off topic - but perhaps someone can clarify?

Thanks :)


It's a bit of a weird one! I think the most famous example of the term comes from Richard Feynman, in his essay "Cargo Cult Science" (http://calteches.library.caltech.edu/51/2/CargoCult.htm):

> In the South Seas there is a Cargo Cult of people. During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So they’ve arranged to make things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas—he’s the controller—and they wait for the airplanes to land. They’re doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn’t work. No airplanes land.

It's about copying others to achieve a goal — but because you aren't copying the motives, only the visible methods, you don't succeed. In this case, it's about copying the engineering practices of big organisations without being a big organisation yourself, in the hope that you become one. Hope this helps.


This is also almost my first time that I've heard about "cargo cult" and that does sum up nicely a lot of things I've seen at my place.

I hope this is a valid example?

1. Project P fails spectacularly, really bringing the reliability targets of the whole production down. In reality we don't have proper targets, we have "100% reliability" target, but still, it was below 98.5% judging by my calculations which was unacceptable.

2. Management furious at that time. The team in charge of that project promises that they are going to regroup, start doing "retros", using "planning poker", and introduce more alerts (to some other team of course...) etc

3. Management gives them another (n-th) chance because the tools sound cool and think that employing "technology" will always fix social problems.


That definitely helps - I understand the reference perfectly now - thanks :)


SRE is orthogonal to kubernetes. It is a quantitative approach to balancing customer demands with operational optimization, the latter which may be addressed with kubernetes, though not solely or even necessarily in the general case.


It's cargo culting of the highest order.

The truth is, noone knows we deployed the thing using caprover by just wrapping the app in a docker container.


Google et al. invent new problems (e.g., Kubernetes) and solve them accordingly (SREs). Quite clever, if you think about it but depressing nonetheless (and a waste of time and brains).


Yes technology is like fashion. All these SRE thing are just common sense that sysadmin/DevOP people know how to do. Many things are not needed. Do we really need docker and the cloud provider can just take a snapshot? It really comes down to basically systemadmin skills.


What's it like being an SRE for a FAANG? To me it sounds like you need to know practically everything about software engineering (network stack, linux stack, web app stack) so it appeals to me as a generalist.


It’s great, but highly dependent upon which SRE team you’re on. Google’s internal infrastructure has advanced immensely over the past 5-10 years, which has shifted many capabilities from SRE to dev teams. Over time it has become relatively easy to build the long tail of scalable, reliable services, which has changed the responsibilities of SRE. For some of the larger teams/products, SRE is fundamentally unchanged and is really rewarding, but for others its responsibility has unintentionally evolved to more of a 24x7 operator role.


Being an SRE at Google was great for a generalist at a technical level.

Fairly interesting problems from multiple areas, possibility to delve deep into understanding/fixing any internal project, fantastic technological stack, overwhelmingly smart and kind coworkers, well balanced pager/oncall load.

The only downside was working for Google. Megacorp, small-cog-in-large-machine, americentrism, contract clauses limiting side projects, etc.


From my experience, the $500 Intel NUC in my basement has greater reliability than nearly every single company which calls their IT departments "SREs" now.

As an old school IT professional, I would recommend considering most of these companies' advice "how not to engineer reliability".

It's possible beyond a certain scale real reliability just isn't possible anymore, but then, anyone below that scale should also avoid SRE practices. Aka, do not use this for your side project.


company which calls their IT departments "SREs" now.

My org recently made a bunch of people SREs, with no one leading the cause except for a Director who only involves themselves enough to be effectively seen as a taskmaster, and otherwise is forgetful, disorganized, egotistical, lacks follow-through, constantly spreads his team too thin, and has put my direct boss in a REALLY terrible spot more than one after my boss did exactly what he was instructed to.

A bunch of people were moved from roles they previously excelled at into this new org structure and called SREs. Hired a few more off the street. What are they doing? 25% help desk tasks, 25% break fixes on legacy systems, 40% fixing their own damn laptops, 10% having panic attacks. Roadmaps? Lol. Planning? Lol. SLO? SL-NO. Error budget? I dunno, expense it.

We had to have a zoom to explain and get approval to use Terraform this morning. Not even Terraform enterprise (which is a cost), just plain Terraform.

Management essentially threw a bunch of shit at a wall, stood back, watched as it slid down towards the carpet and but found themselves pondering why the shit wont stick.

Short of using my main account which is a bit too revealing, I have NO qualms admitting a brutal truth about the situation in which I find myself: I am milking this company for every ounce of knowledge and upskill-ing my dev chops and I am gone come Summer.

The pay is JUST good enough that I can take my time, be VERY selective about the next job because good lord if I end up in another shop like this ever again it will be 100% my fault. I'm about to start being stupidly picky about who gets my labor from now on, my only regret is not seeing this value sooner.

Sorry for the rant but I'm mad and snowed in and that part of your comment really brought some shit up.


This is one of the best comments from a green handle I’ve seen in a while. Rant or not, it adds to the site; just this weekend, I was lamenting some very low quality green account comments and contemplating whether a 12-hour waiting period or de-ranking of green would make HN better. This argues strongly against.


Having a similar situation myself actually. In a team of SREs and "SRE" only in name. Just a bunch of ops guys, and being on the bottom of food chain[1]. When I came over there, and I shit you not, there were pager duty alerts every 15 seconds at peak times, multiple ones at that as well. And for years they thought it completely fine. Took me 6 months of employing best of soft skills to be even allowed to tweak some(!) of those alerts. Believe or not, we have tens of millions of users each day and at least 50 "micro" services. It is a complete mess with a lot of reliability problems in the last couple of months (failed projects getting "second chances" - sunken costs fallacies etc). Every few weeks some dev teams organizes meeting with us and basically say: "hey guys, so you've earned [newspeak for we're giving you our shit we don't want to support and move on to next big thing] this service, you're now the proud owners of it, glhf".

I'm also like you, preparing for my next job and this is the first time in my career that I actually know what questions to ask on the next job interview. The experience here is that bad. The only problem I have right now is that I'm so ridiculously micromanaged at work and interrupted every few minutes (or more) that I just cannot focus on learning anything of substance. So I'm trying to do it my free time. I'll get there, but I agree, some changes you're trying to do on a job take years to get there (even if those happen). Life is too short to change a lot of short sighted, entrenched people.

[1]: word of caution for anyone: you cannot be a decision maker (and SREs are supposed to at least be at the place of decision making) if you're on the bottom of the hierarchy. So be wary of jobs with great sounding titles where you cannot change anything for the better.


One of the more ridiculous comments I’ve ever seen on HN. It might be the most ridiculous.

I’m guessing you can build AWS in a weekend and it would only cost a few thousand or less in hardware costs, and you can host it all on your cable connection.


> I’m guessing you can build AWS in a weekend

The OP did mention "scale" in his/her comment, the very great majority of projects now being started won't get to AWS's size hence they won't need to use AWS's way of doing things (or Google's, or Facebook's).


Funnily enough AWS' way of doing things is quite different to Google's and I think there are pros and cons to both.

Google's way is SRE as mentioned here, and I don't have experience to say more there. However, at least in my bubble of the world, AWS's way of "you build it, you run it" is quite popular (and quite effective IMO) for small companies up to any scale.


The Google SRE approach isn't that different. You can ship systems that don't follow the SRE playbook, SRE just aren't going to take on-call for it.

If you want a different team to be on-call for your application though, there are baseline standards that you have to comply to, and if you breach those standards down the line they're going to hand the pager back to you until you're up to scratch.


AWS does have something similar to SRE’s though, at least in terms of skill sets. AWS has system development engineers (sysdevs) and systems engineers. When we made the role of SysDev we specifically chose not to call it SRE because we didn’t want people to think of it as a google style SRE.

The intent of SysDev is to create and maintain the internal, non-customer facing services. This includes writing code and creating services that maintain the reliability of the service/system. It’s usually related to the infrastructure in some way, whether it’s servers or networking but also expands to understanding how all the different sub systems of the AWS product work together.

The core difference here between sysdevs and SREs is that SREs often take over a product from an SWE team once it’s reliable and maintain / improve it. Sysdevs create an internal product and maintain it through the life of it.

Of course in AWS not all orgs follow the intent and often implement the role differently.


> I’m guessing you can build AWS in a weekend

No, but a service on my NUC has better uptime than a service on AWS, and costs less, so why would I want to build AWS?


Because your service neither could handle a million concurrent users tomorrow nor a power outage. Those probably aren't your requirements, but they are for most companies.


Only a tiny portion of companies ever reach a million concurrent users tomorrow, nor have plans to reach them, having enough to pay the bills is already quite good.

A power outage can be dealt just like in the old days with a UPS unit.


> Because your service neither could handle a million concurrent users tomorrow nor a power outage.

SRE here.

92% of all companies can't handle this either.


> million concurrent users

sure, but 99% of sysadmins/devops/SRE/whatever will never work on anything that has million concurrent users.

Most startup will never reach million concurrent users. And if they do, investors will happily shuffle as much money as you need to make your site work at that scale.

Hell even million monthly users is a nice milestone that most projects never reach, and that usually translates to couple of thousand concurrent request (peak), that average laptop could handle.


I absolutely have a UPS in my basement. Considering how little runs on it, it's pretty cheap to get a long runtime out of it too.

A lot of things I see with millions of concurrent users aren't actually monoliths: Sure, Facebook needs to handle that. But most cloud apps would better be run where each tenant/business/team operates an Intel NUC in their basement, instead of the developer using the cloud as a way to force rent-seeking behavior.


Not having to deal with hardware is nice, and I don't think having datacenter grade internet access in your basement is realistic for most.


Do you need datacenter grade? Fiber can probably serve a lot of requests per second


Unless your fiber has an outage, then you want redundancy, multiple independent uplinks that is.

And if your whole region has a problem, which is more likley to happen than one might think, then you want a multi-region setup e.g. us-west-1 and us-east-2, and then we can start to calculate the numbers of nines, unless your username is ocdtrekkie, he can beat AWS with a single NUC while he is sleeping.


Many big things started out in a garage, with very simple solutions like your little UPS powered NUC.


Including Google. Who have now hit a point where that's no longer sustainable, and developed a set of best practices to ensure reliability beyond what you can expect from a NUC in the basement.


SRE is just buzz word for managing/operating components and infrastructure at software companies. You can have companies that do it poorly and ones that don't. If it's an IT department that had been rebranded as sre, then yea, they would probably not be able to do as well as ones that are staffed properly. There are some overlapping skills, but to do it well, you need to understand the software side as well as the infrastructure.

How would you suggest that a software development team (one of many such teams in a company) that has no experience scaling or making software resilient operate? To just make a blanket statement saying that all sre practices are bad, seems unfair. Most companies won't be able to get the talent required to do it well in each and every team. Having specialized individuals that can help guide those teams seems logical.

The sre team can also cover the basics needed for a software company to operate with velocity and reliability. It doesn't really matter whatever you call it, but having the basics like logging, metrics, distributed tracing, dashboarding, and alerting managed well by one centralized team will help allow the component developers to focus on their components and not have to worry about all that other stuff.

The old way of having IT and developers complete separate was crap. Working together by using devops, sre, or whatever you want to call it has in my experience been so much more helpful in building and scaling companies.


That's because no one uses the nuc in your basement.


From my experience, the $500 Intel NUC in my basement has greater reliability than nearly every single company which calls their IT departments "SREs" now.

Probably 99% of the SRE problems I've seen are not IT problems. They're usually bugs or shortcomings in code some engineer deployed.


Well, either that or DNS


Apples and oranges though; if your business can run on a $500 NUC and if it failing is not a problem, then by all means go for it.

But if you look at scaled companies, you'll see a different picture. One of AWS's sales pitches is that setting up a datacenter is a big upfront expense, you need the space, hardware, and personnel to build it, and you have to provision it for peak load.

Take e.g. a recent game like Fall Guys, that went from 0 to millions of concurrent players within days, maybe even hours. Can't run that on your $500 NUC. You couldn't buy and provision enough NUCs to keep up even if you tried.

Anyway once again, if that one works for you then stick with it. I too prefer to not go overboard with scalability and the fancy technologies of today if I can help it.


Small shops shouldn’t use SRE practices, that’s for sure. At small scale your infra is like a house with a couple of pets then - hand managed. But once you reach scale, you run a huge farm with a thousands of cattle. They require different approach, not only because of the cost (too many people would be needed), but also of the requirements (try to change anything, when you have that many people involved).


> Small shops shouldn’t use SRE practices

Hard disagree.

Properly architecting software, setting reasonable SLAs/SLOs and trying to achieve them/doing postmortems when not, reducing toil, proper monitoring etc are good practices for any company that serves customers. Spend (both time and effort) relative to your company size and resources and it'll serve you well.

(Disclaimer: Google SRE, opinions are my own)


Yep. As an ex-Amazonian who just joined a small startup, you can say this keeps me up at night in more ways than one.

Not in the least because we've apparently already made agreements with 2-person SaaS companies with no consideration to SLAs whatsoever.


True, but I think a lot less businesses should be operated as large farms.


The nuc is ofc a bit in jest, but I still see where you’re coming from.

If you take service delivery serious then SRE’ing and DevOps’ing is inherent in the ways you work and the leadership. You use metrics to continuously improve and don’t leave things hanging. This requires a lot of work and most likely some custom tooling and a LOT of automation and strict conventions.

If you’re more of a traditional IT shop where people are winging it - separate network team, dedicated “vmware” team and a few guys doing “storage”... well you’ve been left in the dust. Even if you dedicate a team but rely on the above organization - it will be an unreliable mess. And most likely slow moving - which is why stuff gets funneled to the “cloud” although for most uses it is more expensive and lower performant. It’s worth it because you no longer have to deal with the above... it saddens an old school guy like myself.

This is at least my experience from big to small, tech through enterprise.

Edit: you’re either looking at IT as cost center or a strategic asset. This is where the different approaches and delivery models branches out from.


And you can implement Dropbox in a weekend with a NAS.


It’s not too difficult, if you ignore all the things which made Dropbox successful.


Just use rsync via ssh, right?


SRE, at least as it is applied at the companies where I've worked, isn't about increasing uptime at all cost---it's about hitting some reliability (or other) target given a long list of constraints. A lot of things are possible if your service can fit on a single box, but (at least where I work) that would usually violate some constraint I have.


Along the same lines, if you can avoid a distributed architecture, things get a lot more reliable. You can get a crazy amount of RAM, SSD, CPU cores on a single machine. If you run your system on a powerful machine with some other ones on hot standby, a lot of complexity goes away.


If you can run your system on a single machine, you don't need an SRE.

If you have hundreds or thousands of machines, that's an indicator that you /may/ have the complexity that requires the disciplines that can come from dedicated SRE. The tough thing is conflating filling operations problems with a role named SRE, versus actually using the best practices that will help you scale and improve reliability.


Your reservation rate goes to 200% though (a full hot standby) instead of, say, 120% to accommodate for some nodes becoming unavailable.

If your hot standby is a$100/mo VM, it's not noticeable. If it's $5000/mo, less so.

To say nothing of scaling up and down with the load — which, of course, you only need if you are a pretty large operation.


If you’re paying Bay Area prices for engineers, $5K a month is a steal to not have to pay people to deal with sharding.


Not even Bay Area prices. A Jr. SWE after overhead (benefits, HR, laptop, office-space,...etc) is easily costing the company 150k+/year in most markets.


Indeed, but take it a step further. Two ten thousand dollar servers in your basement with UPS and some rudimentary failover configuration is basically fire and forget. Remote in monthly and install updates. Done.

It'll run for ten years for next to nothing.


Until there's a power outage, flooding, malice, etc.

I think the main issue is that the cloud providers don't publish much about outages that don't affect the end-user. I mean a failed hard drive happens all the time, but S3 is never affected by that.


Depends on your bandwidth requirements. Also, if you want even higher reliability, you might consider getting two independent internet links into your basement, which is pretty doable in an urban setting.


But they wont have diverse routing :-) all it takes is a navvy with a back hoe digging in the wrong place.

And you also need to have diverse routing for power coming in and generator / battery room set up.


Run the backup from your friend's basement in the next town over using a different ISP. You can run the backup for them.


For a long time, my off-site backup was at my grandmother's house because it was the furthest geographic location I could give someone a box who would leave it plugged into their Internet. ;)


As an old school IT professional, I would recommend considering most of these companies' advice "how not to engineer reliability".

Where I work the classic on-prem bare metal and ESX based systems are just massively more reliable than on-prem or cloud Kubernetes, and take far less people to operate. 3 or 4 9’s is easy with ESX and 5 is do-able. Kubernetes barely makes it into 1 9 and might not even do that! Still it employs a lot of engineers, and they probably get paid more than the crusty old ESX guys too! From that point of view you should definitely push for it.


Indeed. Not that you should, but you could leave an ESXi cluster running without maintaining it at all, and it'll probably keep going for five or ten years all on it's own. It's stable by design.


I tried that once, only forgot to position near the ceiling. Second year in the flat I had the unlucky experience of high ground water levels which flooded the basement. Down system and lost all data :(


You're absolutely right, but you're being downvoted because a lot of HN readers' sense of self-worth as engineers is tied up with the idea that the cloud is the right way to do things.


I crash my car much less often than Nascar drivers. They must be terrible drivers.


I've never been in hospital so I don't need to pay health insurance anymore.


I fail to see how a $500 Intel NUC could possibly justify a team of software engineers to maintain, which means the person responsible for setting it up is not a manager, and their boss isn't someone who manages a manager.

The purpose of an employee is not to jack up the stock price, it's to make their boss look more important. Shareholders are like customers, it pays to have them think you're on their side, but your interests do not always align.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: