Hacker News new | past | comments | ask | show | jobs | submit login
SRE Doesn’t Scale (bravenewgeek.com)
162 points by kiyanwang 3 days ago | hide | past | favorite | 110 comments





It sounds like the lesson here is that tacking on SRE to an out-of-control development process, churning out new services by the boatload, doesn't scale.

This is caused by the typical attitude software companies have towards development. The most common model for software development is simple... rush to push out new features to the market, and pay the costs later--and then everyone is balking at the costs.

The solution is kind of brilliant, IMO. You don't have to pay technical debt on projects if you shut them down and delete the code from your repository. Migrate to industry-standard solutions. Use off-the-shelf programs & libraries. Delete all your custom stuff. Replace good solutions with "good enough" solutions.

The SREs can help you with that, but they can't help with out-of-control development. As your code base gets larger, the cost of supporting that code base gets larger too. The difficulty of scaling your SRE to match development reflects your out-of-control development process, not a problem with SRE. Keep the costs under control by keeping your code base under control.


> Use off-the-shelf programs & libraries. Delete all your custom stuff. Replace good solutions with "good enough" solutions.

Be very careful following this advice. Only use an off-the-shelf if it is pretty close to what you need or if you can change the need to fit the tool. Trying to shoehorn processes (or whatever) that don't fit a tool to a tool often end up being a lot worse than a custom coded solution. I've seen this multiple times at enterprises that preferred vendor solutions as you suggest and they were worse train wrecks than places that used more custom solutions.


The best way I've heard this put to me was this: "Your business has some special selling point where it stands out, the reason it's unique and people should buy your thing. Your business has hundreds of bog standard things it needs to do. You only have so much development time; spend it on the stuff that makes your company important and not on the stuff that already has perfectly good solutions sitting on the shelf."

Yup, this is the version that makes sense to me, with one addition:

If a component is high cost, tightly integrated with the rest of the system you're building, and difficult to design, you will need to become an expert on it just in order to evaluate suppliers and make good purchasing decisions.

Then (a) In order to become an expert on something, you need to do it yourself for a while; and (b) if you're an expert on it and have made it for a while, you might as well continue.

Thus: for high cost, tightly integrated, difficult to design parts: build, don't buy.

If a component is a major part of the value you're providing, it's sort of the same situation: you need to be an expert on it to safely buy it, and then you might as well build it indefinitely.


Very true, but I've also seen the inverse where days were spent on configuring and troubleshooting an off-the-shelf solution (with associated technical debt around upgrades) where a small in-house ad-hoc process could solve it much easier.

The answer is not always obvious and being able to make the right call consistently is something I consider a sign of a great engineer.


> Your business has hundreds of bog standard things it needs to do. You only have so much development time [...]

True for startups. Not true for BigCo. Economically it's a balance between the cost of a dev customizing the cogwheel and the cost of pushing the organization toward a standard factory-made cogwheel.


Manager: We need API gateway and Service Mesh in Kubernetes for our service

Engineer: I have come up with solution that wouldn't need all these frameworks.

Manager: Good. See how soon you can integrate with API Gateway and Service Mesh.

Heard at friend's workplace.


Without more context either party could be in the wrong there.

- The developer for building a non-standard solution that wasn’t asked for and is impossible for the next person to maintain (like those who decide to roll their own database, VCS, encryption, etc)

- or the manager for not understanding the requirements of the build and then expecting developers to redo existing work rather than spending 5 minutes learning the tech stack.

Either way, this is why some orgs refuse to allow any development without a technical design. When you go into a build with a design that everyone agrees on you ensure that expensive development time is focused and less time is wasted rewriting code due to missing requirements.


> expensive development time is focused

Is development time all that expensive? Compared to time spent making different designs.


This is also something I've experienced a lot in large orgs. Having 10 developers +2-3 managers have weeks of endless discussions, design work being done to avoid "costly" development.

Sure developer time is costly, but developers in meetings and design rounds are just as costly if not more so, because more people are involved.

This then gets exacerbated when after weeks of discussion and design an attempt is made to implement an MVP where it is noticed that none of the decisions made work very well for this particular use case.

You can of course face the opposite problem of not designing and implementing something wrong.

This is why I always ask to make a small test first. say you have 5 possible rendering backends for your graphical output, where you don't know which one to use. Spend some developer time building a miniature test to see if the features are there for each one. Often as soon as you start using it you can rule 1 or two out after a few minutes. (Installation breaking). The rest you can sort out in a few hours by modifying the examples they use on their web page.

This often leads to only 1 or 2 actual possibilities, focusing the ensuing design work on what matters.


Who says the developer was building a non-standard solution? I read it as they would use one tool instead of the few that the manager suggested.

Question is, was it use AN api gateway or use THIS api gateway.

It was use THIS api gateway.

Mainly it was exercise in moving from technical work to year long vendor negotiation and setup. The OTHER API gateway blew up despite being used at Netflix. But THIS one is NextGen® so obviously much better.


The key in using off-the-shelf software, especially if you're not a capital-E Enterprise that can dictate new features, is to be willing to change your business processes so that they match the tools you're buying. Part of the purchasing process for SMBs should be (but almost never is), evaluating current processes, and which parts of those processes are arbitrary and/or can be changed, then finding COTS software that fits all the must-have/can't-change processes. Then you adjust the last few remaining things to fit the software.

That works for the first tool you buy.

But invariably you continue to follow the advice and buy a 2nd. And it won't integrate well with the first. By the time you get to the 10th system that doesn't want to talk to the others, you're employing dozens of people to do the data entry between them.

And these proprietary systems almost _never_ want to talk to the others. There is a reason Outlook doesn't talk to Google Calendar, and Google doesn't talk to outlook's calendar, yet somehow Thunderbird (open source) manages to talk to both. And I can't count the number of times Microsoft has managed to break their IMAP interfaces.

I've seen this dark pattern in so many large businesses now, I've come to believe it's the norm.


My O365 displays my Google calendars just fine. My iOS and macOS calendar apps display both just fine.

I use a handful of Google calendars, but almost never use the Google calendar UI directly and interact almost exclusively via the Apple or Microsoft UIs to the Google store of data. (As in I might use the Google UI directly perhaps once a year, and that's generally only to get the access/path to the iCal interoperability URL.)


I worked at one place that had the worst of both worlds: SAP Hybris, customized to the nth degree with rushed features.

It was pretty awful and had significant, material negative impacts on revenue due to all the operational issues.


> worst of both worlds: SAP Hybris, customized to the nth degree

I've observed that nearly every customizable "off-the-shelf" solution will eventually converge to this end-state: dozens or hundreds of ad-hoc modifications to support an entrenched business process that incidentally makes upgrades nearly impossible. Non-customizable off-the-shelf software is rarely any better: all of the needed (or perceived needed) points of customization are managed outside of the software but rely on undocumented conventions.


See, basically any SAP integration.

Yep. SRE is not a substitute for high level, overarching architects and designers.

One pattern I see is that, as the company grows the development gets split into different product groups which will organically diverge unless there is rigid enforcement of design patterns. In some places, SRE does this implicitly because they will only support X, Y, or Z but in others each product group will have their own group of SREs.

There becomes a point when you need one or a small group of people who are the opinionated developers who can make design decisions and who have the authority to cause everyone else to course correct. If you don't have this, you'll wind up with long migrations and legacy stuff that never seems to go away.


I don't find having high-level architects to be a good pattern. They can make mistakes like anyone else; indeed having people who are no longer day-to-day coding make decisions that they don't feel the effects of makes wrong decisions more likely.

SRE exists to support product functions and like everything else should be attached to and understood in terms of those functions. Yes, every product group probably should have their own SREs, so that product group can own its whole lifecycle. Yes, different groups will do their own things and there will be mismatches and duplicated effort. That's less bad than the alternative.


I'm not saying that they should not be active developers, but people who can enforce change across the entire organization.

Previous Job (2500+ devs by the time I left) had an in-house RPc system that was being moved over to gRPC. That project was taking years because teams had no coordination on this process. The decision was made at some level and trickled out to everyone else. There was no single person or group who was in charge of:

- How services would be discovered - Implementation Patterns of how Services & Methods will be defined - Standardization of which libraries to use - Examples and Pre-build shared libraries that provide the stuff like tracing, monitoring, retries, etc... - Advocating for the changes

SRE seems to fall into the position of advocating business value for development practices that compete with business objectives that can provide value as well. At large organizations, if you don't have a central point that can set development objectives and be the one who teams can go to with "this pattern doesn't work for us, we can do this but we need buy in from other teams" issues and have directives handed down.

Unless you operate in an environment where the only cross-team communication is well versioned public APIs, then you will run into issues where you have to conflicting needs between teams and need someone to set a vision (this can be a group of people, rotating people, or a single person. how is not the issue)


The whole idea of enforcing technical mandates across the entire organisation is something I'm very sceptical about. No-one can hope to understand the constraints and requirements that 2500+ other devs are working under. Realistically the cross-team bandwidth is low, so if you don't have well versioned public APIs then you have barely understood interactions and no clear responsibility when they break.

There are probably some things do need to be standardised, but if there's a business need for standardisation then product teams should be able to understand and advocate for that (whether that means agreeing something with their directly adjacent product team, publishing something for clients to use, or something else). But in a lot of cases I think just accepting that different parts of the organization will work differently is the best way forward.


We recently decided as a company that the horizontal responsibilities structure doesn't work well at all, at least not at small scale. This was not in the software/infra teams but in our operations but I think there's some general truth here. The more vertically responsible your teams are, the better the final product is, and the more inefficiencies and impedance mismatches you can track down and fix.

For us it meant that the data processing teams have been made part of the drone operators team, so whenever we fly a mission a photography/3d rendering expert will also be part of the team that operates the drone. On paper it's more expensive to have office workers in the field, but in practice it leads to fewer reflights and happier and more productive employees.

I imagine that for the software departments, it could mean that every app development team has at least one member that has good operating system and network infrastructure knowledge, and/or maybe database expertise so that the team as a whole can largely operate a feature largely without having to depend on an outside SRE specialist.

And then the SRE's that you do have can focus on the site reliability, instead of having to constantly tell developers how the way they coded something is bad or whatever.


When you get very large you need both enterprise wide SREs that are responsible for consulting and approving architecture for reliability purposes AND localized per service or other small breakdown SREs to support small sets of services. How you break this down is tricky and there is definitely overlap.

Crappy attitudes towards support and reliability don’t scale, what you can get away with a few people keeping things barely held together stops working as you grow.


> There becomes a point when you need one or a small group of people who are the opinionated developers who can make design decisions and who have the authority to cause everyone else to course correct.

How many times have you been in the "everyone else" camp and was course corrected? How did those efforts work out for the firm in the long run? Any experiences to share that could be useful for the community?


Previous Job had an in-house RPC system and there was a desire to move to gRPC.

This process largely turned into "name and shame" because there was no incentive for some teams to put in the effort to make the changes. They had other objectives to complete, and swapping RPC frameworks was not one of them. The only way the change happened was putting a hard deadline before the old system was shutdown (by SRE), which is not the right way to do it.

There were a lot of stories like this. One team owned user information, but the business needs shifted this ownership to another team. This resulted in the ownership being split between three teams, and applications turning into transparent proxies for other applications. One service was a REST interface that provided a bit of logic on top of forwarding the request to a gRPC service.

The make up of the company was a bunch of loosely coupled product teams, and the only common connection was SRE and Data who worked with everyone. SRE became the team that had to work to resolve these "what the fuck, why can't you just sit down and figure out who should own this" issues. There really needed to be an architect or someone who could look at the big picture that could say "Why do we have this one internal REST service? Ok. Team A and B. You have this quarter to stop using Service Q and migrate to Service W."

$NewCompany, SRE is doing the course correcting (just due to small size), but we have a Principal Developer who is dictating that "Yes, we're going to implement new business logic in lambdas following this pattern." And they work to make sure that everything is done in a standard way, but at the same time take ideas for new patterns and make sure they are done in a smart way and don't conflict with anything else. He doesn't stand as a roadblock, but someone who can make sure teams are not going off and doing weird things (like use MySQL when we're a Postgres shop) that can cause issues later.


It must be possible for SREs to refuse a service that doesn't meet certain criteria.

AFAIK this doesn't happen anywhere. Although I believe G has something like this.


In order to make this possible you need an entirely separate chain of command for SRE like what google has or very influential SREs either of which is exceedingly rare so not surprised op thinks sre doesn’t scale

Google SREs definitely have that stick.

> I believe G has something like this

It does. And until the service meets the criteria, SWEs on the project are actually partially playing the SRE roles under the guidance of the SRE org.


is sre at google just a maintenance team? what do they do then?

Effectively yes. The main things SRE provides are oncall support, production focused design consulting and integration with other infrastructure. In practice, the engagement usually always provides 1) and then the rest are dependent on how mature the SRE team is.

In a typical split, SWEs often do the dev work for features and large reliability/scalability changes (which SRE helps appropriately prioritise), whereas the SRE team maintains the software around the project (config pipelines, monitoring etc.) and might occasionally write some smaller reliability/scalability modifications.

But there can be lots of variance. It’s atypical but some of the infrastructure-focused SRE teams often maintain non-trivial software, but are part of SRE because of other responsibilities.


Google wrote a book about it. It's free to read. https://sre.google/books/

SRE is the first-responder team. They are on-call 24/7 (the team, not each person), perform systems and service monitoring, triage failures and mitigate outages.

That doesn't mean it's all handwork, I'm sure SREs at Google employ a boatload of automated event handling and custom response scripts. But "keeping the service up" requires different skills than "building a service", and Google chose to separate their Dev and Ops this way. As others said in this thread, if some service isn't up to SRE standards (in terms of monitoring, logging, or robustness), the SRE team won't accept it and Devs would have to do their own Ops.


Funny, I read it differently. They talk about frameworks, libraries, and best practices.

Effectively, they're talking about standardization across your teams/services so they don't fuck things up. Essentially, you're taking away some of the purported freedoms of microservices (complete independence - eg I can write this service in brainfuck if I want!) and reigning it in a bit so you don't build a pile of trash.


I think of that kind of standardization kind of like deleting code. Stuff like, "We are deprecating support for Python in SRE, no new projects may be shipped in Python."

Now all your trash sorta looks the same.

Anecdatum time: a friend once worked at a company that had separate teams building small services in a small variety of common languages (mostly python and golang). One individual decided that a particular new service that was going to be really critical just had to be written in erlang, and went ahead and did it.

Fast forward to a few years later, when my friend started there. The erlang dev was long gone, and nobody knew erlang or OTP or anything about the service well enough to take ownership of it. My friend was constantly awoken during his on-call weeks because the damned thing kept repeatedly tipping itself over now that the company had grown. He couldn't maintain it other than restarting and praying, and couldn't rewrite it- the company had bigger priorities.

That was the shortest lived job of his career so far.


This is making me realise how strong the analogy between Conway's Law and Coase's Theory of the Firm is.

(recap: https://en.wikipedia.org/wiki/Theory_of_the_firm especially "Why is the boundary between firms and the market located exactly there with relation to size and output variety? Which transactions are performed internally and which are negotiated on the market?")

The answer involves looking at informational and coordination costs. Microservices are extremely Coasian: rather than have a part of the software/company under direct, traditional command-and-control management, it's kept at arms length and interacted with via a software contract.

Using microservices theoretically reduces the coordination costs between parts of the team, but in practice risks just shovels them around. After all, the easiest way to reduce a cost is to dump the externality on to someone else. If you were deploying a monolith, it would obviously have only one set of SRE practices; with microservices, you end up duplicating those and introducing costly variation.


My read on the article was that much more was related to each team being on their own to set up & drive their pipelines, operate their own services, and there being a lack of commonality/shared experience.

A vast number of the software engineers don't get the ops (running software) stuff hardly at all & half of them can sort of play along, hack stuff into place. The engineers on product teams who do know how to do things meanwhile don't get all the constraints, best practices, ideas that other various DevOps folk have done & have their own wants/desires/expected ways of doing things, so they end up creating their own very unique sub-ways of doing things within the org. None of these practices converge on regularity or consistency with what DevOps machinery ends up being built.

What we do have often is just a random pile of containers and scripts that a couple people sort of know decently & everyone else suffers through & survives within. Almost never does it look like any other company's devops kitchen.

SRE doesn't scale because it's an every now and then thing, and few people notice or care about the difference between a well-built corporate citizen that runs well & is monitored & operated according to whatever the in-power SRE cabal wants. People start to care only if things are going bad, either via services not building/integrating/deploying/running as well as they should, or from too much confoundedness/general head scratching by either the SRE or regular engineers. SRE is not a priority, it's not practiced regular, it's only an every-now-and-then thing, so we don't have the chance to get good, to institutionalize the right ways of doing things. That's what the articles is discussing. Not the rest of the everyday normal software development rushing-bedlam you describe.


> SRE doesn't scale because it's an every now and then thing, ...

That's the part that doesn't scale... tacking on SRE at the end, or doing it every now and then. The reason people don't care about the software being a "well-built corporate citizen" is because they care more about shipping features. If you have an SRE team that will say "no" to you when you try to ship new stuff, you'll eventually figure out a way to build new things in a way that the SRE team will say "yes". When I say "no", that could be a hard pushback like "no, that's not getting shipped" or it could be an answer like, "no, the SRE team will not support that, yet."

These kind of decisions need to be made at a high level, because everyone in the institution is typically operating with the wrong incentives. That's why you end up with a random pile of containers and scripts. It doesn't have to end up that way, even when you have microservices.

> That's what the articles is discussing. Not the rest of the everyday normal software development rushing-bedlam you describe.

I disagree with the article, so necessarily there are going to be differences between what I'm saying and what the article is saying.


Everything in your second paragraph points to companies that either through hiring, firing, or attrition have successfully excised all real systems expertise from the entire company. This isn't the fault of any ops methodology, so let's maybe call it "shrug-ops."

>few people notice or care about the difference between a well-built corporate citizen that runs well & is monitored & operated according to whatever the in-power SRE cabal wants

QED


The thing that made Google SRE scale was in the fine print and covered in this article. SRE is a premium service, and in order to get the premium service you had to meet some criteria to show you weren't about to turn highly paid, highly knowledgeable folks into professional firefighters.

During my time as a SRE at multiple organizations what I witnessed was executives that owned Operations teams renaming the team, paying for a Python course, and calling it "change". Operations Engineers have a culture unto themselves that will not be broken in a years time. A culture of firefighting and heroism as opposed to proactivity and dull standards. They're rarely, if ever, programmers which was a prerequisite for a SRE-SE and SRE-SWE. You don't need to be a master in logic to know that if you dress your old pig up with lipstick and a new title that it's still a pig.

I agree with the article: read the SRE book and understand why they did what they did, and if you hire SRE-SE or SRE-SWE types, don't colocate them with Operations types or make them a 1:1 map with your dev teams that haven't earned them.


I heard about that; amongst other things, if the amount of times SRE gets pinged because of a software fault exceeds X amount, they basically give the pager back to the team that built the software. They aren't taking responsibility for reliability if the software is not reliable.

I really appreciated working with a team of SRE types, it gave me a newfound appreciation for quality in software development. I remember one instance where a colleague in my team went up to the 'ops' team and wanted SSH access to a production server to check on some settings (environment variables, I believe). He thought it was a trivial thing, you know, "just lemme have a peek" kinda thing, but the ops team flat-out told him no, if he needs to print env vars, he can do it in his own code - we did continuous releases, a patch could be deployed within minutes if passing all the checks. I loved that the ops team had the mandate from higher up to say no to requests like this, and I found them a lot more professional than the software developers, including myself.


Ideally what that developer was wanting should be a function of the platform. That's likely not in an operations teams scope because they're mostly just fire fighters. Also ideally in a situation like SOX where devs can't have access to production their preproduction environments share the same interface that production has and if their values differ that much is documented.

> the ops team flat-out told him no, if he needs to print env vars, he can do it in his own code

This is still the wrong attitude from a "productization" of infrastructure perspective. Configuring the environment a program runs in is a core responsibility of infrastructure. So is being able to query it. Build that infrastructure product feature.

Or select a platform that makes this trivial, e.g kubernetes.


It actually sounds more like what I'd expect: microservices don't scale.

I've never been a fan of microservices for this simple reason: every boundary, every interface, every connection is friction and a potential source of bugs, performance issues and problems.

It's now been several years since I worked at Google so I can't speak to the current state of things but when I was there, monoliths ruled the roost.

It seems possible there was some experimentation with microservices given the rise of Docker and Kubernetes in the last few years. I'd be surprised if this made serious inroads into high-traffic core services.

It's also worth mentioning that generally Google has (had?) SRE teams rather than SREs embedded on SWE teams. You'd never have one SRE supporting a service as there'd have to be enough to maintain a healthy oncall rotation (typically 8-12).

I can't help but feel like the author is projecting his own experiences onto what Google has said here.


Microservices are a cargo cult thing. It tries to solve a problem that 99% of companies that apply it wish they had.

I previously worked in a BigCo that did microservices and currently work at BigCo that leans towards monoliths.

It's completely true that microservices makes development _very_ hard. Just being able to run things locally was very hard since instead of understanding how to build/run one service, you had to figure our how to build/run N services. Testing was another major headache.

Monoliths help with technical aspects but organizationally they make your life hell and can severely limit your ability to move fast.


To expand on the brief title

> Google [...] says the SRE model [...] does not scale with microservices. Instead, they go on to describe a more tractable, framework-oriented model to address this through things like codified best practices, reusable solutions, standardization of tools and patterns

Basically anyone planning on microservices should define and monitor bounds on which frameworks, tools and diversity of design patterns in use. Good advice at any scale.


Isn't this more of a comment about microservices than it is about SRE? It reads to me like "once you hit a number of microservices it ends up looking like a monolith":

http://highscalability.com/blog/2020/4/8/one-team-at-uber-is...


It's always microservices, whatever you build. You just decide what kind of IPC you use between some of the pieces.

Forgive me but aren't 'macroservices' just... services? I don't see the difference.

unless its a gigaservice

dumb question time but what exactly makes something a micro service.

Is the separation of a specific functionality from a wider array of functions to its own vm make it a microservice?

When does something stop being a microservice i guess?


I remember asking a candidate whether they were doing microservices at her current job.

She answered "I don't know if we have microservices, but we do have services that don't do much"

It's since then that's my definition of a microservice :)


Productized class methods.

A microservice is deployed independently, has its own hardware and can scale independently from the rest of the system.

If there are multiple services and they all share hardware, they're not really microservices.


> A microservice is deployed independently, has its own hardware and can scale independently from the rest of the system.

Nicely articulated

> If there are multiple services and they all share hardware, they're not really microservices.

You just described most microservice deployments on Kubernetes ;)


In this case, they run in containers, so they can actually scale independently. For this purpose, it's as if they were running on independent hardware.

Even if we had separate cloud servers - such as EC2 on AWS - for each service, technically they are not really independent hardware.

But for the purposes of deploying, hardware capacity allocation and scaling, you can see them as such. The same goes for containers orchestrated by K8s.


Don't we have a term for that — software coupling?

A microservice architecture has weak coupling.


it's coupling via network calls instead of in-memory calls

I think it's the same line of thinking as RISC and CISC (reduced/complex instruction set computer). There's no such thing as a RISC machine or a CISC machine, all computer ISAs live on a spectrum that can be described as RISC-like or CISC-like.

There's no such thing as a micro service or a macro service (in terms of there being a perfect delineation). They're simply terms to describe certain design goals. Any given service might have several qualities that can be described as befitting a macro or a micro service.

It's all too easy to fall into the no true scotsman trap.


I like to think of it more of a separation of components needed to do the job.

Need a database, use a database as it's own service, don't couple the database to the app, so if you need to scale the app horizontally you can add more replicas. If you need to scale the database vertically, you can scale the database vertically. If these are coupled, you scale only whats needed.

On top of that use the language or tool that is meant for the job. If you need something written in go, build your service with go, if you need nodejs, do it with node, etc.

Microservices don't need to be micro or small


My definition is separation of infrastructure and deployment cycles. Everything that always in one deployment is one service or stuff thats part of your code-base is definitely not a different service.

It stops being a microservice when a developer starts saying, "oh! We can do X in service Y too! It already does ${similar work} and reads/writes from/to ${data source}, so why not?"

The intended model is to do one thing, thus enabling surgical changes to functionality without having to rebuild everything. As long as you stick to your API contracts, you can muck around with the internals without effecting anything else.


That's not really answering the question. How do you define what the "one thing" that the service handles actually is?

I have this half-serious opinion that interface design is the hard (technical) part of software development, and the rest is mostly easy.


I remember coming to the same realization when the Google Vs Oracle court case was going on. Which decided that API’s are not copyrightable. To me it felt like the wrong decision, but what do i know.

I would definitely argue that public interfaces should not be copyrightable, since you need to be able to freely reimplement them to interoperate, which in my opinion should be preferred over granting even temporary monopolies to software interfaces.

However, public interfaces are only a small part of software interfaces in total. It's getting the system's internal interfaces right that takes most of the work.


I totally get it from the sense of avoiding monopolies for sure.

In my view, I was lumping internal/public interfaces in the same basket. I haven’t really considered public vs internal aspect of this.


Copyright in interfaces entrenches monopoly.

that sounds like a corollary to Conway's law.

That's been called "creeping featuritis" for about 25 years now. :)

I would say its best to only think about any deployable thing as a service. "Microservices" is just the idea that those things do not have a minimum size.

Essentially, "is this enough functionality to make a new service?" is the wrong question.


> dumb question time but what exactly makes something a micro service.

This leftpad as a service, over HTTPS


I had to google that to understand the context.

What a wild story


“Hiring experienced, qualified SREs is difficult and costly. Despite enormous effort from the recruiting organization, there are never enough SREs to support all the services that need their expertise.”

Uh huh. Maybe what isn't scaling is their onerous recruiting filters.


Listen, if you can't write a working text justification algorithm in 45 minutes a decade after college, you have no business doing completely unrelated SRE stuff like monitoring services or scaling databases. Everybody knows leetcode is the key to software. If you can't do that, maybe you just don't have the right IQ to build the next chat app @ Google.

This but unironically. If you can't do a simple programming task, (we can argue over what text justification is, because "left-pad" could qualify, as could a full TrueType renderer) you don't belong in SRE - Much of the point is to have cross-functional people, who can interface with developers on their level, and do the math and grunt work around operating a database cluster.

I once met a man who was intimately familiar with the details of the linux kernel and how the new chiplet architecture in AMD processors resembles a NUMA architecture and thus impacts VM performance. He was well versed in shell scripting, k8s, docker, the principles of observability, and infrastructure as code. He could explain the difference between READ COMMITED and REPEATABLE READ or LSTMs or distributed consistency models off the top of his head. He didn't have a CS degree so obviously he wasn't as intelligent as me, but even so I found him a little intimidating for some reason.

But then I asked him -

"Given an array of positive integers target and an array initial of same size with all zeros.

Return the minimum number of operations to form a target array from initial if you are allowed to do the following operation:

    Choose any subarray from initial and increment each value by one.
"

He was stumped. As I had suspected, he wasn't quite up to the job of an SRE. I immediately failed him and returned to editing my networking.yaml file. Someone has to maintain the bar around here..


I am emphatically not saying you shouldn't tailor your technical questions to your audience. Commits to the kernel can be prime-facie evidence that someone can code. A good shell script can be the same.

I have no degree. I don't write code for a living. I'm actually almost precisely the opposite of the stereotype of the stuck up software engineer, aside from being a white dude in my early thirties.

What I am saying is: I have met many people who can configure a database, but cannot talk to developers as equals about how to write code about it, and there's really an interesting split of reasons why that doesn't work. The fact remains that it doesn't. You can build operations teams for some tasks, and you should - you can't run your own large hardware or compute or database clusters without having some entirely operations focused people. You should not ignore their opinions or pain points. You shouldn't call them SRE, though. The point of SRE is to build systems and run them, and you want them to have feet on either side of that line.


I understand and agree - im sorry i don't mean to attack you personally, its more of a satire about FAANG interview processes. My first comment was responding to the quote about difficulties hiring SREs at Google and their interview process in particular (the quote in the parent comment came from Google). It wasn't intended to be read as a position one way or the other on any and all coding questions or interviews - I don't think there are any simple black and white rules there. The text justification question is listed somewhere on leetcode as being commonly asked by Google and is a bit more involved than a simple leftpad or "do you know how to code" sort of question.

Apropos of nothing, I went and did that leetcode problem - Getting the algorithm took me about 10 minutes, figuring out how I'd messed up my fenceposting took another 25, and figuring out what they wanted for the last line took me about an hour. Ugh.

When did he stump in your fictional scenario? When you asked him to prove that the number is indeed minimal? It's trivial too but may be an unreasonable thing to ask during SRE interview.

Hey! What was on my interview with a FAANG company. How did you know?

I'm sorry that's impossible. As you know, all candidates sign NDAs and FAANG tier developers have only the highest standards so they would never reuse a question - they would just make a new one.

> He didn't have a CS degree so obviously he wasn't as intelligent as me

Is that supposed to be a joke?


The whole post is sarcasm, and a joke.

Why were you intimidated by him, and why is it obvious you were more intelligent than him?

This is absurd. A lot of companies do interviews that make sense and are not focused on some obscure bitshift puzzles.

You need to be a strong developer to work efficiently as an SRE.


Pretty sure Google hasn't used 'bitshift puzzle' interview questions for ~10y now. It really is just a meme thing from long ago. My interview there ~6y ago was basically the same as ones I've done at a handful of other companies

It's good to be an sre. From my experience, there is high demand, low supply and schools don't really teach for sre roles so the pipeline stays small.

Yes.

And then there's the unrealistic demands from companies that didn't even understand what DevOps was supposed to be, and are misunderstanding SRE even more. Many companies still treats this the same way they did "ops".


A different element of this is that software professionals are often paid so well that a lot of people can take early retirement relatively easily if they're good about saving and investing. And the FIRE movement is only growing.

I sometimes wonder if the software world has relatively greater 'leakage' via early retirement than other fields, creating a constant problem of not enough highly experienced people who remain stuck as wage slaves throughout a 40 year career.


Hiring SREs is costly...

Hiring narrowly focused people with significant pre-existing experience -- Is always costly.

... and I also think, it is practically always a wrong strategy...

I am fundamentally against micro-managing labor pools by federal government.

However, there are need to be economic incentives (including immigration policies) -- that make it more difficult for employers to hire for 'tool-centric' positions, unless those tools are very expensive physical devices (eg telescopes, quantum computers, etc).

The incentives need to direct employers at training on the job (not at after-work online classes).

When I see on HN's hire threads -- 'if you have Azure experience -- you will get on top of the pile' -- I cringe.

This is absolutely ridiculous. Same pretty much with Ruby/Php Rust/C++ Haskell/Scala modalities of the same problem.

Yes hire people who understand process/idioms/patters, but invest in the f..ing training -- if you need somebody 'yesterday' to help, get consultants -- and have them help and at the same time hire for full time roles -- and train your internal stuff (possibly, even, have them learn from the consultants).

Afraid of people leaving after acquiring the highly thought-out skills ?

-- implement meaningful compensation/retention policies that reflect effort that you spend on training, and risks that you are taking if a person who was just trained -- leaves.


What you are saying is generalists are cheaper to hire and that you can train them. That's true.

That's why, as a seller (worker) you should try to specialize in at least one area to sell yourself better.


Speaking from experience, it sells better to bring along special soft skills in the package, than yet another technology stack.

Too many coders, generalists or not, cannot engange properly talking to something that isn't a computer.

Those that manage to merge coding skills, with UI/UX, marketing, understanding the customer point of view, already have an upper hand.


Yes, you need soft skills too. But that doesn't mean you shouldn't specialize in something. That doesn't have to be a tech stack either, in fact it's the last thing I would suggest.

That's part of it.

I am also saying that there needs to be a system of incentives, that then, produces, organizational/hiring practices that favor non-tool-specific employment process.


Well, for that your company has to admit it it doesn't have specific problems. Because if you do have specific problems, it's probably more efficient to hire people that can handle those.

At some point, it feels like every company is basically re implementing a kind of internal Heroku. I feel like there’s a billion dollar business in putting that on rails for medium to large enterprises.

There are many products that aim to fill this niche. I worked on one for years (Cloud Foundry). There is a lot of money to be made, but it's surprisingly stony ground.

Oblig. self-promoted thread: https://twitter.com/jacques_chester/status/14357388253476454...


> Google enforces standards and opinions around things like programming languages, instrumentation and metrics, logging, and control systems surrounding traffic and load management.

I think the author read this as more of a problem than a solution. This concept is supported by the DevOps model too. Your infrastructure is just as much a part of your product and the teams providing the infrastructure just as responsible for the service levels and API contracts as any customer facing product team.


> this person typically just becomes the “whipping boy” for the team.

That sounds like a culture problem to me.

SRE is just a fancy name for a class of people that make stuff more reliable. However it appears the reader of this book forgot about the other part: making stuff easy to use.

Everywhere I have worked in the last 10 years has had an embedded sysadmin/devop/SRE/PE of some sort. They are there to take the raw, difficult to use tools and adapt them so that developers are able to deploy without hand holding.

This means evangelising metrics, showing how to manage alerting, coordinating standards and leading incident responses.

The symptoms the writer describes are of a dysfunctional dev environment. Ideally your SRE should be able to rotate teams quickly. That means standards. Once a team is setup and running, they shouldn't really need ongoing SRE functions, unless they are making massive changes.

This can only happen if developers are re-using tools, rather than making new wheels out of space age materials and getting bored and moving onto new things.


To me SRE + Cloud means SRE is responsible for the cloud foundation/ platform but the software teams should be DevOps teams and they are responsible for the (micro)services they create. SRE helps out by providing the tools, libs, dashboard , general principles etc.

So SRE in responsible for the health of Kubernetes, networking etc and providing libs, support and infra but the DevOps teams are responsible for the health of the services/ software that is running on the cloud platform.

A DevOPs teams should not consist of developers and one ops person but the team as a whole is responsible for the stuff they create. They are on call just like SRE is, if something go's down SRE + DevOps work together to fix it


> And that move to microservices—in combination with cloud—unleashes a whole new level of autonomy and empowerment for developers who, often coming from a more restrictive ops-controlled environment on prem, introduce all sorts of new programming languages, compute platforms, databases, and other technologies.

You need standards, without that SRE is pointless. Everything needs a standard method of monitoring. As an e.g. - stick to Java/Spring Boot, MariaDB and K8S. That will generally cover 85% of your use cases.

The automation and advantage of SRE is derived through standards and familiarity with the tool chain.


I kind of disagree on the Cambrian explosion of languages being a bad thing. While it has downsides, it has a massive upside of discouraging stuff like crazy custom build pipelines. Big companies love to burn money and time by reinventing everything. Bigco isn't going to try and implement their own build system if they need to support a dozen languages.

Microservices have a marvelous long-term potential to trash the service with a myriad of legacy and obsolete languages instead of just one. So you just have to run faster to stay in the same place.

Microservices should be used as an optimization. Not as a starting point.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: