This is caused by the typical attitude software companies have towards development. The most common model for software development is simple... rush to push out new features to the market, and pay the costs later--and then everyone is balking at the costs.
The solution is kind of brilliant, IMO. You don't have to pay technical debt on projects if you shut them down and delete the code from your repository. Migrate to industry-standard solutions. Use off-the-shelf programs & libraries. Delete all your custom stuff. Replace good solutions with "good enough" solutions.
The SREs can help you with that, but they can't help with out-of-control development. As your code base gets larger, the cost of supporting that code base gets larger too. The difficulty of scaling your SRE to match development reflects your out-of-control development process, not a problem with SRE. Keep the costs under control by keeping your code base under control.
Be very careful following this advice. Only use an off-the-shelf if it is pretty close to what you need or if you can change the need to fit the tool. Trying to shoehorn processes (or whatever) that don't fit a tool to a tool often end up being a lot worse than a custom coded solution. I've seen this multiple times at enterprises that preferred vendor solutions as you suggest and they were worse train wrecks than places that used more custom solutions.
If a component is high cost, tightly integrated with the rest of the system you're building, and difficult to design, you will need to become an expert on it just in order to evaluate suppliers and make good purchasing decisions.
Then (a) In order to become an expert on something, you need to do it yourself for a while; and (b) if you're an expert on it and have made it for a while, you might as well continue.
Thus: for high cost, tightly integrated, difficult to design parts: build, don't buy.
If a component is a major part of the value you're providing, it's sort of the same situation: you need to be an expert on it to safely buy it, and then you might as well build it indefinitely.
The answer is not always obvious and being able to make the right call consistently is something I consider a sign of a great engineer.
True for startups. Not true for BigCo. Economically it's a balance between the cost of a dev customizing the cogwheel and the cost of pushing the organization toward a standard factory-made cogwheel.
Engineer: I have come up with solution that wouldn't need all these frameworks.
Manager: Good. See how soon you can integrate with API Gateway and Service Mesh.
Heard at friend's workplace.
- The developer for building a non-standard solution that wasn’t asked for and is impossible for the next person to maintain (like those who decide to roll their own database, VCS, encryption, etc)
- or the manager for not understanding the requirements of the build and then expecting developers to redo existing work rather than spending 5 minutes learning the tech stack.
Either way, this is why some orgs refuse to allow any development without a technical design. When you go into a build with a design that everyone agrees on you ensure that expensive development time is focused and less time is wasted rewriting code due to missing requirements.
Is development time all that expensive? Compared to time spent making different designs.
Sure developer time is costly, but developers in meetings and design rounds are just as costly if not more so, because more people are involved.
This then gets exacerbated when after weeks of discussion and design an attempt is made to implement an MVP where it is noticed that none of the decisions made work very well for this particular use case.
You can of course face the opposite problem of not designing and implementing something wrong.
This is why I always ask to make a small test first. say you have 5 possible rendering backends for your graphical output, where you don't know which one to use. Spend some developer time building a miniature test to see if the features are there for each one. Often as soon as you start using it you can rule 1 or two out after a few minutes. (Installation breaking). The rest you can sort out in a few hours by modifying the examples they use on their web page.
This often leads to only 1 or 2 actual possibilities, focusing the ensuing design work on what matters.
Mainly it was exercise in moving from technical work to year long vendor negotiation and setup. The OTHER API gateway blew up despite being used at Netflix. But THIS one is NextGen® so obviously much better.
But invariably you continue to follow the advice and buy a 2nd. And it won't integrate well with the first. By the time you get to the 10th system that doesn't want to talk to the others, you're employing dozens of people to do the data entry between them.
And these proprietary systems almost _never_ want to talk to the others. There is a reason Outlook doesn't talk to Google Calendar, and Google doesn't talk to outlook's calendar, yet somehow Thunderbird (open source) manages to talk to both. And I can't count the number of times Microsoft has managed to break their IMAP interfaces.
I've seen this dark pattern in so many large businesses now, I've come to believe it's the norm.
I use a handful of Google calendars, but almost never use the Google calendar UI directly and interact almost exclusively via the Apple or Microsoft UIs to the Google store of data. (As in I might use the Google UI directly perhaps once a year, and that's generally only to get the access/path to the iCal interoperability URL.)
It was pretty awful and had significant, material negative impacts on revenue due to all the operational issues.
I've observed that nearly every customizable "off-the-shelf" solution will eventually converge to this end-state: dozens or hundreds of ad-hoc modifications to support an entrenched business process that incidentally makes upgrades nearly impossible. Non-customizable off-the-shelf software is rarely any better: all of the needed (or perceived needed) points of customization are managed outside of the software but rely on undocumented conventions.
One pattern I see is that, as the company grows the development gets split into different product groups which will organically diverge unless there is rigid enforcement of design patterns. In some places, SRE does this implicitly because they will only support X, Y, or Z but in others each product group will have their own group of SREs.
There becomes a point when you need one or a small group of people who are the opinionated developers who can make design decisions and who have the authority to cause everyone else to course correct. If you don't have this, you'll wind up with long migrations and legacy stuff that never seems to go away.
SRE exists to support product functions and like everything else should be attached to and understood in terms of those functions. Yes, every product group probably should have their own SREs, so that product group can own its whole lifecycle. Yes, different groups will do their own things and there will be mismatches and duplicated effort. That's less bad than the alternative.
Previous Job (2500+ devs by the time I left) had an in-house RPc system that was being moved over to gRPC. That project was taking years because teams had no coordination on this process. The decision was made at some level and trickled out to everyone else. There was no single person or group who was in charge of:
- How services would be discovered
- Implementation Patterns of how Services & Methods will be defined
- Standardization of which libraries to use
- Examples and Pre-build shared libraries that provide the stuff like tracing, monitoring, retries, etc...
- Advocating for the changes
SRE seems to fall into the position of advocating business value for development practices that compete with business objectives that can provide value as well. At large organizations, if you don't have a central point that can set development objectives and be the one who teams can go to with "this pattern doesn't work for us, we can do this but we need buy in from other teams" issues and have directives handed down.
Unless you operate in an environment where the only cross-team communication is well versioned public APIs, then you will run into issues where you have to conflicting needs between teams and need someone to set a vision (this can be a group of people, rotating people, or a single person. how is not the issue)
There are probably some things do need to be standardised, but if there's a business need for standardisation then product teams should be able to understand and advocate for that (whether that means agreeing something with their directly adjacent product team, publishing something for clients to use, or something else). But in a lot of cases I think just accepting that different parts of the organization will work differently is the best way forward.
For us it meant that the data processing teams have been made part of the drone operators team, so whenever we fly a mission a photography/3d rendering expert will also be part of the team that operates the drone. On paper it's more expensive to have office workers in the field, but in practice it leads to fewer reflights and happier and more productive employees.
I imagine that for the software departments, it could mean that every app development team has at least one member that has good operating system and network infrastructure knowledge, and/or maybe database expertise so that the team as a whole can largely operate a feature largely without having to depend on an outside SRE specialist.
And then the SRE's that you do have can focus on the site reliability, instead of having to constantly tell developers how the way they coded something is bad or whatever.
Crappy attitudes towards support and reliability don’t scale, what you can get away with a few people keeping things barely held together stops working as you grow.
How many times have you been in the "everyone else" camp and was course corrected? How did those efforts work out for the firm in the long run? Any experiences to share that could be useful for the community?
This process largely turned into "name and shame" because there was no incentive for some teams to put in the effort to make the changes. They had other objectives to complete, and swapping RPC frameworks was not one of them. The only way the change happened was putting a hard deadline before the old system was shutdown (by SRE), which is not the right way to do it.
There were a lot of stories like this. One team owned user information, but the business needs shifted this ownership to another team. This resulted in the ownership being split between three teams, and applications turning into transparent proxies for other applications. One service was a REST interface that provided a bit of logic on top of forwarding the request to a gRPC service.
The make up of the company was a bunch of loosely coupled product teams, and the only common connection was SRE and Data who worked with everyone. SRE became the team that had to work to resolve these "what the fuck, why can't you just sit down and figure out who should own this" issues. There really needed to be an architect or someone who could look at the big picture that could say "Why do we have this one internal REST service? Ok. Team A and B. You have this quarter to stop using Service Q and migrate to Service W."
$NewCompany, SRE is doing the course correcting (just due to small size), but we have a Principal Developer who is dictating that "Yes, we're going to implement new business logic in lambdas following this pattern." And they work to make sure that everything is done in a standard way, but at the same time take ideas for new patterns and make sure they are done in a smart way and don't conflict with anything else. He doesn't stand as a roadblock, but someone who can make sure teams are not going off and doing weird things (like use MySQL when we're a Postgres shop) that can cause issues later.
AFAIK this doesn't happen anywhere. Although I believe G has something like this.
It does. And until the service meets the criteria, SWEs on the project are actually partially playing the SRE roles under the guidance of the SRE org.
In a typical split, SWEs often do the dev work for features and large reliability/scalability changes (which SRE helps appropriately prioritise), whereas the SRE team maintains the software around the project (config pipelines, monitoring etc.) and might occasionally write some smaller reliability/scalability modifications.
But there can be lots of variance. It’s atypical but some of the infrastructure-focused SRE teams often maintain non-trivial software, but are part of SRE because of other responsibilities.
That doesn't mean it's all handwork, I'm sure SREs at Google employ a boatload of automated event handling and custom response scripts. But "keeping the service up" requires different skills than "building a service", and Google chose to separate their Dev and Ops this way. As others said in this thread, if some service isn't up to SRE standards (in terms of monitoring, logging, or robustness), the SRE team won't accept it and Devs would have to do their own Ops.
Effectively, they're talking about standardization across your teams/services so they don't fuck things up. Essentially, you're taking away some of the purported freedoms of microservices (complete independence - eg I can write this service in brainfuck if I want!) and reigning it in a bit so you don't build a pile of trash.
Fast forward to a few years later, when my friend started there. The erlang dev was long gone, and nobody knew erlang or OTP or anything about the service well enough to take ownership of it. My friend was constantly awoken during his on-call weeks because the damned thing kept repeatedly tipping itself over now that the company had grown. He couldn't maintain it other than restarting and praying, and couldn't rewrite it- the company had bigger priorities.
That was the shortest lived job of his career so far.
(recap: https://en.wikipedia.org/wiki/Theory_of_the_firm especially "Why is the boundary between firms and the market located exactly there with relation to size and output variety? Which transactions are performed internally and which are negotiated on the market?")
The answer involves looking at informational and coordination costs. Microservices are extremely Coasian: rather than have a part of the software/company under direct, traditional command-and-control management, it's kept at arms length and interacted with via a software contract.
Using microservices theoretically reduces the coordination costs between parts of the team, but in practice risks just shovels them around. After all, the easiest way to reduce a cost is to dump the externality on to someone else. If you were deploying a monolith, it would obviously have only one set of SRE practices; with microservices, you end up duplicating those and introducing costly variation.
A vast number of the software engineers don't get the ops (running software) stuff hardly at all & half of them can sort of play along, hack stuff into place. The engineers on product teams who do know how to do things meanwhile don't get all the constraints, best practices, ideas that other various DevOps folk have done & have their own wants/desires/expected ways of doing things, so they end up creating their own very unique sub-ways of doing things within the org. None of these practices converge on regularity or consistency with what DevOps machinery ends up being built.
What we do have often is just a random pile of containers and scripts that a couple people sort of know decently & everyone else suffers through & survives within. Almost never does it look like any other company's devops kitchen.
SRE doesn't scale because it's an every now and then thing, and few people notice or care about the difference between a well-built corporate citizen that runs well & is monitored & operated according to whatever the in-power SRE cabal wants. People start to care only if things are going bad, either via services not building/integrating/deploying/running as well as they should, or from too much confoundedness/general head scratching by either the SRE or regular engineers. SRE is not a priority, it's not practiced regular, it's only an every-now-and-then thing, so we don't have the chance to get good, to institutionalize the right ways of doing things. That's what the articles is discussing. Not the rest of the everyday normal software development rushing-bedlam you describe.
That's the part that doesn't scale... tacking on SRE at the end, or doing it every now and then. The reason people don't care about the software being a "well-built corporate citizen" is because they care more about shipping features. If you have an SRE team that will say "no" to you when you try to ship new stuff, you'll eventually figure out a way to build new things in a way that the SRE team will say "yes". When I say "no", that could be a hard pushback like "no, that's not getting shipped" or it could be an answer like, "no, the SRE team will not support that, yet."
These kind of decisions need to be made at a high level, because everyone in the institution is typically operating with the wrong incentives. That's why you end up with a random pile of containers and scripts. It doesn't have to end up that way, even when you have microservices.
> That's what the articles is discussing. Not the rest of the everyday normal software development rushing-bedlam you describe.
I disagree with the article, so necessarily there are going to be differences between what I'm saying and what the article is saying.
>few people notice or care about the difference between a well-built corporate citizen that runs well & is monitored & operated according to whatever the in-power SRE cabal wants
During my time as a SRE at multiple organizations what I witnessed was executives that owned Operations teams renaming the team, paying for a Python course, and calling it "change". Operations Engineers have a culture unto themselves that will not be broken in a years time. A culture of firefighting and heroism as opposed to proactivity and dull standards. They're rarely, if ever, programmers which was a prerequisite for a SRE-SE and SRE-SWE. You don't need to be a master in logic to know that if you dress your old pig up with lipstick and a new title that it's still a pig.
I agree with the article: read the SRE book and understand why they did what they did, and if you hire SRE-SE or SRE-SWE types, don't colocate them with Operations types or make them a 1:1 map with your dev teams that haven't earned them.
I really appreciated working with a team of SRE types, it gave me a newfound appreciation for quality in software development. I remember one instance where a colleague in my team went up to the 'ops' team and wanted SSH access to a production server to check on some settings (environment variables, I believe). He thought it was a trivial thing, you know, "just lemme have a peek" kinda thing, but the ops team flat-out told him no, if he needs to print env vars, he can do it in his own code - we did continuous releases, a patch could be deployed within minutes if passing all the checks. I loved that the ops team had the mandate from higher up to say no to requests like this, and I found them a lot more professional than the software developers, including myself.
This is still the wrong attitude from a "productization" of infrastructure perspective. Configuring the environment a program runs in is a core responsibility of infrastructure. So is being able to query it. Build that infrastructure product feature.
Or select a platform that makes this trivial, e.g kubernetes.
I've never been a fan of microservices for this simple reason: every boundary, every interface, every connection is friction and a potential source of bugs, performance issues and problems.
It's now been several years since I worked at Google so I can't speak to the current state of things but when I was there, monoliths ruled the roost.
It seems possible there was some experimentation with microservices given the rise of Docker and Kubernetes in the last few years. I'd be surprised if this made serious inroads into high-traffic core services.
It's also worth mentioning that generally Google has (had?) SRE teams rather than SREs embedded on SWE teams. You'd never have one SRE supporting a service as there'd have to be enough to maintain a healthy oncall rotation (typically 8-12).
I can't help but feel like the author is projecting his own experiences onto what Google has said here.
It's completely true that microservices makes development _very_ hard. Just being able to run things locally was very hard since instead of understanding how to build/run one service, you had to figure our how to build/run N services. Testing was another major headache.
Monoliths help with technical aspects but organizationally they make your life hell and can severely limit your ability to move fast.
> Google [...] says the SRE model [...] does not scale with microservices. Instead, they go on to describe a more tractable, framework-oriented model to address this through things like codified best practices, reusable solutions, standardization of tools and patterns
Basically anyone planning on microservices should define and monitor bounds on which frameworks, tools and diversity of design patterns in use. Good advice at any scale.
Is the separation of a specific functionality from a wider array of functions to its own vm make it a microservice?
When does something stop being a microservice i guess?
She answered "I don't know if we have microservices, but we do have services that don't do much"
It's since then that's my definition of a microservice :)
If there are multiple services and they all share hardware, they're not really microservices.
> If there are multiple services and they all share hardware, they're not really microservices.
You just described most microservice deployments on Kubernetes ;)
Even if we had separate cloud servers - such as EC2 on AWS - for each service, technically they are not really independent hardware.
But for the purposes of deploying, hardware capacity allocation and scaling, you can see them as such. The same goes for containers orchestrated by K8s.
A microservice architecture has weak coupling.
There's no such thing as a micro service or a macro service (in terms of there being a perfect delineation). They're simply terms to describe certain design goals. Any given service might have several qualities that can be described as befitting a macro or a micro service.
It's all too easy to fall into the no true scotsman trap.
Need a database, use a database as it's own service, don't couple the database to the app, so if you need to scale the app horizontally you can add more replicas. If you need to scale the database vertically, you can scale the database vertically. If these are coupled, you scale only whats needed.
On top of that use the language or tool that is meant for the job. If you need something written in go, build your service with go, if you need nodejs, do it with node, etc.
Microservices don't need to be micro or small
The intended model is to do one thing, thus enabling surgical changes to functionality without having to rebuild everything. As long as you stick to your API contracts, you can muck around with the internals without effecting anything else.
I have this half-serious opinion that interface design is the hard (technical) part of software development, and the rest is mostly easy.
However, public interfaces are only a small part of software interfaces in total. It's getting the system's internal interfaces right that takes most of the work.
In my view, I was lumping internal/public interfaces in the same basket. I haven’t really considered public vs internal aspect of this.
Essentially, "is this enough functionality to make a new service?" is the wrong question.
This leftpad as a service, over HTTPS
What a wild story
Uh huh. Maybe what isn't scaling is their onerous recruiting filters.
But then I asked him -
"Given an array of positive integers target and an array initial of same size with all zeros.
Return the minimum number of operations to form a target array from initial if you are allowed to do the following operation:
Choose any subarray from initial and increment each value by one.
He was stumped. As I had suspected, he wasn't quite up to the job of an SRE. I immediately failed him and returned to editing my networking.yaml file. Someone has to maintain the bar around here..
I have no degree. I don't write code for a living. I'm actually almost precisely the opposite of the stereotype of the stuck up software engineer, aside from being a white dude in my early thirties.
What I am saying is: I have met many people who can configure a database, but cannot talk to developers as equals about how to write code about it, and there's really an interesting split of reasons why that doesn't work. The fact remains that it doesn't. You can build operations teams for some tasks, and you should - you can't run your own large hardware or compute or database clusters without having some entirely operations focused people. You should not ignore their opinions or pain points. You shouldn't call them SRE, though. The point of SRE is to build systems and run them, and you want them to have feet on either side of that line.
Is that supposed to be a joke?
You need to be a strong developer to work efficiently as an SRE.
And then there's the unrealistic demands from companies that didn't even understand what DevOps was supposed to be, and are misunderstanding SRE even more. Many companies still treats this the same way they did "ops".
I sometimes wonder if the software world has relatively greater 'leakage' via early retirement than other fields, creating a constant problem of not enough highly experienced people who remain stuck as wage slaves throughout a 40 year career.
Hiring narrowly focused people with significant pre-existing experience -- Is always costly.
... and I also think, it is practically always a wrong strategy...
I am fundamentally against micro-managing labor pools by federal government.
However, there are need to be economic incentives (including
immigration policies) -- that make it more difficult for
employers to hire for 'tool-centric' positions, unless those tools are very expensive physical devices (eg telescopes, quantum computers, etc).
The incentives need to direct employers at training on the job (not at after-work online classes).
When I see on HN's hire threads -- 'if you have Azure experience -- you will get on top of the pile' -- I cringe.
This is absolutely ridiculous. Same pretty much with Ruby/Php Rust/C++ Haskell/Scala modalities of the same problem.
Yes hire people who understand process/idioms/patters, but invest in the f..ing training -- if you need somebody 'yesterday' to help, get consultants -- and have them help and at the same time hire for full time roles -- and train your internal stuff (possibly, even, have them learn from the consultants).
Afraid of people leaving after acquiring the highly thought-out skills ?
-- implement meaningful compensation/retention policies that reflect effort that you spend on training, and risks that you are taking if a person who was just trained -- leaves.
That's why, as a seller (worker) you should try to specialize in at least one area to sell yourself better.
Too many coders, generalists or not, cannot engange properly talking to something that isn't a computer.
Those that manage to merge coding skills, with UI/UX, marketing, understanding the customer point of view, already have an upper hand.
I am also saying that there needs to be a system of incentives, that then, produces, organizational/hiring practices that favor non-tool-specific employment process.
Oblig. self-promoted thread: https://twitter.com/jacques_chester/status/14357388253476454...
I think the author read this as more of a problem than a solution. This concept is supported by the DevOps model too. Your infrastructure is just as much a part of your product and the teams providing the infrastructure just as responsible for the service levels and API contracts as any customer facing product team.
That sounds like a culture problem to me.
SRE is just a fancy name for a class of people that make stuff more reliable. However it appears the reader of this book forgot about the other part: making stuff easy to use.
Everywhere I have worked in the last 10 years has had an embedded sysadmin/devop/SRE/PE of some sort. They are there to take the raw, difficult to use tools and adapt them so that developers are able to deploy without hand holding.
This means evangelising metrics, showing how to manage alerting, coordinating standards and leading incident responses.
The symptoms the writer describes are of a dysfunctional dev environment. Ideally your SRE should be able to rotate teams quickly. That means standards. Once a team is setup and running, they shouldn't really need ongoing SRE functions, unless they are making massive changes.
This can only happen if developers are re-using tools, rather than making new wheels out of space age materials and getting bored and moving onto new things.
So SRE in responsible for the health of Kubernetes, networking etc and providing libs, support and infra but the DevOps teams are responsible for the health of the services/ software that is running on the cloud platform.
A DevOPs teams should not consist of developers and one ops person but the team as a whole is responsible for the stuff they create. They are on call just like SRE is, if something go's down SRE + DevOps work together to fix it
You need standards, without that SRE is pointless. Everything needs a standard method of monitoring. As an e.g. - stick to Java/Spring Boot, MariaDB and K8S. That will generally cover 85% of your use cases.
The automation and advantage of SRE is derived through standards and familiarity with the tool chain.