Org charts that ship a platform are default stable because everybody it a team or group is doing approximately the same things. Growth is less uncomfortable, advancement feels more objective, and individual developers are relatively interchangeable.
But what if a company needs to change? Now the stable org chart resists that change. By rejecting requests from client teams that are responsible for a new set of objectives. This recurses. One layer of platform can simultaneously be moving too slowly for the layer above and too quickly for the one below. Shear forces tear it apart and the organization finds itself with n (3 < n < 6) fewer platform engineers.
Breakthrough can be pushed by rolling a second library/framework/platform. Like AWS ELB and ALB. Then developers can adopt the later if it's so much greater, but they won't because it's 90% of the same and who wants to work on migrations?
Large organizations are fundamentally split apart. First part of the org wants A and B. Second part wants B and C. The developer team next floor is rolling their own thing to do C and D. All while features A and C are incompatible so it's impossible to satisfy everyone. There is no solution to resolve internal conflicts (except maybe reducing a large company to 20% of its current workforce).
The big thing I remember about their approach that surprised everyone: there was no mandate to use the platform/central team's tools. It made me chuckle how many times the presenter was grilled by everyone about that. It was like some of the audience straight up thought he was lying about that.
But basically, if you have a platform team, and a mandate to use that team's tools, well, the other teams aren't really "customers", in the sense you can leverage choice as a signal. So you have to make up for that. In my experience, you need very good management, and constant, multifaceted communication. Which... might work, might not.
It's probably best to delay any kind of centralized/platform work until you have a _very_ clear pain that defines a very clear set of roles and requirements. Unless everyone says "Oh shit this is amazing" ... just say no.
“This is the cluster scheduler you must use to run code in the datacenters” Ok.
“Since part of your feature communicates with end users, it must be implemented in this visual programming language we created for workflows that interact with end users.” Not fine. This is how you end up with abominations of workarounds on workarounds. Let me use my regular tools and give me a damn API. Your decision to invent a shitty half baked programming environment is not my problem. When you try to make it my problem you are only creating more damage.
How are ops/developers supposed to use ansible or salt or docker or kubernetes, when the only available solution to access/deploy to the servers is with the one centrally approved tool.
The way you ultimately solve this is by aligning incentives, eg "platform engineers get fat bonuses/promotions when the products built on top of their platform kick ass."
The extra bit of empathy makes all the difference, because without fantastic personal communication, 'platform' could be a waste of time for everyone involved.
And honestly other eng teams kind of are your customers. They may not pay you directly, but they also help to build your product and the company that cuts your (and everyone's) paycheck.
If my coworkers need my services, then that's because they are developing something that a customer needs. I think of it like a dependency-tree. As long as you trust your company to not have multiple teams develop something that will never see the light of day, or something the user doesn't want or need, then this mindset is absolutely a good one (Sadly been there, done that too).
Btw this can be true of other departments as well – for example, SaaS product marketing is often an entity that exists to serve internal "customers." Product management can be envisioned this way as well.
>And being able to effectively and quickly incorporate feedback from consuming teams into the product.
Without that, you can have a quality platform team pushing out good products, but if it doesn't align with what other teams need or you don't expose good override hooks, then the end effect is the teams will fragment into doing what works best for them.
The things I've seen built with Excel and Access.
These shortcomings all manifest themselves in how state is managed. Terraform state is declaratively described, and it may or may not match the state of the backend. Once this state drift exists, it becomes difficult to correct.
This is my primary criticism of Terraform and one of the reasons I prefer Kubernetes. I know it's an apple to orange comparison, but in Kubernetes there is both declarative configuration and active reconciliation. You have both current state and desired state and a set of controllers seeking to make them match. I'd love to see this implemented with Terraform.
Ideally, a platform team should give you reliable, self-service components to build upon, like databases, caches, rate-limiters, api-gateways, etc.
If they're framework-ish, they're mandating you build your app in exactly their preferred way in order to receive the benefits.
An important realization in my opinion is that a platform team is just another development team. They should consider their platform as a product and the developers as their customers. To minimize any downtime they should use the typical mechanisms that developers are also using: automated tests, pair programming, rolling out changes to test environments first, etc.
When done right a platform team should basically not be noticed except that the tooling, developer experience and overall reliability of a system goes up.
You also have to be very careful allowing the users to expand the platform to their needs or the platform team will be a permanent bottleneck. I have seen this in companies where the SAP people had a backlog of several years.
It's is a given that SRE build tooling, but most of the SRE-focused work, as described in that book is around improving the resilience of a given application or product. Addressing production readiness, defining SLOs, and handling incident response.
There are platform specific SRE teams at Google, but there's not much published about how they get about creating platform.
The book "Seeking SRE" makes it clear that in most places, the notion of "platform engineering" varies tremendously.
I don't know of any authors that have addressed this explicitly other than Susan Fowler in "Production Ready Microservices" who writes:
Another important part of microservice adoption is the
creation of a microservice ecosystem. Typically (or, at
least, hopefully), a company running a large monolithic
application will have a dedicated infrastructure
organization that is responsible for designing, building,
and maintaining the infrastructure that the application runs
on. When a monolith is split into microservices, the
responsibilities of the infrastructure organization for
providing a stable platform for microservices to be
developed and run on grows drastically in importance. The
infrastructure teams must provide microservice teams with
stable infrastructure that abstracts away the majority of
the complexity of the interactions between microservices.
I can definitely recommend this book for a (new) perspective on how platform teams fit into organizations.
In a past life I worked on an SRE team that backed up data to cassette tapes using a fleet of robots with lasers. https://youtu.be/kQ2taAttvwo That was easy. Hard SRE work is more like Traffic Team.
So I think what the author Nick is saying makes perfect sense. Yes, these platform layers can cause difficulties. Can we fix them? Yes, just read the SRE book. That same wisdom will carry over just fine to this kind of problem.
“However, a number of factors are, and continue to, cause the traditional responsibilities of a Site Reliability Engineer (SRE) to shift.”
ie, SRE teams are shifting to become Platform teams or in other words are indistinguishable from them...