It must be possible for SREs to refuse a service that doesn't meet certain crite...

dilyevsky · on Oct 12, 2021

In order to make this possible you need an entirely separate chain of command for SRE like what google has or very influential SREs either of which is exceedingly rare so not surprised op thinks sre doesn’t scale

pkhuong · on Oct 12, 2021

Google SREs definitely have that stick.

jedmeyers · on Oct 13, 2021

> I believe G has something like this

It does. And until the service meets the criteria, SWEs on the project are actually partially playing the SRE roles under the guidance of the SRE org.

tayo42 · on Oct 13, 2021

is sre at google just a maintenance team? what do they do then?

ratorx · on Oct 13, 2021

Effectively yes. The main things SRE provides are oncall support, production focused design consulting and integration with other infrastructure. In practice, the engagement usually always provides 1) and then the rest are dependent on how mature the SRE team is.

In a typical split, SWEs often do the dev work for features and large reliability/scalability changes (which SRE helps appropriately prioritise), whereas the SRE team maintains the software around the project (config pipelines, monitoring etc.) and might occasionally write some smaller reliability/scalability modifications.

But there can be lots of variance. It’s atypical but some of the infrastructure-focused SRE teams often maintain non-trivial software, but are part of SRE because of other responsibilities.

fragmede · on Oct 13, 2021

Google wrote a book about it. It's free to read. https://sre.google/books/

tremon · on Oct 13, 2021

SRE is the first-responder team. They are on-call 24/7 (the team, not each person), perform systems and service monitoring, triage failures and mitigate outages.

That doesn't mean it's all handwork, I'm sure SREs at Google employ a boatload of automated event handling and custom response scripts. But "keeping the service up" requires different skills than "building a service", and Google chose to separate their Dev and Ops this way. As others said in this thread, if some service isn't up to SRE standards (in terms of monitoring, logging, or robustness), the SRE team won't accept it and Devs would have to do their own Ops.