Hacker News new | past | comments | ask | show | jobs | submit login

Staffing such that on-call is handled by presently-in-office staff. This is, as I understand, pretty much what Google does. When you're in the office, you're in the office, but when you're not, you're not. Having global coverage means ops in several timezones, and this is what Google accomplishes.

Not knowing when, at any time, your phone or pager will go off wears in interesting ways over time.




It depends on the team and type of oncall rotation for the service. My team (a SWE team) has its own oncall rotation as we don't have dedicated SREs for all of our services.

Since we US based only, it means the oncall person will have pager duty while they sleep. Our pager can be a bit loud at night due to the nature of our services, so it's definitely not for everyone (luckily it's optional).


Is this at Google?

I'll note you're SWE not SRE. I'm talking mostly about dedicated Ops crew on pager.

It's one thing if you're responding to pages resulting from other groups' coding errors or failure-to-build sufficiently robust systems. Another if you're self-servicing.

One of my own "take this job and shove it" moments came after pages started rolling in at 2am, bringing me on-site until 6am. I headed back for sleep, showed up that afternoon and commented on the failure of any of the dev team to answer calls/pages/texts (site falling over, I had exceptionally limited access capabilities and was new on team). Response was shrugs.

Mine was "That wasn't your ass being hauled out of bed. See ya."


The opinions stated here are my own, not necessarily those of Google.

Yes, it is at Google. Our important and high visibility bits have SREs that help monitor our services (SREs actually approached us to take over some bits that were more important).

Google has a lot of oncall people that aren't going to go into a data center (most googlers never see a data center). So there is lots of oncall rotations that still have an SLA that can be handled from their bed if it happens at 2am.

(I sadly can't give any examples)


This is not generally true for at least the big SRE-supported services at Google. I don't know what every team does, but my team's oncall shift (for example) is 10am-10pm, Mon-Thu or Fri-Sun. Another office covers the 10pm-10am part of the US day.


That's for first response ops though, what if a code change is needed to recover or something else that goes beyond the playbooks?

I guess today is a perfect example, I wonder how many out of hours engineers got paged.


Then Dev gets to deal with its own shit.


The magic of devOps is carting two pagers around.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: