Disclosure: I'm a former Google employee
I've since transitioned into onto a different career track, but I have long wished to find some way to use my combination of unix sysadmin and software engineering skills without ever having to be on call. In the companies I worked in (including Google), I never really found that.
This said, I really enjoyed the experience. Yes, it was tiring and stressful, but it was also super interesting and exciting. Being responsible for such a huge site was incredible, and the feeling of figuring out how to overcome a big outage was exhilarating. I actually miss the pager drama from time to time (my wife does not).
The choice to put people to be on call is an explicitly sacrificing reliability for the sake of a budget. At what point is reliability worth more than the cost of a few remote employees?
Furthermore, I'd argue that if there is a dedicated group of people whose job it is SOLELY to work outages and help tickets, and their job is 100% that, it's not oncall.
Oncall is "hey, you have these projects to do, but every 6 weeks you have to also answer every IM/phonecall/ticket escalation as well".
The number one problem was that (starting in Year 4) there was nobody else with the knowledge to be on-call. So, I was on-call 24/7-365 for 5 years. Realistically it was a small enough company with localized (mostly east-coast USA) clients, that things going wrong in the middle of the night was pretty unusual. I had a great boss who mentored me to be able to handle these things by myself, and he mostly put the right tools in my hands so that it was possible to avoid getting woke in the middle of the night because anything went wrong short of a disk going bad.
It happened often enough to be a good reason to want to leave! It was my first serious job, and for all its faults it was a pretty great job until my boss got a better offer. When he left, it seemed that there was no choice but to implement vSAN and VSphere. And then I actually didn't need to be woke by disks going bad in the middle of the night anymore.
When I did go, there was nobody prepared to handle 75% of the incidents I would have been needed to work on. (I assume that many of those things just stay broken now when they go wrong.)
Being paged is the least bad part of being on call. The restrictions on your life outside of work are what really grate.
If you are on call one week every 3 months, it's mostly the calls that are annoying, especially at night.
If you are on call more often (e.g. 1/2 weekend for small companies/teams), it stops mattering how often pages happen. You are mentally on-call all the time and are forced to be physically ready. It's a huge drain on your life.
I get a small amount for each day I'm on call, plus an extra payment if I'm actually called. That seems a fair balance - I'm compensated reasonably for making myself available, and the actual callout payment makes it worth getting up at 3am a couple of times a year.
Maybe it's a psychological thing, with different people responding differently. But it isn't a huge drain on my life, at all.
The calls themselves were no big deal.
I was never on call but I have a friend who was. Like you mentioned, it plays a huge role in how you live your life. That lingering feeling of they can call you any second haunts you day in and day out.
Even going to the movies must be a terrible experience? You either feel guilty for turning your pager off, or you spend the movie just ruminating about how it will end early for you if you get called.
How do you even sleep at night knowing that someone might call you any second? Sounds worse than prison honestly.
Whoever setup oncall rotation was smart at Google and compensates you well for being on the oncall rotation. Depending on the SLO/response-time of your pager has some impact on that pay as well.
As for knowledge, as the SRE book calls out, we have a primary and secondary rotation. The secondary is responsible for all things build related (keeping continuous integration clean, deploying to production, and being fallback if primary is unavailable).
I do agree that it can be a bit painful to have to carry a laptop wherever you go, but I've lived with it. I guess it's a matter of what your expected response time is that may impact how urgently you need to get a computer started and working on the problem.
We have it at my company but it's an expectation everyone is on call and there is no bonus for doing it.
You get an additional percentage of your base pay for the hours you're oncall outside of normal business hours (evenings and weekends). The percentage is based on how tight your response SLA is: if you have to be hands-on-keyboard in 5 minutes after a page, that's obviously a lot more disruptive to your life than if it's 30 minutes, so the compensation is adjusted.
Sounds like you need to renegotiate that. When I'm on call I don't get paid 9-5. There's a rather nice on-call bonus which pretty much means I get paid the full 24hrs. Might also depend on the country, some have laws around these kinds of things. For example, I can only be on-call for a 1 week stretch and have to be given a (paid) day off after. I can also not be on-call again for another couple of days.
> I was never on call but I have a friend who was. Like you mentioned, it plays a huge role in how you live your life. That lingering feeling of they can call you any second haunts you day in and day out.
If being on-call gives you this lingering feeling of doom then don't be on-call. If your systems are in that much of a fucked up state then you push back until that's sorted. Sometimes you need to let stuff burn. Being on-call does not have to equal being in a constant state of stress on the verge of panic attacks.
> Even going to the movies must be a terrible experience? You either feel guilty for turning your pager off, or you spend the movie just ruminating about how it will end early for you if you get called.
I don't go to the movies when I'm on-call. Turning your pager off at that point is irresponsible and not what I'm being paid for. In most companies you're only on-call for short periods at a time, it doesn't take over your life. You just go to the movies the day after, just like you'd plan any other activity and overlapping commitment.
> How do you even sleep at night knowing that someone might call you any second? Sounds worse than prison honestly.
By using Quiet Hours on your phone correctly and configuring the numbers that pages can come from to always go through. You get to bed and you sleep. The first few times you might sleep a bit less over it. Eventually you get the hang of it. A big part in this is knowing that you don't get paged for random crap but only ever when there's something truly wrong that can't wait until the next morning. Correctly behaving systems and not too trigger-happy alerting are crucial to this.
Also notice that it is impossible to achieve in the real world. What company will have enough people for rotation + systems that barely fail + major bonus for oncall + time off + people who are decisively not trigger happy against your pager...
the only downside is a small rotation (3 people 1 week each) and some serious scaling issues which have made the pervious few months much worse than earlier ones. though we also have management buy in to fix the less than good parts.
It's certainly not easy to get to that point but it's entirely possible. Everything on the post you're replying to is how on-call works for me. It's not some imaginary world, it's reality.
Go to SRECon for example and talk to people there. You'll be surprised. Just because you haven't experienced it (yet) doesn't make it impossible.
Met way too many people who consider their on-call to be okay. And when digging deeper, they're being totally exploited and I wouldn't touch their positions/teams/organizations with a 10 foot pole.
People who care about on-call moved to contracting or positions where they are not on call, they're not coming back. I did and I ain't coming back ;)
You can't be oncall 24hours. That's broken. Human beings need sleep. An oncall rotation that has an individual oncall for 24 hours is a broken thing. And depending on the country it also rightfully violates labour laws.
Second you get paid when you work. Oncall is work. Now it might be less demanding work, so it might get paid a bit less than the 9-5 part, but it has to get paid. Do work, get paid. Simple rule, people should stop fucking with it. And again, labour laws show up here.
If you're oncall you can't shut the pager off. If shutting the pager off is a thing people"oncall" do it's a broken oncall culture. Since it's often a consequence of the first two problems, those must be fixed first.
You should not be asleep when you're oncall. Same issues as the last problem; same likely causes.
You also shouldn't be oncall often. One 12 hour / 7 day shift every six weeks is fine.
Also pages should not be common. They should be for real, actual issues. If you regularly get paged more than once per shift, your oncall shift is broken.
I do agree that being oncall can be an unpleasant experience, but it's also a motivation to structure your service to not page you in the middle of the night.
Apparently if you write software to their invariably insane and wrong and totally misunderstood specifications they gain the right to bother you every time anything goes wrong with it. Usually it's 100% their own stupid fault.
Whether this happens at 6am, 2pm, 8pm is apparently all your fault and you, as a developer, should not just fix this, sacrifice your time and drop what you're doing, but you are also expected to pay for this.
So let me just say:
2) if you want this, you're paying extra, and sorry to say, but a LOT extra (10% of a month's consultancy rate for 2h + incidents which DO NOT include new features. Furthermore it is expected that 2 out of 3 months you're paying for checking if the monitoring system is working, nothing more. If it is more, price goes up)
3) if you disagree on this, we have contracts, AND labour law that disagrees with your assessment that "SWEs" aren't paid for 9-5 hours.
4) Obviously we can talk about this. But either modifications to software or improvements in general will come at a cost. I'm very willing to discuss, design, implement, even hire a team etc. for you to do this, but we need to understand eachother : it's not free.
I don't know how it works at Google, but I've seen these attitudes often at large companies. And ... euhm ... you can find some other poor victim to pull this crap with.
I think there are plenty of job sectors where the only options above entry level are middle management roles that have serious drawbacks like this. I accepted it because it seemed like the only way to climb the ladder.
I really miss that job but the oncall was just fucking the worst and has caused me to turn down future jobs because of I was burned by doing on call.
Obvious PS: Google employee.
That being said, SRE does want to ultimately fix the problem (otherwise it's just going to page again, right?). But if that means tracking down a wrong config flag, cherry-picking a fix into a new release, etc. -- those are all things that can be done AFTER the bleeding is stopped.
Source: I'm an SRE
Reproducing the issue resulted in an immediate fix by the swe.
Again, i understand why it is the way it is, it is just really interesting to see how specialized each engineer is in the grand scheme of things.
Abstractly, we got pushback from QA about this policy. After we had gathered a couple of concrete examples, it was clear that QA-as-gatekeeper when the factory is already in the worst possible state wasn't valuable. We do mandate the normal reviews but allow them after deployment. (You can imagine the conversations with the auditors about this as well, so we had to carefully document that this was our process and made the auditors audit our conformance to the process not to their own preconception of what it should be.)
It's especially not true when big amount of money is at stake. (like ads).
Edit: last sentence.
I'm trying to fully automate the testing side of the product, while making the process transparent enough to be amenable to manual intervention/quick tweaking.
After that, I'm hoping to move to automating the deployments, putting the server behind a load balancer, rollbacks, backup testing, all that good stuff that makes sure things only break where it can't hurt. Luckily the product is already pretty stable with the current dev/dogfooding-as-staging/prod model.
It's the most enjoyable work I've had so far. I think it mostly boils down to:
* I have clearly defined tasks, which I mostly plan in/negotiate with the product owner myself, so I have a large share of "ownership" of the dev/QA infrastructure improvements
* I work fully remotely and part-time, which gives me plenty of free time to socialize and decompress (we mainly communicate via Slack). I also have the option of working more hours, but I already doubled my
* I'm not currently on the critical path, so work feels low-stress
* I don't have to deal with under-defined business logic and product owners that do not want to commit to specifying (the product owner has transitioned from building the Java software to managing and subcontracting it, so is very knowledgeable about the product, and besides he's a great guy)
* I'm learning the tooling around the product through automating its development, testing and deployment (vs. learning it through adding crufty new features to it in a completely un-repeatable way, I'm looking at you never-again crashing Visual Studio Community and randomly-failing-builds Xamarin Forms).
Obligatory disclaimer: I'm one, and we're hiring ;)
Very common questions get at basic things- what are the pain points you've encountered running the system? What monitoring systems are you using? What are known failure modes where monitoring is silent? Have any agreements on availability or latency/performance been reached with users? What is the process to qualify, push, and rollback changes? What's the impact to the user / to the company if everything goes and stays down? What are your runtime dependencies and how do you behave if they fail? Provide a review of recent monitoring alerts?
Most of the value in most PRR checklists really just get at the above- sometimes the answer really is "we don't know" or it is incomplete (especially re: runtime dependencies) so follow up questions can make discovery easier.
 often the SREs can figure this out and will look them up without even asking a question. Lots of formalized processes ask people to list what's needed to do this anyway (e.g. list alert queues, mailing lists that receive alerts, etc.).
When I applied, I was rejected for not having enough experience. I find tooling and devops extremely fun, but I'm not quite sure how to develop my talent. Do you have any advice?