Hacker Newsnew | comments | show | ask | jobs | submitlogin

well, in many places DevOps is implemented as "developers on PagerDuty". When I (the developer) have to be on-call for 7 day rotations, phone by bedside, paged at all hours, then I'm most definitely acting as operations - probably NOT what I signed up for.

And, contrary to the stated intentions, I've directly observed developers making crappy, band-aid fixes to ongoing production problems in the interest of "making the pages stop". This is the mindset when you are on call be being paged at all hours.

In theory, DevOps is supposed to put those that can best fix things closest to the problems, but in reality a slight separation from the firestorm of ops actually produces better, more thoughtful solutions in the long run.

The best balance is to have a first tier Ops on-call, 2nd tier engineering on-call, and any alerting issues get attention within 24 hours, moving to the front of the work-queue. But, indiscriminately assigning everyone "pager-duty" rotations leads to lower quality solutions in the end.

As the guy who's usually on pager rotation (and too often with far too bodies to share it), I disagree. I wrote a detailed comment a few days ago explaining my rational here, in conjunction with overtime / off-the-clock responsibilities:


• It increases pager coverage, and reduces any one person's pager obligations. Simply having pager anticipation is a mental burden after a while.

• It creates a stronger incentive for response procedures: what are the expected obligations of response staff, what's considered sufficient effort, what's the escalation policy, who is expected to participate, what are consequences of failure to respond?

• Cross-training. Eng learns ops tasks, ops has a better opportunity for learning what eng is up to and deals with.

• It makes engineering more aware of the consequences of their actions: is insufficient defensive engineering causing outages (say, unlimited remote access to expensive operations), are alerts, notification mails, and/or monitoring/logging obscuring rather than revealing anomalous conditions? Are mechanisms for adjusting, repairing, updating, and/or restarting systems complex and/or failure prone themselves?

My experience at one site, where I was a recent staff member (and hence unfamiliar with policies, procedures, and capabilities), systems went down starting at 2am, I was unable to raise engineering or my manager, and the response the next staff meeting to my observation of this was pretty much "so what" did not endear me to the organization (I left it shortly afterward).

Note that what I'm calling for isn't for eng to be the sole group on pager duty, but for eng and ops to share that responsibility.


I'm glad you have had a positive experience, but, it feels like your outlooks is unique among many of the developers I talk to daily. Could be selection bias, though! Good things to think about.


To be clear: I'm generally on the ops/systems side, not engineering / development.


Giving everyone pager duty can lead to higher quality solutions. The band-aid fixes crop up when ownership of a whole system eventually spreads too thinly.

Within the right framework, keeping everyone on pager rotation can lead to much smoother operations, because everyone stays familiar with the system as a whole. This was going around recently, and captures the essence of the philosophy: http://catenary.wordpress.com/2011/04/19/naurs-programming-a...


In my experience, it also leads to better solutions because devs who don't get woken by issues with their own code are people who don't particularly care about such faults. I've done on-call before where I've begged the devs to fix issues because they were waking me up needlessly. The devs were nice, but somewhat lazy, and my fix wasn't on their radar. Stick them on on-call, and all of a sudden it's more important to fix.

At one place I worked we had a two-person support shop. We would claim time and again that this or that affected customers or made support hard. The devs would pick and choose what was fun to work on. I ended up leaving and the other guy went on a prearranged month-long vacation. Everyone else had to pick up support (~5 devs) for a month, and I'm told that they had so much trouble with the normal support load that development actually stopped for that month. Apparently when the other guy got back, they started listening a bit more to his concerns, having had a taste of what happens on the pointy end.

In a similar vein, there's a wine distributor where all employees spend their first week half on the phones and half in the packing department, to give everyone a feel of what the core function is and what customers complain about. The guy telling me said that everyone gets the treatment, except the new CEO, who got away with only doing a day rather than a whole week.


Sounds like somebody in the hierarchy doesn't quite "get it".


Applications are open for YC Summer 2015

Guidelines | FAQ | Support | Lists | Bookmarklet | DMCA | Y Combinator | Apply | Contact