DevOps Students Learn the Value of Uptime With 3 a.m. Calls

lostcolony · on June 16, 2016

I expect it'd be more helpful to force it on managers in a management course. That way when ops says "This thing you paid an outside consultant to develop is fragile as hell and can fall over at any time" they'll be inclined to listen.

And on developers, so they realize that reliability is as much a requirement as any of the functional requirements they were given.

mancerayder · on June 16, 2016

Agreed.

While it's normal and expected that the 'Ops' or 'DevOps' or the artist previously known as 'Sys Admin' is expected to be available off-hours for stuff they didn't build (read: emergencies-that-are-not-within-their-purview-to-prevent)... on the flipside, the tendency of many companies to reduce the burden of responsibility of developers and product managers who are often directly responsible for instability in the environment, is prevalent in a way that makes my blood boil.

Going back to this article, it's almost a celebration that DevOps folks have to shoulder that burden.

While it's normal to be on-call, being woken up all the time is a sign of a badly run infrastructure OR release/change-management practices that are rushed and/or feeble.

Everywhere I've worked I've stood up against this tendency, analyzing each issue that causes a page and seeing how it could be prevented. A quarter of the time it's technical: creating redundancy, deep diving into an ongoing issue, doing load tests and capacity management, etc. The rest of the time it's political: oh, the new code caused the memory to be sucked dry from the system, this started precisely after the last release (proof, here's a graph from my Check_MK setup); oh, the devs ran a crap query again on the Hadoop cluster even though we warned them not to do X. etc.

I think if they want to prepare future DevOps students, rather than using PagerDuty as per the article, maybe they should give them shots of liquor and strong beer, their livers could use the preparation. It's an incredibly political role.

x0x0 · on June 16, 2016

I've had success granting myself comp time when I've been up late or worked on the weekend to fix ops issues. Eg if I'm up at 2 or 3, I come in at noon or 1pm, and if it takes more than 2 hours, I just take a full day. The one boss who complained shut up when I told him he had 3 options: (1) deal, (2) remove ops responsibilities, (3) I quit on the spot. It's incredibly necessary to push the costs of sloppy code to be really apparent to management.

lostcolony · on June 16, 2016

Well said. I've done the same thing, and fought to make it standard department wide. I'm a dev, I -want- to do the right thing, but sometimes management insists on cutting corners. Fine, but the interest payments on that technical debt should be upon their head.

mancerayder · on June 16, 2016

Amen, brother.

serge2k · on June 16, 2016

> DevOps is a set of practices, a philosophy aiming for agile operations, to expand the collaboration between developers and operation folks to make them work toward the same goal: contribute to the entire product life cycle, from design, development and shipping, up to the production stage. This is a radical shift from the industry norm of separate engineering and operations departments which often operate in opposition to each other.

Lots of words for "You can save money be shoving ops bullshit onto devs instead".

Oh right, it's all about collaboration and ownership and it's way more efficient this way. Just like open offices.

I am willing to acknowledge there are some pros to this system, but it's still screwing over devs by shoving extra (inconvenient) work onto their plates with no extra remuneration.

hedwall · on June 16, 2016

We deal with dev bullshit in production all day, every day. Maybe we should all care for it instead...

FireBeyond · on June 16, 2016

Bleh. As a DevOps guy who is a paramedic, I just cringed at the image of someone sprinting to the computer (I know it's exaggerated - or at least I hope so).

We don't run to cardiac arrests. Walk fast, with a purpose. I fail to see what production issue necessitates me running down the hall at 3am.

And DevOps students should be learning how to prevent (sorry, minimize) this, not conducting fire drills.

toomuchtodo · on June 16, 2016

As a DevOps guy at a startup, I'm expected to respond to alerts within 1-3 minutes at my current employer, 24/7 when I'm on my rotation. We rotate devs in to allow them to try building something "devopsy", but they're never included in the on-call rotation.

Its unsettling.

yardie · on June 16, 2016

1-3 minutes is nuts. So what I'm reading is you have nothing in the way of redundancy or even a means to die gracefully. Just zero or garbage fire.

toomuchtodo · on June 16, 2016

There is some autoremediation in place, but the expectation for responsiveness is top down. Efforts to bring sanity to the situation go ignored. Its been made clear (not directly, but through process) that there are a line of people who could take the job if that level of responsiveness isn't delivered.

yardie · on June 16, 2016

There is always someone willing to take your job. Most won't have what it takes. But managers will use that to cow you into a shitty bargaining position.

Look I've been you. I took the crappy job because I figured that was the only one available. There are better companies out there. Just ask around. Get a few offers.

There is nothing normal about being on call with a 1-3 minute response time. I'm pretty sure it's illegal as well.

bogomipz · on June 16, 2016

You should consider finding another job. I'm being serious. Yes there are generally internal SLAs and MTTR type metrics but setting a hard limit of 1 - 3 minutes to respond is not common practice at all. It speaks volumes about your management and perhaps their lack of experience/maturity. I'm speaking from a number of years of experience of working in this capacity.

serge2k · on June 16, 2016

> I'm expected to respond to alerts within 1-3 minutes at my current employer, 24/7 when I'm on my rotation

That's ridiculous. That's "Guess I can't take the dog outside this week" territory.

DanBC · on June 16, 2016

Does anyone die if you don't respond? Because I can understand those kind of response times if you're a paramedic or ambo driver, but otherwise it sounds like power-tripping bullshit from bad managers.

toomuchtodo · on June 16, 2016

> Does anyone die if you don't respond?

Not at all.

itgoon · on June 16, 2016

That's considered waiting. You're entitled to compensation.

pmiller2 · on June 16, 2016

That's beyond unsettling. It's absurd. What happens if it takes you 4 minutes to wake up enough to realize what's going on at 2:30 am?

toomuchtodo · on June 16, 2016

If you're lucky enough to have someone from engineering on the chat client at the time, they jump in, and it reflects poorly against you.

I realize the expectations are unrealistic, but as I'm sure most people know, your options are to leave or stick it out.

Declanomous · on June 16, 2016

There are more than two options. For instance, you can document this is an expectation of you, and you can document all of the times you are on call. Then you can sue them for unpaid overtime.

serge2k · on June 16, 2016

lol, OT for IT workers. Good one.

We're part of a very special group that employers are legally allowed to screw over regarding OT. Because reasons.

Actually I would really like to know the context of how that law got passed, if anyone has insights.

golergka · on June 16, 2016

I hope you're being paid in full for that.

toomuchtodo · on June 16, 2016

I'm actually $10K under market average.

anexprogrammer · on June 16, 2016

You really should be job hunting, your current role sounds abusive.

What are you meant to achieve in 1-3 mins? Answer the alert or be at a workstation?

How are you meant to have anything resembling a normal life?

golergka · on June 16, 2016

Uhm — get out, I guess?

bogomipz · on June 16, 2016

Yes exactly. Waking someone up at 3:00AM as a training exercise is ridiculous. That prepares you for nothing. Is this really part of the curriculum for a school that charges 17% of your first 3 year's salary after completing the course? Terrible.

humbleMouse · on June 16, 2016

3 years??? Holy cats. I doubt that would hold up in non-compete court.

bogomipz · on June 16, 2016

Yeah what a joke right?

"That is why there is no upfront cost to join Holberton school. We only charge 17% of your internship earnings and 17% of your salary over 3 years once you find a job. If the company you join agrees to pay us a placement fee, this percentage will be reduced."

https://www.holbertonschool.com/education

cturner · on June 16, 2016

Devops started as a message that developers were responsible for the solution they were building, not just for throwing builds over to deployment. The phrase turned into HR code for firefighters who hack scripts on production.

When it's built right, operations doesn't exist. On this theme - we're hiring. Central London. We need a platform/infrastructure engineer: solid unix, automation-centric, someone who will drive evolution of the infrastructure and release platform. https://clearmatics.workable.com/jobs/257440

groby_b · on June 16, 2016

Hey, DevOps has discovered hazing rituals.

And keeping students on-call 24/7 is abuse, nothing less.

hexadec0079 · on June 16, 2016

Wait, this seems to ignore the fact that with good change controls and sound code, products do not just fail at 3am. If they call everyone at 3am without a failure the student could have prevented, that does not teach anything other than how to answer a phone. Instead, teach them to properly engineer and document their solution such that they aren't called at 3am.

This seems like a waste of a good night's sleep to me.

cheald · on June 16, 2016

> Wait, this seems to ignore the fact that with good change controls and sound code, products do not just fail at 3am.

Because network outages never happen, disks never fail or fill up, memory is never an issue, programs always deal with only the data they were expected to, products never do more traffic than expected, and all infrastructure software ships completely bug-free.

If you aren't occasionally up at 3 AM fixing unexpected outages, then either you haven't deployed a project that requires uptime or you're paying someone else to do it for you.

serge2k · on June 16, 2016

> Because network outages never happen, disks never fail or fill up, memory is never an issue, programs always deal with only the data they were expected to, products never do more traffic than expected, and all infrastructure software ships completely bug-free.

several of these things are exactly the type of things wehre the whole idea of devops just falls apart. If a disk fails what use is a dev vs a good ops person?

hexadec0079 · on June 16, 2016

The issue is not with problems, but with the arbitrary nature of this as a learning tool. Getting a page to fix an issue that you forsaw and prevented is dumb. Teach them to make resilient systems rather than get out of bed.

I have never gotten up at 3am to fix an outage because outages do not impact systems that widely. If it is a network problem, let the network team do their job. All of your other problems are solved with multiple HA/ load balanced servers, monitoring, and proper testing.

sylvainkalache · on June 16, 2016

I am the one behind this part of the curriculum, I've been SRE/DevOps for the last past 5 years of my life. The project that the article refers to is about uptime, better uptime means better grade.

We partnered with companies such a PagerDuty, Wavefront so that they have tools to help them keeping their uptime, we are guiding them to put everything in place so that their website/servers never go down. We never call them at 3AM and we actually never call them at all. However we do have challenges where we simulate hardware failure, traffic spike...

Just to make things clear: the goal is not to wake up students are 3 AM but to get them ready for production and this involve being on call and possibly getting paged at 3AM.

pbnjay · on June 16, 2016

My hope would be that they somehow engineered a failure on all the systems that they would have to debug at 3am

hexadec0079 · on June 16, 2016

So the students engineered themselves to fail or the instructors gave them a terrible system that is bound to fail? Shit going sideways is a fact of life, but you can at least let people have the chance to prevent the problems before they exist.

While there is always someone else's mess you have to support, that does not seem like a great teaching moment to me.

hartator · on June 16, 2016

After that, we keep seeing posts about how coders are earning too much money.

aayala · on June 16, 2016

people will burn out soon or later.

downrightmike · on June 16, 2016

Cost: 17% for 3 years isn't bad at all

serge2k · on June 16, 2016

Would have been easily more than my students loans for university

Declanomous · on June 16, 2016

On the flip side, with income-based repayment you are paying 10% of your discretionary income for 20 years or so. Depending on what your salary ends up being this might not be so bad. For instance, 17% for three years would be way less than my student loans. I do work for a non-profit organization though.