I expect it'd be more helpful to force it on managers in a management course. That way when ops says "This thing you paid an outside consultant to develop is fragile as hell and can fall over at any time" they'll be inclined to listen.
And on developers, so they realize that reliability is as much a requirement as any of the functional requirements they were given.
While it's normal and expected that the 'Ops' or 'DevOps' or the artist previously known as 'Sys Admin' is expected to be available off-hours for stuff they didn't build (read: emergencies-that-are-not-within-their-purview-to-prevent)... on the flipside, the tendency of many companies to reduce the burden of responsibility of developers and product managers who are often directly responsible for instability in the environment, is prevalent in a way that makes my blood boil.
Going back to this article, it's almost a celebration that DevOps folks have to shoulder that burden.
While it's normal to be on-call, being woken up all the time is a sign of a badly run infrastructure OR release/change-management practices that are rushed and/or feeble.
Everywhere I've worked I've stood up against this tendency, analyzing each issue that causes a page and seeing how it could be prevented. A quarter of the time it's technical: creating redundancy, deep diving into an ongoing issue, doing load tests and capacity management, etc. The rest of the time it's political: oh, the new code caused the memory to be sucked dry from the system, this started precisely after the last release (proof, here's a graph from my Check_MK setup); oh, the devs ran a crap query again on the Hadoop cluster even though we warned them not to do X. etc.
I think if they want to prepare future DevOps students, rather than using PagerDuty as per the article, maybe they should give them shots of liquor and strong beer, their livers could use the preparation. It's an incredibly political role.
I've had success granting myself comp time when I've been up late or worked on the weekend to fix ops issues. Eg if I'm up at 2 or 3, I come in at noon or 1pm, and if it takes more than 2 hours, I just take a full day. The one boss who complained shut up when I told him he had 3 options: (1) deal, (2) remove ops responsibilities, (3) I quit on the spot. It's incredibly necessary to push the costs of sloppy code to be really apparent to management.
Well said. I've done the same thing, and fought to make it standard department wide. I'm a dev, I -want- to do the right thing, but sometimes management insists on cutting corners. Fine, but the interest payments on that technical debt should be upon their head.
> DevOps is a set of practices, a philosophy aiming for agile operations, to expand the collaboration between developers and operation folks to make them work toward the same goal: contribute to the entire product life cycle, from design, development and shipping, up to the production stage. This is a radical shift from the industry norm of separate engineering and operations departments which often operate in opposition to each other.
Lots of words for "You can save money be shoving ops bullshit onto devs instead".
Oh right, it's all about collaboration and ownership and it's way more efficient this way. Just like open offices.
I am willing to acknowledge there are some pros to this system, but it's still screwing over devs by shoving extra (inconvenient) work onto their plates with no extra remuneration.
Bleh. As a DevOps guy who is a paramedic, I just cringed at the image of someone sprinting to the computer (I know it's exaggerated - or at least I hope so).
We don't run to cardiac arrests. Walk fast, with a purpose. I fail to see what production issue necessitates me running down the hall at 3am.
And DevOps students should be learning how to prevent (sorry, minimize) this, not conducting fire drills.
As a DevOps guy at a startup, I'm expected to respond to alerts within 1-3 minutes at my current employer, 24/7 when I'm on my rotation. We rotate devs in to allow them to try building something "devopsy", but they're never included in the on-call rotation.
There is some autoremediation in place, but the expectation for responsiveness is top down. Efforts to bring sanity to the situation go ignored. Its been made clear (not directly, but through process) that there are a line of people who could take the job if that level of responsiveness isn't delivered.
There is always someone willing to take your job. Most won't have what it takes. But managers will use that to cow you into a shitty bargaining position.
Look I've been you. I took the crappy job because I figured that was the only one available. There are better companies out there. Just ask around. Get a few offers.
There is nothing normal about being on call with a 1-3 minute response time. I'm pretty sure it's illegal as well.
You should consider finding another job. I'm being serious. Yes there are generally internal SLAs and MTTR type metrics but setting a hard limit of 1 - 3 minutes to respond is not common practice at all. It speaks volumes about your management and perhaps their lack of experience/maturity. I'm speaking from a number of years of experience of working in this capacity.
Does anyone die if you don't respond? Because I can understand those kind of response times if you're a paramedic or ambo driver, but otherwise it sounds like power-tripping bullshit from bad managers.
There are more than two options. For instance, you can document this is an expectation of you, and you can document all of the times you are on call. Then you can sue them for unpaid overtime.
Yes exactly. Waking someone up at 3:00AM as a training exercise is ridiculous. That prepares you for nothing. Is this really part of the curriculum for a school that charges 17% of your first 3 year's salary after completing the course? Terrible.
"That is why there is no upfront cost to join Holberton school. We only charge 17% of your internship earnings and 17% of your salary over 3 years once you find a job. If the company you join agrees to pay us a placement fee, this percentage will be reduced."
Devops started as a message that developers were responsible for the solution they were building, not just for throwing builds over to deployment. The phrase turned into HR code for firefighters who hack scripts on production.
When it's built right, operations doesn't exist. On this theme - we're hiring. Central London. We need a platform/infrastructure engineer: solid unix, automation-centric, someone who will drive evolution of the infrastructure and release platform. https://clearmatics.workable.com/jobs/257440
Wait, this seems to ignore the fact that with good change controls and sound code, products do not just fail at 3am. If they call everyone at 3am without a failure the student could have prevented, that does not teach anything other than how to answer a phone. Instead, teach them to properly engineer and document their solution such that they aren't called at 3am.
This seems like a waste of a good night's sleep to me.
> Wait, this seems to ignore the fact that with good change controls and sound code, products do not just fail at 3am.
Because network outages never happen, disks never fail or fill up, memory is never an issue, programs always deal with only the data they were expected to, products never do more traffic than expected, and all infrastructure software ships completely bug-free.
If you aren't occasionally up at 3 AM fixing unexpected outages, then either you haven't deployed a project that requires uptime or you're paying someone else to do it for you.
> Because network outages never happen, disks never fail or fill up, memory is never an issue, programs always deal with only the data they were expected to, products never do more traffic than expected, and all infrastructure software ships completely bug-free.
several of these things are exactly the type of things wehre the whole idea of devops just falls apart. If a disk fails what use is a dev vs a good ops person?
The issue is not with problems, but with the arbitrary nature of this as a learning tool. Getting a page to fix an issue that you forsaw and prevented is dumb. Teach them to make resilient systems rather than get out of bed.
I have never gotten up at 3am to fix an outage because outages do not impact systems that widely. If it is a network problem, let the network team do their job. All of your other problems are solved with multiple HA/ load balanced servers, monitoring, and proper testing.
I am the one behind this part of the curriculum, I've been SRE/DevOps for the last past 5 years of my life. The project that the article refers to is about uptime, better uptime means better grade.
We partnered with companies such a PagerDuty, Wavefront so that they have tools to help them keeping their uptime, we are guiding them to put everything in place so that their website/servers never go down. We never call them at 3AM and we actually never call them at all. However we do have challenges where we simulate hardware failure, traffic spike...
Just to make things clear: the goal is not to wake up students are 3 AM but to get them ready for production and this involve being on call and possibly getting paged at 3AM.
So the students engineered themselves to fail or the instructors gave them a terrible system that is bound to fail? Shit going sideways is a fact of life, but you can at least let people have the chance to prevent the problems before they exist.
While there is always someone else's mess you have to support, that does not seem like a great teaching moment to me.
On the flip side, with income-based repayment you are paying 10% of your discretionary income for 20 years or so. Depending on what your salary ends up being this might not be so bad. For instance, 17% for three years would be way less than my student loans. I do work for a non-profit organization though.
And on developers, so they realize that reliability is as much a requirement as any of the functional requirements they were given.