
Ask HN: How do you deal with operational work as a software engineer? - lamansion
As a software engineer doing infrastructure work I often find myself working on operational stuff (mostly chasing weird bugs, some on-call, etc.). In my position I am also expected to release features and do development too, but I feel like it&#x27;s very difficult to focus because of all the operational issues I am dealing with. How are you guys dealing with that sort of work?
======
megaman22
Badly. We've lost a couple devs/ops people in the last year, and haven't
adequately replaced them. We're stretched way too thin and everyone is getting
very burned out.

I haven't done any significant development work in more than six months, just
chasing bugs, doing support, and fussing with email and meetings. It blows;
I've got to find a different job.

------
Jtsummers
Identify points to automate. Automate them. Get the automation peer reviewed
by the team. Establish testing for the automation. Deploy the automation.

If it's one-offs and not consistent misbehavior that the above can deal with,
improve testing infrastructure. If you're unable to hit your feature
development schedule, point to the problems in the present system and
infrastructure.

Ask your boss for clear priorities: Do they want a stable system, or more
features. If the present system is this unstable, then more features will only
exarcibate this. If they say they want both, and give them equal priority, ask
for a pay raise and search for new jobs.

------
kaikai
Chasing bugs and being on-call sound like core parts of a software engineer's
job, rather than operational work.

That said, some teams at my company are experimenting with having a week-long
rotation for "bread box" issues. Those include tending issues/PRs in open
source repos, handling bugs as they come in, etc. That frees up the rest of
the rest of the team to work on core feature work.

I like to keep a running list of smaller, non-urgent tasks that would
otherwise get neglected. When I have a long-running script or need to take a
break from another project, I can refer to the list.

~~~
mottomotto
Chasing bugs? Yes. Being on-call? No. Not unless you signed up for that. Too
many companies think they can just get Pagerduty going and sign up all their
engineering staff for operations duty. This is stupid for a number of reasons
least of which is managed services get rid of most of the need for this and it
is typically cheaper than developer time.

Do some developers on the team need to think about scale? Yes. Should all the
developers be on call because perhaps the company decided to roll it's own
infrastructure and someone has to deal with occasional server with full disks?
No.

~~~
twalla
The flipside to this is that being on call forces developers to care about
bugs in their code that cause operational headaches instead of just throwing
releases with varying degrees of test coverage over the fence to ops. Funny
how certain bugs that languished in the background get priority when the dev
responsible for that code's phone is the one that rings at 3am instead of some
poor schmuck on the ops team.

~~~
rocmcd
This exactly. If the developers responsible for the problem (and the fix)
aren't feeling the pain of being on-call, then nothing will change and the
fallout will be left on support/ops (who will usually find a poorly thought
out workaround).

Do developers need to be on-call to handle purely ops-related activities (low
disk space, high system load, etc)? Absolutely not. Should developers be
responsible for their "production-ready" code when it breaks? Definitely.

~~~
mottomotto
But the problem is if you assign a rotating duty to your engineering staff,
you as an engineer have no direct impact on how often you will be called due
to the half-assed work of other developers. It's a rocky road. Do this too
much and your staff will leave. I certainly will. Life is too short.

In short, we're all describing poor management issues. Signing up all the
developers for Pagerduty is band aid. So is pushing it all onto operations. In
both cases, management is making a choice to avoid dealing with something that
requires ongoing effort and time.

------
amriksohata
Never agree to be on call and if you do, make sure you are being paid double
salary as a minimum, all modern science points to working unsociable hours as
a massive detriment to your health. Also working Saturdays and Sundays does
not make your team more productive, because your staff will be tired the
following week, it's a false economy.

~~~
scarface74
So if your software needs to run 24 hours and something breaks with your
software, how do you avoid being on call?

A developer shouldn't be the first person called, there should be an
operations staff but they may have to escalate.

On the other hand, any time that a developer is routinely being called in the
middle of the night, there is usually either an issue with the software or the
infrastructure not being fault tolerant.

~~~
amriksohata
In the UK there are laws you can opt out of being asked to work more than a
certain amount of hours. They company should have an out of hours plan but
most experienced developers will know very few things get resolved in the
middle hours of the night, things need testing, reviewing and sometimes the
solution is not simple, it is better like you said to have ops staff that
gather data and then pass it on when devs are in fresh, however if you have,
say, a big international sale which is happening in another timezone then why
not just pay staff as a one off to be around?

------
flukus
It depends on the frequency and nature of these issues, but it sounds like you
are experiencing technical debt and that your paying for it with slower
development speed. Solving the stability issues should take precedence over
developing new features.

Is the stuff you have to intervene for under your control or external? If
you're relying on outside systems that are flakey then you need make your
systems more resilient, things like automatically retrying a few minutes later
if some third party service is down and/or being more transnational so you can
deal with errors.

------
sqldba
We may need some clarity in the problem are you experiencing.

If the problem is that you can’t focus long enough to do non-operations work
whats the problem with that?

Are you unhappy you’re not coding? If so then ask for a new hire to take over
the part you don’t want or start looking for another job.

Are you unhappy that your boss is still pushing you for results and is an
utterly clueless idiot who has no idea where your time actually goes?

Fill us in.

------
watwut
I see chasing weird bugs as part of development job, not something separate.
As long as it has weird bugs, the feature is not really done. As a side note,
developers who do "only development" and offload all weird bugs to someone
else tend to create less maintainable software overtime - they lack feedback
and tend to favor whatever makes them produce new stuff faster over what makes
us all avoid those weird bugs.

As for infrastructure and first line support, lobbying management for more
people continuously is just about the only long term solution.

The other thing is planning and transparency which helps the above. Keep plan
with realistic estimates to show it management each time you talk with them.
Do your best work, definitely dont slack etc, but dont skip corners to make
something look like done when it is not. Instead, move dates in plan and send
it to management again. The point is to convince them that there is really
more overall work then possible by one person. (If they get offended over that
or treat you badly over that, find a new job.)

~~~
Kuraj
> As a side note, developers who do "only development" and offload all weird
> bugs to someone else tend to create less maintainable software overtime

My problem is that I have become that someone else.

------
eitland
Time Management for System Administrators has some ideas I think:
[http://shop.oreilly.com/product/9780596007836.do](http://shop.oreilly.com/product/9780596007836.do)

(I haven't read this cover to cover but I has more or less read his and
Christina J. Hogans book cover to cover I thing and I've also bought a couple
of copies of the above book to share.)

Summary of what I've learned and found useful from those and other resources:

Get someone to step in for you half the time. (If only to fill in a ticket or
- in a real emergency: call you.)

Manage expectations. (You don't expect hard interrupts except for emergencies.
)

Make support requests asynchronous. (Mail, support tickets - not calls. Even
when you (or someone else) are available for real time support, - make chat
the preferred option.

------
holydude
Yeah I really get your suffering. I really hate when software engineers try to
meddle in the ops part. It usually ends up being a stupid piles of crap on
another crap. It is also sad to see companies pushing devs to do this instead
of giving it to someone who understands what they are doing.

------
pmontra
If I don't fix bugs and I don't help my customers with setting up servers and
the like I don't think I'll get new projects with them. Why would they trust a
developer that disappears? It's as simple as that.

Some of those activities are paid, but fixes close to a delivery are not and
it's OK. Usually I set up a maintenance contract for quick activities, like
small new features or investigating puzzling events (not necessarily bugs.) I
have a ticketing system to keep track of those activities. Customers have
access to it.

Obviously one has to make clear that maintenance will slow down development.

------
dozzie
What issues exactly are you dealing with? You only provided a very vague
description of this "operational stuff" you do and are disturbed with.

~~~
lamansion
Dealing with production problems, which may be functionality, performance, and
reliability related.

~~~
sjellis
Speaking as an ops person, my first thought is that you have technical or
architecture debt. Obviously, big and/or very rapidly growing systems will hit
limits and need constant attention, but these days designing most applications
to scale is not a problem.

The root cause of many operations issues that I see these days stems from one
or more deficiencies in the development process. I don't say "deficiencies in
developers": to get safe development at speed, you need a disciplined
development process with appropriate feedback mechanisms: unit tests,
integration tests, performance tests, static analysis, code review etc. The
default state of code is "buggy", because humans are not perfect.

------
aprdm
You need a better system in place to prevent bugs from happening.

\- Separation between development / staging / production environments.

\- Integration tests.

\- Service / System Metrics.

\- Central logging.

\- High availability.

\- Alerts.

When you have a solid deployment pipeline things don't usually break. Errors
and regressions are caught in the staging part of the deployment pipeline and
errors in production can be rolled back automatically (and then you add a
integration test for the regression!)

All this devopsy work at my company is done by software engineers with advise
from systems engineers. And we do it because neither of the groups want to get
called in the weekends :) it has been working really well. Last year we had 0
calls. Before we had this in place things would break in a weekly basis.

You can build all of what I mentioned with OSS like:

\- Ansible (deployment)

\- Jenkins (ci)

\- ELK stack (metrics / logging)

\- Zabbix (system metrics)

This system has been serving us, on premises, without much maintenance.

------
thisisit
> As a software engineer doing infrastructure work

So you are into devops but doing more ops than dev? This doesn't sound like a
problem until your team's agenda and objective is to deliver more ops work.

------
bradhe
Treat your operations work like your engineering work. Over time things get a
lot better.

------
akulbe
It's hard to read this and not want to offer help. I don't know if this is the
best venue though.

