
Ask HN: Best approach to inheriting an out of control system? - jimmynopension
I can&#x27;t find any good advice for managing what is a reasonably common situation, &quot;how to successfully fire fight&quot;<p>On-prem, cobbled together system, multiple parts that are completely unknown, everyone is afraid of making a change, if it goes down it might not come back up.<p>Two devs were building a little project on top of a system and adding bits to it as they got new ideas (both quite junior quick and dirty types). Much of the codebase they have never touched, it&#x27;s just there running with the stuff they have on top.<p>Deployed it to an OVH remote managed rack (yeah I know!) and offered it to a client. Client loved it, hockey stick growth, huge demand, totally not production ready but it&#x27;s throwing off cash (lots and lots and lots).<p>Devs are burnt out and one is taking an offline holiday, they aren&#x27;t mature about the situation, they are also a self contained business unit away from the rest of the company.<p>Emergency consultant and senior devs from parent company started to take a look over the evenings this week. Emergency plan is to get a copy deployed and running in azure so there are 2 instances and if the OVH one goes down again we can swap traffic (pending P&amp;L sign off)<p>Potential for a great product to die and cause reputational damage. Potential to beat it into shape and turn it into a production product and multi million revenue.<p>Thoughts?
======
austincheney
This is more a problem of failed leadership, possibly at multiple levels.

If you are the guy tapped to fix it then own it. The problems occurred before
you got there and you are dealing with a legacy mess, but real leadership
means having the humility to own that problem like it is all your own.

Don’t blame the junior developers for their lack of direction. Cover for them
and protect them. Earn their trust and guide them in a better direction. This
is where you will begin to turn the ship around.

Numerous people will be quick to bias you with their opinions. Always bear in
mind that bias is misleading. Be firm in forming original opinions from your
own observations.

Set a valid plan with realistic expectations. This problem didn’t happen in a
day and it won’t be fixed in a day. Define primitive metrics to identify
progress and business health. Use those metrics to determine if you are moving
in the proper direction and the speed of success. Be very transparent about
your measures to both your management and your team.

Frequently praise your people, work your ass off, and set a positive example.

------
rawgabbit
Your leadership will view the situation from a Risk-Reward lens. They know the
reward which is the revenue it brings in. They will ask you what is the worst
that can happen and then they will decide to continue or shut it down.

I personally think that lifting and shifting to Azure is a good thing. Next
you should get Microsoft Premier Support/Premier Field Engineering to analyze
what you have. They will give you plan on how to improve reliability and
probably reduce your costs as well. You take that plan and explain to your
client the good news/bad news. The bad news is that Microsoft says we should
perform the following to improve reliability; the good news is that your
company will not charge the client extra; you will improve reliability for
"free". As reliability improves, your company can direct the senior devs to
start designing version 2.0 (maybe).

------
atsaloli
Max's answer to the InfoQ interview[1] question "What is the most terrible
code that you ever encountered, and what was your approach to refactoring it?"
may be relevant.

In his book "Code Simplicity", Max has a checklist (summarized in point 4 of
[https://techbeacon.com/app-dev-testing/rewrites-vs-
refactori...](https://techbeacon.com/app-dev-testing/rewrites-vs-
refactoring-17-essential-reads-developers)) -- and that's the checklist
referred to in the InfoQ interview.

1: [https://www.infoq.com/news/2018/01/q-a-max-kanat-
alexander/](https://www.infoq.com/news/2018/01/q-a-max-kanat-alexander/)

------
backslash_16
It sounds like you are understandably worried about it going down. Standing up
another copy is a great idea, and a great exercise in making sure your company
understands how to build and deploy it from scratch.

When you do that you also need to figure out the data persistence layer. You
probably want to either share it between both instances of the application or
setup a backup/copy system so the version in the cloud has up to date data and
is more of a hot spare than a second set of infra lying around with an empty
DB.

Moving on from there, if your only goal is to keep it alive while the devs are
on vacation, you should probably implement a deployment freeze. Yes in an
ideal world you would make any and all changes to an infrastructure as code
template and re-deploy, or at least change config files in a repo and re-
deploy those but it sounds like the application isn't that modular.

Most incidents are caused by change, so minimize that until the whole team is
back together, including the devs who wrote the service, to start making
improvements.

At the same time you need to know or figure out how you can change config to
keep it running. Ex: You need to update a cert thumbprint or change a timeout
value. It sounds like it's either running on bare metal or on VMs on a
physical server you own? If that's the case maybe sshing into the boxes to
manually edit config is the least bad way of updating it (again - only until
this emergency situation is over). If you go the route of ssh-ing, at least
build a tool or script to update the key value pairs in your config store so
you don't missing an angle bracket or quote and set your whole system on fire.

If the developers weren't going on vacation my advice would be a lot
different. What I have written above is purely tactics to keep the system
alive and the business making money until it's a better time, personnel wise,
to improve the system.

Lastly for some dev advice, get started on some end to end or "pinning" tests.
Yes they are typically the most fragile and slow type of tests but you can get
a lot of safety and peace of mind from just a few of them. I personally feel
that in situations like these they are the best value per dev hour spent.

If you're using Python, introduce mypy immediately. Same for JavaScript (use
TypeScript). Being able to lean on a type system (or if you're in a static
language - compiler) when refactoring or making sweeping changes it incredible
helpful and lowers the chance of making a mistake by a large amount.

If you want more, I have tons more I can write on this subject but this is
getting quite long :)

------
nhayfield
rewrite it from scratch

~~~
jimmynopension
total greenfield, gather requirements, work out what the customer wants,
define the functionality of existing system, consider architecture etc? could
take months, current staff is 2 burnt out jr devs

This is a black box that is generating $10k a day

Rewrite from scratch assumes you have good devs that know what needs to be
built, can't even start looking under the hood of this right now. More
interested in this from an ops perspective / management. Any guides / books /
war stories about this sort of thing

~~~
AnimalMuppet
For $10k a day, you can hire a lot better than two burnt out junior devs. If
you want that money to keep flowing, plow some back in into getting several
more senior people on it.

~~~
rboyd
in the OP they mention they already have some much more senior devs looking at
it and their big idea is to move it to Azure. [?]

------
thedevindevops
Step 1) Document the system (what currently exists) in more detail than you
think you'll need, paying specific focus to the system and module boundaries
(if the Interfaces are done well this should be a fairly easy step)

Step 2) Modularise all the things, tuck all those fiddly bits behind the
abstractions you pulled out in Step 1

Step 3) Mock and unit test all the things, this is actually crucial because
this is where you clarify all the crud that built up due to developer
assumptions, you test and check all those Modules and verify the system is
doing not only what you think it's doing but _what it 's supposed to be doing_

Step 4) Introduce a good Dependency Injection Framework

Step 5) (This is where you actually fix things) Now you can break up those
Modules and refactor their internals - even swapping out entirely freshly
written modules thanks to that nice DI framework - with confidence that you
won't break the overall system.

