
Ask HN: Ever worked with a service that can never be restarted? - scandox
I&#x27;m currently working with a legacy system. One element of it has a loaded config in memory, but the physical config was accidentally overwritten and there are no backups. In addition the source code for this compiled binary has also been lost in the mists of time.<p>The service has current up time of 55 Months. The general consensus therefore is that as long as it is never restarted it will continue to perform its function until a replacement can be put in place. Which seems a little fatalistic to me...<p>Has anyone experience of doing something sensible in a similar situation?
======
verytrivial
My advice:

1\. Suggest that the work to replace it is prioritized commensurate with the
business impact caused whilst re-establishing service _as if it went down
right now_

2\. Remind them that it will go down at the worst possible time.

3\. Ensure your name is attached to these two warnings.

4\. Promise yourself you wouldn't run your business this way.

5\. Get on with your life.

~~~
jlokier
I would add to that:

\- Make a plan of exploratory steps that could trigger a failure, in order of
increasing risks.

\- Then, if necessary, get a formal sign-off from someone with authority
(perhaps a director of the company) to proceed with each step.

Steps like that, which are usually very low risk but do risk taking down an
unknown service, might include things like (just ideas from other comments):

\- Logging in to the server at all.

\- Splicing something into the network switch.

\- Cloning filesystems or disk images.

\- Cloning the process memory or system memory.

\- Cloning the image or other things if it's in a VM.

\- Running strace gdb, or packet tracing.

\- Swapping the power source live.

These days, it is often possible to transfer a full working system into a VM
on a more modern, powerful machine, and after doing that, it's a great relief
to everyone because it's no longer hardware dependent.

I've done that with some legacy systems that were originally on bare metal,
and are now still running nearly 20 years later in reliable VMs, with no
change to the running software. _Usually_ running faster and with more memory
doesn't break a working system.

But doing it on a service that can't be taken down even for a moment is quite
an interesting adventure! :-) It is possible, but technically challenging, as
long as you can obtain disk and memory images from the live system and then
capture changes fast enough to perform a hot transfer.

~~~
Apofis
I read earlier that some people were putting legacy apps on VMs, but I didn't
realize that you can transfer a full working system into a VM including
memory... which is pretty much a full state snapshot. Awesome. This is this
guy's solution right there, until a replacement is in place.

------
dijit
Actually, I have experience here.

The problem is that anything you do that's potentially destructive in service
of getting the system to be more sustainable is going to be met with heavy
criticism. So you must be careful, the company has accepted the risk that
they're in and you'll have to contend with that most likely.

First thing is first: is it a VM or a Physical machine? Things get a little
easier if it's a VM because there might be vMotion or some kind of live-
migration in place meaning hardware failure might not actually bring the
service down.

Next thing is you absolutely have to plan for failure because the one thing
that I learn as I learn more about computers is that they're basically working
"by accident", so much error correction goes into systems that it absolutely
boggles my mind and failures still get through. So, it's certainly not a
question of "if" but "when" and plan for it being soon.

Now, the obvious technical things are:

* Dump the memory * Grab any files it has open from its file-descriptors (/proc/<pid>/fd/ _) Your config file might be there... but somehow I doubt it._ Attach a debugger and dump as much state as possible

Be sure to cleanly detatch : [https://ftp.gnu.org/old-
gnu/Manuals/gdb/html_node/gdb_22.htm...](https://ftp.gnu.org/old-
gnu/Manuals/gdb/html_node/gdb_22.html)

Don't use breakpoints! They will obviously halt execution.

If it was my service I would also capture packet's and see what kind of
traffic it receives most commonly; and I would make some kind of stub
component to replace it before working on the machine. Just in case I break it
and it causes everything to go down that depends on it.

But, this is a horrible situation, reverse engineering this thing is going to
be a pain. Good Luck.

~~~
scandox
> because the one thing that I learn as I learn more about computers is that
> they're basically working "by accident"

So true. Thanks this is good advice. As it happens it is a physical server.

I'm not actually mandated to do anything except replace it with a comparable
service but this makes me feel like I'm racing against time!

~~~
arethuza
Have you tried decompiling the executable to see if any sensible information
can be gained that way?

~~~
scandox
Everyone's afraid to do anything on that machine at all. So even getting
agreement to login or install software or run tcpdump or whatever is fraught.

~~~
arethuza
What I was meaning was that if you can get a hold of the binary that is
running you could take it to some other environment and have a look (starting
with a hex dump and working up from there...).

Mind you if things are that bad then maybe I'd follow my own advice (in
another comment) and stay well away from the currently running thing.

~~~
celticmusic
this would certainly help recreate the configuration.

------
aequitas
At a previous job I once sshd into a legacy system which had a motd in the
vein of: "Don't ever reboot this system, if you do, I will find you and do
horrible things to you". No explaination of why. My opinion was that the
person who dared to leave a system in this state was the one who needed the
horrible things done to them instead. Needless to say, he was already fired.

We started working on solution to run parallel to the existing one, which
would receive shadow traffic so we could observe behaviour, find all edge
cases and put them in a testsuite for the new solution. After we where
confident that our testsuite contained most of the important behaviour we
switched traffic to our new solution, keeping the old online just in case we
needed to switch back.

Key is monitoring and learning the expected behaviour of the connected
systems, to sense if that behaviour is not as expected and being able to act
upon that as soon as possible.

~~~
ratel
I think your solution is fine.

Just another perspective on that server message. This server might be actually
running a real-time legacy interface to one of your biggest customers who pay
a 6 figure premium to keep this old thing running, just so they have time to
migrate. Which they have been trying to do since the 80's, hence the fact that
few people remember why it is there.

I'm not saying this happened here, but it happens a lot more than any of us
want. Just keep in mind there might actually be a good reason for the warning,
although I would object to the wording. First head the warning. Then figure
out why it is there.

~~~
aequitas
> just so they have time to migrate. Which they have been trying to do since
> the 80's, hence the fact that few people remember why it is there.

The questions I would ask are: what is the insurance if the system fails due
to natural causes (hardware failure, power outage, you name it)? And what is
the cost/benefit balance of having this unpatched security hole connected to
the internet (which was the reason we discovered this).

If a system is so important that such a warning is deemed appropriate, imho
its just laziness and carelessness from the warning's writer and they are
shifting their problem onto the next person that stumbles upon the system.

At least give some information on why it is important and at what time or
under which conditions the warning can be considered expired. Just like
commenting your code is important because you never know who is having to make
sense of it in a few months time (especially true if that person turn out to
be you :)).

~~~
ratel
To be sure I'm not defending the practice, just shedding another light on it.

In answer to your first question. There might not be an insurance against the
system failing. The honest answer to what to do then, because someone always
needs to ask is: Panic! It might not even be unrecoverable, just really
expensive and time-consuming to do so.

The follow-up question would be: Yes it is a risk, yes it is a possible
disaster. What do you want to do about it? If your answer is anything more
than: We should not have taken on that risk in the first place or lets bet the
companies future on the fact that we can fix this. Then it might actually be
interesting to listen to. It seems from your comment you resolved it.

As for messaging. I do not agree you should explain the situation on the
server. Just let people know they should not touch this ever. As soon as you
explain why people will assume their reason to do so will trump whatever
reason you gave. If they need to think of all possible disasters that might
happen it has more impact than the one you can describe.

Messages like "Before doing anything on this server contact Bob" will
eventually lead to Bob receiving a message: "We have done this or that. Just
letting you know, but it was after office hours" which he will probably see
rushing into the office in the middle of the night because the server is not
working anymore. The other type of message: "Don't reboot this server ever!
For more information contact Bob" Bob changed because he spend a considerable
time saying no and explaining the situation to the sysops team, their manager,
their manager's manager, etc. who all thought their priority must trump Bob's.
Bob might still be working for the company. He might have tried the better
part of his career to get this stain resolved. Nothing bad about Bob.

------
sdmike1
There is a tool used in malware analysis and computer forensics called
Volatility[0]. It has some very powerful analysis tools and works on Linux Mac
and Windows. In your case its ability to dump the memory of a running process
without messing with the process state [1] may be very helpful! It also has
the ability to run a Yara scan against the dumped memory which could let you
find the region of memory containing the config file (so long as you know some
of the strings in it).

Hope this helps!

[0]
[https://github.com/volatilityfoundation/volatility](https://github.com/volatilityfoundation/volatility)

[1] [https://www.andreafortuna.org/2017/07/10/volatility-my-
own-c...](https://www.andreafortuna.org/2017/07/10/volatility-my-own-
cheatsheet-part-3-process-memory/)

~~~
john_moscow
I would advise exercising caution though. Even if the tool works perfectly
fine, the service going down anywhere within days to weeks after you captured
its state would be seen by the management as your fault. They won't care for
the explanations, they will just need a scapegoat and that would be whoever
touched the thing last time.

That said, there are plenty of legitimate (albeit low-probability) reasons for
the dumping tool to screw up the service. Slightly affecting memory timing and
triggering some rare race condition, using up more RAM than usual (or
increasing the HDD utilization) and again triggering some conditions that
wouldn't happen normally. You name it.

If it was me, I would try to run the binary on another machine and do a "clean
room" recreation of the config file. That said, service without sources and
missing configs without backups indicate severe organizational problems, so I
would probably not want to work in such a place for a long time, unless I was
hired by the CEO specifically to rectify things.

~~~
marcus_holmes
I agree, there are actually two problems here:

One is technical, and to do with the service and how to deal with it.

One is political, and to do with who gets the blame when the service
inevitably fails.

OP needs to work out if they're being tasked with the job as the answer to the
technical problem or the political problem. Because management could be very
aware that this is a ticking bomb with no technical solution, and have
appointed OP as a scapegoat to take the blame when it blows up.

~~~
ashish5887
What good is a memory dump if you don't have the codebase to navigate? without
the code and structure of how its managed a memory dump would just look like
an alien language.

~~~
dsr_
The phrase you want to google is "reverse engineering from memory dump".
Prepare to be astounded.

You need to know how machine instructions work, details of the operating
system, and a bunch of other things... but any number of copy-protection
systems have been bypassed this way. Any number of devices have had their
firmware dumped, their memory scanned, and the resulting data analyzed to make
patches or build compatible systems.

------
phillipseamore
Dump memory and see if you can retrieve the config that's running. If someone
has a clue about how the config should be, you could probably get all the
elements of it from simply running 'strings' on the binary. You could also
disassemble the binary. Possibly the deleted config file could be salvaged
from disk.

As is, this is all relying on the HW not giving up and the UPS working in a
power failure!

------
chha
I once worked with a client in such a situation. They were in the process of
building a huge oil rig, costing somewhere north of $5.5bn USD. The basic
premise was that a previous vendor had configured a big documentation system
running on a soon-to-be-outdated Windows Server version ages ago, along with a
"kind of" API allowing the shipyard to send information in the form of
equipment/construction metadata, documents of various kinds and similar stuff.

The main challenge was that nothing could not be resent if a transmission
failed, unless there was an actual change on the shipyard side, or by an
highly complex and manual method. There was no source code to anything, and
while the enterpricy system was a fairly standard off-the-shelf type thing,
the API was completely customized. Nobody knew anything about how it talked to
the system, dependencies or anything else. Changing anything was out of the
question, as everything had been defined in contracts and processes making
waterfall seem agile.

The team I was in was basically there to set up a new application in a
separate environment, so that we could migrate and replace the existing setup
once the shipyard had handed over all the information. In the end everything
was kind of anti-climactic, with everything working as expected for as long as
it needed to.

~~~
celticmusic
> In the end everything was kind of anti-climactic, with everything working as
> expected for as long as it needed to.

I often tell people that great software development is boring.

------
ratel
I have some experience in this field.

First of all you are there to build a replacement service. As long as you do
that in the allotted time what happens to the old should not bother you. Focus
on your part of the solution.

The fact that you do seem to worry suggests you need something from the old
service. Do you have a good design for the new service? Are there details of
the old service that you do not know about? It won't be the first time a
company has an application that runs complex calculations, but nobody knows
how it is done. We actually built black-box applications in the past to
replace calculation modules on obsolete hardware.

If you need information from the old system you may have to resort to cloning
the input and output preferably on the network level so you don't have to mess
with the existing service.

If the replacement service is years out you might suggest to the company to
put in a red team. Another person or persons who work on finding out how the
service works by running a duplicate on another system, rent it if necessary,
and poking at it seeing if they can get comparable results quicker than the
replacement service. But that is not you.

There are a couple of things you need to have in place before going live. Make
sure you can clone input and output so you can run the old service next to the
new one for a while. You don't want to stop the old service the moment you go
live, because if anything goes wrong you need to be able to switch back. Even
than starving the old service of input might have adverse effects, so cloning
is better. If the old service uses external resources, like a database, files,
etc. make sure you do not interfere with file locks, sequence numbers,etc.
when running in parallel.

------
ThomasRedstone
A lot of talk about what to do yourself, I'd say do nothing yourself.

Find an expert who has masses of experience who can consult on it.

This isn't a good time to be learning and testing those lessons.

~~~
faeyanpiraat
This is the solution in my experience aswell.

Doing even minor things that could distrupt the execution for any time
(attaching debugger, installing packet sniffer that blocks network traffic for
a sec or cause a packet lost, etc) could deadlock the application.

Wouldn’t risk it.

~~~
jmkni
The only thing I would risk is copy/pasting the binary/executable that is
running to somewhere else so I could try and decompile it. Anything beyond
that is a nope.

------
weego
My experience in this situation is that the more you decide to the hero and
take on things that everyone else is washing their hands of when you have no
reason to above anyone else, the more likely it is that you'll also be
shouldered with responsibility for things going wrong that were similarly
nothing to do with you in the first place.

Prioritise a replacement and don't decide to try smart things with the running
one.

------
jacquesm
If it is running you can still get at the binary file through /proc and
recover it. The file system will only really delete a file when there are no
more users and a running process counts as the file being in used (so that
pages can be paged back in from it if needed).

~~~
FDSGSG
> but the physical config was accidentally overwritten

Overwritten, not deleted. Open file handles can't really help you there.

~~~
jacquesm
Yes, they can. The low level implementation is such that the inode will not be
released. That's why securely overwriting a file is hard. Just opening a file
for output and writing data is not enough, you need to open it for _update_
and then write as much data as the old one had or more.

------
jomkr
I'm not sure that 55 months of up-time indicates it's more or less likely to
go down in the next month, but I'd guess more likely.

Surely there are options, have you tried de-compiling the source from the
binary?

>but the physical config was accidentally overwritten and there are no backups

Any old dev PCs lying around somewhere? It's worth reaching out to the old
developers to see if they have a copy, in really old legacy systems it's
fairly likely someone will have a copy somewhere.

~~~
barry-cotter
> I'm not sure that 55 months of up-time indicates it's more or less likely to
> go down in the next month, but I'd guess more likely.

Less likely

[https://en.wikipedia.org/wiki/Lindy_Effect](https://en.wikipedia.org/wiki/Lindy_Effect)

> The Lindy effect is a theory that the future life expectancy of some non-
> perishable things like a technology or an idea is proportional to their
> current age, so that every additional period of survival implies a longer
> remaining life expectancy. Where the Lindy effect applies, mortality rate
> decreases with time. In contrast, living creatures and mechanical things
> follow a bathtub curve where, after "childhood", the mortality rate
> increases with time. Because life expectancy is probabilistically derived, a
> thing may become extinct before its "expected" survival. In other words, one
> needs to gauge both the age and "health" of the thing to determine continued
> survival.

~~~
RandomBacon
OP is talking about a phyaical server. The Lindy Effect is for "some non-
perishable things". A server that already has already been running for at
least 55 monthes is definitely perishable.

~~~
matthewowen
A server running software has both perishable and non-perishable elements. The
physical hardware is perishable and weakens with age, but as the software
itself ages the likelihood that it contains bugs like "it crashes whenever
daylight savings occurs" decreases.

The question is which effect dominates at a particular point in time.

------
fxtentacle
Hire a consultant to do a full-memory dump.

1\. That gives you a pretty good chance of attaching a debugger (to the
offline memory file) and extracting the config from memory without touching
the running system.

2\. You are safe in case things turn awful, which seems likely.

As for the source code, if it is an interpreted language like Java or Python
or C#, your chances of recovering a fully-working source code tree are pretty
good. At the very least, it'll help with understanding what the system does.

For C / C++, buy Hex-Rays IDA and Decompiler. That's what the pros use for
reverse engineering video game protections, breaking DRM, etc. So that tool
set is excellent for getting a overview of what a binary does with the ability
to go into all the nitty-gritty details if you need help re-implementing one
detail. Plus, Hex-Rays can actually propagate types and variable names through
your disassembly and then export compile-able C source code for functions from
your binary.

~~~
jacquesm
Java is a compiled language; as is C#.

~~~
Vendan
If it's not obfuscated, you can get a pretty good dump of a c# binary's source
with a decompiler. Some variable names may be missing (local variables and
such) and some constructs may be awkward (generator/iterator stuff) but it's
usable.

------
honkycat
My mother had a saying: Choose your battles. I feel like the value of calling
things "not my problem" in software is undervalued. You cannot fix every
problem, you cannot train every engineer, and you cannot control everything.
Learn to prioritize.

If it is not on your head, don't fuck with it. I understand the instinct to
fix a Big Problem and look great to management. However, this is too high
risk. If you solve the problem, you are a hero. If you fuck it up, you are an
idiot and fired and cost the company $$.

Why run the risk at all? Just cash the paychecks and fix other things that
can't go catastrophically wrong.

------
d--b
I personally would look for that code, it must be somewhere... In old code
repos or hard drives. Maybe you guys have tape backups or things of that
nature.

Don't touch the server, if anything happens, you'll be held responsible.

------
Tilian
The configuration still exists in memory somewhere, so you could extract it
from a dump... whether you should want to do this is an entirely different
question though.

~~~
adrianmsmith
That's not necessarily true I think. The program could read the config at
startup, and create structures or objects inside its heap to represent the
config. After that it would have no need for the actual bytes of the config.

~~~
sokoloff
Those heap objects are the configuration. Not the configuration input file,
but the configuration itself.

It is then be a task to reverse engineer the coding file from the config data
structures.

------
letharion
This is also metioned by dijit, but here's more concrete instructions on how
to potentially recover the config from /proc/ :
[https://web.archive.org/web/20171226131940/https://www.linux...](https://web.archive.org/web/20171226131940/https://www.linuxnov.com/recover-
deleted-files-still-running-active-processes-linux/)

------
PaulHoule
In case you want to replace or upgrade the UPS note they make devices that can
cut into the power cord used by law enforcement to move a computer which is
being impounded without depowering it.

~~~
blackrock
Are you saying there is a device to splice the power cable while it is still
hot, and re-route power to the computer without causing a hiccup?

~~~
dijit
Oh for sure this is possible:
[https://www.youtube.com/watch?v=vQ5MA685ApE](https://www.youtube.com/watch?v=vQ5MA685ApE)

the devices that do it without doing what these guys did is quite pricey
though [https://www.cru-
inc.com/products/wiebetech/hotplug_field_kit...](https://www.cru-
inc.com/products/wiebetech/hotplug_field_kit_product/)

~~~
blackrock
I was thinking more along the lines of a single power input like your home
desktop PC. That you would need to splice into this power cable while it’s
still hot.

For servers I expect that they would have redundant inputs, that you can
easily hot swap it.

The second link had a video on how to do that. With multiple ways to achieve a
power re-route, including a tool to easily splice into the wire. The easiest
technique was to just use a sheath to touch the metal prongs on the cable.

But the surprise for me was the power strip technique, where he plugged in his
battery source into one of the 6 outlets, thereby simulating a power input. I
didn’t even know that was possible.

~~~
jlokier
> For servers I expect that they would have redundant inputs, that you can
> easily hot swap it.

This is true, but there is still a real risk that the dual power input on the
server fails when you actually do the swap.

For example, the second PSU may not work, or it may have failed in the past
and then be forgotten about, or connector or cable internally may be flaky, or
the component that combines power from both PSUs may fail (although I think
that's quite rare).

If it's a really critical service, those risks may be unacceptable.

------
swalsh
You guys may not have the best processes in place, but whoever developed that
app really deserves some credit. 55 month uptime without restart. Not bad.

------
INTPenis
Rather than focus on a replacement they should focus on reading the systems
memory out and creating a replacement virtual machine or perhaps even trying
to decipher the config structure from memory.

My experience with migration projects is that they can drag on in time, and
all the while this system is just itching to go down due to a power failure or
some other issue.

------
beat
"The physical config was accidentally overwritten and there are no backups".

Welcome to legacy.

So for a situation like this, there are several things that you need to think
about. First and foremost... _what is the impact when (not if) this process
finally stops?_ This isn't just for technical people. You need a business
impact assessment. You need the users involved. They're your lever for
fighting the inevitable fear-based political hurdles. Is it an annoyance? Or
does the company go out of business? The potential severity of the impact
matters a great deal. If it's putting the entire business at risk, you should
be able to get support from the highest levels of management to do whatever is
necessary.

Second... how do you recover? There are a variety of ways off the top of my
head. The most obvious would be to reconstruct the physical config. The
"obvious to others" that is probably a stupid idea is rewriting the
application. Let's ignore the stupid one and start dealing with reconstructing
the configuration.

Do you have the source code for the system? If so, you can probably
reconstruct the configuration architecture from reading source, at least. It
may suck, but it's something.

Is there a test environment with its own running copy of the app? If so, it
will have its own configuration, which will make reconstruction much easier,
as then you differ only by values and don't have to figure out what the fields
are.

Now, what kind of data is in the configuration that makes it difficult?
Resource locations? Authentication credentials? Something else? If it's
connecting to external systems, you can look at logs, packet-sniff, etc, to at
least figure out where it's going. Credentials can be reset for a new version
- a painful one-way trip, but it can work. Do you own any external systems, or
are they outside your control?

Now, all systems have inputs and outputs. What is the output of this? Is it
going to a database? If so, are you backing up that data? Make sure any
locally stored data is getting backed up!

If there isn't a duplicate test system, what would need to be done to create
one? Are there licensing restrictions? Specialized hardware/OS? Are you
building from source code? Do you have the source? Do whatever it takes to
create a parallel system that you can test configuration on, make it run
elsewhere.

I can just go on and on with this, but the important thing is to be able to
duplicate as much as possible before you try to replace. And find out what the
cost is - that buys you authority.

~~~
james_s_tayler
>"The physical config was accidentally overwritten and there are no backups".

>Welcome to legacy.

No. Not welcome to legacy. Welcome to a terrible company with terrible
processes.

~~~
beat
Fixing old, broken process is an interesting intellectual challenge in itself.

------
surfsvammel
Damn. Everyone here seems to advise you to just protect your own ass as a top
priority. Is that a product of US work culture?

If you where in a sound organisation, which I would say most Swedish IT
organisations are, you need to think about the company and the clients first.

Someone advised you to: “don’t do anything until explicitly asked to”. I think
that’s just bad advise. You obviously know this is a major problem and risk.
You also have some ideas about how to proceeds in mitigating and solving it.
You should jump at the chance to help your company with this. Highlight this
directly to your managers, talk with as many as you can, gather the
information you need, get the approvals, and fixed the problem ASAP or find
someone who can.

Your clients and customers might be badly affected when hell breaks loose.

I think anyone advising being passive, or plainly hiding from the problem, is
totally wrong. Those are not the kind of colleagues I would want.

Get to work.

~~~
Ghjklov
Uh no, that's terrible advice. What if he ends up disrupting the service and
cause days or worst weeks of downtime for the company? He'll end up taking the
entire blame for it and possibly even fired. Will you come to back him up?
Because to everyone else there, everything was working fine for months and
years until OP came along and put his nose where he shouldn't have.

------
guiriduro
Are you working on a Nuclear Reactor or something that will cause loss of life
if rebooted accidentally? If yes, then you have a truly critical system that
needs very careful uptime management, despite huge costs to carefully derisk
and duplicate it, and there's plenty of good advice here already. But too many
systems are 'super pets' like this and are mistakenly considered critical at
exorbitant cost.

If no: turn it off and on, and see. The simple fact of the matter is that
sooner or later it will happen anyway, and its better to bring that about,
learn and solve. And if it results in large financial losses for extended
downtime, then the management that allowed it to get to that state is already
at fault (not you) and some better, safer alternative will arise from your
efforts. Don't sweat it.

~~~
psnosignaluk
Heh. Know of a team of engineers who deliberated over a brittle system for
months. Mission critical, world ending stuff if it failed. Their manager got
tired of waiting and sent a mail to the team that ran the data centre asking
for the machine to be disconnected from the network. By CoB, nobody else had
even noticed. Thanks to the nature of big businesses, that mission critical
application had been programmed into obsolescence years before by other teams
shifting their own workloads off of the server.

------
hkt
Have a look at Checkpoint and Restore In Userspace:
[https://criu.org/Main_Page](https://criu.org/Main_Page)

I've not looked for a long time but you could potentially snapshot the running
process and restore it. Obviously do your reading extensively first though.

------
duxup
My first "real" job was working with some old equipment related to some big
old IBM (and related) mainframes. To view or change anything you dialed in
with a modem and then via a terminal application displayed and modified memory
directly.

I sometimes have dreams about specific hex codes that were common / would mean
bad things.

However, they were really good about documentation and updates / backups.

Best thing is to start working on folks in the business to develop a backup
plan / make your voice heard about the potential fallout of a failure. Or even
just ask "what would happen if" of the decision makers.

------
jlokier
I have used GDB to look at a process and get a dump of particular run-time
data structures, which is usually enough to reconstruct a config file.

Config data structures usually don't change while a process is running. Often
they are just values in global variables.

If you have the executable file for the process, it may be possible to run
that with trial config files, and then compare the GDB dump from the running
service with the GDB dump from a trial config, to compare the relevant data
structures. That can provide more confidence than just figuring out what the
config ought to be.

Getting a GDB dump of the running service will be quite disruptive if it's
done manually, but that might not matter. It will depend on whether the
service is heavily used or if it's only doing something occasionally.

If the service is in constant use, it could make sense to automate GDB so it
captures the config data structures quickly then safely detaches, and only
briefly pauses the service.

Alternatively, if even automated GDB is too disruptive or difficult to use, or
if the ptrace() syscalls used by GDB might cause a problem, it is often
possible to capture a memory dump of the running process without affecting the
process much, via /proc/PID/mem on Linux.

If necessary, write a tool in C to walk the heap by reading it from
/proc/PID/mem to reconstruct the config data structures that way.

(All the above assumes Linux.)

~~~
dboreham
I was going to post roughly this, but with one addition: find a human with
knowledge of the config file format. Doesn't have to be a perfect memory but
surely someone developed this application and will remember roughly what went
in the file and its layout. Then use "strings" or equivalent on the binary and
your memory dump to piece together a reproduced config file. Test that in a
separate VM until it works. Biggest problem is probably going to be the cut
over between the running and new config file instances. If it doesn't work
you're screwed. Perhaps do that by diverting network traffic, leaving the
working legacy instance in place to allow fall back?

------
pezo1919
May I ask about the business you are in? What happens to the world if your
service goes down? It's just curiosity. Hopefully not related to breathing
machines like importance. :)

------
Reith
Memory dump will probably be what you want to do, but are you sure the process
closed the file? If it's a config file, that's sensible the process has closed
the file descriptor, but if it didn't (probably because of developer error),
it's possible that file be still accessible (through inode for example in
Linux). Actually I faced the same problem today; it was a virtual machine disk
file that has been overwritten but virtual machine was running, so the
(deleted) file was open and I could get it back.

------
Canada
Interesting problem. There's some good ideas in here, and I think you might
get some better ones if you add more details.

You said that you have been assigned the task of replacing the "zombie"
program, which I assume means that you are to write a new one that interfaces
whatever depends on the zombie.

If the zombie were to die right now, how severe would the consequences be? On
the scale of total disruption of business to minor inconvenience that could be
worked around until it's fixed?

How complex is the zombie? What is the nature of the state that was supplied
by the lost configuration file? Is it stuff like which port to listen on and
how to connect to its database, or is it more like huge tables of magic values
that won't be easy to figure out again?

Do you currently have enough information about the zombie to write a
replacement? Do you have what you need to test your replacement before
deploying it?

If you are confident you have sufficient information to write the replacement,
how long do you think you need to write and test that?

You said the zombie runs on a physical server. What operating system? What
language/runtime/stack is the zombie based on?

------
abbadadda
I'm at the airport in Dublin. On a layover. I was enjoying my Oreo Fusion at
BK and when I read, "The service has current up time of 55 Months" I burst out
laughing hysterically.

(Update re the downvote: You can't tell me that uptime isn't a little bit
funny. I almost did a spit take at this. As someone who sees and thinks an
uptime of 150 days is way too much this number just shocked me, that's all)

------
miohtama
My father worked in a district heating plant. The control PC was 80286 and had
uptime over 10 years. Though the plant in theory could be run without it with
knobs and levers.

It was replaced by a Pentium with a virtualized MS-DOS due to y2k event,
though I am quite sure nothing changed in the underlying MS-DOS program that
was more or less time independent.

------
pschastain
I remember reading some time ago an old story about folks recovering
accidentally deleted system files from the currently running system. Can't
find it, I _think_ it was Ritchie who told it but I'm not sure. The long and
short is that it might be possible to recover the lost config from what's
currently running.

~~~
tomrod
This one?
[https://www.ee.ryerson.ca/~elf/hack/recovery.html](https://www.ee.ryerson.ca/~elf/hack/recovery.html)

------
swayson
In South Africa the citizens are facing a crises of such a service and its our
electricity system. Predominately coal power plants, that are at end of life,
and rolling blackouts is a common occurrence. It is necessary action, to avoid
a blackout, given it would take weeks to get the system going again.
Furthermore, high inequality and unemployment, makes it paramount to grow the
economy, so these rolling blackouts have a huge impact socio-ecomically. It is
a service that can't simply be restarted.

Transitioning risky services which can't be restarted, is clearly complicated,
especially for complex systems. I wonder if there is body of knowledge with
principles that could apply to not only to these massive nationwide scale, but
to that of web-services and alike. Does anybody have resources in this
respect?

~~~
Gibbon1
You guys need Jesus... and some solar panels.

~~~
swayson
Unfortunately, the national power provider has a monopoly on energy, and is a
political mine field. So independent power producers are ready to integrate
massive solar and wind power capabilities, but due to politics they keep
delaying the negotiations and contracts. Many private businesses are trying to
produce their own power through solar and alike, but its expensive as that
technology needs to be important and the rand is one of the most undervalued
currencies.

------
farseer
The raptors escaped because the The Jurassic Park security system was never
restarted before in its history :)

------
aasasd
> _In addition the source code for this compiled binary has also been lost in
> the mists of time._

Sounds like you could try copying the binary, putting it into an isolated
VM—or a virtual copy of your other services—and beating a file with a stick
until the service accepts it as a config.

------
kazinator
> _Ever worked with a service that can never be restarted?_

Yes; any Unix init daemon.

What's "restarted"? Does accepting any new configuration and changing behavior
in response to it count as a restart?

Or do we have to terminate the process entirely and start a new one for it to
be a "restart"?

------
lormayna
At my previous-previous job we had a core switch that was on from 5 years. No
one wants to reload because everyone was scared it cannot restart anymore. We
don't have maintenance, spare parts and cabling was a mess.

My colleagues told me that it was still there 2 years ago.

------
totaldude87
Been there.. short story goes like..

There was a active passive cluster with some complicated raid storage.

One day the passive's storage was gone(there were problems all along,
corruptions, physical storage change/swap etc etc, mgmt put it on hold since
there was a migration) and since there were already plans to migrate the
cluster to somewhere else, no one from mgmt wanted to build another set of
servers for such a "short" period.

Guess what... that short period was 18 months :)..

So for a 1 year and half , those systems were never touched and never
restarted, all the time serving some applications out of it.

Sometimes we get lucky, sometimes we are out of job :)

------
teekert
I once overwrote my own partition table. The next week was spend keeping the
laptop on or in standby and investigating recovery. I never found a good
solution so I started investigating alternatives, and ended with the so called
nuke-and-pave, I Reinstalled Ubuntu.

Anyway, that's quite a situation you have got there :) can you not suspend to
disc and keep the suspend image safe and able to be booted into after power
failures or so? Maybe suspend to disc (hibernate) and then image the swap
partition/file? Then some grub editing... or... what OS is this anyway?

------
jasonmar
If it's a critical system and you don't have the ability to fix it but you
know it's only a matter of time before it fails, the sensible thing to do is
change jobs.

------
throwme9876
As of a few years ago, Google had something vaguely similar. The details
escape me, but it was something like --

The main RPC service had a dependency on the lock service to start up and vice
versa. If both services went globally offline at the same time, they wouldn't
be able to turn either (thus most of Google) back on again.

Someone came up with a wonderful hack to solve this involving basically air-
gaped laptops which when connected could help a datacenter bootstrap itself.

------
withinboredom
If the file is still open by the process and not yet closed, the config file
is still there if it’s Linux. You can find it in the process’s file handles in
/proc

------
asickness231
I have seen a scenario where a bespoke application which had also been running
for sometime had been lost to the miasma (consultants took the source code
with them).

The organization in question paid for a team of forensic software experts to
reverse engineer it to the best of their ability ahead of a full data center
migration to a new facility.

I left the company while they were somewhere in the middle of this project.

------
noonespecial
You might try something like this, hoping that the process still has the gone
file "open".

[https://unix.stackexchange.com/questions/268247/recover-
file...](https://unix.stackexchange.com/questions/268247/recover-files-if-
still-being-used-by-a-process)

------
bluesign
From my experience, dumping memory and most likely reverse engineering the
binary is a must.

Actually as it is legacy system, changes are this will be easier.

To be honest, the path to follow mostly depends on the OS.

As it is critical system, I would start with reverse engineering the binary,
making sure the config is preserved across life time of application.

------
villgax
You could try to make every possible request programmatically & save the
responses & then rebuild it.

------
Piskvorrr
"as long as it is never restarted"

Seen that. And then one day, power failed. UPSes kicked in. In one hour, still
no mains power. Batteries depleted, systems started shutting down...including
that one.

Had to build a replacement in a _real_ hurry after that.

------
gumby
I have made a phone call. It’s not clear the phone system could be restarted.
For example the control plane (SS7) is simply transmitted over the links it
controls. It was built incrementally as a continually running system for more
than a century and certain bootstraps/incremental elements were long since
optimized out. Note that this is unrelated to the fanatical backwards
compatibility at the edge which you might think would imply that it is
restartable.

The Internet, I think, is restartable as long as layer 0 were available. It’s
not really clear what it would mean for the Internet to _need_ to be restarted
— perhaps some attack/failure of the BGP infrastructure?

------
nnq
Unpopular advice: _find a subtle way to make it crash, preferably stealthy,
but if not possible, at least in a way that can be attributable to mild
innocent incompetence instead of malice!_

Then there will be more and more interesting work to do for you and others,
either rediscovering and properly documenting the config, or, hopefully,
architecting and coding its replacement! In the aftermath, the organization
will be more robust. If it actually collapses bc of this, then it deserved to
die anyway, you only helped accelerate the outcome and reduced the suffering.

Some things and processes need to be "helped to fail faster", everyone will
benefit from the renewal in the end, even if most will hate it ;)

~~~
_jal
Back when I ran a company, my partners and I gamed out what to do with
employees in various situations, as a way of coming to a consensus. I was
against legal action in all but extreme hypotheticals. Behavior like this is
in that class.

You assume you not only have all the relevant information to predict the
outcome, but also that your analysis of the situation is superior enough to
trump those who actually own the process and will be responsible for the
aftermath of your destructive actions. You demonstrate you know this by
suggesting stealth.

Thinking about the fact that people like this are out there makes me paranoid
about my hiring all over again.

~~~
celticmusic
If you had hired competent people to begin with, the configs would have been
backed up.

I don't agree with the other posters recommendation, but I understand the
sentiment. The same incompetence that allowed the event to happen is the same
incompetence that allowed it to be swept under the rug for almost 6 years.

I've seen paragons of hiring get upset because failures were made visible, not
understanding that the failures were there all along and they were playing a
game of chicken with reality, only now they have a chance of winning.

~~~
setr
>If you had hired competent people to begin with, the configs would have been
backed up

What happy-go-lucky universe do you hail from? You could hire the most
competent people in the world and you'll still end up with a horribly complex
shitshow with lost programs, untested backups, missing docs, implicit
information out the ass, and so on. Software development has never had the
discipline to claim compotent employees produce competent results -- they're
just less likely to be shitty results.

And we spend 90÷ of our time dealing with every layer of incompetence our
competent, and incompetent, friends have provided us

~~~
celticmusic
It's possible to avoid losing a configuration.

~~~
setr
It's possible, and it's possible to avoid losing source code, and its.possible
to have valid backups, and keep your servers decemtly secure... yet it's
sufficiently common (especially as the program ages) that its not trivially
attributed to competency; rather you need continuous, unfailing competency, in
the face of continuous environmental change (local changes, within the
business, and external changes, as business needs change, and mergers occur,
and people quit/fired, etc)

~~~
celticmusic
> rather you need continuous, unfailing competency

That's called doing it manually.

What you're doing is making excuses rather than holding yourself to a higher
level of quality.

~~~
setr
>That's called doing it manually.

Yes, a common feature in unstable environments. Automation is an ideal, but
not a necessity, and it's not always viable or cost-effective.

But also that the unfailing competency means that you need to contiously
maintain (all aspects of the) automation without fail, again in spite of a
changing environment. Across everyone who does things.

>What you're doing is making excuses rather than holding yourself to a higher
level of quality.

And what you're doing is pretending that firing off blind, trivial
prescriptions is a useful activity -- perhaps I'll soon find you selling self-
help books as well :-)

Automation is not an answer that can always be applied trivially; especially
retroactively. Even despite all the marketing that will tell you otherwise.

~~~
celticmusic
> Automation is an ideal, but not a necessity

You're just bad. You automate specifically to take the human error risk out of
the equation, that's a part of what it means to be competent. We don't
automate our deploys because it's an ideal that is not a necessity, but so
some jackass doesn't delete a configuration while also never putting it into a
backup, or source control, or on their dev machine, ad nauseum.

It's not an accident that some of us have been working 20+ years in this
industry and have never been the cause of something like this.

> Automation is not an answer that can always be applied trivially; especially
> retroactively. Even despite all the marketing that will tell you otherwise.

shit, I guess I didn't read the right pamphlet...

------
lnanek2
We have a job running on the back end servers here that is expected to take
two weeks in order to finish :) Hearing 55 months for your backend makes me
more optimistic, haha.

------
raverbashing
My first step would be to try and "undelete" the file from FS, even if it was
overwriten, it is possible that some of it is still salvageable.

------
joseph2342
Surprised with the scenario of losing code for a binary that is still running.
In which software domains is this common ?

------
ohiovr
Is it possible to run queries on the programs in effort to reconstuct the
config or would that also be too dangerous?

------
blinkingled
What OS? On Linux with Checkpoint/Restore you can checkpoint the process to
disk and restore it when needed.

------
sitkack
DD the disk over and scp pipe to another machine, the config file could still
be in there.

------
alinspired
check [https://github.com/checkpoint-
restore/criu](https://github.com/checkpoint-restore/criu) \- designed to dump
a set of process to a file and restore it elsewere

------
trcarney
One of the source control companies should take this story to use for
marketing.

------
eru
Can you take a core dump while it's running?

------
lonelappde
The is how most systems work, at some scale.

------
arrty88
Is it running in a docker container? Can you copy the suspended state of that
container?

------
zaphirplane
You know how with flu vaccine 1 in a big number get sick and with this it may
end up worse.

If it’s vm take a snapshot periodically, I suspect it’s not.

Try a p2v which converts it to a vm on the fly, and leave the vm off and
periodically rerun the p2v

------
goatinaboat
_capture packet 's and see what kind of traffic it receives most commonly_

The modern solution would be to capture the incoming packets as a training set
then apply machine learning to create a model that can perfectly recreate the
outgoing packets.

It’s still an inexplicable black box of course, that is the nature of ML, but
at least you can run it in the cloud now.

~~~
martin_a
Sorry that nobody else got it, but I like that joke.

Do you think we can put some crypto in there, too?

~~~
hoorayimhelping
> _Sorry that nobody else got it, but I like that joke._

No, plenty of us got it. It's just not funny right now.

A person is trying to find a real solution to a real problem they have. Every
comment in this thread represents a potential hope that someone knows how to
fix it. When someone comes in to a serious thread that is spitballing ideas
and tries to make it about them by cracking jokes and being fun and clever,
it's annoying. I know this because for about 8 years I was the guy trying to
crack jokes in these threads.

This is not the time to make some kind of point about the state of modern
software engineering - do that in a thread where people are bullshitting, not
trying to work a problem. It's not ever really funny. It's noise. It doesn't
help anything. It's not too subtle, it's just obnoxious.

~~~
celticmusic
what a silly excuse for being offended by nothing.

Watch as I come up with a random reason you're wrong.

The OP could see it and laugh and that could possibly lighten their day and
help them relax.

Wait, no. That's positive, online we're only supposed to bash men, hitler, and
take everything in the most negative light possible. I should have instead
judged you as a hateful person who has a horrible life and no friends.

~~~
dang
Please don't take HN threads further into flamewar. It helps nothing.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

