Ask HN: Ever worked with a service that can never be restarted?

verytrivial · on Jan 16, 2020

My advice:

1. Suggest that the work to replace it is prioritized commensurate with the business impact caused whilst re-establishing service as if it went down right now

2. Remind them that it will go down at the worst possible time.

3. Ensure your name is attached to these two warnings.

4. Promise yourself you wouldn't run your business this way.

5. Get on with your life.

jlokier · on Jan 16, 2020

I would add to that:

- Make a plan of exploratory steps that could trigger a failure, in order of increasing risks.

- Then, if necessary, get a formal sign-off from someone with authority (perhaps a director of the company) to proceed with each step.

Steps like that, which are usually very low risk but do risk taking down an unknown service, might include things like (just ideas from other comments):

- Logging in to the server at all.

- Splicing something into the network switch.

- Cloning filesystems or disk images.

- Cloning the process memory or system memory.

- Cloning the image or other things if it's in a VM.

- Running strace gdb, or packet tracing.

- Swapping the power source live.

These days, it is often possible to transfer a full working system into a VM on a more modern, powerful machine, and after doing that, it's a great relief to everyone because it's no longer hardware dependent.

I've done that with some legacy systems that were originally on bare metal, and are now still running nearly 20 years later in reliable VMs, with no change to the running software. Usually running faster and with more memory doesn't break a working system.

But doing it on a service that can't be taken down even for a moment is quite an interesting adventure! :-) It is possible, but technically challenging, as long as you can obtain disk and memory images from the live system and then capture changes fast enough to perform a hot transfer.

Apofis · on Jan 17, 2020

I read earlier that some people were putting legacy apps on VMs, but I didn't realize that you can transfer a full working system into a VM including memory... which is pretty much a full state snapshot. Awesome. This is this guy's solution right there, until a replacement is in place.

CareyB · on Jan 16, 2020

What verytrivial said!!!!

I’ve been in similar situations, but never without an immediately obvious solution.

That system WILL FAIL. Even as we speak the time-to-failure is shrinking.

Even if one of the solutions described below actually works, you won’t get 100% recovery.

I recall a story years ago - from MIT, if memory serves - where they rebooted a system because they had many generations of Sybase backups. When they tried, it didn’t work. Nobody had actually tested recovering from a backup.

Grit your teeth; cover your ass; and get on with it. The clock IS ticking.

mrweasel · on Jan 16, 2020

> That system WILL FAIL.

I think that the key point. We have a system that can't easily be restarted and won't automatically start after a server is rebooted. The developers of the software don't care, because "what are the chances of a virtual machine spontaneously restarting". Turns out, those chances are rather good.

Servers, virtual machines, containers, doesn't matter, unless that thing is running on a mainframe, it will crash.

andai · on Jan 16, 2020

What's special about mainframes in this context?

DuskStar · on Jan 16, 2020

Mainframes tend to be designed with extreme uptime in mind - to the point of often having hot-swappable processors.

marcosdumay · on Jan 16, 2020

The software doesn't care. It will crash anyway.

DuskStar · on Jan 16, 2020

Oh, of course. But the MTBF for 'sleep 500y' on a mainframe is higher than on commodity x86 hardware.

pawelos · on Jan 16, 2020

Aah, my favorite write-only backups.

3fe9a03ccd14ca5 · on Jan 16, 2020

Follow up advice:

After sufficient CYA steps are taken, don’t offer any help on this problem unless explicitly asked to do so. And even then wait until there’s actually desperation. Feel free to think about it so you have some ideas when the time comes, but if this service isn’t something you are responsible for then stay away.

Or else it will be a “you touched it last you broke it” regardless of what actually happened. And if you make too much ruckus about it, they’ll say “why didn’t you do anything to fix it??”

Nextgrid · on Jan 16, 2020

I would say keep your own copy of the warnings - email them to yourself or similar.

When shit hits the fan if the company wanted to throw you under the bus they could delete them remotely from your machine or work mailbox and then pretend the warnings never existed.

dijit · on Jan 16, 2020

Actually, I have experience here.

The problem is that anything you do that's potentially destructive in service of getting the system to be more sustainable is going to be met with heavy criticism. So you must be careful, the company has accepted the risk that they're in and you'll have to contend with that most likely.

First thing is first: is it a VM or a Physical machine? Things get a little easier if it's a VM because there might be vMotion or some kind of live-migration in place meaning hardware failure might not actually bring the service down.

Next thing is you absolutely have to plan for failure because the one thing that I learn as I learn more about computers is that they're basically working "by accident", so much error correction goes into systems that it absolutely boggles my mind and failures still get through. So, it's certainly not a question of "if" but "when" and plan for it being soon.

Now, the obvious technical things are:

* Dump the memory * Grab any files it has open from its file-descriptors (/proc/<pid>/fd/) Your config file might be there... but somehow I doubt it. Attach a debugger and dump as much state as possible

Be sure to cleanly detatch : https://ftp.gnu.org/old-gnu/Manuals/gdb/html_node/gdb_22.htm...

Don't use breakpoints! They will obviously halt execution.

If it was my service I would also capture packet's and see what kind of traffic it receives most commonly; and I would make some kind of stub component to replace it before working on the machine. Just in case I break it and it causes everything to go down that depends on it.

But, this is a horrible situation, reverse engineering this thing is going to be a pain. Good Luck.

arethuza · on Jan 16, 2020

I have a personal theory that Really Bad Things (TM) usually tend to happen during the response to other, relatively minor, problems.

So whatever is done cannot risk the running process. So personally I'd be loathed to try anything that touches the running process - even if they seem low risk.

Edit: One thing I would do is make it clear that building a replacement and keeping this thing running are different responsibilities and if you are doing the former then you really shouldn't be held responsible for the latter (politics I know).

lonelappde · on Jan 16, 2020

Yep. Responding to an outage increases the probability of having an outage.

lallysingh · on Jan 16, 2020

You can use /proc to get what you need. It has a memory map and the memory itself. Read the map, pull the heap segments out, and read them into files named after the addresses. On an identical machine, load the process up, attach via gdb, and (this is the tricky part) then mmap in the files saved from the other side at the right addresses. If you can get the roots into your heap (tough), you can use gdb'd python API to write a traverser to get the data you need back out.

dbancajas · on Jan 16, 2020

Seems like a checkpoint restore process. Do you have experience of this or actual guide? I'm really curious as to the underlying mechanism.

lallysingh · on Jan 17, 2020

Nothing so formal. The tricky parts are the non-memory program state: registers and instruction pointers for all threads.

adrianmonk · on Jan 16, 2020

I would add this: whatever you decide to do to the running binary, create a mock binary on another computer and try the exact sequence of commands/steps first. And verify that it doesn't disturb the mock.

This should reduce risk a bit because you are not experimenting and creating your procedure on the fly when you do the actual thing. Instead, you are applying a procedure that you have rehearsed, and you have gained partial confidence that it won't disturb a similar running binary.

The mock doesn't have to be a perfect reproduction of the real binary, but it should do similar things where relevant. For example, if the service listens on a TCP port, so should the mock. You might not have to code the mock from scratch; maybe some existing software is similar enough. Or maybe you can even run a non-production copy of your real binary, if you are really confident you can configure it to not interact with anything it shouldn't.

Nextgrid · on Jan 16, 2020

I would also recommend trying the process on a wide range of existing software you can get your hands on (let's say PostgreSQL or something similarly big). This will reduce the risk of any edge-case that your mock didn't cover.

scandox · on Jan 16, 2020

> because the one thing that I learn as I learn more about computers is that they're basically working "by accident"

So true. Thanks this is good advice. As it happens it is a physical server.

I'm not actually mandated to do anything except replace it with a comparable service but this makes me feel like I'm racing against time!

arethuza · on Jan 16, 2020

Have you tried decompiling the executable to see if any sensible information can be gained that way?

scandox · on Jan 16, 2020

Everyone's afraid to do anything on that machine at all. So even getting agreement to login or install software or run tcpdump or whatever is fraught.

nerpderp82 · on Jan 16, 2020

You need to setup an identical-ish host, could even be in VM so that you can build software that will run on the Ghost Ship. This in and of itself could be difficult, Debian 4?

Using this Shadow Ghost Ship you can build a copy of tcpdump that you can just scp over and run in place. Don't touch the package manager, /etc, /var, anything!

mst · on Jan 16, 2020

It might not help much but maybe you can get permission to run tcpdump on a machine that talks to the service?

arethuza · on Jan 16, 2020

What I was meaning was that if you can get a hold of the binary that is running you could take it to some other environment and have a look (starting with a hex dump and working up from there...).

Mind you if things are that bad then maybe I'd follow my own advice (in another comment) and stay well away from the currently running thing.

celticmusic · on Jan 16, 2020

this would certainly help recreate the configuration.

pjc50 · on Jan 16, 2020

Attaching a debugger potentially disrupts the service. I would suggest trying to live-image it with some other technique first.

fxtentacle · on Jan 16, 2020

Agree.

Attaching a debugger can cause multi-threaded processes to behave differently and/or deadlock if you are unlucky.

Also, some closed-source software components deliberately crash if they detect a debugger and since you don't have source code for the binary, there is a risk that one such component was used during development.

dijit · on Jan 16, 2020

Yeah, that's pretty much the last resort once you have fallbacks in place, might be the last/only way to glean info about the config that it's running.

lmilcin · on Jan 16, 2020

I would add: overcommunicate everything you plan before you do and give others plenty of time to react to even simplest operations.

The company may have failed but when the proverbial truckload of shit hits the proverbial supersonic fan there will be tremendous pressure to find a culprit. A small peon might find himself/herself outnumbered and outgunned.

aequitas · on Jan 16, 2020

At a previous job I once sshd into a legacy system which had a motd in the vein of: "Don't ever reboot this system, if you do, I will find you and do horrible things to you". No explaination of why. My opinion was that the person who dared to leave a system in this state was the one who needed the horrible things done to them instead. Needless to say, he was already fired.

We started working on solution to run parallel to the existing one, which would receive shadow traffic so we could observe behaviour, find all edge cases and put them in a testsuite for the new solution. After we where confident that our testsuite contained most of the important behaviour we switched traffic to our new solution, keeping the old online just in case we needed to switch back.

Key is monitoring and learning the expected behaviour of the connected systems, to sense if that behaviour is not as expected and being able to act upon that as soon as possible.

ratel · on Jan 16, 2020

I think your solution is fine.

Just another perspective on that server message. This server might be actually running a real-time legacy interface to one of your biggest customers who pay a 6 figure premium to keep this old thing running, just so they have time to migrate. Which they have been trying to do since the 80's, hence the fact that few people remember why it is there.

I'm not saying this happened here, but it happens a lot more than any of us want. Just keep in mind there might actually be a good reason for the warning, although I would object to the wording. First head the warning. Then figure out why it is there.

aequitas · on Jan 16, 2020

> just so they have time to migrate. Which they have been trying to do since the 80's, hence the fact that few people remember why it is there.

The questions I would ask are: what is the insurance if the system fails due to natural causes (hardware failure, power outage, you name it)? And what is the cost/benefit balance of having this unpatched security hole connected to the internet (which was the reason we discovered this).

If a system is so important that such a warning is deemed appropriate, imho its just laziness and carelessness from the warning's writer and they are shifting their problem onto the next person that stumbles upon the system.

At least give some information on why it is important and at what time or under which conditions the warning can be considered expired. Just like commenting your code is important because you never know who is having to make sense of it in a few months time (especially true if that person turn out to be you :)).

ratel · on Jan 16, 2020

To be sure I'm not defending the practice, just shedding another light on it.

In answer to your first question. There might not be an insurance against the system failing. The honest answer to what to do then, because someone always needs to ask is: Panic! It might not even be unrecoverable, just really expensive and time-consuming to do so.

The follow-up question would be: Yes it is a risk, yes it is a possible disaster. What do you want to do about it? If your answer is anything more than: We should not have taken on that risk in the first place or lets bet the companies future on the fact that we can fix this. Then it might actually be interesting to listen to. It seems from your comment you resolved it.

As for messaging. I do not agree you should explain the situation on the server. Just let people know they should not touch this ever. As soon as you explain why people will assume their reason to do so will trump whatever reason you gave. If they need to think of all possible disasters that might happen it has more impact than the one you can describe.

Messages like "Before doing anything on this server contact Bob" will eventually lead to Bob receiving a message: "We have done this or that. Just letting you know, but it was after office hours" which he will probably see rushing into the office in the middle of the night because the server is not working anymore. The other type of message: "Don't reboot this server ever! For more information contact Bob" Bob changed because he spend a considerable time saying no and explaining the situation to the sysops team, their manager, their manager's manager, etc. who all thought their priority must trump Bob's. Bob might still be working for the company. He might have tried the better part of his career to get this stain resolved. Nothing bad about Bob.

pjc50 · on Jan 16, 2020

If that were true, they should have put it in the motd as well.

Etheryte · on Jan 16, 2020

Shadow traffic is a really good solution here. Aside from simply speccing the system, you're also building a very good automated test suite to run your new solution against.

sdmike1 · on Jan 16, 2020

There is a tool used in malware analysis and computer forensics called Volatility[0]. It has some very powerful analysis tools and works on Linux Mac and Windows. In your case its ability to dump the memory of a running process without messing with the process state [1] may be very helpful! It also has the ability to run a Yara scan against the dumped memory which could let you find the region of memory containing the config file (so long as you know some of the strings in it).

Hope this helps!

[0] https://github.com/volatilityfoundation/volatility

[1] https://www.andreafortuna.org/2017/07/10/volatility-my-own-c...

john_moscow · on Jan 16, 2020

I would advise exercising caution though. Even if the tool works perfectly fine, the service going down anywhere within days to weeks after you captured its state would be seen by the management as your fault. They won't care for the explanations, they will just need a scapegoat and that would be whoever touched the thing last time.

That said, there are plenty of legitimate (albeit low-probability) reasons for the dumping tool to screw up the service. Slightly affecting memory timing and triggering some rare race condition, using up more RAM than usual (or increasing the HDD utilization) and again triggering some conditions that wouldn't happen normally. You name it.

If it was me, I would try to run the binary on another machine and do a "clean room" recreation of the config file. That said, service without sources and missing configs without backups indicate severe organizational problems, so I would probably not want to work in such a place for a long time, unless I was hired by the CEO specifically to rectify things.

marcus_holmes · on Jan 16, 2020

I agree, there are actually two problems here:

One is technical, and to do with the service and how to deal with it.

One is political, and to do with who gets the blame when the service inevitably fails.

OP needs to work out if they're being tasked with the job as the answer to the technical problem or the political problem. Because management could be very aware that this is a ticking bomb with no technical solution, and have appointed OP as a scapegoat to take the blame when it blows up.

ashish5887 · on Jan 16, 2020

What good is a memory dump if you don't have the codebase to navigate? without the code and structure of how its managed a memory dump would just look like an alien language.

dsr_ · on Jan 16, 2020

The phrase you want to google is "reverse engineering from memory dump". Prepare to be astounded.

You need to know how machine instructions work, details of the operating system, and a bunch of other things... but any number of copy-protection systems have been bypassed this way. Any number of devices have had their firmware dumped, their memory scanned, and the resulting data analyzed to make patches or build compatible systems.

harha · on Jan 16, 2020

> so I would probably not want to work in such a place for a long time, unless I was hired by the CEO specifically to rectify things.

Reminds me of The Phoenix Project.

phillipseamore · on Jan 16, 2020

Dump memory and see if you can retrieve the config that's running. If someone has a clue about how the config should be, you could probably get all the elements of it from simply running 'strings' on the binary. You could also disassemble the binary. Possibly the deleted config file could be salvaged from disk.

As is, this is all relying on the HW not giving up and the UPS working in a power failure!

chha · on Jan 16, 2020

I once worked with a client in such a situation. They were in the process of building a huge oil rig, costing somewhere north of $5.5bn USD. The basic premise was that a previous vendor had configured a big documentation system running on a soon-to-be-outdated Windows Server version ages ago, along with a "kind of" API allowing the shipyard to send information in the form of equipment/construction metadata, documents of various kinds and similar stuff.

The main challenge was that nothing could not be resent if a transmission failed, unless there was an actual change on the shipyard side, or by an highly complex and manual method. There was no source code to anything, and while the enterpricy system was a fairly standard off-the-shelf type thing, the API was completely customized. Nobody knew anything about how it talked to the system, dependencies or anything else. Changing anything was out of the question, as everything had been defined in contracts and processes making waterfall seem agile.

The team I was in was basically there to set up a new application in a separate environment, so that we could migrate and replace the existing setup once the shipyard had handed over all the information. In the end everything was kind of anti-climactic, with everything working as expected for as long as it needed to.

celticmusic · on Jan 16, 2020

> In the end everything was kind of anti-climactic, with everything working as expected for as long as it needed to.

I often tell people that great software development is boring.

ratel · on Jan 16, 2020

I have some experience in this field.

First of all you are there to build a replacement service. As long as you do that in the allotted time what happens to the old should not bother you. Focus on your part of the solution.

The fact that you do seem to worry suggests you need something from the old service. Do you have a good design for the new service? Are there details of the old service that you do not know about? It won't be the first time a company has an application that runs complex calculations, but nobody knows how it is done. We actually built black-box applications in the past to replace calculation modules on obsolete hardware.

If you need information from the old system you may have to resort to cloning the input and output preferably on the network level so you don't have to mess with the existing service.

If the replacement service is years out you might suggest to the company to put in a red team. Another person or persons who work on finding out how the service works by running a duplicate on another system, rent it if necessary, and poking at it seeing if they can get comparable results quicker than the replacement service. But that is not you.

There are a couple of things you need to have in place before going live. Make sure you can clone input and output so you can run the old service next to the new one for a while. You don't want to stop the old service the moment you go live, because if anything goes wrong you need to be able to switch back. Even than starving the old service of input might have adverse effects, so cloning is better. If the old service uses external resources, like a database, files, etc. make sure you do not interfere with file locks, sequence numbers,etc. when running in parallel.

ThomasRedstone · on Jan 16, 2020

A lot of talk about what to do yourself, I'd say do nothing yourself.

Find an expert who has masses of experience who can consult on it.

This isn't a good time to be learning and testing those lessons.

zug_zug · on Jan 16, 2020

Yes. Moreover, if you backup the system great, you get peace of mind, but management hardly cares. If you BREAK the system while trying to establish a backup, management cares and it's "your fault."

You have exponentially more to lose than to gain by experimenting with this at all. In this situation you should convey the risks, let the stakeholders decide if/who is going to address this, as it's their risk to take.

faeyanpiraat · on Jan 16, 2020

This is the solution in my experience aswell.

Doing even minor things that could distrupt the execution for any time (attaching debugger, installing packet sniffer that blocks network traffic for a sec or cause a packet lost, etc) could deadlock the application.

Wouldn’t risk it.

jmkni · on Jan 16, 2020

The only thing I would risk is copy/pasting the binary/executable that is running to somewhere else so I could try and decompile it. Anything beyond that is a nope.

weego · on Jan 16, 2020

My experience in this situation is that the more you decide to the hero and take on things that everyone else is washing their hands of when you have no reason to above anyone else, the more likely it is that you'll also be shouldered with responsibility for things going wrong that were similarly nothing to do with you in the first place.

Prioritise a replacement and don't decide to try smart things with the running one.

jacquesm · on Jan 16, 2020

If it is running you can still get at the binary file through /proc and recover it. The file system will only really delete a file when there are no more users and a running process counts as the file being in used (so that pages can be paged back in from it if needed).

gwbas1c · on Jan 16, 2020

That depends on how the application works.

Most likely, the application just deserialized into its internal data structures and closed the file.

FDSGSG · on Jan 17, 2020

> but the physical config was accidentally overwritten

Overwritten, not deleted. Open file handles can't really help you there.

jacquesm · on Jan 19, 2020

Yes, they can. The low level implementation is such that the inode will not be released. That's why securely overwriting a file is hard. Just opening a file for output and writing data is not enough, you need to open it for update and then write as much data as the old one had or more.

jomkr · on Jan 16, 2020

I'm not sure that 55 months of up-time indicates it's more or less likely to go down in the next month, but I'd guess more likely.

Surely there are options, have you tried de-compiling the source from the binary?

>but the physical config was accidentally overwritten and there are no backups

Any old dev PCs lying around somewhere? It's worth reaching out to the old developers to see if they have a copy, in really old legacy systems it's fairly likely someone will have a copy somewhere.

zwp · on Jan 16, 2020

> reaching out to the old developers to see if they have a copy

I think you mean "older devs that are still with the company" but I was once contacted by a former employer for exactly this: they had lost source code for the billing system. I didn't have the source code so that was easy to answer. Developers that have stolen intellectual property from former employers may want to consider carefully how they answer such questions...

Also, I was never supposed to have had access to this code whilst I was employed so I did wonder if there was more to it than just "oops we lost it can you help?". Perhaps they were trying to chase down a leak or something? I never heard back though (they survived).

Piskvorrr · on Jan 16, 2020

If you barge in, lawyers blazing, shouting about "stolen intellectual property", the likely response is "don't have any, never had", regardless of its veracity. OTOH, "disks that are left unerased by accident" _and_ generally being very open on the nature of the emergency is much more likely to bring help.

adrianpike · on Jan 16, 2020

This case is one I'd be pretty careful about though as a former employee - the situation the company's found themselves in is pretty indicative of mismanagement, and desperate people do desperate things.

IANAL, but a hold harmless agreement would be a simple CYA that I'd do if I was personally in this situation.

Piskvorrr · on Jan 17, 2020

Indeed. That's why I wrote, in essence "if you start with 'you surely HAVE our code, GIMME!!!!!'", the answer will always be 'nope', regardless of facts".

barry-cotter · on Jan 16, 2020

> I'm not sure that 55 months of up-time indicates it's more or less likely to go down in the next month, but I'd guess more likely.

Less likely

https://en.wikipedia.org/wiki/Lindy_Effect

> The Lindy effect is a theory that the future life expectancy of some non-perishable things like a technology or an idea is proportional to their current age, so that every additional period of survival implies a longer remaining life expectancy. Where the Lindy effect applies, mortality rate decreases with time. In contrast, living creatures and mechanical things follow a bathtub curve where, after "childhood", the mortality rate increases with time. Because life expectancy is probabilistically derived, a thing may become extinct before its "expected" survival. In other words, one needs to gauge both the age and "health" of the thing to determine continued survival.

viraptor · on Jan 16, 2020

It's going to be some mix of Lindy and bathtub. Lindy for the software itself and electronics, bathtub for fans, spinning disks, cables, other things that may physically degrade.

Also, unless the server is in a data centre with some serious redundancies, (and even then...) the power is guaranteed to go down one day.

RandomBacon · on Jan 16, 2020

OP is talking about a phyaical server. The Lindy Effect is for "some non-perishable things". A server that already has already been running for at least 55 monthes is definitely perishable.

matthewowen · on Jan 16, 2020

A server running software has both perishable and non-perishable elements. The physical hardware is perishable and weakens with age, but as the software itself ages the likelihood that it contains bugs like "it crashes whenever daylight savings occurs" decreases.

The question is which effect dominates at a particular point in time.

cf141q5325 · on Jan 16, 2020

Its hardware, so you are looking at the bathtub curve https://en.wikipedia.org/wiki/Bathtub_curve

clarry · on Jan 16, 2020

Ideas and technologies are actively sustained by external forces due to perceived merit in said ideas or technologies.

A computer program that is freewheeling along, by contrast, is not sustained by but dependent on external factors (software, hardware, electricity), all of which are prone to fail and some of which are more likely to fail with age.

rob74 · on Jan 16, 2020

Well, at least that's a plus for non-compiled languages (PHP, JS etc.): even if everything else fails, you still have the code on the server.

But I guess the bigger problem here is the missing config - as long as you still have the binary of the application, you should be able to restart it.

gberger · on Jan 16, 2020

> I'm not sure that 55 months of up-time indicates it's more or less likely to go down in the next month, but I'd guess more likely.

If 'going down' is an event governed by chance (e.g. power failure), then it does not matter if it has been up for 55 months or 55 minutes.

https://en.wikipedia.org/wiki/Gambler%27s_fallacy

sudhirj · on Jan 16, 2020

There’s also the Turkey fallacy - Turkeys that are smug about knowing the Gambler’s fallacy keep thinking their survival is independent of their age, until people eat them. My probability of dying goes up with each year as well, it’s not constant.

Think the probability of the machine shutting down in the next minute is independent of age, but it shutting off in the next 100 years is certain. There’s a line or curve that should indicate probability of shutting down before OP writes the replacement.

celticmusic · on Jan 16, 2020

This is how I feel when people talk about Schroedingers cat. Give me a week and I'll be able to answer that question definitively.

listenallyall · on Jan 16, 2020

Gambler's Fallacy is applicable "when it has otherwise been established that the probability of such events does not depend on what has happened in the past" -- not the case here.

The fact that the server has been running uninterrupted for 55 months is a strong clue that the power to this system is robust, not subject to "random" power loss scenarios like one would expect at a typical residence.

eru · on Jan 17, 2020

> The fact that the server has been running uninterrupted for 55 months is a strong clue that the power to this system is robust, not subject to "random" power loss scenarios like one would expect at a typical residence.

I don't remember random power losses in typical first world residences after perhaps the 1990s?

mod · on Jan 17, 2020

I get them at my house. Every time there's an ice storm. And for a while there was an intermittent issue where floodwaters came up over my internet port in the junction box, and so my internet would go out for a while. Hopefully the power grid isn't as janky.

eru · on Jan 17, 2020

Interesting! I guess I never lived either in rural areas or anywhere with ice storms.

eru · on Jan 16, 2020

It depends on what kind of chance we are talking about.

See eg https://en.wikipedia.org/wiki/Bathtub_curve for a common way to fail. Or https://en.wikipedia.org/wiki/Reliability_engineering in general.

You are right about some specific kinds of spontaneous failures. (Though not sure if external power failures are distributed that way.)

fxtentacle · on Jan 16, 2020

Hire a consultant to do a full-memory dump.

1. That gives you a pretty good chance of attaching a debugger (to the offline memory file) and extracting the config from memory without touching the running system.

2. You are safe in case things turn awful, which seems likely.

As for the source code, if it is an interpreted language like Java or Python or C#, your chances of recovering a fully-working source code tree are pretty good. At the very least, it'll help with understanding what the system does.

For C / C++, buy Hex-Rays IDA and Decompiler. That's what the pros use for reverse engineering video game protections, breaking DRM, etc. So that tool set is excellent for getting a overview of what a binary does with the ability to go into all the nitty-gritty details if you need help re-implementing one detail. Plus, Hex-Rays can actually propagate types and variable names through your disassembly and then export compile-able C source code for functions from your binary.

jacquesm · on Jan 16, 2020

Java is a compiled language; as is C#.

Vendan · on Jan 16, 2020

If it's not obfuscated, you can get a pretty good dump of a c# binary's source with a decompiler. Some variable names may be missing (local variables and such) and some constructs may be awkward (generator/iterator stuff) but it's usable.

honkycat · on Jan 16, 2020

My mother had a saying: Choose your battles. I feel like the value of calling things "not my problem" in software is undervalued. You cannot fix every problem, you cannot train every engineer, and you cannot control everything. Learn to prioritize.

If it is not on your head, don't fuck with it. I understand the instinct to fix a Big Problem and look great to management. However, this is too high risk. If you solve the problem, you are a hero. If you fuck it up, you are an idiot and fired and cost the company $$.

Why run the risk at all? Just cash the paychecks and fix other things that can't go catastrophically wrong.

d--b · on Jan 16, 2020

I personally would look for that code, it must be somewhere... In old code repos or hard drives. Maybe you guys have tape backups or things of that nature.

Don't touch the server, if anything happens, you'll be held responsible.

Tilian · on Jan 16, 2020

The configuration still exists in memory somewhere, so you could extract it from a dump... whether you should want to do this is an entirely different question though.

adrianmsmith · on Jan 16, 2020

That's not necessarily true I think. The program could read the config at startup, and create structures or objects inside its heap to represent the config. After that it would have no need for the actual bytes of the config.

sokoloff · on Jan 16, 2020

Those heap objects are the configuration. Not the configuration input file, but the configuration itself.

It is then be a task to reverse engineer the coding file from the config data structures.

letharion · on Jan 16, 2020

This is also metioned by dijit, but here's more concrete instructions on how to potentially recover the config from /proc/ : https://web.archive.org/web/20171226131940/https://www.linux...

PaulHoule · on Jan 16, 2020

In case you want to replace or upgrade the UPS note they make devices that can cut into the power cord used by law enforcement to move a computer which is being impounded without depowering it.

blackrock · on Jan 16, 2020

Are you saying there is a device to splice the power cable while it is still hot, and re-route power to the computer without causing a hiccup?

dijit · on Jan 16, 2020

Oh for sure this is possible: https://www.youtube.com/watch?v=vQ5MA685ApE

the devices that do it without doing what these guys did is quite pricey though https://www.cru-inc.com/products/wiebetech/hotplug_field_kit...

blackrock · on Jan 16, 2020

I was thinking more along the lines of a single power input like your home desktop PC. That you would need to splice into this power cable while it’s still hot.

For servers I expect that they would have redundant inputs, that you can easily hot swap it.

The second link had a video on how to do that. With multiple ways to achieve a power re-route, including a tool to easily splice into the wire. The easiest technique was to just use a sheath to touch the metal prongs on the cable.

But the surprise for me was the power strip technique, where he plugged in his battery source into one of the 6 outlets, thereby simulating a power input. I didn’t even know that was possible.

jlokier · on Jan 16, 2020

> For servers I expect that they would have redundant inputs, that you can easily hot swap it.

This is true, but there is still a real risk that the dual power input on the server fails when you actually do the swap.

For example, the second PSU may not work, or it may have failed in the past and then be forgotten about, or connector or cable internally may be flaky, or the component that combines power from both PSUs may fail (although I think that's quite rare).

If it's a really critical service, those risks may be unacceptable.

ZenPHP · on Jan 16, 2020

Hot swapping servers with redundant power supplies is great, until you discover the hard way that some nimrod plugged both power supplies into the same UPS. And not just once, but on every server in both racks, one UPS per server.

jwr · on Jan 16, 2020

> the devices that do it without doing what these guys did is quite pricey though https://www.cru-inc.com/products/wiebetech/hotplug_field_kit....

It's somewhat interesting how this company can sell a product which obviously violates pretty much any electrical safety regulations (a plug with exposed prongs that can become hot). I'm guessing this is sold as a kit, not as a complete product.

lexicality · on Jan 16, 2020

In the grand scheme of things, $600 is nothing money to a company.

bouke · on Jan 16, 2020

That requires the computer to hotswap to be plugged into a power strip. It doesn't work if the computer is plugged into the wall directly. (Well leaving aside modifying the house's electricity / opening up the wall obviously.)

Defenestresque · on Jan 16, 2020

Incorrect. See this video[0] from the product stage. They have two ways to hotswap if it's not plugged into a powerbar: 1) plugging in their device into the adjacent power outlet then removing the socket assembly and 2) an adapter that can splice leads into the power cable

https://youtu.be/-G8sEYCOv-o

Pretty ingenious, tbh. Though I'm a bit concerned about how much you end up handling energized plugs.

bouke · on Jan 25, 2020

Well that’s what I meant by “opening up the wall”.

dijit · on Jan 16, 2020

you can do it pretty easily if you have a dual socket wall socket because it follows the exact same principles. But you have to be very careful.

In these cases it's probably worth doing it though.

https://media.screwfix.com/is/image//ae235?src=ae235/38040_P...

FWIW I do not endorse doing this, just stating that its possible and would work, it's not electrically safe and is dangerous for not only the hardware but for you too.

villgax · on Jan 16, 2020

I think they could run wires to the motherboard pins on the backside or something too

swalsh · on Jan 16, 2020

You guys may not have the best processes in place, but whoever developed that app really deserves some credit. 55 month uptime without restart. Not bad.

INTPenis · on Jan 16, 2020

Rather than focus on a replacement they should focus on reading the systems memory out and creating a replacement virtual machine or perhaps even trying to decipher the config structure from memory.

My experience with migration projects is that they can drag on in time, and all the while this system is just itching to go down due to a power failure or some other issue.

beat · on Jan 16, 2020

"The physical config was accidentally overwritten and there are no backups".

Welcome to legacy.

So for a situation like this, there are several things that you need to think about. First and foremost... what is the impact when (not if) this process finally stops? This isn't just for technical people. You need a business impact assessment. You need the users involved. They're your lever for fighting the inevitable fear-based political hurdles. Is it an annoyance? Or does the company go out of business? The potential severity of the impact matters a great deal. If it's putting the entire business at risk, you should be able to get support from the highest levels of management to do whatever is necessary.

Second... how do you recover? There are a variety of ways off the top of my head. The most obvious would be to reconstruct the physical config. The "obvious to others" that is probably a stupid idea is rewriting the application. Let's ignore the stupid one and start dealing with reconstructing the configuration.

Do you have the source code for the system? If so, you can probably reconstruct the configuration architecture from reading source, at least. It may suck, but it's something.

Is there a test environment with its own running copy of the app? If so, it will have its own configuration, which will make reconstruction much easier, as then you differ only by values and don't have to figure out what the fields are.

Now, what kind of data is in the configuration that makes it difficult? Resource locations? Authentication credentials? Something else? If it's connecting to external systems, you can look at logs, packet-sniff, etc, to at least figure out where it's going. Credentials can be reset for a new version - a painful one-way trip, but it can work. Do you own any external systems, or are they outside your control?

Now, all systems have inputs and outputs. What is the output of this? Is it going to a database? If so, are you backing up that data? Make sure any locally stored data is getting backed up!

If there isn't a duplicate test system, what would need to be done to create one? Are there licensing restrictions? Specialized hardware/OS? Are you building from source code? Do you have the source? Do whatever it takes to create a parallel system that you can test configuration on, make it run elsewhere.

I can just go on and on with this, but the important thing is to be able to duplicate as much as possible before you try to replace. And find out what the cost is - that buys you authority.

james_s_tayler · on Jan 16, 2020

>"The physical config was accidentally overwritten and there are no backups".

>Welcome to legacy.

No. Not welcome to legacy. Welcome to a terrible company with terrible processes.

beat · on Jan 17, 2020

Fixing old, broken process is an interesting intellectual challenge in itself.

surfsvammel · on Jan 16, 2020

Damn. Everyone here seems to advise you to just protect your own ass as a top priority. Is that a product of US work culture?

If you where in a sound organisation, which I would say most Swedish IT organisations are, you need to think about the company and the clients first.

Someone advised you to: “don’t do anything until explicitly asked to”. I think that’s just bad advise. You obviously know this is a major problem and risk. You also have some ideas about how to proceeds in mitigating and solving it. You should jump at the chance to help your company with this. Highlight this directly to your managers, talk with as many as you can, gather the information you need, get the approvals, and fixed the problem ASAP or find someone who can.

Your clients and customers might be badly affected when hell breaks loose.

I think anyone advising being passive, or plainly hiding from the problem, is totally wrong. Those are not the kind of colleagues I would want.

Get to work.

Ghjklov · on Jan 16, 2020

Uh no, that's terrible advice. What if he ends up disrupting the service and cause days or worst weeks of downtime for the company? He'll end up taking the entire blame for it and possibly even fired. Will you come to back him up? Because to everyone else there, everything was working fine for months and years until OP came along and put his nose where he shouldn't have.

marcosdumay · on Jan 16, 2020

> If you where in a sound organisation

Most people here are arguing that if he was in a sound organization, that wouldn't happen. Thus, he isn't, and advice that assumes sound organizations is useless.

callesgg · on Jan 16, 2020

You have a great understanding of most Swedish IT organisations...

guiriduro · on Jan 16, 2020

Are you working on a Nuclear Reactor or something that will cause loss of life if rebooted accidentally? If yes, then you have a truly critical system that needs very careful uptime management, despite huge costs to carefully derisk and duplicate it, and there's plenty of good advice here already. But too many systems are 'super pets' like this and are mistakenly considered critical at exorbitant cost.

If no: turn it off and on, and see. The simple fact of the matter is that sooner or later it will happen anyway, and its better to bring that about, learn and solve. And if it results in large financial losses for extended downtime, then the management that allowed it to get to that state is already at fault (not you) and some better, safer alternative will arise from your efforts. Don't sweat it.

psnosignaluk · on Jan 16, 2020

Heh. Know of a team of engineers who deliberated over a brittle system for months. Mission critical, world ending stuff if it failed. Their manager got tired of waiting and sent a mail to the team that ran the data centre asking for the machine to be disconnected from the network. By CoB, nobody else had even noticed. Thanks to the nature of big businesses, that mission critical application had been programmed into obsolescence years before by other teams shifting their own workloads off of the server.

hkt · on Jan 16, 2020

Have a look at Checkpoint and Restore In Userspace: https://criu.org/Main_Page

I've not looked for a long time but you could potentially snapshot the running process and restore it. Obviously do your reading extensively first though.

duxup · on Jan 16, 2020

My first "real" job was working with some old equipment related to some big old IBM (and related) mainframes. To view or change anything you dialed in with a modem and then via a terminal application displayed and modified memory directly.

I sometimes have dreams about specific hex codes that were common / would mean bad things.

However, they were really good about documentation and updates / backups.

Best thing is to start working on folks in the business to develop a backup plan / make your voice heard about the potential fallout of a failure. Or even just ask "what would happen if" of the decision makers.

jlokier · on Jan 16, 2020

I have used GDB to look at a process and get a dump of particular run-time data structures, which is usually enough to reconstruct a config file.

Config data structures usually don't change while a process is running. Often they are just values in global variables.

If you have the executable file for the process, it may be possible to run that with trial config files, and then compare the GDB dump from the running service with the GDB dump from a trial config, to compare the relevant data structures. That can provide more confidence than just figuring out what the config ought to be.

Getting a GDB dump of the running service will be quite disruptive if it's done manually, but that might not matter. It will depend on whether the service is heavily used or if it's only doing something occasionally.

If the service is in constant use, it could make sense to automate GDB so it captures the config data structures quickly then safely detaches, and only briefly pauses the service.

Alternatively, if even automated GDB is too disruptive or difficult to use, or if the ptrace() syscalls used by GDB might cause a problem, it is often possible to capture a memory dump of the running process without affecting the process much, via /proc/PID/mem on Linux.

If necessary, write a tool in C to walk the heap by reading it from /proc/PID/mem to reconstruct the config data structures that way.

(All the above assumes Linux.)

dboreham · on Jan 16, 2020

I was going to post roughly this, but with one addition: find a human with knowledge of the config file format. Doesn't have to be a perfect memory but surely someone developed this application and will remember roughly what went in the file and its layout. Then use "strings" or equivalent on the binary and your memory dump to piece together a reproduced config file. Test that in a separate VM until it works. Biggest problem is probably going to be the cut over between the running and new config file instances. If it doesn't work you're screwed. Perhaps do that by diverting network traffic, leaving the working legacy instance in place to allow fall back?

pezo1919 · on Jan 16, 2020

May I ask about the business you are in? What happens to the world if your service goes down? It's just curiosity. Hopefully not related to breathing machines like importance. :)

Reith · on Jan 22, 2020

Memory dump will probably be what you want to do, but are you sure the process closed the file? If it's a config file, that's sensible the process has closed the file descriptor, but if it didn't (probably because of developer error), it's possible that file be still accessible (through inode for example in Linux). Actually I faced the same problem today; it was a virtual machine disk file that has been overwritten but virtual machine was running, so the (deleted) file was open and I could get it back.

Canada · on Jan 16, 2020

Interesting problem. There's some good ideas in here, and I think you might get some better ones if you add more details.

You said that you have been assigned the task of replacing the "zombie" program, which I assume means that you are to write a new one that interfaces whatever depends on the zombie.

If the zombie were to die right now, how severe would the consequences be? On the scale of total disruption of business to minor inconvenience that could be worked around until it's fixed?

How complex is the zombie? What is the nature of the state that was supplied by the lost configuration file? Is it stuff like which port to listen on and how to connect to its database, or is it more like huge tables of magic values that won't be easy to figure out again?

Do you currently have enough information about the zombie to write a replacement? Do you have what you need to test your replacement before deploying it?

If you are confident you have sufficient information to write the replacement, how long do you think you need to write and test that?

You said the zombie runs on a physical server. What operating system? What language/runtime/stack is the zombie based on?

abbadadda · on Jan 16, 2020

I'm at the airport in Dublin. On a layover. I was enjoying my Oreo Fusion at BK and when I read, "The service has current up time of 55 Months" I burst out laughing hysterically.

(Update re the downvote: You can't tell me that uptime isn't a little bit funny. I almost did a spit take at this. As someone who sees and thinks an uptime of 150 days is way too much this number just shocked me, that's all)

miohtama · on Jan 16, 2020

My father worked in a district heating plant. The control PC was 80286 and had uptime over 10 years. Though the plant in theory could be run without it with knobs and levers.

It was replaced by a Pentium with a virtualized MS-DOS due to y2k event, though I am quite sure nothing changed in the underlying MS-DOS program that was more or less time independent.

pschastain · on Jan 16, 2020

I remember reading some time ago an old story about folks recovering accidentally deleted system files from the currently running system. Can't find it, I think it was Ritchie who told it but I'm not sure. The long and short is that it might be possible to recover the lost config from what's currently running.

tomrod · on Jan 16, 2020

This one? https://www.ee.ryerson.ca/~elf/hack/recovery.html

swayson · on Jan 16, 2020

In South Africa the citizens are facing a crises of such a service and its our electricity system. Predominately coal power plants, that are at end of life, and rolling blackouts is a common occurrence. It is necessary action, to avoid a blackout, given it would take weeks to get the system going again. Furthermore, high inequality and unemployment, makes it paramount to grow the economy, so these rolling blackouts have a huge impact socio-ecomically. It is a service that can't simply be restarted.

Transitioning risky services which can't be restarted, is clearly complicated, especially for complex systems. I wonder if there is body of knowledge with principles that could apply to not only to these massive nationwide scale, but to that of web-services and alike. Does anybody have resources in this respect?

Gibbon1 · on Jan 17, 2020

You guys need Jesus... and some solar panels.

swayson · on Jan 17, 2020

Unfortunately, the national power provider has a monopoly on energy, and is a political mine field. So independent power producers are ready to integrate massive solar and wind power capabilities, but due to politics they keep delaying the negotiations and contracts. Many private businesses are trying to produce their own power through solar and alike, but its expensive as that technology needs to be important and the rand is one of the most undervalued currencies.

farseer · on Jan 16, 2020

The raptors escaped because the The Jurassic Park security system was never restarted before in its history :)

aasasd · on Jan 16, 2020

> In addition the source code for this compiled binary has also been lost in the mists of time.

Sounds like you could try copying the binary, putting it into an isolated VM—or a virtual copy of your other services—and beating a file with a stick until the service accepts it as a config.

kazinator · on Jan 16, 2020

> Ever worked with a service that can never be restarted?

Yes; any Unix init daemon.

What's "restarted"? Does accepting any new configuration and changing behavior in response to it count as a restart?

Or do we have to terminate the process entirely and start a new one for it to be a "restart"?

lormayna · on Jan 16, 2020

At my previous-previous job we had a core switch that was on from 5 years. No one wants to reload because everyone was scared it cannot restart anymore. We don't have maintenance, spare parts and cabling was a mess.

My colleagues told me that it was still there 2 years ago.

totaldude87 · on Jan 16, 2020

Been there.. short story goes like..

There was a active passive cluster with some complicated raid storage.

One day the passive's storage was gone(there were problems all along, corruptions, physical storage change/swap etc etc, mgmt put it on hold since there was a migration) and since there were already plans to migrate the cluster to somewhere else, no one from mgmt wanted to build another set of servers for such a "short" period.

Guess what... that short period was 18 months :)..

So for a 1 year and half , those systems were never touched and never restarted, all the time serving some applications out of it.

Sometimes we get lucky, sometimes we are out of job :)

teekert · on Jan 16, 2020

I once overwrote my own partition table. The next week was spend keeping the laptop on or in standby and investigating recovery. I never found a good solution so I started investigating alternatives, and ended with the so called nuke-and-pave, I Reinstalled Ubuntu.

Anyway, that's quite a situation you have got there :) can you not suspend to disc and keep the suspend image safe and able to be booted into after power failures or so? Maybe suspend to disc (hibernate) and then image the swap partition/file? Then some grub editing... or... what OS is this anyway?

jasonmar · on Jan 16, 2020

If it's a critical system and you don't have the ability to fix it but you know it's only a matter of time before it fails, the sensible thing to do is change jobs.

throwme9876 · on Jan 16, 2020

As of a few years ago, Google had something vaguely similar. The details escape me, but it was something like --

The main RPC service had a dependency on the lock service to start up and vice versa. If both services went globally offline at the same time, they wouldn't be able to turn either (thus most of Google) back on again.

Someone came up with a wonderful hack to solve this involving basically air-gaped laptops which when connected could help a datacenter bootstrap itself.

withinboredom · on Jan 16, 2020

If the file is still open by the process and not yet closed, the config file is still there if it’s Linux. You can find it in the process’s file handles in /proc

asickness231 · on Jan 16, 2020

I have seen a scenario where a bespoke application which had also been running for sometime had been lost to the miasma (consultants took the source code with them).

The organization in question paid for a team of forensic software experts to reverse engineer it to the best of their ability ahead of a full data center migration to a new facility.

I left the company while they were somewhere in the middle of this project.

noonespecial · on Jan 16, 2020

You might try something like this, hoping that the process still has the gone file "open".

https://unix.stackexchange.com/questions/268247/recover-file...

bluesign · on Jan 16, 2020

From my experience, dumping memory and most likely reverse engineering the binary is a must.

Actually as it is legacy system, changes are this will be easier.

To be honest, the path to follow mostly depends on the OS.

As it is critical system, I would start with reverse engineering the binary, making sure the config is preserved across life time of application.

villgax · on Jan 16, 2020

You could try to make every possible request programmatically & save the responses & then rebuild it.

Piskvorrr · on Jan 16, 2020

"as long as it is never restarted"

Seen that. And then one day, power failed. UPSes kicked in. In one hour, still no mains power. Batteries depleted, systems started shutting down...including that one.

Had to build a replacement in a real hurry after that.

gumby · on Jan 16, 2020

I have made a phone call. It’s not clear the phone system could be restarted. For example the control plane (SS7) is simply transmitted over the links it controls. It was built incrementally as a continually running system for more than a century and certain bootstraps/incremental elements were long since optimized out. Note that this is unrelated to the fanatical backwards compatibility at the edge which you might think would imply that it is restartable.

The Internet, I think, is restartable as long as layer 0 were available. It’s not really clear what it would mean for the Internet to need to be restarted — perhaps some attack/failure of the BGP infrastructure?

nnq · on Jan 16, 2020

Unpopular advice: find a subtle way to make it crash, preferably stealthy, but if not possible, at least in a way that can be attributable to mild innocent incompetence instead of malice!

Then there will be more and more interesting work to do for you and others, either rediscovering and properly documenting the config, or, hopefully, architecting and coding its replacement! In the aftermath, the organization will be more robust. If it actually collapses bc of this, then it deserved to die anyway, you only helped accelerate the outcome and reduced the suffering.

Some things and processes need to be "helped to fail faster", everyone will benefit from the renewal in the end, even if most will hate it ;)

_lqaf · on Jan 16, 2020

Back when I ran a company, my partners and I gamed out what to do with employees in various situations, as a way of coming to a consensus. I was against legal action in all but extreme hypotheticals. Behavior like this is in that class.

You assume you not only have all the relevant information to predict the outcome, but also that your analysis of the situation is superior enough to trump those who actually own the process and will be responsible for the aftermath of your destructive actions. You demonstrate you know this by suggesting stealth.

Thinking about the fact that people like this are out there makes me paranoid about my hiring all over again.

celticmusic · on Jan 16, 2020

If you had hired competent people to begin with, the configs would have been backed up.

I don't agree with the other posters recommendation, but I understand the sentiment. The same incompetence that allowed the event to happen is the same incompetence that allowed it to be swept under the rug for almost 6 years.

I've seen paragons of hiring get upset because failures were made visible, not understanding that the failures were there all along and they were playing a game of chicken with reality, only now they have a chance of winning.

setr · on Jan 16, 2020

>If you had hired competent people to begin with, the configs would have been backed up

What happy-go-lucky universe do you hail from? You could hire the most competent people in the world and you'll still end up with a horribly complex shitshow with lost programs, untested backups, missing docs, implicit information out the ass, and so on. Software development has never had the discipline to claim compotent employees produce competent results -- they're just less likely to be shitty results.

And we spend 90÷ of our time dealing with every layer of incompetence our competent, and incompetent, friends have provided us

celticmusic · on Jan 16, 2020

It's possible to avoid losing a configuration.

setr · on Jan 16, 2020

It's possible, and it's possible to avoid losing source code, and its.possible to have valid backups, and keep your servers decemtly secure... yet it's sufficiently common (especially as the program ages) that its not trivially attributed to competency; rather you need continuous, unfailing competency, in the face of continuous environmental change (local changes, within the business, and external changes, as business needs change, and mergers occur, and people quit/fired, etc)

workingpatrick · on Jan 17, 2020

It's more than possible, it's expected. While all of these things do happen, they are preventable and should not be treated as excusable, people need to be held accountable and systems / processes need to be put in place to ensure that these things don't reoccur.

celticmusic · on Jan 16, 2020

> rather you need continuous, unfailing competency

That's called doing it manually.

What you're doing is making excuses rather than holding yourself to a higher level of quality.

setr · on Jan 16, 2020

>That's called doing it manually.

Yes, a common feature in unstable environments. Automation is an ideal, but not a necessity, and it's not always viable or cost-effective.

But also that the unfailing competency means that you need to contiously maintain (all aspects of the) automation without fail, again in spite of a changing environment. Across everyone who does things.

>What you're doing is making excuses rather than holding yourself to a higher level of quality.

And what you're doing is pretending that firing off blind, trivial prescriptions is a useful activity -- perhaps I'll soon find you selling self-help books as well :-)

Automation is not an answer that can always be applied trivially; especially retroactively. Even despite all the marketing that will tell you otherwise.

celticmusic · on Jan 16, 2020

> Automation is an ideal, but not a necessity

You're just bad. You automate specifically to take the human error risk out of the equation, that's a part of what it means to be competent. We don't automate our deploys because it's an ideal that is not a necessity, but so some jackass doesn't delete a configuration while also never putting it into a backup, or source control, or on their dev machine, ad nauseum.

It's not an accident that some of us have been working 20+ years in this industry and have never been the cause of something like this.

> Automation is not an answer that can always be applied trivially; especially retroactively. Even despite all the marketing that will tell you otherwise.

shit, I guess I didn't read the right pamphlet...

nnq · on Jan 16, 2020

> Thinking about the fact that people like this are out there makes me paranoid about my hiring all over again.

To be honest, I was 100% trolling or troll-baiting "for the lulz" - I wouldn't actually ever apply my "advice". And I don't think there's actually a significant chance of coming across people thinking this way in your business unless you work in very specific areas.

I mean, sure, some of us enjoy to destroy/break things or to make them fail or fall apart for either "the fun of it" or "to leave space for future better things", but we do it in "special controlled environments" to get our kicks, not in our business/work. Unless you're in the military or a similar business you wouldn't ever be in the position or hiring people expected to manifest their selective-destructive tendencies at work and have the risk of them also occasionally turning the gun in the wrong direction.

The biggest problem for a business imo is people who actually do this but do it subtly and unconsciously: like the developers picking "interesting" architectures and technologies that end up destroying the possibility of integration with legacy services and indirectly forces an expensive rewrite of those legacy systems, the managers that adopt "novel" management systems that end up pushing valuable people out of the company etc.

People that are consciously doing sabbotage and destructive action (even if it's "to make the world a better place" in their views) are easy to spot and handle (eg. legally), and those with the competence to really do things stealthly are always more motivated to apply such skills in other areas... the other "damaged apples" doing it without fully realizing are dangerous for businesses and you need to worry about them when hiring...

setr · on Jan 16, 2020

This is naive evil Mastermind level thinking, where the villain spouts his whole line and you think "no wait that's just stupid.." as he sends off the sequence of nukes

>If it actually collapses bc of this, then it deserved to die anyway, you only helped accelerate the outcome and reduced the suffering

If you want to make an evolutionary "survival of the fittest" type of claim... Life is entirely about error correction (and reproduction + staving off death as long as possible). Engineers babbling to engineers is part of a business's error correction.. Duct tape and monkeypatching is part of the engineer's error correction..

A rogue engineer trying to invoke rapid deconstruction of his environment (a virus) is one of those errors -- Solution: engage, restrain, and eject.

The legitimate error correction is already happening, by means of hope & replace. Presumably because other business concerns make it so that actual shutdown would be a Terrible, Horrible, No Good, Very Bad idea.

>If it actually collapses bc of this, then it deserved to die anyway, you only helped accelerate the outcome and reduced the suffering.

A doctor can't help you if you die in minutes. They can save you if you die in hours. Time is a major part of error correction.

Oh damn, you've pushed out the nukes already.

>Some things and processes need to be "helped to fail faster", everyone will benefit from the renewal in the end, even if most will hate it ;)

Seems to me the main process requiring failure acceleration is this one employee's UAC expiration policy.

workingpatrick · on Jan 17, 2020

Came here to post something similar. Why jump through all these hoops to unfuck some inherently fucked system.

It's a bunch of work and time and effort to reverse engineer it just to get back to the point of still having a piece of shit legacy system.

Honestly this is effectively natural selection at play for businesses. They put themselves in this situation, if it dies and destroys the business ... good fucking riddance, the people who let this situation occur need to fail hard, it's the only way they will ever learn.

Just walk up to the machine and plug in a USB kill for 2 seconds, remove it, and get on with your life.

lnanek2 · on Jan 16, 2020

We have a job running on the back end servers here that is expected to take two weeks in order to finish :) Hearing 55 months for your backend makes me more optimistic, haha.

raverbashing · on Jan 16, 2020

My first step would be to try and "undelete" the file from FS, even if it was overwriten, it is possible that some of it is still salvageable.

joseph2342 · on Jan 17, 2020

Surprised with the scenario of losing code for a binary that is still running. In which software domains is this common ?

ohiovr · on Jan 17, 2020

Is it possible to run queries on the programs in effort to reconstuct the config or would that also be too dangerous?

blinkingled · on Jan 16, 2020

What OS? On Linux with Checkpoint/Restore you can checkpoint the process to disk and restore it when needed.

sitkack · on Jan 16, 2020

DD the disk over and scp pipe to another machine, the config file could still be in there.

alinspired · on Jan 16, 2020

check https://github.com/checkpoint-restore/criu - designed to dump a set of process to a file and restore it elsewere

trcarney · on Jan 16, 2020

One of the source control companies should take this story to use for marketing.

eru · on Jan 16, 2020

Can you take a core dump while it's running?

lonelappde · on Jan 16, 2020

The is how most systems work, at some scale.

arrty88 · on Jan 16, 2020

Is it running in a docker container? Can you copy the suspended state of that container?

zaphirplane · on Jan 16, 2020

You know how with flu vaccine 1 in a big number get sick and with this it may end up worse.

If it’s vm take a snapshot periodically, I suspect it’s not.

Try a p2v which converts it to a vm on the fly, and leave the vm off and periodically rerun the p2v

goatinaboat · on Jan 16, 2020

capture packet's and see what kind of traffic it receives most commonly

The modern solution would be to capture the incoming packets as a training set then apply machine learning to create a model that can perfectly recreate the outgoing packets.

It’s still an inexplicable black box of course, that is the nature of ML, but at least you can run it in the cloud now.

martin_a · on Jan 16, 2020

Sorry that nobody else got it, but I like that joke.

Do you think we can put some crypto in there, too?

hoorayimhelping · on Jan 16, 2020

>Sorry that nobody else got it, but I like that joke.

No, plenty of us got it. It's just not funny right now.

A person is trying to find a real solution to a real problem they have. Every comment in this thread represents a potential hope that someone knows how to fix it. When someone comes in to a serious thread that is spitballing ideas and tries to make it about them by cracking jokes and being fun and clever, it's annoying. I know this because for about 8 years I was the guy trying to crack jokes in these threads.

This is not the time to make some kind of point about the state of modern software engineering - do that in a thread where people are bullshitting, not trying to work a problem. It's not ever really funny. It's noise. It doesn't help anything. It's not too subtle, it's just obnoxious.

murph-almighty · on Jan 16, 2020

That just made me realize that my personal habit of joking during prod incidents, while relieving to me, might not be to co-workers. Thanks for the perspective check.

celticmusic · on Jan 16, 2020

what a silly excuse for being offended by nothing.

Watch as I come up with a random reason you're wrong.

The OP could see it and laugh and that could possibly lighten their day and help them relax.

Wait, no. That's positive, online we're only supposed to bash men, hitler, and take everything in the most negative light possible. I should have instead judged you as a hateful person who has a horrible life and no friends.

dang · on Jan 16, 2020

Please don't take HN threads further into flamewar. It helps nothing.

https://news.ycombinator.com/newsguidelines.html

goatinaboat · on Jan 16, 2020

Definitely a use case for blockchain ;-)

Too subtle for HN but I don’t mind.

exdsq · on Jan 16, 2020

Ah apologies for not realising this was /s, I see people who genuinely advocate this sort of stuff on HN so it went way over my head!

martin_a · on Jan 16, 2020

That's the scary part! You can't anymore see whether it's a joke or some cutting-edge consultant speech... ;-)

martin_a · on Jan 16, 2020

Nice! Was looking for that on my bingo card. :-)

viraptor · on Jan 16, 2020

This is not a reasonable approach at this point. You'd need to know the whole state required for processing. That includes precise time, any local storage, any remote dependencies, randomness seeds, etc. Even then, we're not even close to full synthesis from examples. This is not a modern solution. This is made up, beyond toy examples.

tylermac1 · on Jan 16, 2020

I think they're joking.

viraptor · on Jan 16, 2020

The follow-up sounded pretty serious: https://news.ycombinator.com/item?id=22063160

jerome-jh · on Jan 16, 2020

You end up in the same state you started from: running a blackbox.

goatinaboat · on Jan 16, 2020

True but now you are independent of the original hardware.

quickthrower2 · on Jan 16, 2020

Ha ha. If it’s that easy why developer software at all. Just run a NN on packers and you can create anything without coding!

exdsq · on Jan 16, 2020

This problem is being worked on but it is by no means solved!

This paper [0] is a pretty decent survey on the field and [1] is an excellent article on the topic at a high level, written by a team currently working in the space (synthesising smart contracts).

[0] https://arxiv.org/pdf/1802.02353.pdf

[1] https://medium.com/@vidiborskiy/software-writes-software-pro...

rusk · on Jan 16, 2020

apply machine learning to create a model that can perfectly recreate the outgoing packets.

This reminds me of Team America [1] ... or maybe idiocracy [2] :-p

[1] https://www.youtube.com/watch?v=DIlG9aSMCpg

[2] https://www.youtube.com/watch?v=kIZ9YuPm_Ls

exdsq · on Jan 16, 2020

Is this seriously done? I just completed a small literature review for neural program synthesis which uses inputs as training data to generate a program using neural networks that can generate the right outputs but it's a huge search problem and I assume for anything non-trivial this is out of our reach? As an example, Deepmind achieved binary multiplication using this technique in 2018 and this was notable.

_wwz4 · on Jan 16, 2020

Is this seriously done?

No. Not even close for any general purpose I/O.

goatinaboat · on Jan 16, 2020

I’ve never done it but it should be feasible. An IP packet is 1518x8 == 12144 bits which is not especially big for a one-hot style encoding these days. You would just need to be sure to capture representative traffic e.g. accounting for monthly jobs.

jerome-jh · on Jan 16, 2020

In my understanding, ML works by mapping a huge input space to a tiny output space, and most of the input space remains unexplored.

In this case the input space is small and you want a quite dense mapping from input to output. I do not think ML applies.

gdxhyrd · on Jan 16, 2020

No, ML works fine with large output spaces too.

The problem is that the solution space grows super exponentially, and if you need to find an exact one, then the number of samples too.

lonelappde · on Jan 16, 2020

Most IO is larger than one packet.

dang · on Jan 16, 2020

We detached this subthread from https://news.ycombinator.com/item?id=22062790.