
Unix Recovery Legend (1986) - electrum
http://www.ee.ryerson.ca:8080/~elf/hack/recovery.html
======
krylon
> Well, for one thing, you must always remember the immortal words, DON'T
> PANIC

So true. A colleague of mine managed - on his second or third day on job - to
delete every single user account in our Active Directory. After an hour, we
gave up trying to restore the AD (it was an SBS2008, so no AD recycle bin) and
simply restored the entire DC (at the time, our domain only had the one DC)
from backup. Surprisingly, most of our users took it very well and used the
time to get some paperwork done or clean up their desks or something like
that. Still, it was one of the most stressful days of my life. So we kind of
panicked. In restrospect, I think another hour or so of research might have
saved us the eight hours of restoring that server (did I mention that our
backup infrastructure really, really sucked at the time?).

In smaller desasters, I've found the ability to remain calm most valuable,
though. Having your boss breathing down your neck impatiently can instill a
deep desire to simply do _something_ just to show that you are working on the
problem. But if you don't understand what's wrong, at best you are wasting
time, and possibly making the problem even worse.

~~~
Spooky23
Amen. I was a new DBA at a company that used a DBMS which used extent based
disk allocation with a default extent size of 16kb. That was a problem on a
busy table (which this was), because there was a finite limit to the number of
extents that could allocated, but that limit was impossible to predict.

The proactive fix was simple. Backup, restore the table in a two hour outage
window. I asked for 5 hours, as the previous DBA never tested a restore.
"Unacceptable", said the SVP.

Fast forward two weeks. We hit the limit, and the entire company is
essentially down. Between lost revenue, SLA fines and payroll, they were
losing something like $5k/minute. Recovery at this point required 30 hours of
full database restore, including journal recoveries from slooooow DAT tapes.

There were three of us, everyone stayed calm, provided regular updates and
handled a few hours of direct observation by the CEO and a board member for a
few hours.

~~~
dredmorbius
And the SVP?

~~~
Spooky23
Master politician... He made it somebody else's problem.

Funny story is that he forgot to pay the phone bill for one of the call
centers, and when they went to walk him out, they found him in a "compromising
position" with his secretary in the office.

That company was an unlimited source of material! Good times!

------
bluesmoon
Many years ago I ran into a similar problem. A technician came in to the
office to replace my hard disk with a larger one, and instead of copying my
old data to the new one, he started copying the new disk to the old one,
resulting in /home now being partially NTFS and partially ext2. ie, it was
unreadable in linux and Windows. I documented the recovery process here:
[http://tech.bluesmoon.info/2004/11/home-is-
ntfs.html](http://tech.bluesmoon.info/2004/11/home-is-ntfs.html)

Bonus: a year later, at a different company, I was faced with having to
undelete source files on FreeBSD (UFS). I documented that as well here:
[http://tech.bluesmoon.info/2004/08/undelete-in-
freebsd.html](http://tech.bluesmoon.info/2004/08/undelete-in-freebsd.html)

------
acveilleux
I've had to do something similar a long time ago in the mid-90s when Linux
switched from libc5 to glibc6. In this case, I hadn't deleted everything,
rather I'd stupidly upgraded libc locally.

After learning a valuable lesson in exactly how dynamic library work and the
recommended process for live libc upgrade (don't do it if ABI changes) I fixed
it by using my IRC client which was already running so unaffected to get a
statically linked copy of /bin and /sbin from another machine, via DCC Send...

Recovery then consisted of restoring libc5 from slackware 3.2 install media.

I can't remember how I got root, either su was statically linked (believable
since it's setuid) or I had a logged in root session. I did have to used the
tcsh "echo *" trick for file listing and the shell built-in cd...

~~~
teddyh
"cd" is _always_ a shell built-in. It is not even possible to have a /bin/cd
binary which does what "cd" does.

~~~
GTP
Why is impossible? Maybe I've understood why it is always built in: without it
the shell wouldn't be able to navigate through the directory tree to find the
executables of the commands. But once the shell has it, I think it would be
rendundant (so nobody does it) but still possible to have a /bin/cd that
navigates directories.

~~~
duaneb
You can't change the environment of the parent process, including CWD.
`/bin/cd` would either be another process like all executables run by the
shell, and not work, or special cased, at which point why make it an
executable at all?

~~~
JoshTriplett
I can think of one way to write "cd" out-of-process from the shell: exec
/bin/cd /some/path, which does chdir("/some/path") and execs your shell. You'd
lose your history and similar state, but if you had a shell that didn't even
put cd in-process, it probably doesn't have history either.

Sample:

    
    
        /tmp$ exec /tmp/cd /bin
        /bin$ exec /tmp/cd /home
        /home$ exec /tmp/cd /usr
        /usr$ 
    

Sample code:

    
    
        #include <err.h>
        #include <stdlib.h>
        #include <unistd.h>
        
        int main(int argc, char *argv[])
        {
            char *shell;
            if (argc != 2)
                errx(1, "Usage: exec cd /some/path");
            if (chdir(argv[1]) == -1)
                err(1, "chdir");
            shell = getenv("SHELL");
            if (!shell || !*shell)
                shell = "/bin/sh";
            execl(shell, shell, NULL);
            err(1, "exec");
        }

~~~
duaneb
Hah, hadn't thought of that. Kind of CPS for processes. I'm not sure what that
would gain you, though, as you'd have to either a) write both the cd and the
user of the cd, and there are less awkward ways to do that than exec'ing, or
b) you'd have to pass something ELSE to exec to return control after cd
finishes.

This seems a little awkward and I'm not sure what is being proven. :P

~~~
aardvark179
It's how very early version of Unix effectively did stuff, they exceed a
program, and when it exited they exceed the shel again.

When they introduced something like the fork/exec model we now have they
discovered cd didn't work any more and had to write it as a built in.

Now, for bonus points, work out how goto worked when it wasn't a built in.

~~~
JoshTriplett
> It's how very early version of Unix effectively did stuff, they exceed a
> program, and when it exited they exceed the shel again.

Makes sense, since those systems might not have enough memory to run the shell
and another program at the same time.

------
LordKano
6-7 years ago, a co-worker was doing some work on a server that we were
phasing out. It wasn't being used for anything of critical importance, so a Jr
level person was permitted to have root. He was actually my colleague and
equal on the org-chart but he had less experience with *NIX administration.

He was copying directories to the new server and deleting them when he was
done. Well...

I don't remember the sub-directory he had just finished copying but when he
went to delete it he typed in "rm -rf / SOMEDIR/SOMESUBDIR" and hit enter.

He almost immediately realized what he had done and hit CTRL-C but by that
point the damage was done. Our boss had access to the previous week's backups
and since the server wasn't critical, he just rebooted it and restored but we
had a good laugh about it.

The next day, I made a paper airplane and on the side I wrote in red Sharpie
"Linux Air" and then "rm -rf" inside of a red circle with a line through it
and taped it to the top of his monitor.

He was a good sport about it and left it up for a couple of months.

~~~
pdw
This particular variation was (finally!) fixed in the 2013 POSIX revision:

"If either of the files dot or dot-dot are specified as the basename portion
of an operand (that is, the final pathname component) or if an operand
resolves to the root directory, rm shall write a diagnostic message to
standard error and do nothing more with such operands."

[http://pubs.opengroup.org/onlinepubs/9699919799/utilities/rm...](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/rm.html)

Warning: not yet widely implemented outside the GNU world.

~~~
TheLoneWolfling
Personally, I wish two things:

1) that rm had a (configurable) number of files / directories above which
it'll ask you to double-check.

2) that files had a "pleasedontdelete" flag that rm would check and ask about.

~~~
derefr
> pleasedontdelete flag

For some programs (rm isn't one of them), removing the write permission is
sufficient to get an interactive prompt when you try to
overwrite/truncate/delete the file.

I really like OSX's current metaphor, where files that haven't been touched in
a while get "locked", and must then be "unlocked" to modify them further.
Phrasing it in terms of stability rather than permission makes a lot of sense
to me. It's too bad the metaphor isn't echoed in the CLI.

~~~
alcari
`rm` is definitely one of them, it's just suppressed by the `-f` flag.
(examples from Ubuntu)

    
    
        $ touch foo
        $ chmod -w foo
        $ rm foo
        rm: remove write-protected regular empty file ‘foo’?
    

There's also the immutable attribute many modern filesystems support, which
prevents the file from being modified unless the attribute is removed.

    
    
        $ sudo chattr +i foo
        $ rm foo
        rm: remove write-protected regular empty file ‘foo’? y
        rm: cannot remove ‘foo’: Operation not permitted

------
Zenst
Yip had a manager do that on a clients site, bestpart was the kit was so new
that only a few in the country and the install set for the OS had not arrived
and no backups. Was new machine and been partialy configured, awaiting tapes.

Luckily anotehr client had the same RS/6000 (think 3rd in the country outside
IBM) and was able to borrow there install DAT to bring AIX back to life.

Odd as had problem with RT/6150 in which (nobody admitted it) had similiar
problem and that involved to get it limping along copying files from a working
system onto this holed system to fill the gaps. Which given the eventual
reinstall that weekend took most of the weekend only to find that floppy disk
70 odd was corrupt, much fun.

But *nix is great as always more than one way to get things done and on many
systems can also be true.

Still good education in not only backups, but backup integrity as you never
know when you want to read them back.

------
amyjess
I first found this story in a collection of Unix horror stories several years
ago:
[http://www.yak.net/carmen/unix_horror_stories](http://www.yak.net/carmen/unix_horror_stories)

If you enjoyed this one, you'll probably enjoy the others in there as well.

~~~
teddyh
See also _COMPUTER-RELATED HORROR STORIES, FOLKLORE, AND ANECDOTES_ :

[http://wiretap.area.com/Gopher/Library/Techdoc/Lore/rumor.ne...](http://wiretap.area.com/Gopher/Library/Techdoc/Lore/rumor.net)

------
irishcoffee
One of my coworkers was out in the field trying to fix a problem on a rhel
box.

As root, typed in chown -R nobody:nobody / some/dir/somewhere/

Ended up having to re-image the system. We still give him a hard time about
it.

Now I'm wondering if he could have tried to find a more creative solution as
displayed in TFA.

Thoughts?

Edit: He hard killed the box right after he typed that in.

~~~
bluesmoon
I suppose he could have tried remounting the device read-only, and then using
dd to directly update the inode table to change ownership, but, if he was
already root, he could have just chown'ed it again.

------
caipre
Awhile back I read another disaster recovery story, along the lines of
recovering a long pipeline command from a long-running process by reading the
resident memory of the containing shell. I've not managed to find it, but I
think it was posted to HN sometime last year or so. Anyone know about it?

------
pronoiac
In the ext2 days, an incident put a _lot_ into lost+found. I had a Tripwire
database with file checksums, and wrote a script to checksum the files and
move them back into place. I think a handful of files were corrupted, but I
got it booting again.

------
wsterling
Why did they not boot from installation/recovery media, mount the drive, copy
what was left, restore and copy back the user and configuration data that was
found?

~~~
charonn0
> Alternatively, we could get the boot tape out and rebuild the root
> filesystem, but neither James nor Neil had done that before, and we weren't
> sure that the first thing to happen would be that the whole disk would be
> re-formatted, losing all our user files. (We take dumps of the user files
> every Thursday; by Murphy's Law this had to happen on a Wednesday). Another
> solution might be to borrow a disk from another VAX, boot off that, and tidy
> up later, but that would have entailed calling the DEC engineer out, at the
> very least. We had a number of users in the final throes of writing up PhD
> theses and the loss of a maybe a weeks' work (not to mention the machine
> down time) was unthinkable.

~~~
wsterling
Yea, I am not buying it. They have a UNIX guru that is building new commands
in assembly and transferring the files with uuencoding but no one knows how a
recovery tape works? Using a recovery tape was not a rare event in UNIX shops.

~~~
PhasmaFelis
Mario Wolczko's home page is here:
[http://www.wolczko.com/](http://www.wolczko.com/) Feel free to drop him a
line and tell him he's a liar.

------
pronoiac
I've considered this problem before. If I remember my research, I'd
investigate using bash's built-in echo or printf, and paste encoded versions
of binaries into the terminal. To speed up the terminal work, get uudecode,
base64, or iconv working. Maybe wget or curl, to fetch base packages and
unpack them. Then, say, ssh, and rsync, and selectively restore from backups.

------
pki
Slightly off-topic but I am curious: what are the things at the bottom? Are
they precursors to email or something?

~~~
rwh86
By "things" I assume you mean this stuff:

> ARPA: miw%uk.ac.man.cs.ux@cs.ucl.ac.uk > USENET: mcvax!ukc!man.cs.ux!miw >
> JANET: miw@uk.ac.man.cs.ux

miw is clearly Mario Wolczko's username.

Those will be his email address in different formats. The first would route
email to him at miw@uk.ac.man.cs.ux using cs.ucl.ac.uk first as a gateway. The
second address format is what's called a bang path, for UUCP email. The final
one is a modern IETF standard email address. JANET is a network provider for
UK academic institutions, similar to an ISP but structured differently.

You might find these references interesting:

[https://en.wikipedia.org/wiki/Email_address](https://en.wikipedia.org/wiki/Email_address)
[http://www.faqs.org/docs/linux_network/x-087-2-mail.address....](http://www.faqs.org/docs/linux_network/x-087-2-mail.address.html)
[http://www.livinginternet.com/e/ew_addr.htm](http://www.livinginternet.com/e/ew_addr.htm)

~~~
pm215
I think that last one isn't a modern standard email address -- the domain part
is reversed. It's a JANET NRS path
([https://en.m.wikipedia.org/wiki/JANET_NRS](https://en.m.wikipedia.org/wiki/JANET_NRS)).
The modern equivalent would be miw@ux.cs.man.ac.uk.

The first email address (the ARPA one) is presumably using a machine in UCL as
the gateway between internet routing and JANET routing, since the part before
the % is the JANET NRS component ordering. The obvious conclusion would be
that the mail server he was using didn't have an ARPAnet connection at all.

------
davidw
That's way more badass than my own rm -rf disaster:
[http://journal.dedasys.com/2006/01/30/disaster-
strikes/](http://journal.dedasys.com/2006/01/30/disaster-strikes/)

~~~
smhenderson
I'm curious to see what your tcl code looks like. Did you keep it around for
future disasters or was it just a quick and dirty one-off?

~~~
davidw
It was as quick and dirty as could be. It was pretty simple, really, as it
just accepted a connection and wrote a file with no encoding. No security or
concurrency or any other niceties.

~~~
smhenderson
Ah, got it. Sounds like it did the trick though, a clever idea to come up with
in the midst of looming disaster! :-)

------
AlphaWeaver
This somehow always ends up popping up every couple of years, never ceases to
astound me...

------
barteklev
Thanks god, nowadays we have extundelete. :-)

------
kjs3
I remember the before-time, when the GNU lived only in Boston. When beasts
like 'adb' and 'fsdb' roamed the earth, and 'fsck' was the wimpy stuff of
childrens taunts. When I controlled my horror and revulsion and used them to
recover the broken VAX by manually rebuilding inode lists and finding the lost
fragments in the dark.

Hard tools for hard admins; the Gods of BSD, McKusick and Joy and the rest,
were and are wise. We live in more refined times....

