Hacker News new | comments | ask | show | jobs | submit login
Unix Recovery Legend (1986) (ryerson.ca)
143 points by electrum on Sept 2, 2015 | hide | past | web | favorite | 60 comments



> Well, for one thing, you must always remember the immortal words, DON'T PANIC

So true. A colleague of mine managed - on his second or third day on job - to delete every single user account in our Active Directory. After an hour, we gave up trying to restore the AD (it was an SBS2008, so no AD recycle bin) and simply restored the entire DC (at the time, our domain only had the one DC) from backup. Surprisingly, most of our users took it very well and used the time to get some paperwork done or clean up their desks or something like that. Still, it was one of the most stressful days of my life. So we kind of panicked. In restrospect, I think another hour or so of research might have saved us the eight hours of restoring that server (did I mention that our backup infrastructure really, really sucked at the time?).

In smaller desasters, I've found the ability to remain calm most valuable, though. Having your boss breathing down your neck impatiently can instill a deep desire to simply do something just to show that you are working on the problem. But if you don't understand what's wrong, at best you are wasting time, and possibly making the problem even worse.


Amen. I was a new DBA at a company that used a DBMS which used extent based disk allocation with a default extent size of 16kb. That was a problem on a busy table (which this was), because there was a finite limit to the number of extents that could allocated, but that limit was impossible to predict.

The proactive fix was simple. Backup, restore the table in a two hour outage window. I asked for 5 hours, as the previous DBA never tested a restore. "Unacceptable", said the SVP.

Fast forward two weeks. We hit the limit, and the entire company is essentially down. Between lost revenue, SLA fines and payroll, they were losing something like $5k/minute. Recovery at this point required 30 hours of full database restore, including journal recoveries from slooooow DAT tapes.

There were three of us, everyone stayed calm, provided regular updates and handled a few hours of direct observation by the CEO and a board member for a few hours.


Oh my! And I thought our backup software took a long time to restore the server... (This whole thing happened on a Friday, so 30 hours would not have been that much of a problem for the rest of the company, but I would have had to work over the weekend and spend the night on-site...)

The sad thing about such events is that afterwards, you could go "Told you so", but usually, people will not only not listen, but sometimes will still find a way to blame you for what went down. (In our case, it was our mistake, but we were lucky our CEO took it very well - he has no problem with people making mistakes as long as they are open about it and try to learn from their mistakes. What he cannot stand, though, is people trying to cover their butt and/or shift the blame onto others...)


And the SVP?


Master politician... He made it somebody else's problem.

Funny story is that he forgot to pay the phone bill for one of the call centers, and when they went to walk him out, they found him in a "compromising position" with his secretary in the office.

That company was an unlimited source of material! Good times!


Many years ago I ran into a similar problem. A technician came in to the office to replace my hard disk with a larger one, and instead of copying my old data to the new one, he started copying the new disk to the old one, resulting in /home now being partially NTFS and partially ext2. ie, it was unreadable in linux and Windows. I documented the recovery process here: http://tech.bluesmoon.info/2004/11/home-is-ntfs.html

Bonus: a year later, at a different company, I was faced with having to undelete source files on FreeBSD (UFS). I documented that as well here: http://tech.bluesmoon.info/2004/08/undelete-in-freebsd.html


I've had to do something similar a long time ago in the mid-90s when Linux switched from libc5 to glibc6. In this case, I hadn't deleted everything, rather I'd stupidly upgraded libc locally.

After learning a valuable lesson in exactly how dynamic library work and the recommended process for live libc upgrade (don't do it if ABI changes) I fixed it by using my IRC client which was already running so unaffected to get a statically linked copy of /bin and /sbin from another machine, via DCC Send...

Recovery then consisted of restoring libc5 from slackware 3.2 install media.

I can't remember how I got root, either su was statically linked (believable since it's setuid) or I had a logged in root session. I did have to used the tcsh "echo *" trick for file listing and the shell built-in cd...


"cd" is always a shell built-in. It is not even possible to have a /bin/cd binary which does what "cd" does.


Good point. I'd never really thought about that, just noticed that all shells seemed to include id...


Why is impossible? Maybe I've understood why it is always built in: without it the shell wouldn't be able to navigate through the directory tree to find the executables of the commands. But once the shell has it, I think it would be rendundant (so nobody does it) but still possible to have a /bin/cd that navigates directories.


You can't change the environment of the parent process, including CWD. `/bin/cd` would either be another process like all executables run by the shell, and not work, or special cased, at which point why make it an executable at all?


I can think of one way to write "cd" out-of-process from the shell: exec /bin/cd /some/path, which does chdir("/some/path") and execs your shell. You'd lose your history and similar state, but if you had a shell that didn't even put cd in-process, it probably doesn't have history either.

Sample:

    /tmp$ exec /tmp/cd /bin
    /bin$ exec /tmp/cd /home
    /home$ exec /tmp/cd /usr
    /usr$ 
Sample code:

    #include <err.h>
    #include <stdlib.h>
    #include <unistd.h>
    
    int main(int argc, char *argv[])
    {
        char *shell;
        if (argc != 2)
            errx(1, "Usage: exec cd /some/path");
        if (chdir(argv[1]) == -1)
            err(1, "chdir");
        shell = getenv("SHELL");
        if (!shell || !*shell)
            shell = "/bin/sh";
        execl(shell, shell, NULL);
        err(1, "exec");
    }


Hah, hadn't thought of that. Kind of CPS for processes. I'm not sure what that would gain you, though, as you'd have to either a) write both the cd and the user of the cd, and there are less awkward ways to do that than exec'ing, or b) you'd have to pass something ELSE to exec to return control after cd finishes.

This seems a little awkward and I'm not sure what is being proven. :P


It's how very early version of Unix effectively did stuff, they exceed a program, and when it exited they exceed the shel again.

When they introduced something like the fork/exec model we now have they discovered cd didn't work any more and had to write it as a built in.

Now, for bonus points, work out how goto worked when it wasn't a built in.


> It's how very early version of Unix effectively did stuff, they exceed a program, and when it exited they exceed the shel again.

Makes sense, since those systems might not have enough memory to run the shell and another program at the same time.


This would cause an eventual exhaustion of process table entry / pid and/or addressable memory as the previous shell is left in the background.


There's nothing being left in the background. It's one process which is calling exec repeatedly; no forking going on.


The parent comment's explanation is exactly right (although I might not use "CWD" in this context, because a reader might mistakenly think the working directory is determined by an environment variable). As an example, try

python -c 'import os; os.chdir("/")'

Notice that the working directory of your shell is unaffected. :-)

Or:

bash -c 'export foo=bar'; echo $foo


Those examples are trying to change the directory on the shell's child process, not on the shell itself.

That being said, aside from a kernel patch there is definitely no official way of doing it. This is the hacky way:

http://stackoverflow.com/questions/2375003/how-do-i-set-the-...


I meant for my examples just to illustrate that a child process can't change the working directory or environment of the parent process.


It's impossible because it simply calls chdir(2). The "current directory" is a property of a process; having it built-in means it changes the directory of the current process. If it was a separate executable, you'd change the directory in a separate child process for that executable, and the current directory of the shell would be unchanged, making it entirely useless.


That's why I freakin' love Slackware. The ability to repair stuff by hand, or even to get libc6 binaries running on a libc5 system (by putting the appropriate .sos in /lib), was a godsend during my early Linuxing, and completely unthinkable under Windows. It's very nearly unthinkable under Ubuntu or Fedora, but Slackware just keeps chugging the best it can no matter what shape you bash it into.


To be fair to windows, its backwards compat is legendary. I'm still using windows software on windows 8 that was last compiled in 1997.


At one point I used a production proprietary DB engine last compiled in that time frame on Linux.

The hoops needed to run it on semi-modern Linux was a real PITA (think chroot jail and some ld shenanigans to patch shared libs). Microsoft's backward compatibility is reallly hard to do.


6-7 years ago, a co-worker was doing some work on a server that we were phasing out. It wasn't being used for anything of critical importance, so a Jr level person was permitted to have root. He was actually my colleague and equal on the org-chart but he had less experience with *NIX administration.

He was copying directories to the new server and deleting them when he was done. Well...

I don't remember the sub-directory he had just finished copying but when he went to delete it he typed in "rm -rf / SOMEDIR/SOMESUBDIR" and hit enter.

He almost immediately realized what he had done and hit CTRL-C but by that point the damage was done. Our boss had access to the previous week's backups and since the server wasn't critical, he just rebooted it and restored but we had a good laugh about it.

The next day, I made a paper airplane and on the side I wrote in red Sharpie "Linux Air" and then "rm -rf" inside of a red circle with a line through it and taped it to the top of his monitor.

He was a good sport about it and left it up for a couple of months.


This particular variation was (finally!) fixed in the 2013 POSIX revision:

"If either of the files dot or dot-dot are specified as the basename portion of an operand (that is, the final pathname component) or if an operand resolves to the root directory, rm shall write a diagnostic message to standard error and do nothing more with such operands."

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/rm...

Warning: not yet widely implemented outside the GNU world.


No protection for `rm -rf file.bak ~`?

I'd love to know whoever it was that thought tilde was a good character to use in backup filenames.


Personally, I wish two things:

1) that rm had a (configurable) number of files / directories above which it'll ask you to double-check.

2) that files had a "pleasedontdelete" flag that rm would check and ask about.


> pleasedontdelete flag

For some programs (rm isn't one of them), removing the write permission is sufficient to get an interactive prompt when you try to overwrite/truncate/delete the file.

I really like OSX's current metaphor, where files that haven't been touched in a while get "locked", and must then be "unlocked" to modify them further. Phrasing it in terms of stability rather than permission makes a lot of sense to me. It's too bad the metaphor isn't echoed in the CLI.


`rm` is definitely one of them, it's just suppressed by the `-f` flag. (examples from Ubuntu)

    $ touch foo
    $ chmod -w foo
    $ rm foo
    rm: remove write-protected regular empty file ‘foo’?
There's also the immutable attribute many modern filesystems support, which prevents the file from being modified unless the attribute is removed.

    $ sudo chattr +i foo
    $ rm foo
    rm: remove write-protected regular empty file ‘foo’? y
    rm: cannot remove ‘foo’: Operation not permitted


That's a surprisingly good idea for UAC too. People would hate that it's non-deterministic (from "how will my program run on a user's system" perspective), but makes more sense than other approaches.

If I'm tapping a file that's been sitting there since the OS was installed and hasn't been touched since... Probably fair to ask me to confirm!


You can script that all up and call it rm and pop it higher up the user PATH and if you need the pure rm command without the wrapper protection (for say scripts) then can just point to that with the full path.


> 1) that rm had a (configurable) number of files / directories above which it'll ask you to double-check.

Install safe-rm; it does exactly that.


A week of stress produced this doozy while attempting to upgrade from go 1.2 to go 1.3

[peter@bamboo-sb bin]$ sudo mv /bin/ /usr/local/bin/go1.3 [peter@bamboo-sb bin]$ sudo mv /tmp/go/bin/ /usr/local/bin/go1.3 sudo: mv: command not found [peter@bamboo-sb bin]$ mv -bash: mv: command not found [peter@bamboo-sb bin]$ ls -bash: /bin/ls: No such file or directory [peter@bamboo-sb bin]$ ls -bash: /bin/ls: No such file or directory [peter@bamboo-sb bin]$ ls -bash: /bin/ls: No such file or directory

All during a skype call. It is very recoverable, but I felt quite foolish.


Yip had a manager do that on a clients site, bestpart was the kit was so new that only a few in the country and the install set for the OS had not arrived and no backups. Was new machine and been partialy configured, awaiting tapes.

Luckily anotehr client had the same RS/6000 (think 3rd in the country outside IBM) and was able to borrow there install DAT to bring AIX back to life.

Odd as had problem with RT/6150 in which (nobody admitted it) had similiar problem and that involved to get it limping along copying files from a working system onto this holed system to fill the gaps. Which given the eventual reinstall that weekend took most of the weekend only to find that floppy disk 70 odd was corrupt, much fun.

But *nix is great as always more than one way to get things done and on many systems can also be true.

Still good education in not only backups, but backup integrity as you never know when you want to read them back.


I first found this story in a collection of Unix horror stories several years ago: http://www.yak.net/carmen/unix_horror_stories

If you enjoyed this one, you'll probably enjoy the others in there as well.


See also COMPUTER-RELATED HORROR STORIES, FOLKLORE, AND ANECDOTES:

http://wiretap.area.com/Gopher/Library/Techdoc/Lore/rumor.ne...


One of my coworkers was out in the field trying to fix a problem on a rhel box.

As root, typed in chown -R nobody:nobody / some/dir/somewhere/

Ended up having to re-image the system. We still give him a hard time about it.

Now I'm wondering if he could have tried to find a more creative solution as displayed in TFA.

Thoughts?

Edit: He hard killed the box right after he typed that in.


I suppose he could have tried remounting the device read-only, and then using dd to directly update the inode table to change ownership, but, if he was already root, he could have just chown'ed it again.


Bogus file ownership won't prevent you from booting in single-user mode. And once you boot in single-user mode, you're running as root so you can change file ownerships back.


Awhile back I read another disaster recovery story, along the lines of recovering a long pipeline command from a long-running process by reading the resident memory of the containing shell. I've not managed to find it, but I think it was posted to HN sometime last year or so. Anyone know about it?


In the ext2 days, an incident put a lot into lost+found. I had a Tripwire database with file checksums, and wrote a script to checksum the files and move them back into place. I think a handful of files were corrupted, but I got it booting again.


Why did they not boot from installation/recovery media, mount the drive, copy what was left, restore and copy back the user and configuration data that was found?


> Alternatively, we could get the boot tape out and rebuild the root filesystem, but neither James nor Neil had done that before, and we weren't sure that the first thing to happen would be that the whole disk would be re-formatted, losing all our user files. (We take dumps of the user files every Thursday; by Murphy's Law this had to happen on a Wednesday). Another solution might be to borrow a disk from another VAX, boot off that, and tidy up later, but that would have entailed calling the DEC engineer out, at the very least. We had a number of users in the final throes of writing up PhD theses and the loss of a maybe a weeks' work (not to mention the machine down time) was unthinkable.


Yea, I am not buying it. They have a UNIX guru that is building new commands in assembly and transferring the files with uuencoding but no one knows how a recovery tape works? Using a recovery tape was not a rare event in UNIX shops.


Mario Wolczko's home page is here: http://www.wolczko.com/ Feel free to drop him a line and tell him he's a liar.


I've considered this problem before. If I remember my research, I'd investigate using bash's built-in echo or printf, and paste encoded versions of binaries into the terminal. To speed up the terminal work, get uudecode, base64, or iconv working. Maybe wget or curl, to fetch base packages and unpack them. Then, say, ssh, and rsync, and selectively restore from backups.


Slightly off-topic but I am curious: what are the things at the bottom? Are they precursors to email or something?


By "things" I assume you mean this stuff:

> ARPA: miw%uk.ac.man.cs.ux@cs.ucl.ac.uk > USENET: mcvax!ukc!man.cs.ux!miw > JANET: miw@uk.ac.man.cs.ux

miw is clearly Mario Wolczko's username.

Those will be his email address in different formats. The first would route email to him at miw@uk.ac.man.cs.ux using cs.ucl.ac.uk first as a gateway. The second address format is what's called a bang path, for UUCP email. The final one is a modern IETF standard email address. JANET is a network provider for UK academic institutions, similar to an ISP but structured differently.

You might find these references interesting:

https://en.wikipedia.org/wiki/Email_address http://www.faqs.org/docs/linux_network/x-087-2-mail.address.... http://www.livinginternet.com/e/ew_addr.htm


I think that last one isn't a modern standard email address -- the domain part is reversed. It's a JANET NRS path (https://en.m.wikipedia.org/wiki/JANET_NRS). The modern equivalent would be miw@ux.cs.man.ac.uk.

The first email address (the ARPA one) is presumably using a machine in UCL as the gateway between internet routing and JANET routing, since the part before the % is the JANET NRS component ordering. The obvious conclusion would be that the mail server he was using didn't have an ARPAnet connection at all.



Somewhat. Those are/were actual email addresses. It's just another kind of email (UUCP/Usenet instead of SMTP):

https://en.wikipedia.org/wiki/UUCP#Mail_routing


That's way more badass than my own rm -rf disaster: http://journal.dedasys.com/2006/01/30/disaster-strikes/


Back in the days when there were bad proprietary IDE controller chipsets that enumerated differently under Linux than they did windows...

Well, as a kid I clobbered my windows install while trying to destructively check the validity of a disk. Checks to see if a filesystem is mounted won't protect you if it isn't.

Since then I've been -extremely- careful for any destructive operation on block devices, and generally careful otherwise.

Also, I don't know if blkid existed at the time, but it's /very/ useful in avoiding those types of mistakes.


I'm curious to see what your tcl code looks like. Did you keep it around for future disasters or was it just a quick and dirty one-off?


It was as quick and dirty as could be. It was pretty simple, really, as it just accepted a connection and wrote a file with no encoding. No security or concurrency or any other niceties.


Ah, got it. Sounds like it did the trick though, a clever idea to come up with in the midst of looming disaster! :-)


This somehow always ends up popping up every couple of years, never ceases to astound me...


Thanks god, nowadays we have extundelete. :-)


I remember the before-time, when the GNU lived only in Boston. When beasts like 'adb' and 'fsdb' roamed the earth, and 'fsck' was the wimpy stuff of childrens taunts. When I controlled my horror and revulsion and used them to recover the broken VAX by manually rebuilding inode lists and finding the lost fragments in the dark.

Hard tools for hard admins; the Gods of BSD, McKusick and Joy and the rest, were and are wise. We live in more refined times....




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: