

Our Server’s Hard Drive is Dead. We didn’t have a backup. - rhdoenges
http://blog.method.ac/announcements/our-servers-hard-drive-is-dead-we-didnt-have-a-backup/

======
JPKab
It's noble of you to come clean and own your mistake, but let me say this over
and over:

You should never, ever provide an environment that stores people's hard work
without having professionals who know how to safeguard it.

If it makes you feel any better, I recently had to clean up a mess in a huge
enterprise IT shop, (if I were to name the organization you would immediately
know them) involving hundreds of thousands of man-hours of work lost due to a
lazy, incompetent DBA and the clueless management above her.

This "DBA" was the kind of person who came in at 9:45AM, took a 2 hour lunch
at noon, and left at 3:30. Did I mention she refused a work from home option?

She didn't know how to do chron jobs, so all of her backup scripts had to be
run manually. If she was on vacation, they didn't get run. Surprise Surprise,
the DB died after her long pre-Christmas vacation. Zero backups for the first
3 weeks of December.

Even "professionals" can be suspect sometimes.

~~~
harshreality
Running cronjob backups and looking at them in passing to see that they look
like valid backups is not sufficient for any serious website or web service.

Automated backups need automated backup restoration and testing. Otherwise,
the backups might not be created properly, or they might be perfect backups
that have some hidden error that will cause them to fail when they're put to
use.

As an example, Jeremiah Wilton's self-case study on Amazon's Oracle database
problem in 1997. <http://www.bluegecko.net/download/disaster-diary.pdf>

Other than the one missed backup, backup procedures were fine. An Oracle bug
caused Oracle to refuse to start due to a database format/schema change weeks
earlier. TESTING backups would have caught the error, and allowed them to fix
it before they took down their production database and triggered the bug on
the next attempt to start it again.

------
gregd
_"We need to get programming talent on-board."_ Sounds to me like they still
haven't learned their lesson...

~~~
jrussbowman
glad I wasn't the only one who had that as a first thought. as an ops guys I
wasn't sure if I should be offended or just shake my head at the irony of it.

~~~
gregd
Well apparently someone didn't like my comment since it got down-voted.
Whatever. As someone with a background in systems administration, it bothers
me when this profession gets left out of the equation, far too often.

~~~
jrussbowman
judging by some of the other comments I think people believe sysadmins can and
have been replaced by heroku, aws and such.

~~~
gregd
I would have to agree. But I would also have to agree that Heroku or AWS may
have been a better choice than a 1&1 dedicated server in this particular
instance.

However, Heroku and AWS are no substitute for good systems administration.
They _are_ good substitutes for data centers given signed and documented SLAs.

------
ashray
I don't understand, you had a backup HD - that means you had a RAID setup. Why
didn't your host replace the damaged hard disk ? In my experience hosts
usually monitor RAID health on their servers and if there is a problem they
replace the bad hard drives at the quickest opportunity.. and I'm talking
about budget hosts.

EDIT: Too many to respond to below so just editing in here. The author
mentioned that the primary hard disk had failed over a year ago - but he
didn't know about that (the host informed him of this... now?). That points to
a RAID setup where the mirror was basically working all this while. That's
what I'm talking about in this post.

~~~
dangrossman
When you buy unmanaged servers, the host isn't monitoring RAID health -- they
don't have any remote access to your machine except maybe IPMI for reboots.
I've rented servers from various providers for a decade and none has ever
monitored my hard drives... plenty have failed, including disks in a RAID and
RAID adapters themselves; they get replaced when I call up and tell someone
the server won't boot and I need someone to go take a look.

~~~
ashray
You're right, I seem to have gotten lucky with my unmanaged hosting (3 times
over with different hosts). They seem to have some sort of hardware interface
to monitor RAID health, of course, this is hardware RAID so maybe that's where
the setup differs. I was surprised when I received an email from them one
morning about a year ago saying "Hey, one of your RAID drives failed so we
replaced it, just FYI".

It's true that a RAID failure may go unnoticed by a sysadmin for a year or
more if they don't have proper checks setup for themselves.

I guess the only thing that could've been done in this case was to have a
backup cronjob or use a provider that takes care of this stuff..

------
femto
Chances are that most of the information is still physically there, just that
it is inaccessible. First thing I would do is physically obtain the drive, so
even if it's inaccessible you have the information in your possession.

From your blog post, I'd assume you don't have the knowledge to attempt
recovery yourself, so call in an expert to handle the data recovery for you.
At this stage, it is a matter of what the information is worth to you,
compared to the cost of recovery. Almost any intervention is possible, for a
price.

------
pdeuchler
I feel for you. I really do. You did a lot of things right: learned how to
program, bootstrapped your startup, released a product (!), got users, went
viral, etc. etc.

But.

1) All of this could have been solved with money, specifically money used to
pay professionals. You got 30 _THOUSAND_ signups and you didn't think of
trying to get funding? I'm surprised VC's weren't pounding at your door. At
the very least, that might even be enough for a bank loan from a savvy lender.
Hell, you could probably find a recently graduated ('tis the season) CSCI
student willing to just take sweat equity with those numbers. This is
especially frustrating for me as I currently have a startup that recently
garnered a whopping 400 (count 'em!) _hits_ on it's signup page, and yet I
still got emails from people trying to invest. Not nearly platinum tier, and
thus far none have panned out, but still!!!

2) You claim to have worked in web design/development for a while, and you
didn't hear about 1&1's horrific reputation? That's hard for me to believe. In
fact, of any community, the PHP/JS crowd is probably most familiar with being
burned by 1&1\. (Not even going into the slimy overselling).

I hate to say it, but you should have known better. That said, I sincerely
wish you the best of luck. You've succeeded pretty spectacularly thus far, and
in the big scheme of things this is a pretty minor setback. Just keep shipping
and you'll get it eventually.

Edit: I realize that it might seem foolish to some to go after funding when
it's not needed, but I would argue that if you are making it up as you go
along (not an indictment, it's how we learn) and you get these kind of
numbers, you should feel at least a little obligated to your users to secure
your product. If that requires money that you don't have, get funding.

------
sc00ter
Not to be depended on as a substitute for backup, but 'dead' doesn't
necessarily mean _dead_. Forensic recovery (either DIY or professional,
depending on the nature of the failure) may still be an option.

Logic board failures are common, and replacements cheap (the cost of a new HD
of the same model), data can be highly recoverable from soft failures.
Mechanical failure is the worst case, but as long as the platter(s) is/are in
tact, not insurmountable.

------
tempestn
"Technical support informed me that my first HD died 20 days into my contract.
The backup HD hummed along for a year."

That sounds more like RAID than a backup HD.

~~~
gregd
This doesn't sound like RAID at all. How are you guys coming to this
conclusion? At best it sounds like a server with a master/slave hard drive
setup...you know, something from the early 2000s.

~~~
ams6110
Two mirrored disks is technically a RAID level that can survive failure of one
disk. I worked at a company in the late 1990s that had mirrored disks, they
would "split" the mirror at the end of the business day, backup to tape from
one disk and run nighttime batch jobs on the other disk. Before start of next
business day they would backup the batch disk to tape, then resync the mirror.

------
jefe78
"We messed up bad. We launched without having a backup procedure in place, and
without the resources to make it happen. This was a hard-learned lesson that
won’t happen again. We have no one except ourselves to blame."

You know how you messed up? By not using something like AWS - EC2 - Snapshots.
Or even S3 or Glacier. What is this trend of devs doing Operations? As a
Sysadmin with a Compsci/dev background, it blows my mind constantly.

Great, you know how to move around the CLI, but are you versed in how to
maintain a proper and robust system?

Also, why weren't you using something like SES for your email alerts?

~~~
toomuchtodo
Long story short, money was tight, priorities were set incorrectly, and they
got fucked.

~~~
jefe78
Sounds about right. My favourite part was leasing a server from 1&1\. Even a
little industry knowledge with regards to infrastructure would have caused
someone to avoid them.

------
thaumaturgy
a. My business has a partnership with a good data recovery outfit. We might be
able to get you a good deal on a data recovery if you want to try going that
route.

b. It takes a particular kind of personality to be good at sysadmin work. (And
a lot of trial-and-error -- I just recently had to do an emergency server
build due to a Debian update whoops, and I've been doing this stuff for a
while.)

c. I usually recommend BackupPC (<http://backuppc.sourceforge.net/>) for easy
set-it-and-forget-it backup infrastructure. It's compatible with everything,
it will notify you if there are problems, it does pooling and de-duplication
and compression, it's fast and reliable, and you can usually store months of
backups on a small offsite server. I store 12 months of all hosted and
customer data with it, and we've used it to meet other clients' needs too.

d. If you need affordable help, let me know. I'm way too cheap, and I do this
stuff all day, every day. I opened a business specifically to address problems
like this: needs something, money is a problem.

That goes for anybody else too. If your lack of backups is keeping you awake
at night, or if you've suddenly outgrown your infrastructure, or if looking at
config files gives you an ulcer, get in touch with me. I'll help you out.

------
Sealy
How many people did it affect? I would be too ashamed to admit it if i was a
company offering services for programmers but didn't back up my server.

------
nwilkens
I see this too many times.. and have read about this more than once on HN in
recent memory.

Hire a proper system administration company early to work with you on these
types of things. There are many companies out there that do this. I happen to
run a company that does this, so I know that you can add an expert admin to
your team for $100-200/mo.

~~~
Brandon0
That is actually surprisingly cheap. Care if I ask what types of services one
would get at those rates?

You're absolutely right though, for a company like OP's, if they are so short
on cash, it makes a lot of sense to get someone in even if just for the week
to address these types of fundamental problems.

~~~
nwilkens
For a monthly service, you generally receive an initial: \- System
architecture review \- Backup strategy / DR review \- Security scan, and
detailed review \- System monitoring design, and implementation

and on-going: \- 24x7 monitoring, and response to outages \- Server patch
management \- Ad-hoc system admin time available to be used on-demand

Many more details, and capabilities, but you get the idea ;)

------
foobarbazqux
This site is an excellent way to find out if you've covered all your bases in
your backup protocol:

<http://www.taobackup.com/>

~~~
femto
Note: the website is an extended advertisement for a piece of backup software,
and the user account was created 3 minutes before the comment was posted.

~~~
foobarbazqux
Yeah, that's true. But then do you really think those guys are spamming HN at
8 p.m.?

I wish I had said, if you ignore the advertising, it's a great resource. If
you apply it to proposed backup solutions, it's an effective means to find out
if they are viable.

~~~
femto
I said it for you, since a reader would probably be interested in that piece
of information. I apologise for the cynical comment about the account being 3
minutes old, as that is ad hominem and being new to HN shouldn't come into it.
Welcome to HN.

~~~
foobarbazqux
Oh it's okay. Thanks for pointing it out. Truth is I lost my password and I
hadn't given them my email, so I just made a new account. (I'm not upset about
this.)

------
hoodoof
[http://www.codinghorror.com/blog/2009/12/international-
backu...](http://www.codinghorror.com/blog/2009/12/international-backup-
awareness-day.html)

------
ironchef
On the plus side, this will probably only ever happen to you once. Once you've
felt the pain, you'll never let it happen again.

~~~
ams6110
However remember that backups are only half of the disaster recovery picture.
You need to have a _tested_ restore process ready to go as well.

~~~
ironchef
Automated and tested is even better :)

------
cdvonstinkpot
I wonder if it still would've happened if they were SSDs...

I tend to think they're safer due to no moving parts.

~~~
gregd
Actually it's been my experience that SSDs are _less_ reliable than HDDs...

~~~
protomyth
And for the love of all you hold holy, do not ever put drives bought in the
same batch in the RAID at the same time. They tend to fail at the same time.
Check those serial numbers first.

~~~
cdvonstinkpot
That's most unfortunate to hear, because I just bought 10x used Intel X25-E
SSDs for my server because SSDLife.exe said they only had 6 months of use,
with 99% of their life left & an expected runtime of 10 more years.

With these drives typically around $750/each, it'll be difficult if not
impossible to find deals on 8 separate lots of _good_ used drives.

I can only afford them when I see an exceptional deal on eBay, and feel lucky
to have found these I just got for $200/each. I strongly doubt I'll be able to
afford to implement 1 drive per lot with my server like your warning implies
would be a good thing to do.

It seems you'd have to buy them all new at different times of the year to be
able to implement such a thing, which I certainly can't afford.

~~~
protomyth
Well, I went with the buy one drive a month from each of my vendors for my
RAID.

Use them, but take some precautions. Do real backups and test the backup to
make sure it will actually restore. Replace blown drives very quickly. I had a
group of C4 SSDs that tanked within 24 hours of each other. I was not amused
and thus learned what I need to do in the future.

------
macarthy12
Spinrite !

