
I Don't Need Backups, I Use Raid1 - keyist
http://momjian.us/main/blogs/pgblog/2012.html#May_31_2012
======
arethuza
I once started work somewhere where they did software releases by RAID.

Their systems involved shipping a server (effectively an appliance) to the
customer with all of the working components on it. However, there was no build
or deployment process for these components - so the only way to create a new
server was to take an existing one and create a copy.

This was done by opening up a working server running with RAID 1, removing one
of the disks and installing the disk into a new server. Let the RAID recover
the data onto the other blank disk then remove it and put the other blank disk
in and let it rebuild.... result, a copied server!

~~~
dan_b
<http://thedailywtf.com/Articles/RAIDing_Disks.aspx>

------
furyg3
_Dud, Flood, & Bud._

 _Duds_ are hardware that goes bad, like a disk drive, network adapter, NAS,
or server. There are an infinite number of ways and combinations things can
break in a moderate sized IT shop. How much money / effort are you willing to
spend to make sure your weekend isn't ruined by a failed drive?

 _Floods_ are catastrophic events, not limited to acts of God. Your datacenter
goes bankrupt and drops offline, not letting you access your servers. Fire
sprinklers go off in your server room. Do you have a recent copy of your data
somewhere else?

 _Bud_ is an accident-prone user. He accidentally deleted some files... the
accounting files... three weeks ago. Or he downloaded a virus which has slowly
been corrupting files on the fileserver. Or Bud's a sysadmin who ran a script
meant for the dev server on the production database. How can we get that data
back in place quickly before the yelling and firing begins?

There are more possible scenarios (hackers, thieves, auditors, the FBI), but
if you're thinking about Dud, Flood, & Bud, you're in better shape than most
people are.

------
kator
We live in a sad world where most companies don't have a real disaster
recovery plan. Many times in my career I've had customers ask me to save them
because they "thought" they were backing up but when they went to restore from
the {tape|floppy|backup disk} media they found it to be corrupt.

Backup and Disaster recovery strategies seem really easy until you think
through all the failure modes and realize the old axiom "You don't know what
you don't know" is there to make your life full of pain and suffering.

Years ago my customers would literally restore their entire environments onto
new metal to verify they had a working disaster recovery plan. Today most
clients think having a "cloud backup" is awesome.. Until they realize in the
moment of disaster that they are missing little things like license keys for
software, network settings, passwords to local admin on windows boxes etc.

------
pja
RAID is not a backup strategy. RAID is an _availability_ strategy.
Unfortunately, it appears that many people don't understand the distinction.

------
gaius
_The community has discussed the idea of adding a feature to specify a minimum
streaming replication delay_

This is a feature of Oracle, the redo logs are replicated to the standbys as
normal, so you have an up to date copy of them on the standby, but only
applied after an x hour delay. You can roll the standby forward to any
intervening point in time and open it read-only to copy data out.

Less need of it these days with Flashback, of course, but it saved a lot of
bacon.

~~~
ErikD
Using mk-slave-delay you can do this with Mysql as well. We always have a
slave running behind a day. You can fast forward the slave using the 'START
SLAVE UNTIL' command.

~~~
gaius
Don't MySQL slaves pull from the master, rather than the master pushing? In
the Oracle way, the transactions are on the standby, just not applied yet. In
the event of a catastropic failure of the primary, you're still covered.

------
bstpierre
Most companies I've worked for have had some kind of annual fire drill / alarm
testing. They announce it the prior week, and then, say, Tuesday at 10am the
alarm goes off, everyone files out of the building into the parking lot for 5
minutes, then back inside. In 15+ years (at several different companies), only
once has there been an actual fire department call where the evacuation was
"real" (even then, there was no actual fire).

In those same 15+ years, mostly working for startups, there have been numerous
drive failures. Unfortunately, failure (a) to verify backups _before_ there's
a failure, and (b) to practice restoring from backups has often meant that a
drive failure means loss of several days' worth of work. In one instance, the
VCS admin corrupted the entire repo, there were no backups, that admin was
shown the door, and we had to restart from "commit 0" with code pieced
together from engineers' individual workstations. _That_ was when I got
religious about making & _testing_ backups for my work and the systems I was
responsible for...

------
waivej
You must test your backups. I used a commercial backup service that sent daily
status emails. It seemed great for months until I realized it had a bug and
there was nothing in the archive.

~~~
wiredfool
Yep. It helps to think of it as: Backups aren't the end product. Successful
restores are the end product.

------
Legion
Cloud backup services have taken away any possible excuse for not remotely
backing up any non-ginormous collection of data. It's push-button easy and a
lot cheaper and easier than dealing with taking tape backups and moving them
offsite.

Not to say that it's the best solution for everyone, but simply that it leaves
people no excuse for doing _nothing_.

------
hythloday
The underlying cognitive bias in "I don't need backups, I use raid1" seems to
be the quite common one of "I don't do anything stupid, so I don't need anti-
stupidity devices" (feel free to substitute "careless" or similar for
"stupid"), maybe with a side-order of "if I set up systems that protect me
from my stupidity then only stupid people will want to work with me". The fact
is, most of us do many stupid things every day--some stupid at the time, some
stupid in retrospect--and systems that don't let us recover from them are poor
systems.

------
ender7
Never underestimate your RAID controller's ability to fail (silently!) and
start writing corrupted garbage to your disks.

~~~
its_so_on
EDIT: people didn't like my humor. Well, look, the whole thing that you're
buying with a raid controller is...redundancy. So if it's not redundant,
failing silently, while telling you it's being redundant, how is this
different from, say, paying for a house inspection that doesn't get done? If a
raid controller is allowed to silently fail, it becomes a post-experience
good.

<http://en.wikipedia.org/wiki/Experience_good>

Meaning that even while you're using it, you have no idea if it works.

My contention is that it's not a raid array if it can silently stop being
redundant without telling you.

At best it's an Possibly Redundant Array of Inexpensive Disks.

(The below is how my comment first read.)

 _(sarcastic) Yeah, it's only prudent to grab a drive out from time to time
and make a surprise inspection of whether it's actually filled up a full 4/5th
of the way (or whatever) with the actual data the volume is supposed to
contain! And the remaining fifth had better look a_ damn _sight like parity
information!

Seriously though, a controller that fails like this isn't a RAID controller,
since what separates it from a paper plate and a cardboard box. On the paper
plate you write "RAID controller" and tape it to an already attached hard
drive, and you put the remaining members of the redundant array into the
cardboard box. No setup or even connection required!

seriously seriously though, what you're suggesting is unacceptable. that's not
a raid controller, that's a scam_.

~~~
eli
Of course a RAID controller isn't supposed to fail silently, but it can and it
does. I can't think of many complex pieces of technology that work 100% all of
the time.

~~~
its_so_on
You don't think that something that only exists to create disk redundancy is
in a different category from complex pieces of technology that don't have this
in their name?

I simply disagree that you should "never underestimate" your raid controller's
ability to fail silently (which is the comment I was replying to). If this is
even on your radar you don't have a RAID controller.

This is literally like saying. "Never underestimate your digest algorithm's
ability to hash the same file to different values, making the checksum seem to
fail." That's not a digest algorithm, that's a randomized print statement.

A RAID controller you should 'never underestimate' the ability of to fail
silently is literally sometimes the same as a paper plate with "raid
controller" written on it. Call it "sometimes raid". or "maybe raid" or "more
raid". You don't have a raid controller.

~~~
hythloday
In general, then, we shouldn't call RAM memory since it might misremember, we
shouldn't call them computers since they might miscompute, we shouldn't call
it encryption since it might not encrypt, and we shouldn't call them bicycles
since one of the wheels might fall off. Is that about it? I think I can see
your point, but as we're nowhere near the point at which machines become as
reliable as humans, let alone utterly reliable, I'm not quite sure of the use
of the fine distinction you're trying to draw.

~~~
its_so_on
No, I don't at all mean in general.

(See my cousin reply here).

That is not at all "about it". I mean, specifically, for the layer that a RAID
produces. It's simple. When you add RAID, you add a layer on top of physical
hard-drives to make them redundant.

This type of layer has a completely different expectation from all of your
other examples. The example in my cousin reply is apt: it would be like
expecting a checksummong algorithm (which you're ONLY using to add
verification that a file is genuine) to sometimes fail and produce a random
checksum in the space of possible checksums the algorithm can produce, instead
of the checksum that the algorithm actually produces for that particular file.
Or if it has a comparison function, to sometimes fail and say that the file
checksums to the provided checksum, regardless of whether it does so.

This is ridiculous: such a layer wouldn't be a checksum, it would be
completely different. The idea that I have to physically roll a layer on top
of my checksum, to check whether it's currently acting like a randomized print
statement or a comparison function whose truth value is randomly negated, is
ridiculous.

I don't know how else to put this. Maybe instead of your RAM, bicycle,
examples, I can give you these examples: -> Imagine if you are adding a fuse
to a circuit to protect it, but the fuse sometimes actually just saves up
electricity so it can release it one quick burst and override the circuit.
That's not a fuse.

-> Imagine if you hire an auditor to make sure your employees aren't misappropriating funds, since the business involves a lot of cash, but your auditor sometimes just pockets cash. That's not an auditor. You only thought you hired an auditor. The solution isn't to make sure the auditor has an auditor, it's to hire an actual auditor instead of someone you mistakenly think is one.

-> Imagine if you buy insurance, but actually the company sometimes will just spend lawyers on defending having to pay out, even when the event clearly happened and you were clearly covered. That's not insurance - that's a scam. You shouldn't have to insure the layer of insurance with an insurance against the insurance company out-lawyering you. You should get an actual insurance policy.

-> Imagine if you buy a seatbelt, but after buckling it, there is a realistic chance that you really haven't, and it's just a clothing item draped across your body and not attached in any way at any point.

Well if that's possible, that's just not a seatbelt. It's a defective item
that was supposed to be a seatbelt but isn't.

The point is, all these examples are optional layers on TOP of a process. If
they have a realistic chance of failing as in the above descriptions, they
simply are not what they're claiming to be. Their chance of failure should be
so low you can't even think about it; if it isn't, you should just hire or buy
a different on, since you made a mistake.

------
alexchamberlain
In case anyone is confused, what happens when the server catches fire or is
stolen?

~~~
duck
Or even more likely, the RAID configuration is lost.

------
larrys
I've worked with tapes offsite before hard drives became cheap enough to use
for backup (of the appropriate amount of data of course).

My current setup goes as follows:

Servers in colocation get backup daily to a server in the office. That server
in the office then gets backed up daily to a iosafe.com fire and water proof
hard drive in the office which when I get a chance will be bolted to the desk
for further security. Clones are then made of that server biweekly (which are
bootable) and one is kept in the office and one is taken offsite.

So the office server is the offsite for the colo server and the clone of that
is the backup for the office.

The clones allow you to test the backup (hook it up and it boots basically).

Added: Geographically the office is about 3 miles from where the backup of the
office is kept. But the office is about 40 miles from where the colo servers
are kept.

------
RyanMcGreal
Fun anecdote: years ago, I worked for a department that had its server on a
RAID setup, and when I asked about backups they said, "Don't worry". One day,
a drive failed. They replaced it and started restoring from the other drive -
which failed mid-sync. The two drives were from the same production lot and
died literally within 12 hours of each other.

So: back up your data.

~~~
jeromeparadis
It happened to me. It happens the drives had a bug where there death time was
hardcoded in the drives. Past a predetermined time of usage, they would fail.
Of course, I never believed RAID was a backup strategy so I was able to
recover.

------
avgarrison
This is one of the problems I have with SQL Azure. They have yet to implement
a satisfactory backup option:
[http://www.mygreatwindowsazureidea.com/forums/34685-sql-
azur...](http://www.mygreatwindowsazureidea.com/forums/34685-sql-azure-
feature-voting/suggestions/655599-enable-backups)

------
Spooky23
It's amazing to me that anyone is actually arguing that RAID negates the need
for backup. That is just dumb.

If I ever heard an SA working for me advocate that position, I would probably
get them off of my team ASAP.

------
eli
Maybe I'm an idiot, but the vast majority of times I've needed to recover
something from a backup are due to user error, not hardware failure. RAID sure
doesn't help there.

------
nviennot
A backup has not much value when stored in the same physical location with the
original data. Any fire/flood/robbery will destroy all the data.

~~~
mike-cardwell
Of course it has value; fast recovery of data after hardware failure.

You still want off-site backups as well of course, in case of something more
extreme, but they're usually going to be slower to recover from than nearby
backups.

------
jemeshsu
I was burnt once when both hard disks in my Raid1 fail at the same time,
unlikely but it happened. And Raid is not a backup strategy.

~~~
trapexit
Not as unlikely as many people would like to think! If the two drives are from
the same production lot, they may suffer from a common manufacturing defect.
And because they are in the same chassis, if a server fan fails, both drives
may subsequently fail due to thermal damage within a very short interval.

Even if they don't fail simultaneously, the mirror drive may fail (or even
more likely) have read errors or flipped bits that will corrupt the restore or
render it impossible.

Personally, I don't place much trust in any RAID configuration other than
RAIDZ2 (ZFS; you can lose two drives and still recover all your data; every
block is checksummed to avoid reading or restoring corrupted data).

But even ZFS can't protect you against accidental deletion, fire, theft, or
earthquake.

~~~
mietek
There's RAIDZ3 now!

------
zalew
redundancy != backup

~~~
dredmorbius
Actually, backups _are_ redundancy.

You just have to structure your redundancy to survive multiple threat models.

In which case, the redundancy offered by RAID alone is grossly insufficient.

~~~
InclinedPlane
There is a very important difference. Redundancy doesn't protect you against
bad changes to your data, backups do. Backups should ideally be immutable, and
append only. What happens when a disgruntled employee runs 'sudo rm -rf /'?
With redundancy the effects of that decision are dutifully cloned on all
media. With backups one has the ability to rollback to an older state.

~~~
dredmorbius
Backups are redundancy _out of firing range_ of problems like, say, hard drive
meltdown, operator error, etc.

I've had gruntled employees, occasionally myself, run some variant of 'rm -rf'
unintentionally far more often than I've had to deal with the other sort.

If you feel my grandparent post was advocating _against_ backups, I'd strongly
suggest you re-parse it. It's distinguishing between varieties of redundancy.

------
highfreq
Using RAID1 as backup is OK as long as you occasionally run 'sudo rm -rf /'
for maintenance.

------
kayoone
for most of my stuff dropbox pro (with packrat addon for unlimited file
histories) + github handle all my backup needs. Of course this wouldnt work
for all scenarios but i dont work with/have loads of huge files.

