Their systems involved shipping a server (effectively an appliance) to the customer with all of the working components on it. However, there was no build or deployment process for these components - so the only way to create a new server was to take an existing one and create a copy.
This was done by opening up a working server running with RAID 1, removing one of the disks and installing the disk into a new server. Let the RAID recover the data onto the other blank disk then remove it and put the other blank disk in and let it rebuild.... result, a copied server!
It is amazing how even fairly technically-savvy people get sucked into the "RAID=backup" mentality. This story (in the above link) ended up costing the business owner tens of thousands of dollars.
Duds are hardware that goes bad, like a disk drive, network adapter, NAS, or server. There are an infinite number of ways and combinations things can break in a moderate sized IT shop. How much money / effort are you willing to spend to make sure your weekend isn't ruined by a failed drive?
Floods are catastrophic events, not limited to acts of God. Your datacenter goes bankrupt and drops offline, not letting you access your servers. Fire sprinklers go off in your server room. Do you have a recent copy of your data somewhere else?
Bud is an accident-prone user. He accidentally deleted some files... the accounting files... three weeks ago. Or he downloaded a virus which has slowly been corrupting files on the fileserver. Or Bud's a sysadmin who ran a script meant for the dev server on the production database. How can we get that data back in place quickly before the yelling and firing begins?
There are more possible scenarios (hackers, thieves, auditors, the FBI), but if you're thinking about Dud, Flood, & Bud, you're in better shape than most people are.
Backup and Disaster recovery strategies seem really easy until you think through all the failure modes and realize the old axiom "You don't know what you don't know" is there to make your life full of pain and suffering.
Years ago my customers would literally restore their entire environments onto new metal to verify they had a working disaster recovery plan. Today most clients think having a "cloud backup" is awesome.. Until they realize in the moment of disaster that they are missing little things like license keys for software, network settings, passwords to local admin on windows boxes etc.
This is a feature of Oracle, the redo logs are replicated to the standbys as normal, so you have an up to date copy of them on the standby, but only applied after an x hour delay. You can roll the standby forward to any intervening point in time and open it read-only to copy data out.
Less need of it these days with Flashback, of course, but it saved a lot of bacon.
In those same 15+ years, mostly working for startups, there have been numerous drive failures. Unfortunately, failure (a) to verify backups before there's a failure, and (b) to practice restoring from backups has often meant that a drive failure means loss of several days' worth of work. In one instance, the VCS admin corrupted the entire repo, there were no backups, that admin was shown the door, and we had to restart from "commit 0" with code pieced together from engineers' individual workstations. That was when I got religious about making & testing backups for my work and the systems I was responsible for...
Not to say that it's the best solution for everyone, but simply that it leaves people no excuse for doing nothing.
Meaning that even while you're using it, you have no idea if it works.
My contention is that it's not a raid array if it can silently stop being redundant without telling you.
At best it's an Possibly Redundant Array of Inexpensive Disks.
(The below is how my comment first read.)
(sarcastic) Yeah, it's only prudent to grab a drive out from time to time and make a surprise inspection of whether it's actually filled up a full 4/5th of the way (or whatever) with the actual data the volume is supposed to contain! And the remaining fifth had better look a damn sight like parity information!
Seriously though, a controller that fails like this isn't a RAID controller, since what separates it from a paper plate and a cardboard box. On the paper plate you write "RAID controller" and tape it to an already attached hard drive, and you put the remaining members of the redundant array into the cardboard box. No setup or even connection required!
seriously seriously though, what you're suggesting is unacceptable. that's not a raid controller, that's a scam.
I simply disagree that you should "never underestimate" your raid controller's ability to fail silently (which is the comment I was replying to). If this is even on your radar you don't have a RAID controller.
This is literally like saying. "Never underestimate your digest algorithm's ability to hash the same file to different values, making the checksum seem to fail." That's not a digest algorithm, that's a randomized print statement.
A RAID controller you should 'never underestimate' the ability of to fail silently is literally sometimes the same as a paper plate with "raid controller" written on it. Call it "sometimes raid". or "maybe raid" or "more raid". You don't have a raid controller.
(See my cousin reply here).
That is not at all "about it". I mean, specifically, for the layer that a RAID produces. It's simple. When you add RAID, you add a layer on top of physical hard-drives to make them redundant.
This type of layer has a completely different expectation from all of your other examples. The example in my cousin reply is apt: it would be like expecting a checksummong algorithm (which you're ONLY using to add verification that a file is genuine) to sometimes fail and produce a random checksum in the space of possible checksums the algorithm can produce, instead of the checksum that the algorithm actually produces for that particular file. Or if it has a comparison function, to sometimes fail and say that the file checksums to the provided checksum, regardless of whether it does so.
This is ridiculous: such a layer wouldn't be a checksum, it would be completely different. The idea that I have to physically roll a layer on top of my checksum, to check whether it's currently acting like a randomized print statement or a comparison function whose truth value is randomly negated, is ridiculous.
I don't know how else to put this. Maybe instead of your RAM, bicycle, examples, I can give you these examples:
-> Imagine if you are adding a fuse to a circuit to protect it, but the fuse sometimes actually just saves up electricity so it can release it one quick burst and override the circuit. That's not a fuse.
-> Imagine if you hire an auditor to make sure your employees aren't misappropriating funds, since the business involves a lot of cash, but your auditor sometimes just pockets cash. That's not an auditor. You only thought you hired an auditor. The solution isn't to make sure the auditor has an auditor, it's to hire an actual auditor instead of someone you mistakenly think is one.
-> Imagine if you buy insurance, but actually the company sometimes will just spend lawyers on defending having to pay out, even when the event clearly happened and you were clearly covered. That's not insurance - that's a scam. You shouldn't have to insure the layer of insurance with an insurance against the insurance company out-lawyering you. You should get an actual insurance policy.
-> Imagine if you buy a seatbelt, but after buckling it, there is a realistic chance that you really haven't, and it's just a clothing item draped across your body and not attached in any way at any point.
Well if that's possible, that's just not a seatbelt. It's a defective item that was supposed to be a seatbelt but isn't.
The point is, all these examples are optional layers on TOP of a process. If they have a realistic chance of failing as in the above descriptions, they simply are not what they're claiming to be. Their chance of failure should be so low you can't even think about it; if it isn't, you should just hire or buy a different on, since you made a mistake.
If I did get one, I wouldn't get one that tended to silently fail, since that would pretty well defeat the purpose of thinking my disks were redundant, wouldn't it?
If you were talking about anything else, like a normal hard-drive, fine, of course it can fail silently. But the whole thing that a RAID drive is, is another layer on top of hard-drives, to make them redundant and chirp loudly when one dies or starts having wrong data and has to be removed, so that you can replace it and rebuild the RAID.
I mean, all the RAID controller does is write data that is always redundant (even when it thinks all drives are working fine). How is it not possible for it to check for this consistency as well? Especially in Raid-6 etc configurations, which are even more consistent?
Of course, on a probabilistic level random bit rot means "nothing is certain", but on a practical level, how can you not expect a raid controller not to fail silently, when all it does is corral redundant data around, create checksums, verify what's written, etc. It's the whole reason it exists.
To me this is like saying that a checksumming algorithm should be expected to sometimes fail and just return a checksum chosen randomly from the space of all possible checksums, instead of the checksum actually produced by the algorithm for that data.
That's ridiculous. I shouldn't have to even think about putting another layer on top of the checksum, so that I can checksum it. The very idea of having to do that means you don't have a checksumming algorithm.
This thing should be right up there with bitrot causing bash to execute an rm -rf whenever you drop down to root. Sure that's possible, but that's not even in the scope of anything you have to think about.
To me, a RAID is a layer on top of hard-drives that makes them redundant. Any controller that has a realistic chance of failing silently simply does not fit that definition.
Please note that I have not said "it is likely to fail" or "you should expect that it will probably fail." I agree that it shouldn't be something that keeps a person up at night. But the simple fact is that, when data is important, you should prepare for that possibility (and others) by backing up. RAID does not solve all problems, and it is not guaranteed, as unlikely as failure might be.
Moreover - in saying that it simply isn't RAID if it ever fails silently, you're attempting to define away a nonsemantic problem. The point of a starter motor on a car is to start the engine. If the starter motor fails to start the engine, I guess I could make an Aristotelian argument that it has ceased to be a starter motor, or even perhaps that it was never a starter motor in the first place. But what practical good does that do anybody?
All hardware has the potential to fail. Yes, people should buy hardware that is less likely to fail. I'm pretty sure they already do that, though.
You might read this first:
you can reply to that as well here if you want.
I think we're in very general agreement. Although you yourself did not say "it is likely to fail" or "you should expect that it will fail", this is exactly the sentiment I was replying to was.
Regarding your "all hardware possibly failing" and the example of a starter motor to imply that I am trying to disappear a technical problem with a semantic argument, I think I am (especially in that cousin reply) being quite a bit more specific.
Basically, when it comes to safety mechanisms that exist as a layer on top of a process and aren't necessary at all, I simply shouldn't have to even think about reinventing another safety mechanism on top of the safety mechanism. Get one that isn't defective.
A hard-drive isn't defective just because it fails: it's expected to. A RAID controller is also expected to fail...JUST NOT SILENTLY.
In the seatbelt example: should you even think about having to tie your seatbelt to the buckle with sturdy rope, for real safety in case the seatbelt just doesn't buckle when it seems to, or comes undone like a ripped shirt button at the slightest firm tug?
No. You should get an actual seatbelt.
Basically, the standard you hold a control layer to is different from the standard you hold an underlying process to.
It would be like the difference between your brake failing and your (for added safety) handbreak failing, which you only engage on top of the motor's brake anyway. If the motor brake fails you would start rolling (if you're on a bit of an incline). But you shouldn't even have to think about a hand-brake 'just failing' in the same condition.
Sure it can fail if you are being towed without being lifted, or whatever, in an extreme situation. But in a normal situation?
Basically, it is a difference of both category/kind AND of degree.
I am certainly not saying that a parking brake can never fail. I am not saying a raid controller can never fail.
I am saying that both of these, when they are layers on top of a normal process, should be out of sight, below your threshold of having to control for it. If they're not, you need to get a different one.
You don't get six insurance policies against the same earthquake possibility, hoping that they won't ALL decide to out-lawyer you or go bankrupt. You get real insurance that's properly reinsured. Check up on them. Find a real one.
Raid failure is fine. Silent raid failure is not fine.
(checksum failure with an exception is fine; checksum failure with no exception, warning, or error, just a random checksum produced - or a check randomly passing when the checksum doesn't match the one you provided, is not okay. fix your checksum, get a real one - don't build another layer on top, for the cases that your checksum is a randomized print statement or your insurance policy a monthly donation from you to a non-charitable organization that puts aside a portion to out-lawyer you with if you try to make a claim, with the rest spent on advertising or being their profit. That's not an insurance policy, that's a scam.)
My current setup goes as follows:
Servers in colocation get backup daily to a server in the office. That server in the office then gets backed up daily to a iosafe.com fire and water proof hard drive in the office which when I get a chance will be bolted to the desk for further security. Clones are then made of that server biweekly (which are bootable) and one is kept in the office and one is taken offsite.
So the office server is the offsite for the colo server and the clone of that is the backup for the office.
The clones allow you to test the backup (hook it up and it boots basically).
Added: Geographically the office is about 3 miles from where the backup of the office is kept. But the office is about 40 miles from where the colo servers are kept.
So: back up your data.
If I ever heard an SA working for me advocate that position, I would probably get them off of my team ASAP.
You still want off-site backups as well of course, in case of something more extreme, but they're usually going to be slower to recover from than nearby backups.
Even if they don't fail simultaneously, the mirror drive may fail (or even more likely) have read errors or flipped bits that will corrupt the restore or render it impossible.
Personally, I don't place much trust in any RAID configuration other than RAIDZ2 (ZFS; you can lose two drives and still recover all your data; every block is checksummed to avoid reading or restoring corrupted data).
But even ZFS can't protect you against accidental deletion, fire, theft, or earthquake.
You just have to structure your redundancy to survive multiple threat models.
In which case, the redundancy offered by RAID alone is grossly insufficient.
I've had gruntled employees, occasionally myself, run some variant of 'rm -rf' unintentionally far more often than I've had to deal with the other sort.
If you feel my grandparent post was advocating against backups, I'd strongly suggest you re-parse it. It's distinguishing between varieties of redundancy.