The second issue, is that since this was setup in a RAID I'm assuming that all of the drives were purchased at the same time when it was originally setup which means depending on the supplier you are probably going to receive drives from the same batch of production. Because of striping with RAID if you have some fault in a batch and multiple drives from the same batch then you can have failures on multiple drives that occur at approximately the same time.
The simplest solution is of course to buy enterprise grade drives but the obvious issue there is the price is much higher. Especially if you are going to be purchasing the largest drives as you always pay a premium for max capacity drives and the price jump on large drives is not insignificant.
When we first built out our SSD cloud for DigitalOcean at the end of 2012 we went with consumer drives because we were bootstrapped and SSDs were also significantly more expensive so we couldn't justify the price jump without knowing if we had product market fit.
But rather quickly we ran into a whole host of issues. Another common issue is that performance can degrade when consumer drives get to near full disk capacity so you can't really use all of the available space if you plan to do a significant number of reads and writes, which is pretty much what a database server is designed for.
That was the first upgrade we did after raising money, which is switching to enterprise grade drives, unfortunately you have to pay the premium but it really reduces problems significantly.
But what else is different?
But yeah, battery-backed RAID plus UPS would be good enough for most, I guess.
Which should be a hint as to what's coming.
Enterprise hardware is 3x the cost of consumer hardware, but on average it's at the most twice as reliable. If your software robustness and redundancy is high enough, the total TCO will be lower and no less reliable.
The real wtf with CamelCamelCamel is not that they were running consumer drives, it was they were running 3 replicas on consumer drives with what seemed to be a too heavy workload.
Competent ops being expensive wasn't part of the discussion, Good people are expensive period. In the long term, competent people provide better value. Competent ops would figure out to provide the same reliability guarantees at a lower price. Security likewise, nothing to do with it. That's just using circular logic to say using commodity hardware = bad security (wtf!?).
Sofware and by extension infrastructure gets expensive because of incompetent dev/ops, who are too incompetent to implement efficient systems and use a 'hardware is cheap, throw hardware at the problem' mentality in ops. This especially shows today with the fact that we have have mobile phones with 1000x the processing power of desktops 20 years ago that are sluggish when running a simple TODO app. LET THAT SINK IN. Those same 20 year old computers running the first pentiums were running Quake 1(!).
Back to ops, Google before they were "Google" were using custom built commodity hardware using a 'nodes die, deal with it' approach to software development. Obviously this pattern is a win with the popularity of k8s today / ephemeral nodes but it requires a more complicated development process.
This extends to incompetent ops as well. Who are outsourcing all their work to the cloud, paying a crazy premium for it, calling it a day, whereas the competent guys are able to do it properly.
Closer to the subject, Backblaze engineered a from scratch design to create a highly reliable/available/secure system based on non enterprise class storage that lets the offer storage at a fraction of existing solutions.
You can have a server pegged at 100% CPU waiting on disk with spinning disks, and without changing anything else on the server RAM or CPU, and just replacing the drives with SSD you can quickly see that load drop down to 10%.
Like in the joke.
Our data storage was so well designed, that when first disk died, we didn't notice. When second disk died, we also didn't notice. When third disk died, we noticed.
Edit: I just saw it's about ssds and not hdds, but while Backblaze might not have stats on those, I'm fairly sure my comment still applies. Not 100%, but I assume they account for failures due to predicable issues like write wear.
There are other comments in this thread that go into this, but your comment doesn't apply. Another poster used the light bulb analogy well to describe SSD failures.
It's very common to see SSDs that were purchased at the same time, that were likely manufactured in the same batch, fail within hours of each other. Im' pretty sure I even read this fact on Backblaze.
I mean, I can see a power issue taking out all at once. It could be that. What other way would 3 disks fail at once?
I just have scripts that write key data from "mdadm -D /dev/md$N" to logs. But then I need to remember to review the logs regularly.
In most cases your disaster recovery scenarios should include the complete destruction of a single server.
Consumer SSDs lack IOPs and should be never used on database drives.
Also, I never use consumer SSDs on things that are not a joke. Serious things need enterprise SSDs.
With enterprise drives you're paying for things like beefy capacitors to provide safe shutdown in case of power loss, write durability, etc.
This is likely the answer. If CCC has to ask for donations to afford the consumer version, I imagine they are in no position to purchase an equivalent amount of storage in enterprise SSDs.
A shame that CamelCamelCamel seem to be running on a bit of a shoestring budget as it's a very useful tool. To be fair to the recovery costs though the majority of them are from a professional data recovery company and that ain't cheap!
This article summarizes this kind of scenario fairly well:
I've set notifications on a couple items that I regularly order and when they drop below a certain price I order enough in advance.
Then what's their business model? Do they sell aggregated customer traffic data to merchants?
They also have browser add-ons to make that more convenient. They give you a CamelCamelCamel button that you can click while on an Amazon page to pop up the price history right there.
What concerns me, from the image, is that they seem to be using cheaper, consumer grade drives (Samsung 860 PROs I assume). This is asking for trouble - as they went through - 3 drives failed simultaneously. I've only ever dealt with Cloud infra, but even I know that drives that are bought together, tend to fail together. It's likely their new batch of drives will also fail simultaneously - backblaze has done a ton of research on drive longevity.
Just seems penny-wise, pound-foolish to me.
It's exactly the same as the lights in your house, all of these were made at the same batch with the same quality of material and used the same way, they tend to fall apart within a few days from each other.
My guess is that it would have been a tiny fraction of their cloud bill, and their downtime would have been a smal fraction of the colo downtime.
This suggests whatever backups they had might be months old. Given that they are using a roll your own storage solution with consumer grade hardware, I doubt they are rolling in revenue.