Correlated failures are common in drives. That could be a power surge taking out a whole rack, a firmware bug in the drives making them stop working in the year 2038, an errant software engineer reformatting the wrong thing, etc.
When calculating your chance of failure, you have to include that, or your result is bogus.
Eg. Model A of drive has a failure rate of 1% per year, but when failed the symptom is failure of the drive to spin up from cold, however if already spinning it will keep working as normal.
3 years later, the datacenter goes down due to a grid power outage and a dispute with diesel suppliers so the generators go down. It's a controlled shutdown, so you believe no data is lost.
2 days later when grid power is back on, you boot everything back up, only to find out that 3% of drives have failed.
Not a problem. Our 17 out of 20 redundancy can recover up to 15% failure!
However, each customers data is split into files around 8MB, which are in turn split into the 20 redundancy chunks. Each customer stores say 1TB with you. That means each customer has ~100k files.
The chances that you only have 16 good drives for a file is about (0.97^16 * 0.03^4)2019*18 = 0.3%
Yet your customer has 100k files! The chance they can recover all their data is only (1-0.003)^100000... Which means every customer suffers data loss :-(
That could be a power surge taking out a whole rack
This failure mode, at least, is already accounted for by sharding data across cabinets:
Each file is stored as 20 shards: 17 data shards and 3 parity shards. Because those shards are distributed across 20 storage pods in 20 cabinets, the Vault is resilient to the failure of a storage pod, or even a power loss to an entire cabinet.
However, they don't seem to offer multi-datacenter (or multi-region) redundancy so are still susceptible to a datacenter fire/failure.
In comparison, AWS S3 distributes data across 3 AZ's (datacenters), and you can further replicate across regions if you choose. Though you pay for that added redundancy in 3 - 4X higher cost.
A better example would be the ceramic bearing fiasco that NetApp experienced with Seagate. Seagate had switched to a floating ceramic bearing on one family of their fiber channel drives. In those drives one or more of the bearings would shatter and start spreading ceramic dust across the disk surface. This happened between 3 and 4 years of run time and the disk would rapidly fail after that happened. People that bought a filer with several hundred drives started seeing large numbers fail suddenly. Up to the point they failed, not any indication. But it was run time hours related and thus correlated to all drives that started running at the same time. If you had put those drives in different data centers on different racks on different servers it wouldn't have mattered if you started 20 at the same time to be your 'tome', when they started failing it was possible to lose them all in a week.
There was also a failure mode in Seagate drives where the bearing increased in stiction. As long as it was spinning, there was no problem. But if you spun it down, it might not spin up again. If you had a group of disks powered up for a long time, many could fail together at the next power cycle.
A chaos monkey that randomly powers down disks one at a time can prevent this.
They are separate datacenters, I don't think they make any promises about how far they are apart from each other, but at least in some regions they are 10+ miles apart.
The AWS Cloud infrastructure is built around AWS Regions and Availability Zones. An AWS Region is a physical location in the world where we have multiple Availability Zones. Availability Zones consist of one or more discrete data centers, each with redundant power, networking, and connectivity, housed in separate facilities
Others have answered, but I think the general principle for a Region is that between the AZs you have less than 1ms latency, physical separation between the data centers, but they may still be in the same floodplain, be able to be hit by the same hurricane, etc.
That map is incorrect: us-west-2a, 2b and 2c are not static names for the AZs. Every user gets their own mapping of which physical location is a, which one is b and which one is c. My us-west-2a may be your us-west-2c. They are not the same.
Extremely interesting, thanks for this clarification. For those curious, it's documented here[0] -- search for "To ensure that resources are distributed across the Availability Zones for a region"...
Yes, I know they pseudo randomize the allocations. That is irrelevant to my core point, that each AZ is not just separate networks in the same building or even adjacent buildings, but rather they are truly isolated by a non-insignificant distance of somewhere around 50mi on average.
It's a cool map, but I can't find any reference for the source of the data center locations.
AWS purposely doesn't publish that information, and while I can believe it's possible to crowdsource the data by doing a little sleuthing (or working for certain vendors), it's hard to trust the map without knowing the sources.
> Each region is completely independent. Each Availability Zone is isolated, but the Availability Zones in a region are connected through low-latency links.
I'm not sure how anyone, in their disaster recovery plans, ever expect anything less than 100% failure of a data center. The scenario is: a tornado hits the data center.
Your power outage causing 3% of drives to fail is just a subset of that.
Its just risk analysis. The cost involved in splitting things over multiple data centers vs. the chance of your single DC getting wiped out. Declaring bankruptcy if it happens will be the best business decision in many cases, or insurance if the risk is higher or owners can't afford the loss or have liability.
If it's your data center, you can plan the location to prevent most of this. There are locations where the natural hazards can be completely managed. (No tornados, fires, tsunamis, earthquakes, ...) So the power outage is the most likely thing to happen.
Realistically, the chance of a tornado taking out the Swedish datacenter built inside a former nuclear bunker under 100ft of granite bedrock is so small that it probably doesn't affect the number of 9's that you can claim.
> Arctic stronghold of world’s seeds flooded after permafrost melts
> It was designed as an impregnable deep-freeze to protect the world’s most precious seeds from any global disaster and ensure humanity’s food supply forever. But the Global Seed Vault, buried in a mountain deep inside the Arctic circle, has been breached after global warming produced extraordinary temperatures over the winter, sending meltwater gushing into the entrance tunnel.
There are always unforeseen and unforeseeable risks associated with any location. You can mitigate them but you can't claim X number of 9s for a single physical datacenter.
> There are always unforeseen and unforeseeable risks associated with any location. You can mitigate them but you can't claim X number of 9s for a single physical datacenter.
What is X, here? I'm pretty sure I can claim 99% for a single datacenter.
What I meant is you can't necessarily amortize loss in the event of a localized catastrophe. Failure modes in a single location are by definition not always statistically independent. You could have 99.99999% durability for 20 years, but if something happens to the datacenter that causes total loss, you're SOL. Geographical redundancy vastly reduces the risk of freak occurrences that you can't predict.
If a datacenter boasts flawless durability for 19 years and loses everything in the 20th year, then they have an infinite number of 9's for the first 19 years and zero for the 20th year. It's all about probability.
Nobody can promise 100%, but that doesn't mean that all those 9's are meaningless. They mean a lot for budgeting, and even more for insurance purposes -- which is exactly what we as a civilization have come up with as a way to amortize loss in the event of a local catastrophe. Your premiums are going to be much higher if you don't have enough 9's in a critical part of your money-making infrastructure.
No one here is saying that you don't need geographical redundancy. First we need to figure out how many 9's we can realistically expect in order to determine how much redundancy makes financial sense.
> No one here is saying that you don't need geographical redundancy
I mean, that's kind of what Backblaze is saying in the article, isn't it? They don't have geographical redundancy, yet there's not a single mention of that fact or the importance thereof in an entire article dedicated to teaching the unwashed masses about the limitations of mathematical theory in analyzing durability, even going so far as to say:
> somewhere around the 8th nine we start moving from practical to purely academic... it’s far more likely that...Earthquakes / floods / pests / or other events known as “Acts of God” destroy multiple data centers [emphasis my own]
Seems like a pretty serious omission given their claimed authority as "the bottom line for data durability" and being "like all the other serious cloud providers" who do have geo redundancy, don't ya think?
As mentioned elsewhere in this thread, Backblaze is working on adding another datacenter.
Personally, I don't care whether a single provider has multiple datacenters or not, because I prefer to have redundancy across providers. But that's not the kind of recommendation that we're likely to see on the blog of one of those providers.
I don't think geo redundancy helps much. Your data is more likely to be corrupted by some software of the provider than some random storm, or by some common hardware used by the same provider,
If you need to be safe about your data, you NEED several cloud providers in different places, with different softwares and different countries.
Especially for data it is pretty easy to just back it in 2 really different places at different providers. Relying on the geo redundancy of ONE provider and having to pay for it seems a bit useless for me.
I was going to post essentially the same thing, so here is an upvote :-)
While I always find storage analysis interesting (I spent 5 years at NetApp where it was sort of a religion :-)) some of the assumptions that Brian was tossing out are not good ones to make. (like the lack of correlation, or that Drive Savers will exist as a company 10 years from now).
Still it does help you to understand the they take data availability seriously which is the underlying message.
> some of the assumptions that Brian was tossing out are not good ones to make.
We COMPLETELY welcome other analysis and listing other assumptions. Internally, we argued endlessly about why this or that wasn't totally accurate, and finally decided to publish the math WITH all of our assumptions exposed so you could be the judge. If Amazon wants to publish their assumptions for S3 for comparison, we're all ears.
> that Drive Savers will exist as a company 10 years from now
Absolutely true, this calculation is only good RIGHT NOW. For example, one of the things that came up internally was "well, when drives get more dense the rebuild time rises, so this calculation will no longer be accurate in two years". But at the same time, we have some additional tricks and optimizations to make which we have not done yet to cut the 6 day average drive rebuild time down to 3 days. Also, drives last us about 5 years, so your data will be migrated to totally new drives 5 years from now. Those drives will absolutely have a different drive failure rate (maybe higher, maybe lower) so the calculation will no longer have the same result 5 years from now.
Hey Brian! For the record I love that you are transparent about your assumptions, it is really helpful. I have been in very very similar debates, both at NetApp and at Google of all places. Peter Corbett, the guy who invented the dual parity scheme that NetApp uses, wrote a similar analysis as well for Fast '04 [1].
As someone who likes to geek out on failure proof systems and perfectly secure systems, neither of which are attainable but can be asymptotically approached, I think you are seeing the "there is always one level deeper" kinds of discussions. Personally I think of them as endorsements because if the exceptions get too extreme (say 'what if an asteroid hits?') then you know you've got all the bases covered.
Its only a problem if the person analyzing the analysis finds something that you really did not even consider. Then it opens up an opportunity to look at the problem a whole new way.
Hi Brian, thanks for open-sourcing Backblaze's JavaReedSolomon, which is really well-written. A few months ago I ran into an issue with Reed Solomon coding throughput not saturating the write throughput of 16 drives, and wrote a new Reed Solomon module based on Cauchy matrices: https://github.com/ronomon/reed-solomon
The Cauchy matrices remove the need for a table lookup to do the Galois multiply, replacing it with pure XOR. Together with other optimizations, this gives nearly 3x-5x more coding throughput for the same (17,3) parameters, assuming you're still using your open-sourced JavaReedSolomon in production. I don't know if Reed Solomon coding throughput is a factor in your rebuild times?
It contains optimized Galois Field multiplication, resulting in Reed Solomon (both Vandermonde as well as Cauchy) at multiple GB/s on a modern x86 CPU.
Still, I doubt that the Reed Solomon coding speed is the limiting factor in their rebuild time. There is a mention of a 6-day duration, so even with a very slow Reed Solomon implementation ( ~ 100 MB/s) that should not be a bottleneck for a 10 TB drive rebuild (assuming a distributed rebuild approach, not a traditional RAID style rebuild).
> If Amazon wants to publish their assumptions for S3 for comparison, we're all ears.
The 2010 S3 calculation is obviously also wrong! I totally feel your pain in terms of wanting to have a directly comparable answer, but the reasons you give that "it doesn't matter" (and others, including correlated faults, software bugs, model risk, and security breaches) are actually reasons why the stated durability number is wrong. Honestly IMO a probability of 1-10^-11 is the wrong answer to pretty much any question; model risk is going to dominate that for any problem more complex than 1+1=2.
That said, although neither your system nor Amazon's should be expected to have anywhere near "eleven nines" of durability in reality, if as I understand it S3 is split across availability zones in a region and your product has all splits in a monolithic DC, I would expect S3 to come out ahead in a more careful analysis. (But note that S3 is not really a seamless multi region product, though there is an option to set up cross region replication.)
As far as I know, if they claim multi AZ in a city (region), that really means at least separate buildings, but likely separate locations withing the region too.
>The chance they can recover all their data is only (1-0.003)^100000... Which means every customer suffers data loss :-(
You just made the same mistake you're criticizing. You assumed the 100K files were uniformly and independently spread. They're also likely clustered, and perhaps not even at the same data center. Given the variety of drives Backblaze uses, the drives are also not likely to all be the same model, so your failure method is also unlikely.
You also only looked at the case there are exactly 16 good drives. The proper failure estimate is 1 - (odds of 17 good + odds of 18 + odds of 19 + odds of 20). I'm not sure where you got the 20 x 19 x 18 part either. Did you mean 30 choose 16 or something like that? Using the proper 1-... method I get 0.00267, not 0.03.
No, it doesn't. You picked 3% of all drives failing out of the blue. Your next estimate was an order of magnitude too high. Your last assumption of p^# drives is not reasonable.
The proof is in reality. Backblaze has run over a decade, with all sorts of hardware failures, server configs, running many drive models through their lifetime, across manufacturers, across technologies, across multiple datacenters, and had not seen the level of failures you claim they will.
So I suspect their method of estimating is more accurate than yours. So far it matches reality much better.
This is why, when I was building DIY arrays for startups (around the same time Backblaze published their first pod design [1]), I went through the extra effort of sourcing disks from as many different vendors as possible.
Although it was somewhat more time consuming and limited how good a price I could get and how fast the delivery could be, it meant that, for any given disk drive size, it meant I could build an array as large as 12 where no 2 drives were identical in model and manufacturing batch [2].
Of course, it's still a vanishingly rare risk, and "nobody" cares about hardware any more. It does help to remember, at least once in a while, that, on some level, cloud computing really is "someone else's servers" and to hope that someone else still maintains this expertise.
[1] though I used SuperMicro SAS expander backplane chasses for performance reasons
[2] and firmware from the factory, although this is somewhat irrelevant, as one can explicitly load specific firmware versions, and, IIRC, the advantages of consistent firmware across drives, behind a hardware RAID card, outweighed the disadvantages
I've been doing a poor man's version of this for home videos using 3 hard drives and rsync. It's easy to replace a drive and they are not likely to go out at the same time. But one thing that bugs me is that unless the drive fails hard (e.g. noticed by SMART or unable to read at all) how do I know the data on the drive is not corrupted without reading it? Are there best practices to continuously compare the replicas in the background? Does that impact durability of the drives?
> how do I know the data on the drive is not corrupted without reading it? Are there best practices to continuously compare the replicas in the background?
I assume you're talking about already-written sectors becoming unreadable or a similar failure. Unfortunately, I don't think you can. This is what I believe the "patrol read" feature of RAID cards is meant to address.
Fortunately, however, I don't believe there's evidence that if the data is readable, it would ever be different from what had been written, so comparison isn't needed. The main exception to this is the case of firmware bugs that return sectors full of all-zeros.
> Does that impact durability of the drives?
I haven't read the studies (from Google, mostly, IIRC) in a while, and I'm not sure if they've released anything lately for more modern drives [1]. However, I believe you'll find an occasional "patrol read" won't noticeably reduce drive life/durability.
[1] Especially for something like SMR, whose tradeoffs would seem particularly attractive for something like this archival-like use case.
"I don't believe there's evidence that if the data is readable, it would ever be different from what had been written, so comparison isn't needed. The main exception to this is the case of firmware bugs that return sectors full of all-zeros."
Comparison is needed to address misdirected writes and bit rot in the very least, see "An Analysis of Data Corruption in the Storage Stack" [1]. You can't count on your drive firmware or RAID firmware to get this right. You need bigger end-to-end checksums, and you need to scrub.
Thanks! I had either missed that paper or had taken away more of the message that these errors are more likely to be from events like misdirected writes, cache flush problems (hence the high correlation with systems resets and not-ready-conditions), and firmware bugs (on-drive and further up the stack), rather than bit-rot.
Still:
> On average, each disk developed 0.26 checksum mismatches.
> The maximum number of mismatches observed for any single drive is 33,000.
Considering the latter can represent 132K on a modern, 4K-sectored drive, that's a remarkable amount of data loss, enough to warrant a checksumming higher up (such as in the filesystem).. in theory [1].
However, the fact that this was NetApp using their custom hardware as the testbed makes me wonder if the data are skewed, and if the numbers would be nearly this bad from a more "commodity" setup, such as at Google. The paper alludes to this when referring to the extra hardware for the "nearline" disks, and I'm always suspicious of huge discrepancies in statistics between "enterprise" and other disks, even more so when there's a drastic difference in comparison methodology.
It would be interesting to see if there are any numbers for more modern drives, especially as the distinction between "enterprise" and "consumer" drives is disappearing, if only because demand for the latter is disappearing.
[1] In practice, an individual isn't aware of the 16.5K/132K loss risk, which is vanishingly small compared to other risks, anyway, and businesses don't tend to care and have survived OK anyway.
I've never done so outside of FreeNAS appliances, partly because I remain persuaded that offloading the RAID portion to a card is more cost-effective and higher-performance, especially on otherwise RAM- and/or IO-constrained servers, and partly because the ZFS support under Linux has, historically, been less than ideal.
Higher-level checksum failures are, however, a situation, where I would appreciate an integration between filesystem and RAID, as I'd want a checksum error to mark a drive as bad, just like any other read error.
Unfortunately, no, it doesn't really say how ZFS behaves when an error is encountered.
This is super-disturbing and a dealbreaker, if it's still true:
> The scrub operation will consume any and all I/O resources on the system (there are supposed to be throttles in place, but I’ve yet to see them work effectively), so you definitely want to run it when you’re system isn’t busy servicing your customers.
I browsed a little of the oracle.com ZFS documentation but couldn't find much in the way of what triggers it to decide that a device is "faulted" other than being totally unreachable.
> I went through the extra effort of sourcing disks from as many different vendors as possible.
This is very good advice!
If you already built your array, consider advice: "replace a bad disk with a different brand, whenever possible".
Over time, you naturally migrate away from the bad vendors/models/batches. After following this practice, it seems ridiculous to me now to keep replacing the same bad disks with the same vendor+model.
Although I wouldn't go so far as to insist on switching brands (especially since, as another commenter pointed out, there has been so much consolidation, there remain only 3), I agree that replacing with at least a different model, or, failing that, a different batch, is a best practice for an already-built homogenous array.
Some of this can also be achieved ahead of time if one has multiple arrays with hot spares, by shuffling hot spares around, assuming there's some model diversity between the arrays but not within them.
I doubt I'll ever again have the luxury of being able to perform this kind of engineering, however. Even a minor increase in cost or cognitive/procedure complexity or a decrease in convenience just serves to encourage a "let's move everything to the cloud" reaction, so I keep my mouth shut.
«The chances that you only have 16 good drives for a file is about (0.97^16 ∗ 0.03^4)∗20∗19∗18 = 0.3% Yet your customer has 100k files! The chance they can recover all their data is only (1-0.003)^100000... Which means every customer suffers data loss :-(»
Your math is completely wrong. In reality 92% of customers suffer no data loss.
The chance of a file having 4 failed drives (16 good drives) is: .03^4 = 0.00008100%
The chance of a file being irrecoverable is the chance of having 4 or more failed drives: .03^4 + .03^5 + .03^6 + ... = 0.00008351%
The chance of a file being recoverable is: 1 - .03^4 - .03^5 - .03^6 - ... = 99.99992151%
The chance of a customer's 100k files all being recoverable is: (1 - .03^4 - .03^5 - .03^6 - ...)^100000 = 92.0%
Therefore only 8% of customers encounter one or more 8MB file that is irrecoverable.
The original maths is correct. You have neglected the fact that the 4 failed drives can be any of the 20. Your calculations are for 4 specific drives failing.
I think you're not considering that failed drives will be replaced and the data on them reconstructed from the other shards. This failure mode requires 4 of 20 drives to fail in such a short amount of time that reconstruction cannot be completed.
Yes, but this was the OP's scenario: what happens when 3% of drives all fail at the same time when the DC is powered back on.
Edit: actually the math is still wrong. The chance any 4 out of 20 drive is failing is: .03^4 × C(20,4) = .03^4 × 4845 = 0.392% — There is no need to multiply by .97^16 as the status of the other 16 drives is irrelevant.
Decidedly, statistics is hard. Everything above is wrong. Let's label the twenty drives D0 through D19. There are 2^20 possible scenarios, which can be represented as a string of 20 bits:
• 00000000000000000000 = all 20 drives are working
There are C(20,4) = 4845 scenarios with exactly four failing drives (four "1" bits.) The probability of each scenario is .97^16 × .03^4. Therefore the probability of 4 failing drives (any drive) is the sum of the probability of each scenario: .97^16 × .03^4 × C(20,4) like I said 3 comments above.
However the probability of a file being irrecoverable is P(4 failing drives) + P(5 failing drives) + ... + P(20 failing drives):
You could also have a bad batch of drives, causing a bunch of failures to happen over the period of a couple of months. Sources of failure don't take a number. Like in your example, they can and will overlap, which is why when we try to design bullet proof systems we aren't satisfied until every bad even requires a number of separate things to go wrong all at once.
But if you have faulty power or bad hardware chipping away at your equipment, your depth of resiliency is degraded until the issue is corrected.
Also, there are only 3 hard drive manufacturers left. If one of them have a bug that affects across their product line, that can take out 1/3 of all hard drives.
It's vanishingly unlikely that a bug would affect all their drives (across all recent models, and only after burn-in) simultaneously, unless the drives are managed by a remote server with a SPOF.
We're talking about 6+ 9's here. Putting an order of magnitude on your definition of 'vanishingly small' seems compulsory when we're already so far right of the decimal point.
How many years between incidents are you talking, and when was the last time a manufacturer had a multigenerational bug? (and is quality improving, or decreasing? That is, are we more or less likely to see failures in the next 10 years than we did in the last 10?)
There was a statistics class in college that kicked my ass. I never quite understood how to determine if two variables were independent or dependent. You get a vastly different answer if you get it wrong.
I run into people all the time that seem to have the same problem, to the point that it makes me wary of any software developer putting forth numbers that seem fantastical.
My gut reaction is that if Backblaze wants to keep reporting their disaster preparedness numbers that they need the assistance of an actuary to calculate them.
I think that your point is what they were trying to address by saying that anything beyond eight nines is impractical. Their examples of correlated failures included earthquakes and floods which might have huge reach rather than just impacting a single rack, but I think it's the same general idea.
The customer only suffers data loss if they lost their 'master' copy of the data as well during the outage. Iff they don't have a secondary backup solution.
This was an interesting read, both the points made about durability, as well as the in-depth math. However, what stood out to me most was the line:
Because at these probability levels, it’s far more likely that:
- An armed conflict takes out data center(s).
- Earthquakes / floods / pests / or other events known as “Acts of God” destroy multiple data centers.
- There’s a prolonged billing problem and your account data is deleted.
The point that once you get to a certain point of durability (at least as far as hardware/software is concerned) you're chasing diminishing returns for any improvement. But the risks that are still there (and have been big issues for people lately) are billing issues. I think it's an important point that the operational procedures (even in non-technical areas like billing and support) are critical factors in data "durability"
I've posted the math here before but if we assume that an asteroid hits the earth every 65 million years and wipes out the dominant life forms, then this fact alone puts your yearly durability at a maximum of ~8 nines.
The point about billing is better, though.
My other concern is that a software bug, operator error, or malicious operator deletes your data.
> if we assume that an asteroid hits the earth every 65 million years and wipes out the dominant life forms, then this fact alone puts your yearly durability at a maximum of ~8 nines
I don't think this is a useful definition of your yearly durability. If your data center is down for maintenance during a period in which it is guaranteed that nobody wants to access it, that doesn't reduce your availability at all -- if your only failure is an asteroid that kills all of your customers, it would be more accurate to say you have 100.000000% availability than 99.999999%.
Isn't the expected life span of the company more limiting?
Plenty of cloud storage companies go out of business (typically, they run out of money). You can apply Gott's law to this. It's pretty grim
The question of how many data points to use is a subtle one, though. I can say that I picked one data point because I was lazy and doing back of the envelope math, which is reasonable because we can be somewhat assured that I didn't choose a number of data points that was convenient for the hypothesis.
But if you're choosing two data points, my question is... why two? If you are choosing whether or not to reply based on whether or not the second data point fits with the first, then you're introducing selection bias. The chance that the second data point disagrees with the first by at least as much as the 200 My interval disagrees with the 1/65 My rate is equal to 1-(exp(-65/200)-exp(200/65)) = 0.32, which is not especially high.
I wonder if there's a general term in engineering for the case where a particular risk has been reduced well below the likelihood of more serious but exotic risks. I've heard about this most in cryptography where we can sometimes say that the risk of, say, a system failing due to an unexpected hash collision is drastically less than the risk of the system failing due to cosmic radiation or various natural disasters. At that point it doesn't seem important or worthwhile to consider this risk, because it's dwarfed by the others.
This seems like a form of the same argument, and I wonder where else this arises and how people describe it.
I don't think unknown unknowns are what I'm thinking of here. In this case the argument involves a very specific risk, and sometimes a very specific lower bound for its probability.
For example, in the hash collision case the argument says that it's not worth worrying about the (known) probability of a software error due to an unexpected hash collision because it's dominated by the (known) probability of a comparable error due to cosmic radiation. (The former probability can be calculated using the birthday paradox formula, and the latter has been characterized experimentally in different kinds of semiconductor chips.)
This kind of argument doesn't rely on the idea that there are other risks that we can't identify or quantify. It's about comparing two failure modes that we did think of, in order to argue that one of them is acceptable or at least not worth further attempts to mitigate.
I think Rumsfeld gets credit for it because he was using it in the most degenerate, disingeuous form possible. Rather than guarding against legitimate concerns and pursuing actual handling of potential issues, he was just trying to rationalize continuing policies that were demonstrably counterproductive. It's one thing to say there might be factors we don't know. It's another to day that simply because there might be such things, we should dedicate significant resources and lives to blindly flailing away under the assumption that it will help. A presumption of unknown unknowns puts you in a position of not acting, normally. There is no way to know that you aren't exacerbating and making a problem worse if you know that little.
I tend to assume pessimistically that the durability design will itself cause a problem. Redundant switches to survive a hardware failure, e.g., strikes me as inviting trouble.
Indeed. I once had critical systems routed via a large redundant Cisco switch which claimed to be 1+1. Turns out there was a single "supervisor" component which failed (after just a year or two) and made the pair of switches useless. Apparently the designer worked in a team where nobody does anything when the boss is out.
For a consumer the cheapest and easiest way to backup important documents or files is to encrypt it and store it across multiple storage providers, e.g. Dropbox and Google Drive.
They usually give you a reasonable amount of free storage, and it's unlikely all accounts would be terminated or locked at the same time.
And of course, you should always have your local backups as well.
Assume you want access to files over the next 20 years. What are the odds Google will have bought Dropbox in that time; and what are the odds an automated system monitor at Google ad words will have disabled your Google account in that time span?
Replace with Amazon and/or crashplan as appropriate..
Backblaze suggests a “3-2-1” backup strategy. You should always have at least one backup on site. If a remote backup becomes inaccessible, you could move over to another remote access provider.
Financial failure or service shutdown by the provider is the highest risk for long term storage. The backup services CrashPlan, Dell DataSafe, Symantec, Ubuntu One, and Nirvanix all shut down. Nirvanix only gave two weeks notice for users to save their data.[1]
How about USA government instruct the owners to cease access or take ownership like with Kim Dotcom's New Zealand (?) based storage company.
I don't recall hearing about anyone recovering access to their data in that case?
This makes every additional user an increase in risk, as even without a warrant it seems if USA TLAs consider someone a valid target then those servers are going down (or getting taken over by people with unknown service standards in order to run a sting, or ...).
tl;dr you have to worry about accusations (of law breaking or copyright infringement) against others too as some jurisdictions have a strong overreach in such cases.
If Google suspends your account for background music playing in a YouTube video, you might still lose access to your files in Google drive / cloud - even if the files are encrypted.
> Eventually someone realized that their non-work accounts were banned as well. It wasn't until yesterday that someone made the connection. Anyone who had their accounts as a recovery option were also caught in the ban wave.
Forget account ban, there is a nonzero risk a false positive of the automated kiddie porn search all the cloud storage providers do has your home searched and puts you in handcuffs.
While I'm sceptical of content filters, even with a home search, it seems unlikely you'd end up in cuffs unless a) the filter caught acymtual illegal content, or b) the search turned up something illegal.
You might get killed in the course of the initial police raid though..
If pictures of naked kids is illegal in your jurisdiction, you've got bigger problems. I guess that's true for some locations, though. Still, nude people =! pornography.
(kids below age of consent sexting each other is another, related, problem)
Surely the far greater risk is one other person using the service does something the feds don't like (justified or otherwise) and they take every server that person's file fragments have touched.
Just for fun, here's my report of Nirvanix in 2008. I feel no problems sharing it since both NowPublic and Nirvanix is long gone. Let's just say there were reasons they went under:
1) Sideloading. I was unable to benchmark or even to get this to work. Images requested to be loaded from media-src.nowpublic.com to node4 @ Nirvanix never showed up. I showed my code to [nirvanixcontact] who said that the
code looked OK but someone did a load testing on node4 without informing Nirvanix and that steps are made so that such a situation won't occur again. He told me that the requests are not lsot but they still have not landed. Again, my code can be at fault and I would be happy to run a sideload example code.
2) Upload speeds. I have uploaded from d2 to Nirvanix 100 images each between 180-8584 kbytes totalling almost exactly 100 MB (101 844 931 bytes). The upload was a single HTTP request. The uploads took 18-19 minutes (I repeated the experiment). To give us a comparison I changed
the URL in the script to a one line PHP script on another server (at hostignition.com) which just echo'd the number of uploaded files. This took 16.25 seconds and echo'd 100 so seemingly the files landed.
2a) I tried to get another node via the LoginProxy method which we would need for uploads anyways. While LoginProxy itself did work, GetStorageNodeExtended
https://services.nirvanix.com/ws/IMFS/GetStorageNodeExtended...
always fails with ResponseCode 80006, ErrorMessage: Session not found for ip = 67.15.102.70, token = e7b00d25-fc35-431c-9437-9a4302767f46. Seemingly, does not pick up the consumerIP.
3) The image conversion itself is blazing fast though. These images took a total of 101.58 seconds to convert and this includes 300 HTTP requests (200 sent to d2 from Nirvanix, 100 to Nirvanix).
I really want to like Backblaze and they seem to do a lot of good work, but whenever this comes up, I also feel responsible to let people know the dark side so they're informed at least.
I've written in more detail before[0], but just to share the gotchas in case anyone here is thinking of switching to Backblaze:
1. They backup almost no file metadata.
2. The client is very slow (days or more) to add new files and there's no transparency (it claims everything is backed up when it's not).
3. There are still bugs in the client that can put your backup into an invalid state where it gets deleted.
4. Support is terrible, and won't be any help when you run into these bugs.
Yeah, it's a bit old, but here's an article about Backblaze not supporting metadata. [0] "It fails all but one of the Backup Bouncer tests, discarding file permissions, symlinks, Finder flags and locks, creation dates (despite claims), modification date (timezone-shifted), extended attributes (which include Finder tags and the “where from” URL), and Finder comments."
And I don't know if it supports U2F, but it does support TOTP.
I didn't like their main client very much (though it was a while ago) but I'm still planning to use B2 for some things. So it depends on what you're buying from them.
It's a really nice blog post but coming from Backblaze, it would have been nice if they wrote it _after_ bringing the Phoenix DC fully online. When Amazon or Google say 11 9s, I can believe it but Backblaze still only has a single datacenter for most data. All it takes is an earthquake.
When I've gone a clean backup on Backblaze it's taken just over 24 hours to backup about 650 GB. And I'm not even in the US, so my data has to cross the pacific.
Backblaze is actually faster than Apple Time Machine is on my LAN which slightly bothers me. It also has lower CPU usage.
I originally chose Backblaze after benchmarking the other offers available at the time (Carbonite, Crashplan etc) and Backblaze was by far the fastest.
Not that dangerous if you do your own maths. I don't really trust cloud backup. As they said, the 11 9s doesn't matter. You are more likely to encounter a billing problem (as they said) but also get hacked, have a problem with your internet connection, many things can go wrong.
That's why you need another solution if you are serious about your data, maybe a set of external hard drives (local backup). This way, you have redundancy and little correlation in failure, which greatly improves your general durability. That local storage may be paid with the money you save by getting an "inferior" cloud backup provider.
Where are you geographically? Was your Amazon upload to a datacenter physically much nearer to you than California? 30MBit/s sustained for 3 days isn't unreasonable for a business connection, but seems high compared to most of what I see available at least in the US.
Disclaimer: I work at Backblaze and live in California.
> 30MBit/s sustained for 3 days isn't unreasonable for a business
We (Backblaze) are seeing more and more consumer internet connections in the USA with 20 Mbit/sec upstreams, I thought they were available most everywhere if you were willing to upgrade your internet package "just a bit". 30 Mbits is a little unusual for "consumer", but not unheard of. Of course, there is a "selection bias" when you look at online backup users. :-)
Yes regarding the available speed, but on a consumer connection with Comcast at least if you push 30MBit for 3 days straight you may get a call or at least may start seeing popups in your browser on any http (not https) traffic.
From Comcast: The Terabyte Internet Data Usage Plan is a new data usage plan for XFINITY Internet service that provides you with a terabyte (1 TB or 1024 GB) of Internet data usage each month as part of your monthly service. If you choose to use more than 1 TB in a month, we will automatically add blocks of 50 GB to your account for an additional fee of $10 each. Your charges, however, will not exceed $200 each month, no matter how much you use. And, we're offering you two courtesy months, so you will not be billed the first two times you exceed a terabyte.
Also, All customers in locations with an Internet Data Usage Plan receive a terabyte per month, regardless of their Internet tier of service. and The data usage plan does not currently apply to XFINITY Internet customers on our Gigabit Pro tier of service. The plan also does not apply to Business Internet customers, customers on Bulk Internet agreements, and customers with Prepaid Internet.
This was an interesting read from a technical point of view, but also well written and refreshingly transparent.
I found the discussion about why it doesn't matter when you start talking about 11 nines of reliability to be hilariously true.
At the end of the day we're still flawed humans living in a hostile universe, and no matter how foolproof we make the technology, there are some weaknesses that just can't be eliminated.
Thank you! In the interests of full transparency, the blog post was a collaborative affair and was proof read and edited for clarity by several people at Backblaze.
> discussion about why it doesn't matter
One of the philosophies Backblaze uses is to build a reliable component out of several inexpensive and unrelated components. So combine 20 cheap drives into a ultra reliable vault. We have two or three inexpensive network connections into each datacenter instead of buying one REALLY expensive connection for 8x the price. Etc.
Personally, I recommend customers do the same. Instead of storing two copies of your data in two regions in Amazon for 2x the price, store your data in one region of Amazon and put one copy in Backblaze B2 for 1.25x the price. We believe this will result in higher availability and higher durability that two copies in Amazon because Amazon S3 and Backblaze B2 don't share datacenters in common (that we know of), don't share network links, don't share the software stack, etc. For bonus points, use different credit cards to pay for each, and have different IT people's credentials (alert email address) on each. That way if one IT person leaves your company and you don't get an alert that the credit card has expired, hopefully your other copy will be Ok.
> BB and S3 both have eleven 9 durability, how much does using both increase this?
Putting your data in either Backblaze B2 or Amazon S3 suffer from other failure modes outside of the durability of the raw system. For example, let's say your IT person is poking around in their Amazon S3 account and accidentally clicks the "delete" button and all your data is gone? Or what if your credit card has a transaction declined, and your IT guy has left your company or the emails from Amazon are being put in the "Spam" folder of your email program. Or maybe a malicious Amazon employee writes a program to delete all the data in Amazon S3 from all customers? What if one of your employees is really disgruntled and logs into your Amazon S3 account and just to spite you deletes all your data?
In every one of these situations, if you have a copy in Backblaze B2 and also another copy in Amazon S3, you can recover your data from the other vendor.
I recommend using a separate credit card to pay for your Amazon S3 account and your Backblaze B2 account. They should expire a year apart. And don't give the logins to both systems to one disgruntled employee in your organization. Only give that disgruntled employee access to one or the other.
Depends what you are modeling. The probabilities of random disk failures, are probably independent.
However, there are risks that are not necessarily independent such as the US Government ordering these two services to delete your data, or, as the article mentions, an armed conflict destroying data centers.
I've very disappointed their recovery time is 6 days!
Recovery workload should be spread across the whole cluster, so that the recovered data gets distributed evenly. In that case, assuming 10,000 drives, to recover one dead 12TB drive and a recovery rate of even 10 MB/secs per machine, recovery of one drive should be done in under a second. Maybe 10 seconds with some sluggish tail machines.
Why do you need it done in under a second? While the data is down one replica, it is at dramatically higher risk. Also, drive failures can be dramatically accelerated, for example in the case of a bad software release erasing data - you need to be able to move data faster than bad software gets released. And releasing software at a rate of one machine per second still means a release takes 3 hours!
Be careful here, it isn't 6 days until data is recovered it is 6 days until it is fully protected again, there is a big difference. During the 6 days the data would be available it just might have to be reconstructed on the fly by the error correcting rather than read directly.
In most systems we assume that "primary traffic" (read/write stuff) is prioritized over "rebuild traffic" which is recovering lost shards. So when you specify these things it is best to specify "how long to rebuild a shard while the array is providing storage services at its maximum specified rate." This assures the customer that if they have a 24/7/365 non-stop traffic pattern their data will still stay protected in the face of drive failures.
> During the 6 days the data would be available it just might have to be reconstructed on the fly by the error correcting rather than read directly.
Correct. More specifically, the FIRST time the data is accessed in any 24 hour period it must ALWAYS be reconstructed from the Reed-Solomon encoded parts on 17 other drives on 17 other machines. Any 17 is fine, so it's totally fine if 1 or 2 drives are not available. Once reconstructed it is stored in a set of front end cache computers that have fast SSDs for this purpose.
The second time the same file is accessed in a 24 hour period, it will be fetched out of the SSD cache layer so it won't even hit the spinning drives and won't care if all 20 drives are offline.
> "primary traffic" (read/write stuff) is prioritized over "rebuild traffic"
Yes. Backblaze balances between the two if only one drive has failed, but as a tome (20 drive group spread across 20 computers) becomes more badly degraded Backblaze begins favoring the rebuild. When two drives have failed out of 20, Backblaze stops allowing any writes to that tome because more writes will tend to fail yet another drive. Fewer writes offloads the tome. But we still allow reads. At Backblaze, we have never been 3 drives degraded out of 20 (knock on wood), but if this ever occurs the 20 drive tome is now running without parity -> so in that case we even stop allowing reads AT ALL until we are returned to at least 1 drive of fully redundant parity.
> In that case, assuming 10,000 drives, to recover one dead 12TB drive and a recovery rate of even 10 MB/secs per machine, recovery of one drive should be done in under a second.
I want to know where you can find a drive that can write 12TB/sec of data!
(In other words, you clearly missed half the problem. To add a new replacement drive, you have to be able to write to it the data from an original drive. Also RS code calculation is fast these days, but it ain’t that fast)
> spread across 10,000 drives, which is unrealistic
I claim it is also undesirable. Backblaze specifically made the conscious decision that the parts of any one single "large file" (these can be up to 10 TBytes each) are all stored within the same "vault". A vault is 20 computers in 20 separate racks. This allows a single vault to check the consistency and integrity of a large file periodically without communicating to other vaults in the datacenter.
The vaults have been a really good unit of scaling for Backblaze. If the vaults can maintain their performance, then we know we can just stamp out more vaults because there is almost no communication between vaults.
As to the one second claim, I think that's a math error, because even 10k * 10MB is only 100GB.
But spreading the data over 10k drives isn't unrealistic, it's a different architecture. Pick a different 20 drives for each file.
Working it through: Assume 200 machines with 50 drives each. Each machine has to read 1TB, transmit it over the network, do a parity calculation, and write out ~50GB. With dual 10gbps ports the bottleneck is the network, and if we dedicate one on each machine to the rebuild we get a 15 minute clock.
Not that having such a monolithic architecture is worth the complication and extra bugs.
> you need to be able to move data faster than bad software gets released. And releasing software at a rate of one machine per second still means a release takes 3 hours!
Since others have already demonstrated why the remainder of your comment is overly simplistic, I'll tackle this bit.
Generally, software releases are not rolled out at a constant rate to all machines. A typical thing to do is to release it to staging, then to a "canary" subset of machines (e.g. to 1% or 5% of the machines).. Once all seems well there (e.g. metrics are clean and the canaries have handled X writes, reads, and simulated drive failures), it can be rolled out to a larger subset, and eventually to all machines.
In that way, the release can take whatever total amount of time is desired while still catching any such bugs fairly reliably.
Ideally, at backlblaze they could ensure that their canary instances are "data-redundancy aware" such that even if the 5% they roll to for the canary test all explode, data is still safe.
Regardless, any talk of "recovering data faster than software releases" is completely silly and totally misses the reality of how releases are done, how recovery is done, and what sort of bugs might happen. The math based on faulty assumptions about rate is also pointless.
> Also curious what are the stats from other providers.
As to other providers, most are 6+ 9s that I’ve looked at, with many in the 8-9 range. Anything over 8 is (as they admitted) essentially marketing porn and not a useful metric (for reasons they mentioned as well as ones said by other comments here).
And yet, no much complains about lost GMail letters in internet because of disc failures.. And GMail is likely stores much more than 10 trln objects and 100TB of data.
A number of years ago a Gmail "insider" that I know admitted they had lost customer emails and their policy (at that time at least) was to simply ignore it and not tell anybody because most customers never notice if it is a small number.
I think a much bigger scandal is that all major laptop Operating System vendors (Microsoft and Apple) absolutely know when your laptop drive loses files or even is starting to go bad in some cases, and they NEVER tell the customer. I think an excellent product offering would be a 3rd party piece of software and cloud service which was a "verification service". It wouldn't store your files offsite, it would store the name, size, and SHA1 offsite and periodically check that no bits have been flipped on your local drive unless you intended it. For example, a week after I take a photo, I absolutely never want the photo to change. Ever. Same with music I (legally) download.
I would be curious to see what the chance any given email is read and how much it drops off after the first reading. I would be willing to bet that out of 1000 people who receive a reasonable amount of mail you could probably find missing emails.
The raw number would imply that, but I'm pretty sure the math breaks down when you're storing 10 byte records.
Chunks of your data are going to be stored together, so it's a very small chance of losing a big block of 10 byte files. There's no failure mode that loses just one, and does so often.
I just wish Backblaze would not go "oh man a lot of your data changed, you should probabbly check for integrity and start your backup with us again later...."
Oh gosh thanks Backblaze, I'll just dig through several TB of stuff....
The best durability is probably achieved by Amplidata, but it does not matter.
You need to do the same calculation for your meta data, which is probably not erasure coded. If you lose this, you don't lose your data, but you no longer know where you put it.
So you probably add your meta data to your data as well in some kind of recoverable format. That's fine, it means that you can harvest the meta data again.
The Backblaze blog post points to Amazon's CTO (Werner Vogel)'s blog post, in which he states that "...These techniques allow us to design our service for 99.999999999% durability."
(side note: Werner is a great person)
Unfortunately, there is a difference, a huge difference, between a system "designed" for 11 9s of durability, and a system "offering" 11 9s of durability.
I wish Backblaze, or Amazon, or anybody else, would clarify durability using very honest terms.
An example?
"This system offers X 9s of durability over a period of one year, on average. This is a technical paper that describes how we tested that durability", followed by measurements and test specifics.
There’s another piece too. Practically they’re a backup service so knowing how often someone needs to recover their data and not had an opportunity to reset their backup based on their current state. I’ve used Backblaze for years and only needed it once. (I also back up to a time capsule as well since data recovery is more
Practical/Easier from that)
Well written, but there are other significant risks like losing access credentials (e.g. a password stored only on one device that is destroyed in the same accident in which its only user, who remembered the password by heart, dies) or being hacked by someone who gains access to cloud storage and intentionally erases or corrupts data.
Specialization is good, but if Backblaze is strictly in the business of storing data on hard disks, who's going to help with designing and maintaining the reliable complete system on top of their service that users actually need?
Off topic but I just want to say how happy I've been as a Backblaze customer. B2 is a fantastic product for my backup needs, and their hard drive stats have always been handy when selecting drives.
One question since I know some of the Backblaze folks respond to these threads:
In addition to calling (and possibly getting blocked/ignored), does your customer service staff send text messages? I suspect that a big percentage of the phone numbers you have are for cell phones these days, and I see a lot less SMS spam than I do telemarketing. SMS would also allow you to get a bit of info visible to recipients (e.g. "Backblaze CC Expired") with more detail once a message is opened.
Yev from Backblaze here -> I believe we do send SMSs in the case of a Cap or Alert getting reached, so yes that could be possible - though I'm not sure if an SMS is part of our billing failure process - that's an interesting question!
Backblaze B2 customer here. My credit card stopped accepting your billing and no SMS for me. Took me a month or so to notice the emails and update my details. I've got SMS alerts active for Caps. Would be worth adding that as was a bit scary when I noticed the mail (think it was the third one you'd sent!).
This article supposes that a meteor impact is more likely than disaster that renders northern California (their only data centers are in Oakland and Sacramento AFAIK) without power or civil order within ten million years.
As someone else pointed out, it’s overly simplistic. They’re a great low-cost alternative to S3, sure. But keep a backup on another continent if you need your data 100 years from now.
My experience is that it's bullshit. I had a backup (damn I still have one) with Backblaze, when attempting to restore it, maybe 30% of the files survived restore. The rest are lost in smoke.
They don't have any way to detect corruption in the data or if they have, the backup clients are oblivious to it.
>The sub result for 4 simultaneous drive failures in 156 hours = 1.89187284e-13. That means the probability of it NOT happening in 156 hours is (1 – 1.89187284e-13) which equals 0.999999999999810812715 (12 nines).
Minor nitpick, this ignores the possibility of more than 4 failures, although this error only affects the fourth digit after the nines. Much more egregious is the following:
>there are 56 “156 hour intervals” in a given year
This is too simplistic, there are in fact infinitely many 156-hour intervals in a year, some of them just happen to overlap. This overlap can't simply be ignored because even if none of their 56 disjoint intervals contain 4 events this does not rule out the possibility of there being 4 events in some 156 hour interval they didn't take into account. In fact failing to take into account even one of the infinitely many intervals creates a blind spot (consider what happens if the drives happen to fail precisely at the start and end of a particular interval). You can still get a lower bound by e.g. ensuring none of the 56 intervals contain more than 1 failure, or by adding more intervals and ensuring none of them have more than 2 failures etc.
Their binomial calculation contains the same mistake.
A quick improved lower bound can be obtained by calculating the probability that any failure is followed by (at least) 3 other failures within 156 hours. For one failure this probability is given by the Poisson distribution and is
Pc = 1 -\sum_{k<3} e^-λ λ^k / k! = 5.18413e-10.
Now we get into some trouble because the failures and the probability of a 'catastrophic' failure are dependent, however the probability that any particular failure turns catastrophic is constant, so the expected number of catastrophic failures can't be greater than the expected number of failures times that constant, this gives a lower bound of
Pc (365·24·λ) = 6.63154e-9
this is a lower bound, but that's still three fewer nines left than their claim.
Anyway let's just hope their data centres are more reliable than their statistics.
Edit: This last calculation can be justified by noting that the probability that 1 critical failure starts in a particular time interval is Pc times the probability of 1 failure in that interval plus some constant times the probability of more than one interval. Similarly the probability of more than one critical failure is at most the probability of more than one failure.
Now the probability of more than one failure in a time interval is dominated by the length of the interval, therefore if you calculate the density those parts fall away and you're left with a density of Pc λ critical failures per hour.
This seems to be an exact expression for the expected number of critical failures, and not just a lower bound. Although it is still a lower bound for the probability of a critical failure, albeit a fairly tight one.
They're not ultimately gaussian. :) The normal distribution is continuous and has unbounded support whilst neither the binomial or Poisson ate cts and unbounded.
Correlated failures are common in drives. That could be a power surge taking out a whole rack, a firmware bug in the drives making them stop working in the year 2038, an errant software engineer reformatting the wrong thing, etc.
When calculating your chance of failure, you have to include that, or your result is bogus.
Eg. Model A of drive has a failure rate of 1% per year, but when failed the symptom is failure of the drive to spin up from cold, however if already spinning it will keep working as normal.
3 years later, the datacenter goes down due to a grid power outage and a dispute with diesel suppliers so the generators go down. It's a controlled shutdown, so you believe no data is lost.
2 days later when grid power is back on, you boot everything back up, only to find out that 3% of drives have failed.
Not a problem. Our 17 out of 20 redundancy can recover up to 15% failure!
However, each customers data is split into files around 8MB, which are in turn split into the 20 redundancy chunks. Each customer stores say 1TB with you. That means each customer has ~100k files.
The chances that you only have 16 good drives for a file is about (0.97^16 * 0.03^4)2019*18 = 0.3%
Yet your customer has 100k files! The chance they can recover all their data is only (1-0.003)^100000... Which means every customer suffers data loss :-(