Hacker News new | past | comments | ask | show | jobs | submit login

This was an interesting read, both the points made about durability, as well as the in-depth math. However, what stood out to me most was the line:

Because at these probability levels, it’s far more likely that:

- An armed conflict takes out data center(s). - Earthquakes / floods / pests / or other events known as “Acts of God” destroy multiple data centers. - There’s a prolonged billing problem and your account data is deleted.

The point that once you get to a certain point of durability (at least as far as hardware/software is concerned) you're chasing diminishing returns for any improvement. But the risks that are still there (and have been big issues for people lately) are billing issues. I think it's an important point that the operational procedures (even in non-technical areas like billing and support) are critical factors in data "durability"




I've posted the math here before but if we assume that an asteroid hits the earth every 65 million years and wipes out the dominant life forms, then this fact alone puts your yearly durability at a maximum of ~8 nines.

The point about billing is better, though.

My other concern is that a software bug, operator error, or malicious operator deletes your data.


That's why no sane entity would use one earth. Use two and you can quickly recover.


Our goal is N+2, that way when one Earth is down for planned maintenance, you can endure unplanned loss of a second Earth.

N, of course, is always 1.


Why build one when you can have two at twice the price?


Recovery isn't quick at all unless you have developed the Ansible tech tree.


If you have two you have one, if you have one you have none.


> if we assume that an asteroid hits the earth every 65 million years and wipes out the dominant life forms, then this fact alone puts your yearly durability at a maximum of ~8 nines

I don't think this is a useful definition of your yearly durability. If your data center is down for maintenance during a period in which it is guaranteed that nobody wants to access it, that doesn't reduce your availability at all -- if your only failure is an asteroid that kills all of your customers, it would be more accurate to say you have 100.000000% availability than 99.999999%.


Isn't the expected life span of the company more limiting? Plenty of cloud storage companies go out of business (typically, they run out of money). You can apply Gott's law to this. It's pretty grim


I guess this is what you're referring to: https://en.wikipedia.org/wiki/J._Richard_Gott#Copernicus_met...


Yes.

   [t/3,  3t] with 50% confidence
   [t/4,  4t] with 60% confidence
   [t/39, 39t] with 95% confidence


The event that wiped out the Dinosaurs was 65 million years ago, but the Mesozoic era lasted for ~200 million years. Your point stands, though.


The question of how many data points to use is a subtle one, though. I can say that I picked one data point because I was lazy and doing back of the envelope math, which is reasonable because we can be somewhat assured that I didn't choose a number of data points that was convenient for the hypothesis.

But if you're choosing two data points, my question is... why two? If you are choosing whether or not to reply based on whether or not the second data point fits with the first, then you're introducing selection bias. The chance that the second data point disagrees with the first by at least as much as the 200 My interval disagrees with the 1/65 My rate is equal to 1-(exp(-65/200)-exp(200/65)) = 0.32, which is not especially high.


I wonder if there's a general term in engineering for the case where a particular risk has been reduced well below the likelihood of more serious but exotic risks. I've heard about this most in cryptography where we can sometimes say that the risk of, say, a system failing due to an unexpected hash collision is drastically less than the risk of the system failing due to cosmic radiation or various natural disasters. At that point it doesn't seem important or worthwhile to consider this risk, because it's dwarfed by the others.

This seems like a form of the same argument, and I wonder where else this arises and how people describe it.


I think in other contexts, you might say that these signals are "below the noise floor."


As an audio engineer (as well as a dev), I definitely agree with this use of that phrase.




It’s not a perfect match, but Rumsfeldian “unknown unknowns” come to mind.

Specifically: every X-nines durability design will be compromised by some failure mode you didn’t think of.


I don't think unknown unknowns are what I'm thinking of here. In this case the argument involves a very specific risk, and sometimes a very specific lower bound for its probability.

For example, in the hash collision case the argument says that it's not worth worrying about the (known) probability of a software error due to an unexpected hash collision because it's dominated by the (known) probability of a comparable error due to cosmic radiation. (The former probability can be calculated using the birthday paradox formula, and the latter has been characterized experimentally in different kinds of semiconductor chips.)

This kind of argument doesn't rely on the idea that there are other risks that we can't identify or quantify. It's about comparing two failure modes that we did think of, in order to argue that one of them is acceptable or at least not worth further attempts to mitigate.


Although Rumsfeld often gets credit for this statement, it has been around for a long time before him https://en.wikipedia.org/wiki/There_are_known_knowns


I think Rumsfeld gets credit for it because he was using it in the most degenerate, disingeuous form possible. Rather than guarding against legitimate concerns and pursuing actual handling of potential issues, he was just trying to rationalize continuing policies that were demonstrably counterproductive. It's one thing to say there might be factors we don't know. It's another to day that simply because there might be such things, we should dedicate significant resources and lives to blindly flailing away under the assumption that it will help. A presumption of unknown unknowns puts you in a position of not acting, normally. There is no way to know that you aren't exacerbating and making a problem worse if you know that little.


I tend to assume pessimistically that the durability design will itself cause a problem. Redundant switches to survive a hardware failure, e.g., strikes me as inviting trouble.


Indeed. I once had critical systems routed via a large redundant Cisco switch which claimed to be 1+1. Turns out there was a single "supervisor" component which failed (after just a year or two) and made the pair of switches useless. Apparently the designer worked in a team where nobody does anything when the boss is out.


For a consumer the cheapest and easiest way to backup important documents or files is to encrypt it and store it across multiple storage providers, e.g. Dropbox and Google Drive.

They usually give you a reasonable amount of free storage, and it's unlikely all accounts would be terminated or locked at the same time.

And of course, you should always have your local backups as well.


Assume you want access to files over the next 20 years. What are the odds Google will have bought Dropbox in that time; and what are the odds an automated system monitor at Google ad words will have disabled your Google account in that time span?

Replace with Amazon and/or crashplan as appropriate..


Backblaze suggests a “3-2-1” backup strategy. You should always have at least one backup on site. If a remote backup becomes inaccessible, you could move over to another remote access provider.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: