Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Theory of Backups
10 points by Xen9 3 months ago | hide | past | favorite | 21 comments
Most discussions of backups are focused on the consumerist matters.

There seems to exist, buried underneath the superficial & the common sense, theory on how to do backups well.

I've found two elements upon which better theory concerning rotatioms & other details (EG hash verification scheduling, amount of different devices) can be built.

The first is the Tower of Hanoi scheduling scheme, which we will abbreviate TOH.

The second is the Incremental-Differential-Full backups concept, which we will abbreviate IDF.

The best available resource seems to be the Acronis websites' illustrated docs: http://acronis-backup-recovery.helpmax.net/en/understanding-acronis-backup-recovery-10/tower-of-hanoi-backup-scheme/. I request that you in good faith ignore Acronis is a company selling commercial Windows software; you are free to post better links in comments were you to find better info elsewhere.

We end up with a scheme we can call IDF-TOH. In it we have three types of backups:

- Incremental, at L_0 ("Level A" in the linked resource), the most frequent level, capturing only changes made since the last backup.

- Differential, at each level that belongs to the closed interval L_0-L_n, capturing changes made since the last full backup.

- Full, at L_n, the least frequent level, capturing the whole system to be backed up.

So now, what can we do? At least the following directions could be taken in further developing a Theory of Backups:

0 (backup scheduling): The frequencies can be chosen in many ways, and I am not sure which one is most optimal. Tower of Hanoi is for every level L_a, where a belongs to closed interval 0 ... n, 2^a. Frame-Steward may or may not be of any use in this.

1 (rotations): IDF-TOH does not address the problems of rotations. IE: if you make a backup that corrupts your previous data, and then repeat the mistake, you get in trouble quick. It's ALSO noteworthy that certain mediums may better fit certain layers in IDF-TOH & the future schemes. At least, for example, adrian_b three days before this wrote:

"... Of all the optical discs that have been available commercially, those with the longest archival time were the pressed CD-ROM with gold mirrors, where the only degradation mechanism is the depolymerization of the polycarbonate, which could make them fragile, but when kept at reasonable temperatures and humidities that should require many centuries..."

Consequently these would be the best for the Full backups, while Solid State Drives may work for the Incremental ones.

2 (perfecting IDF): The IDF scheme may not be perfect either & can probably be refined more or less.

3 (hashing): Verifying the backups matters & should be a part of a complete scheme.

---

This may not be valuable for all businesses but most invididuals already using rsync or borg would probably prefer to use the best available scheme if reduces probability of incidental data loss at minimal effort. The task of translating the best possible scheme to a config program with humane interface is an undertaking of its own.




My practically minded friend has an interesting scheme for backups. He uses clonezilla or similar products to clone the existing drives of a machine to new replacements. Then puts the old ones in a safe place. On a periodic basis.

The other normal backups are usually managed by someone else, he just does the hardware, most of the time.

His backups are tested by experience.


This is the type of scheme that would be improved hugely by adapting IDF: instead of full copies he'd make the majority incremental or differential, saving time.

I realized today that the mechanism, given multiple devices, where in an IDF one does not always backup from "target" to "drive" but occasionnally from "drive" to "drive" could unlock new possibilities.

Regardless, I would mention IDF to your friend, with the note that he'd have to use backup software that encrypts for it because encrypted LUKS images don't tell their contents and thus can't be used in IDF. This actually leads to the last point about interfaces and tools in my post: it seems that to gain the efficiency of IDF a tool must at least offer encryptation.

Consequently, we get yet another theoretical question: How should the cryptography in IDF-TOH be implemented?

I mean: making F 16 words, D 8 subwords of F, I 4 subwords of F, would hugely improve security!


I fail to see how involving anything other than replacement drives is advantageous. He's got old bootable systems ready to go.


Rule 0 for backups: Whatever scheme you use, periodically verify your backups by restoring some files off them. Before you have a problem. You'll thank yourself eventually.


This is actually what I mean by that lots of "discussion" is repeating common sense hymns.


I find your post quite interesting - at the same time I believe that one reason those hymns are repeated so often is that for the majority of businesses, 99% of the values lies in simply following them and not thinking too hard about theories etc. there will often be a relatively clear path forward prescribed by the software used, know how, or (where no software is used) ease of implementation.


This is precisely the reason why a theory would be valuable. The businesses can have use Legendary Backup™ for Maximal Data Safety Certifications by Us.


I'm missing the threat of ransomware in your explanation. The best concept does not help, if everything is always online and ransomware is able to encrypt your backups.

I personally use the following backup strategy:

- Setup an encrypted ZFS Storage in the network (e.g. TrueNAS - in my case it is Proxmox)

- Enable zfs-auto-snapshot for 15 min snapshots auto rotation (keep 24 daily, etc.)

- NEVER (!) type in the passwords of ZFS Storage permitted users on any client, that could be affected by ransomware

- Provide a user authenticated samba share to store all important data - try to prevent local storage of data

- Sync the ZFS snapshots to an external USB drive every night (I use a tasmota shelly plug and an external usb case to power off the devices if they are not needed)

  # create current snapshot
  zfs snapshot -r "$NEW_POOL_SNAP"

  # first backup
  zfs send --raw -R "$SRC_POOL@$NEW_SNAP_NAME" | pv | zfs recv -Fdu "$DST_POOL"

  # incremental backup
  zfs send --raw -RI "$BACKUP_FROM_SNAPSHOT" "$BACKUP_UNTIL_SNAPSHOT" | pv | zfs recv -Fdu "$DST_POOL"
- On Windows and macOS, backup the OS on an external drive

- Use restic to keep an additional copy of the local files and folders somewhere else

- Use a bluray burner to backup the most important stuff as a restic repository or encrypted archive (like very important documents, the best photo collections of you family, Keepass database, etc.) and put it to another location

- If cloud storage is affordable for the amount of data you have, consider using restic to store your stuff in the cloud

- From time to time try to restore a specific file from the backup and check if it worked and try to restore a full system (on an additional harddisk).

This may sound overkill, but ransomware is a pretty bad thing these days, even if you think you are not one of its targets.


I believe that it was my mistake to not to define the scope & assumptions of the problem of backing up rigorously because this has confused many commenters.

I believe that any ransomware either (1) exists inside a backup and is thus removable (2) would be accounted bh IDF-TOH.

Actually, your comment also confuses me. By definition of a non-cloud backup, it follows that everything is not always online or accessable by a CPU. Whatever may the case be, your approach would be "practical" and not "theoretical" in that it does not seem to make any fruitful contribution to the problem I proposed.

Theoretical approach would be to create a general semi-formal model of the (1) problem; EG scope, variables, failures of previous solutions (2) foundation, IE more or less what I call IDF-TOH as the first iteration of a solution (3) separate discussion for preparing for the future practical implementations of the "final" technical solution, EG integrations with what exists out there, possibly gathering ideas for how to make the interface human-friendly.

Would you think an IRC, mailing list etc. would be good place – kn change of tactics – to stimulate discussion & research on this?


Well, then I totally misunderstood the problem. I'm not a researcher, so I don't think I can contribute much to your topic.

However, Ransomware these days can:

  Lure on you system for a pretty long time until they take action (14 days or longer)
  Encrypt USB Disks, as they are attached
  Encrypt network drives
  Infect other systems connected to your network
  Delete existing backups on many online targets (also cloud)
  
So even if you do a "cloud" backup and have the credentials on your system to do "convenient" backups, it may just delete them.

A way to overcome this is to have an "append only" backup, so that the older backups are secure. I'm not sure you covered this, but as I already stated: I'm not a researcher.


I believe that most good to the world would be done by first working out the theory & then developing an applet, EG website, that lasts long, and will automatically determine the best possible backup solution "real-world."

This would use Bayesian heuristics, openly available data, user's personal requirements.

Non-technical granny can press the buttons, it gives them a printable PDF-sheet that says keep three copies of this; these are your best possible backup rules.

For a good design, the problem of changing requirements must be addressed. By mapping every scenario imaginable and implementing support for the transitions, usability would increase significantly.

This'd answer your problem – I'd call it paranoia – of ransomware. By the way, qubesos can make at arbitrary times exact or near exact VM image backups, and and move files from one host OS to another, without shutting down, mitigating every non-targeted piece of ransom malware & majority of non-state-backed targeted ransom malware.


Everything depends on the security compliance needs.

Regarding backup scheduler - sometimes companies need to have frequent backups due to their RPOs and RTOs, for example, if they operate in a highly regulated industry. If someone can tolerate the loss of data of two hours, then, they need to have backup performed every 2 hours, if we speak here about 8 hours (working day), so why not to have backups on a daily basis?

Regarding rotations - everything depends on a backup solution, if it provides with immutable backups, so the entire data won't be corrupted. Thus, the faster someone notices the mistake, the faster they can restore their copy. IDF helps more to decide the issue with storage - not to overload it (here also worth mentioning deduplication and compression).


We should define the problem as "wanting to store most-typically-incrementally mutating in perpetuity." I did not put all definitions & assumptions to the post because this is not an academic paper and most of it can be figured out (if we assume we care about compliance, then everything is defined, and there is no interesting problem).

IDF-TOH will improve security if implemented in this manner: https://news.ycombinator.com/item?id=41179989

I realized writing this that it absolutely makes sense having different mediums do different jobs, since the I and D drives will be consumed at different rate too. This begs for further analysis because though we can obviously say that SSDs will be better for something that's accessed constantly the problem of which computer hardware to use given these parameters on use is ~solved.


A few things that I find don't get considered when contemplating backup procedures are:

1. How long should you keep backups for - is the content of your backup covered by privacy laws that require you to not have copies of it after a certain period of time? is there a point where the content of your back up is so old that it's the logical equivalent of not having made a back up in the first place?

2. How much does your backup process cost - if it costs more to back up a system than it would cost you if you lost it, then you've got the backup process wrong (interestingly this can be affected by economies of scale)

3. What do you need to restore a backup - does your system requires bespoke hardware that might have been lost in whatever disaster you're trying to recover from?


Yes, it's worth paying attention to these questions as they are very important. If renetion is mostly related to compliance requirements of the industry. Some data should be kept for 5 years and some for 20, so it's good if a backup solution provides long-term or unlimited retention.

Regarding the cost, I think that a backup solution is always much cheaper than the cost of data loss. So it's better to have a backup than not.

And, of course, restore and recovery checks are also important. Moreover, a backup solution should foresee different disaster scenarios. for example, disaster scenarios like those described here in the article: https://gitprotect.io/blog/github-restore-and-github-disaste...


Assuming we want to store most-typically-incrementally mutating data in perpetuity, then TOH or a derivation probably provides optimal scheduling scheme. The question is which frequencies you give each "layer", what parameters define these, et cetera.


If I want to know about Operating System, I know which books to read. What's the equivalent for backups? It seems to me I merely rely on blog posts and that's something I'm not comfortable with. There are some books out there that perhaps dedicate at most 1 chapter to backups, but those are usually outdated books and do not contain much practical information.


I believe there currently exists none because the problem has no real theoretical solution because almost no one has given it thought. Lots of "related" but information-wise not so useful texts exist on "how to do X" but these are not concerned on what the X should be.


YAGNI is among my theories for personal backups. Data is not precious. It is a burden. When something matters, I print it. Or put it on Facebook or Youtube.

…but I never delete because the more copies of the same thing there are, the more likely it will survive. If in fact I need it, time spent searching is far shorter than tedious backup procedure.

In addition, if I have to recreate something version 2 will be better because I keep getting better at the things I do.

But that is me not you. Good luck.


Backups are like prayers, only if you make backups you won't need prayers.


Just wait till your company depends on restoring that backup :D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: