One of my old employers dutifully did tape backups every day, and one of the bosses took them home.
During an upgrade it was revealed that the drive wasn't working, and hadn't been working for at least a year (I think that's how far back the backups went).
But I'd love to hear how people practice restoring all their backups on a regular basis when it requires having an enormous amount of space. Practically, I just restore a few backups every month, and if it works, I just assume the others work as well (which I know is a false assumption, but what are you going to do?)
These would happen outside heavy operation hours, and if successful, the restored servers would become the live servers in most instances.
These were done using scripted vmware instances, mostly restoring from StorageCraft Shadow Protect backups.
File server was the most resource intensive and the most prone to complaints about downtime, but usually the trade off was fine.
DB server was usually fine due to the out of hours restore, and the ability to master - slave - replicate - handover.
Mail server was usually transparent as you could set up smtp transfer rules for incoming, and outgoing didn't really care.
The trickier ones were things like IBM I Series servers that usually require periodic full system saves, restores and tests. These couldn't be done automatically without excessive licensing fees and much manual baby sitting.
If you get to the point where you have automatic scheduled DR, you can be pretty confident your backups are working.
It the automated DR fails, you roll back, learn whatever lesson the failure taught you, and change the system so it doesn't happen again.
Summary: schedule DR, not backups.
However, we have around 30 different databases, each of which take more than a day to restore. We backup databases every X days, upload them to S3, and delete the previous ones, keeping only the last 3 or so.
So if it takes us 30 days to restore all the databases, those backups might already have been gone in S3 as part of the cleanup operation anyway.
It just gets complicated, and non-trivial when you have multiple databases, each TB's large.
The solution is to have a tape or disk library and have the other machine take over for restore. If you have TBs of data you can afford a library and another PC to run the recovery test.
Nobody said it was. This subthread is about the importance of testing backups in general.
> If you have TBs of data you can afford a library and another PC to run the recovery test.
They already explained how that's not enough. Yes, they can add more, but you're missing some info there.
I've tried various options over the years but I haven't found a simple OTS non-tape offline backup solution. There's a lot of potential here for a solution.
You're definitely right, though; there's gotta be a better way. SSDs look promising, though not necessarily for long-term storage (for that, I suspect it'd be hard to beat optical, aside perhaps from printing out a bunch of QR codes or something).
Usually those conditions aren't met properly.
Learned that lesson early in my career - the fourth and final tape did restore.
NB The backups weren't my responsibility, but restoring the payroll database my shell scripting error had nuked most definitely was.
Edit: Why is this getting downvoted?
I'm not saying this is a "real" backup solution. Call it an archive if you want. It's still better than not having any copy of you data in offline storage off site. Defense in depth people, there's no reason you can't both have a real backup system and shove old disks in a storage tote.
If you haven't restored from a backup before, you don't know if it's possible to restore from that backup.
> As long as you know the 1s and 0s were properly written ...
How do you know? Did it actually write properly? If so, was the data you chose to write the data you actually need to fully restore?
You might be able to do certain tests or checks to determine this to a high degree of probability, but there's only one test that will tell you with certainty, and that's actually performing a restore successfully.
* If you backup the whole file system while the database is running, you may end up with an inconsistent (maybe unrecoverable?) backup. You need to use a file system with snapshots like ZFS or BTRFS, or use PostgreSQL’s own tools.
* Similarly, if the underlying file system experiments corruption, it will also seem to keep working for a while before, and after that you may be left with a bad database and backup. You have to be able to restore multiple time points or ensure your backup is good.
This is true for any sort of mounted database. Always dump the database separately from any FS backups.
Another good idea (but not a replacement for backups) is to replicate the DB to a read-only slave which you can either make the master or recreate your master from.
This gives you everything you need to recover to any point between the start and stop of the backup operation, and it's guaranteed to be consistent assuming the data itself was copied successfully. This is because any inconsistent files modified during the copy will be recovered from the archived WAL files to attain a consistent state.
Sure, that's better than nothing. People should start the process of making backups. But when the backups are important (as we're discussing here), you need to verify that a backup can be restored - otherwise you're ignoring a huge risk.
Practicing restores is mandatory to confirm integrity. Any other form of backup is ceremony.
This all seems like an inside job and thus even offline backups are at risk if a disgruntled employee wanted to go after it.
I'd balance this (and arethuza's reply) with the basic fact that we have to balance all backup advice with practicalities of a given individual or organization's situation too. A backup is also no good if it's not used, consistently, or if someone is told "if you don't do X might as well not even bother" and gets discouraged because X requires more capital expenditure or expertise then they're actually capable of. Any decent backup is in fact better then nothing and can help with at least some failure scenarios. More expansive backups can deal with a wider range of threats, but tend to be more expensive too. I don't want anyone to ever be discouraged from getting the ball rolling. Just using a direct simple cloud solution or plug-in USB drive even is definitely objectively better then nothing.
There is also room for hybrids and considering where, exactly, the offline bit should come in. It doesn't necessarily have to be the physical storage. There are cloud storage providers that offer WORM capability, or can separate delete/over write permissions from append. So it's possible to have a system that is backing up to that independent service through a restricted permission account and then have all management done from a minimal dedicated set of notebooks that are generally kept locked in safes. Another dedicated isolated system could be responsible for pulling a sample of backup files down and doing some simple watchdogging (even just a tripwire or entropy check). The backup is automatic and offsite, but it'll still have some additional resistance to remote attacks on the main site by virtue of system independence even if it's not tapes in a vault. And it's more practical for a small or less technically competent org.
I don't deny the value of tapes in a vault or similar, just recognize the capex and the difficulty of many places in making sure it's sufficiently consistent. Resist the finger wagging urge is the main thing, at least not without considering the cost/value equation for the specific situation.
Edit: to be clear, in this specific case there is definitely a reasonable argument that after decades and with a reasonable number of users they should have built up to something heavier duty. It did have paying users apparently after all. Even if it took a few years to save up for a better backup system that should have been done. But then again even just using Amazon Glacier and creating some custom IAM policies for API restriction along with Vault locks may have been enough to prevent this particular attack.
> What is your backup strategy / data retention policy?
> VFEmail feels it's important to provide a long-term, stable, environment for our users. In that effort, we perform nightly backups to an offsite host from all on-site and off-site mail storage locations. This backup runs at 12am CST (-0600) and contains all user data.
> 3rd party storage of user data is generally not wanted by privacy-conscious users. If you fall into that category, you will want to use POP3 and download your mail daily. Our backup is on a daily/weekly rotation, initiated by a snapshot. If you do recieve mail between your last POP and the snapshot at 12am, it will exist on backup for a week - unless it's on Saturday night, then it's a year. You should set your POP program to download every 5-10 minutes in order to avoid having your mail caught on backup.
You can harden a backup host to prevent most every conceivable method to destroy the backup. You can lock down network access to prevent any connection but one to the file retrieving service itself, and prevent operations such as deletes or overwrites. You can use an append-only filesystem. You can use a SAN or NAS to allow a backup host access to one particular mount to dump files, and a different host can then move files off that mount onto a different mount that the backup host has no access to. Or you could even have "rotating backup hosts" that went effectively offline for a week at a time each, giving you several weeks worth of offline backups in case one backup server got compromised.
But probably you should just do offsite backups.
offsite: prevents against physical damage to a single site.
offline: prevents against network-based attacks.
You should have offline backups and offsite backups, which can be the same, or separate. Eg backup to disk/tape, and send it somewhere for storage, or backup to disk/tape kept offline locally, combined with online backups to a server offsite.
Ok, so that is pretty harsh but its part of any disaster recovery scenario where the datacenter explodes (or burns to the ground, or gets flooded, etc). And while LTO tapes are slow and nobody likes maintaining those cranky tape library machines, they will save your bacon when the unthinkable happens. Of course if it is a really nefarious kind of thing you might find that your storage unit where the tapes were stored is also torched. No good financial reason to protect against that, its an unlikely scenario.
That said, when you "catch someone reformatting drives" you turn off the switches that give them access to your machines. Or at least you start logging into machines and halting them until you can be physically present to preserve them.
I'm not 100% certain offline backups would prevent going out of business either. A mass exodus of customers from the extensive down time plus inevitable loss of recent data is catastrophic for a mature business in more maintenance than growth mode.
At one of my early sysadmin jobs decades ago, a few weeks into the position an ex-employee broke into the network and wiped the partition tables and whatever else was in the first megabyte of the disks - usually some of the root fs, on all the ~1000 shared hosting servers.
That was a very long week of headaches, and the business never fully recovered. There were offline tape backups, but they were of filesystems not disk images so we didn't have a backup of the MBR/partition tables. There also wasn't a high throughput restore mechanism. It was a minimal single enterprise-class tape drive with a small robot changer holding 10 tapes at a time. Sufficient for asynchronous backups and selective restores over NFS but not at all sufficient for restoring the entire datacenter from tapes in a timely fashion.
Worst case you buy a remote host for a temporary runtime and if needed contact your DNS or other host and CA face to face to reset or sign new keys. Given airplanes, this is relatively fast but tedious.
It is quite cheap... As long as you have the data backed up offline. They didn't.
The most promising option I've come across so far is Borg running in append-only mode with a client-push type model.
I imagine it wouldn't help in this case, if the attacker has the creds and access to run `dd' on the backup machine directly.
Anyone had any good/bad experiences with Borg append-only (or have other suggestions?)
Biggest concern I have with that is that to get around file permissions, the backup user needs to be effectively root.
It's much more of a problem is the backup server gets compromised and now they also have root level creds to every target machine.
I'm not sure if something like apparmor or selinux could allow for some sort of 'read-only root' type user, and if that would actually be safe in teh circumstances.
There are a number of ways to do this, including a backup client that only sends data but doesn't write, a read only bind remount, or an lvm read only snapshot.
The last one is best I think. Make the new lv device readable to the backup user and have it mount it, then copy the data.
On the upside, the tooling around infrastructure has improved so much since vfemail launched in 2001 that there's a lot more I can do. With AWS, I can do automatic database backups, S3 bucket delete versioning, IAM auditing, whitelist firewalls, etc.
Can you remotely delete all the data and backups? (It's an honest question; I'm not intimately familiar with AWS but from glancing at the documentation it seems that even glacier archives can be deleted without delay)
Because if you can, then so can an attacker who has sufficiently compromised your business.
edit: I see some stuff about "vault locking" which might do the trick. Are you using that to protect your data?
And then keeping root user credentials on a cold storage laptop.
Vault lock seems to be for Glacier, but it'd still be worth looking into for cold storage backups. More layers of defense are always good.
"S3 Object Lock can be configured in one of two modes. When deployed in Governance mode, AWS accounts with specific IAM permissions are able to remove object locks from objects. If you require stronger immutability to comply with regulations, you can use Compliance Mode. In Compliance Mode, the protection cannot be removed by any user, including the root account."
No idea what's considered significant.