Hacker News new | comments | ask | show | jobs | submit login
“Catastrophic” hack wipes out email provider’s entire infrastructure (arstechnica.com)
116 points by ToFab123 8 days ago | hide | past | web | favorite | 71 comments

I will continue to stand on my soapbox and proclaim that backup has to include an offline component (preferably one that's verified independently). An attacker has to be willing to bring "kinetic" means to bear if you have offline backup media.

Also, a backup that you don't fully restore somewhere to check it actually fully works isn't really much of a backup at all.

Ha ha yes!

One of my old employers dutifully did tape backups every day, and one of the bosses took them home.

During an upgrade it was revealed that the drive wasn't working, and hadn't been working for at least a year (I think that's how far back the backups went).

I feel like these are self-contradictory unless you're really going to go through the effort of periodically restoring each of your offline tapes to a server (which I'm sure some organizations end up doing, to be fair, but that doesn't make it any less of a pain in the ass).

I also hear people say that all the time. Usually they're managing measly small 1-100 GB databases, not multiple ones > 1 TB in size. Restoring them in a regular basis is easy in that case.

But I'd love to hear how people practice restoring all their backups on a regular basis when it requires having an enormous amount of space. Practically, I just restore a few backups every month, and if it works, I just assume the others work as well (which I know is a false assumption, but what are you going to do?)

My DR procedure was to automatically restore the mail server, DB server and file server once a month with automated fail back to the live servers if they failed.

These would happen outside heavy operation hours, and if successful, the restored servers would become the live servers in most instances.

These were done using scripted vmware instances, mostly restoring from StorageCraft Shadow Protect backups.

File server was the most resource intensive and the most prone to complaints about downtime, but usually the trade off was fine.

DB server was usually fine due to the out of hours restore, and the ability to master - slave - replicate - handover.

Mail server was usually transparent as you could set up smtp transfer rules for incoming, and outgoing didn't really care.

The trickier ones were things like IBM I Series servers that usually require periodic full system saves, restores and tests. These couldn't be done automatically without excessive licensing fees and much manual baby sitting.

If you get to the point where you have automatic scheduled DR, you can be pretty confident your backups are working.

It the automated DR fails, you roll back, learn whatever lesson the failure taught you, and change the system so it doesn't happen again.

Summary: schedule DR, not backups.

I'm confused. Why does the number of restores affect how much space you need?

You're right, we could restore each, one by one.

However, we have around 30 different databases, each of which take more than a day to restore. We backup databases every X days, upload them to S3, and delete the previous ones, keeping only the last 3 or so.

So if it takes us 30 days to restore all the databases, those backups might already have been gone in S3 as part of the cleanup operation anyway.

It just gets complicated, and non-trivial when you have multiple databases, each TB's large.

If you already have the DBs in S3, it seems obvious that you could bring up EC2 instance(s) with gigantic EBS drives attached, restore a DB to it, and run verification measures on it. Parallelize as your budget allows.

EC2 is not an airgapped machine. That's known as cloud backup not offline backup.

The solution is to have a tape or disk library and have the other machine take over for restore. If you have TBs of data you can afford a library and another PC to run the recovery test.

> EC2 is not an airgapped machine. That's known as cloud backup not offline backup.

Nobody said it was. This subthread is about the importance of testing backups in general.

> If you have TBs of data you can afford a library and another PC to run the recovery test.

They already explained how that's not enough. Yes, they can add more, but you're missing some info there.

There has to be something better than tape, I know I'm just a single data point, but I've had _terrible_ luck with tapes over the years.

I've tried various options over the years but I haven't found a simple OTS non-tape offline backup solution. There's a lot of potential here for a solution.

I only mentioned tapes specifically because they're probably the most common choice for offline backups (I say "probably" because I don't know the relative numbers between tapes and, say, optical media).

You're definitely right, though; there's gotta be a better way. SSDs look promising, though not necessarily for long-term storage (for that, I suspect it'd be hard to beat optical, aside perhaps from printing out a bunch of QR codes or something).

SSD are considered a fail for backup since they do have to be powered on to do maintenance to survive years of storage. However, since you're scheduling recovery tests, might as well run a SMART offline full on the drive to be sure then. More often than a year is good.

"Offline" doesn't necessarily mean "unpowered", though. An airgapped system that iterates through a carousel of SSDs and runs drive/filesystem checks would be pretty neat.

Since you're talking about tapes, I'm guessing there's more data than can be put on an external hard drive?

LTO-8 tapes can store 12TB (uncompressed) for about $300/ea, which is about 30% less than a disk, with supposedly a 30-year life. The drives start at around $5000.

30-year life when the tapes are properly stored in certain environmental conditions.

Usually those conditions aren't met properly.

You need to test your backups. It's a pain, but it'll save you when you need it.

Regular tests are fine. Restoring every single backup is unwarranted paranoia. It doesn't prove that your system is failproof, because nothing is.

It may seem like a pain in the ass - but wait until you try restoring a system from a backup tape and the first tape doesn't work, neither does the second, or the third .....

Learned that lesson early in my career - the fourth and final tape did restore.

NB The backups weren't my responsibility, but restoring the payroll database my shell scripting error had nuked most definitely was.

That's the thing about ass-pains, though: humans are prone to simply not do them. That's why the trick to getting humans to Do The Right Thing™ is to make it as easy as possible to do so.

You try to restore the backup to an airgapped machine separately set out for that. This tests the worst case: server burns down, new machine is in.

People don't want backups. People want restores.

As long as you know the 1s and 0s were properly written to a disk that's now sitting in a storage locker somewhere that's infinitely better than nothing. If shit really hits the fan you can figure out how to restore from it but the data actually needs to be there in the first place. Sure, practice restoring is ideal but anything is better than nothing. It's the difference between not being able to do anything for a week and probably having to close your business entirely.

Edit: Why is this getting downvoted?

I'm not saying this is a "real" backup solution. Call it an archive if you want. It's still better than not having any copy of you data in offline storage off site. Defense in depth people, there's no reason you can't both have a real backup system and shove old disks in a storage tote.

You are getting downvoted because there is more to it than just "practice".

If you haven't restored from a backup before, you don't know if it's possible to restore from that backup.

> As long as you know the 1s and 0s were properly written ...

How do you know? Did it actually write properly? If so, was the data you chose to write the data you actually need to fully restore?

You might be able to do certain tests or checks to determine this to a high degree of probability, but there's only one test that will tell you with certainty, and that's actually performing a restore successfully.

My experience may shed some light on this: I use pgdump to make backups of an important PostgreSQL server and I thought I was done. Then the server died. I then used the standard import... but it failed. This was a while ago, but pgdump didn't preserve/escape some of the weird quoting in my strings or I (most likely) didn't understand hot to use pgdump properly. I hand to manually figure this out and fix the strings in the file before it would import. If I had tested the full restore, then I would have not have had to panic. Thankfully pgdump didn't throw up it's hands and not emit the strings - they would have been a decent and reasonably sane choice as well. I got lucky.

You'd have to test every backup though, because initially, your DB might not contain any strings that trigger the problem.

It’s good to test PostgreSQL backups (or basically everything that is constantly doing reads and writes)... some pitfalls in my small experience with a RPi and some cheap SD card:

* If you backup the whole file system while the database is running, you may end up with an inconsistent (maybe unrecoverable?) backup. You need to use a file system with snapshots like ZFS or BTRFS, or use PostgreSQL’s own tools.

* Similarly, if the underlying file system experiments corruption, it will also seem to keep working for a while before, and after that you may be left with a bad database and backup. You have to be able to restore multiple time points or ensure your backup is good.

> * If you backup the whole file system while the database is running, you may end up with an inconsistent (maybe unrecoverable?) backup. You need to use a file system with snapshots like ZFS or BTRFS, or use PostgreSQL’s own tools.

This is true for any sort of mounted database. Always dump the database separately from any FS backups.

Another good idea (but not a replacement for backups) is to replicate the DB to a read-only slave which you can either make the master or recreate your master from.

You don't need a SQL dump to backup a PostgreSQL database. It has a mechanism (pg_start_backup) where it will only write WAL (journal) files and will not remove them while you use your filesystem backup (eg. rsync) as usual until you signal completion (pg_stop_backup), after which you also need to back up your WAL archive.

This gives you everything you need to recover to any point between the start and stop of the backup operation, and it's guaranteed to be consistent assuming the data itself was copied successfully. This is because any inconsistent files modified during the copy will be recovered from the archived WAL files to attain a consistent state.

How did you know if the pg_dump finished running? In my experience, there doesn't seem to be any way to find out if it finished, or exited prematurely. The only hack I came up with is to write the output to a log, and then grep the log for the last alphabetical table, and ensure it was in the log. That usually meant it backed up all the tables.

pg_dump provides return codes just like any other executable. Zero means no error, and treat non-zero value as an error.

IIRC mysqldump had a similar issue back in the day before the --quoted-names option became enabled by default.

I can't downvote so I can't say why. My first impression though is that your post doesn't offer much to the conversation.

Sure, that's better than nothing. People should start the process of making backups. But when the backups are important (as we're discussing here), you need to verify that a backup can be restored - otherwise you're ignoring a huge risk.

I guess you are getting downvoted because it would be really really hard to know if the all of the right 1s and 0s you need are there without restoring.

> Sure, practice restoring is ideal but anything is better than nothing.

Practicing restores is mandatory to confirm integrity. Any other form of backup is ceremony.

How do we know that this isn't the case here and said hacker went after those records as well?

This all seems like an inside job and thus even offline backups are at risk if a disgruntled employee wanted to go after it.

>that backup has to include an offline component

I'd balance this (and arethuza's reply) with the basic fact that we have to balance all backup advice with practicalities of a given individual or organization's situation too. A backup is also no good if it's not used, consistently, or if someone is told "if you don't do X might as well not even bother" and gets discouraged because X requires more capital expenditure or expertise then they're actually capable of. Any decent backup is in fact better then nothing and can help with at least some failure scenarios. More expansive backups can deal with a wider range of threats, but tend to be more expensive too. I don't want anyone to ever be discouraged from getting the ball rolling. Just using a direct simple cloud solution or plug-in USB drive even is definitely objectively better then nothing.

There is also room for hybrids and considering where, exactly, the offline bit should come in. It doesn't necessarily have to be the physical storage. There are cloud storage providers that offer WORM capability, or can separate delete/over write permissions from append. So it's possible to have a system that is backing up to that independent service through a restricted permission account and then have all management done from a minimal dedicated set of notebooks that are generally kept locked in safes. Another dedicated isolated system could be responsible for pulling a sample of backup files down and doing some simple watchdogging (even just a tripwire or entropy check). The backup is automatic and offsite, but it'll still have some additional resistance to remote attacks on the main site by virtue of system independence even if it's not tapes in a vault. And it's more practical for a small or less technically competent org.

I don't deny the value of tapes in a vault or similar, just recognize the capex and the difficulty of many places in making sure it's sufficiently consistent. Resist the finger wagging urge is the main thing, at least not without considering the cost/value equation for the specific situation.

Edit: to be clear, in this specific case there is definitely a reasonable argument that after decades and with a reasonable number of users they should have built up to something heavier duty. It did have paying users apparently after all. Even if it took a few years to save up for a better backup system that should have been done. But then again even just using Amazon Glacier and creating some custom IAM policies for API restriction along with Vault locks may have been enough to prevent this particular attack.

From their FAQ page[0]:

> What is your backup strategy / data retention policy?

> VFEmail feels it's important to provide a long-term, stable, environment for our users. In that effort, we perform nightly backups to an offsite host from all on-site and off-site mail storage locations. This backup runs at 12am CST (-0600) and contains all user data. > 3rd party storage of user data is generally not wanted by privacy-conscious users. If you fall into that category, you will want to use POP3 and download your mail daily. Our backup is on a daily/weekly rotation, initiated by a snapshot. If you do recieve mail between your last POP and the snapshot at 12am, it will exist on backup for a week - unless it's on Saturday night, then it's a year. You should set your POP program to download every 5-10 minutes in order to avoid having your mail caught on backup.

[0] https://www.vfemail.net/faq.php

Good chance the SSH private key for that offsite host was on one of the servers the attacker compromised.

You should have an offsite backup. If for some reason you don't, you have options to secure a backup host.

You can harden a backup host to prevent most every conceivable method to destroy the backup. You can lock down network access to prevent any connection but one to the file retrieving service itself, and prevent operations such as deletes or overwrites. You can use an append-only filesystem. You can use a SAN or NAS to allow a backup host access to one particular mount to dump files, and a different host can then move files off that mount onto a different mount that the backup host has no access to. Or you could even have "rotating backup hosts" that went effectively offline for a week at a time each, giving you several weeks worth of offline backups in case one backup server got compromised.

But probably you should just do offsite backups.

A nit, but it is important to differentiate between offsite and offline backups. VFEmail did have offsite backups, but they were online, and the attacker reformatted those disks as well.

offsite: prevents against physical damage to a single site.

offline: prevents against network-based attacks.

You should have offline backups and offsite backups, which can be the same, or separate. Eg backup to disk/tape, and send it somewhere for storage, or backup to disk/tape kept offline locally, combined with online backups to a server offsite.

Good point! I used to assume offsite implied offline, since I'm used to offsite tape backups.

"A tale of woe and destruction where our hero learns the value of offline tape backups..."

Ok, so that is pretty harsh but its part of any disaster recovery scenario where the datacenter explodes (or burns to the ground, or gets flooded, etc). And while LTO tapes are slow and nobody likes maintaining those cranky tape library machines, they will save your bacon when the unthinkable happens. Of course if it is a really nefarious kind of thing you might find that your storage unit where the tapes were stored is also torched. No good financial reason to protect against that, its an unlikely scenario.

That said, when you "catch someone reformatting drives" you turn off the switches that give them access to your machines. Or at least you start logging into machines and halting them until you can be physically present to preserve them.

Turning off a switch isn't going to kill any existing sessions or dd commands that were in progress - at least not until the SSH session times out and hangs up the commands it was running.

I know some people might only ever think about "switch" as meaning a piece of network hardware, but there are other kinds. Remote power off is a thing, and damn well will kill any existing sessions etc.

Exactly. When you've got active destruction going on, power down everything first.

Just speculating here, but perhaps there was a party very interested in destroying data on their infrastructure. Maybe to erase evidence of wrongdoing, or something else entirely.

That appears to be the case. The post says that hacking into all of these systems required access to multiple passwords, and because of the speed with which the systems were destroyed, I would guess the attacker had them all in hand to begin with.

that's where my thoughts went as well... Seems it could have been a revenge attack or coverup for something/someone else.

There's always a risk of disgruntled employees doing things like this.

I'm not 100% certain offline backups would prevent going out of business either. A mass exodus of customers from the extensive down time plus inevitable loss of recent data is catastrophic for a mature business in more maintenance than growth mode.

At one of my early sysadmin jobs decades ago, a few weeks into the position an ex-employee broke into the network and wiped the partition tables and whatever else was in the first megabyte of the disks - usually some of the root fs, on all the ~1000 shared hosting servers.

That was a very long week of headaches, and the business never fully recovered. There were offline tape backups, but they were of filesystems not disk images so we didn't have a backup of the MBR/partition tables. There also wasn't a high throughput restore mechanism. It was a minimal single enterprise-class tape drive with a small robot changer holding 10 tapes at a time. Sufficient for asynchronous backups and selective restores over NFS but not at all sufficient for restoring the entire datacenter from tapes in a timely fashion.

Restore of even huge offline backup should take no longer than 24 to 48 hours... Only situation that can take longer is a police seizure, since they will seize offline backups.

Worst case you buy a remote host for a temporary runtime and if needed contact your DNS or other host and CA face to face to reset or sign new keys. Given airplanes, this is relatively fast but tedious.

It is quite cheap... As long as you have the data backed up offline. They didn't.

The service owner is sloppy enough with his online presence that his personal website advertised in the Twitter profile is pointing to an SEO landing page. Would make me think twice before entrusting any of my data to his company.

I've occasionally wondered what the best way to secure your backup host from the individual clients, and possibly clients from a compromised backup host would be.

The most promising option I've come across so far is Borg running in append-only mode[1] with a client-push type model.

I imagine it wouldn't help in this case, if the attacker has the creds and access to run `dd' on the backup machine directly.

Anyone had any good/bad experiences with Borg append-only (or have other suggestions?)

[1] https://borgbackup.readthedocs.io/en/stable/usage/notes.html...

I personally like the model of burp[1]. The clients can be configured to not have delete or even restore access to their backups and the backup server is responsible for rotating backups.

[1] https://burp.grke.org/

This was the main reason why I built BorgBase[1] to host backups: You can set individual SSH keys to append-only. Then this setting can be protected by 2FA. This should give good protection against such cases. Many other providers still allow SFTP access, which makes append-only mode useless.

1: https://www.borgbase.com/

What about backup server being a client? That should do it.

as in, the backup host connects to the target machines and pulls them back to itself?

Biggest concern I have with that is that to get around file permissions, the backup user needs to be effectively root.

It's much more of a problem is the backup server gets compromised and now they also have root level creds to every target machine.

I'm not sure if something like apparmor or selinux could allow for some sort of 'read-only root' type user, and if that would actually be safe in teh circumstances.

It only needs root read not write.

There are a number of ways to do this, including a backup client that only sends data but doesn't write, a read only bind remount, or an lvm read only snapshot.

The last one is best I think. Make the new lv device readable to the backup user and have it mount it, then copy the data.

Wow, as someone working on an email service provider startup… this is basically my worst nightmare.

On the upside, the tooling around infrastructure has improved so much since vfemail launched in 2001 that there's a lot more I can do. With AWS, I can do automatic database backups, S3 bucket delete versioning, IAM auditing, whitelist firewalls, etc.

> With AWS, I can do...

Can you remotely delete all the data and backups? (It's an honest question; I'm not intimately familiar with AWS but from glancing at the documentation it seems that even glacier archives can be deleted without delay)

Because if you can, then so can an attacker who has sufficiently compromised your business.

edit: I see some stuff about "vault locking" which might do the trick. Are you using that to protect your data?

Yep, AWS doesn't seem to have any time-lock delete mechanisms as far as I know, which is a shame. I still have to research this, but as far as I can tell the best practice seems to be using MFA delete:


And then keeping root user credentials on a cold storage laptop.

Vault lock seems to be for Glacier, but it'd still be worth looking into for cold storage backups. More layers of defense are always good.

You can also use S3 legal hold in compliance mode, although, I'm not sure how you would eventually delete it.

"S3 Object Lock can be configured in one of two modes. When deployed in Governance mode, AWS accounts with specific IAM permissions are able to remove object locks from objects. If you require stronger immutability to comply with regulations, you can use Compliance Mode. In Compliance Mode, the protection cannot be removed by any user, including the root account."


I've run efficient email hosting before. They didn't have backups, they had snapshots. Anything that isn't physically offline with vaulted, off-site provider isn't a backup.

So many duplicated posts about this..

I haven't seen one that got significant attention on HN yet, which is the standard we apply for dupes (https://news.ycombinator.com/newsfaq.html). Did we miss it?

I've seen at least 4 or 5 posts about the VFEMail thing just today so I dinged this one when I saw it.

No idea what's considered significant.

Significant is just a question of whether it got many points and/or comments.

If you are a target, you will be hacked.

...and sooner or later you will be a target.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact