
How to do cheap backups - suhail
http://code.mixpanel.com/2012/02/21/how-to-do-cheap-backups/
======
vlucas
The quoted numbers per GB are honest calculations, but they can be a little
misleading because they don't reflect the REAL costs you are going to end up
paying PER MONTH and UP FRONT. As with anything, you always have to run the
calculations first.

An illustrative example: My co-founders and I recently looked at a bunch of
new office spaces and were doing similar comparisons with costs per square
foot for each space. We ended up in a situation where option A (a much larger
space) looked MUCH cheeper at $17/sq.ft, but we ended up going with option B
(a small space) at nearly $40/sq.ft. because we just didn't need all the space
in option A, and option B was in an office facility where we wouldn't have to
buy any additional furniture or appliances like couches, chairs, a coffee
maker, fridge, etc. (they were supplied to all tenants in a large common area
as part of the cost). So the REAL cost difference to us PER MONTH was about
$400 LESS with option B (the smaller more "expensive" space) and came with a
lot of extra conveniences to boot.

So in the example given in the article, if you're not using the full 45TB of
backup storage space (24 x 2TB in RAID-6), you could actually end up paying
significantly more per GB for what you have stored (what you're actually
using) - ESPECIALLY when you include the UP FRONT costs of buying and co-
locating the server and the maintenance costs that go along with it.

Moral of the story: Just because something LOOKS more expensive per unit
doesn't mean it's actually GOING TO BE when it comes to cashflow. ALWAYS do
the math for your own situation before making decisions like these.

~~~
lusr
Not sure how honest these calculations are because I couldn't repeat them and
the article gives no REAL detail.

Cheapest matching configuration I could find at SoftLayer was $559 for a dual
processor Xeon 5504 with 12GB RAM ("Speciality: Mass Storage"). Each 2TB drive
costs an additional $60. I'm assuming the raid controllers come free if you
can plug in that many drives so not adding that onto the build. Total for
build with 24 drives: $1999. (Assuming a free OS, and that you don't need to
upgrade the network port speed.)

24x2TB in RAID-6 gives you 45TB usable capacity (all 24 drives, madness) or
20TB usable capacity (12 drives in mirrored RAID-6 configuration, probably
more sane).

45TB @ $1999/mo is ~$0.045/GB, but if more than 2 disks fail you're screwed
(quite likely to happen in a 24 disk configuration!)

20TB @ $1999/mo is ~$0.1/GB, and you can stand to lose at most 4 disks (at
most 2 in each of the mirrored array).

Compare with the $0.11/GB Amazon S3 is charging you to store data in the TB
range (excluding data transfer costs, which are free for incoming data, which
ought to be the bulk of data) and I'm not really sure I follow their argument
given that maintaining this backup system is going to be a PITA and the risks
don't seem worth it.

------
sdfjkl
You need to hire a sysadmin (or operations engineer or whatever they're called
these days). You're talking about developers spending time on implementing
backup, which is wrong. That's not what developers should be doing. Sysadmins
may occasionally write software to solve a problem, but they're not software
developers. Software developers may occasionally do sysadmin type work, but in
my experience most of them are notoriously bad it it (setting up mod_python
because they have a blog post with step-by-step instructions from 2004
bookmarked where that was the best way to do stuff - or leaving gigantic
virtualenv turds with complete python binaries in your SCM because they don't
understand what virtualenv is for).

Sysadmins also have toolkits (you know, screwdrivers, torx bits, zip-ties) and
have scars on their arms that prove they aren't afraid of sharp-edged hardware
stuff that sometimes starts smoking for no discernible reason and doesn't turn
on any blinkenlights when you push the button (many software developers panic
at this point). This comes in very handy when you just "left the cloud" and
you're experiencing first-hand the reasons why people moved into the cloud in
the first place (hardware sucks).

Sysadmins also know about this backup stuff and will tell you to shut up when
you start talking about doing it with cobbled together shell scripts. They'll
probably recommend using something like Amanda (or a commercial equivalent),
that makes sure your backups happen regularly, are complete and actually
contain the stuff you needed to backup. Good ones may even know to test the
backup occasionally by restoring a server just to see if it actually works
afterwards.

(Apologies to any software developers who know their sysadmin stuff.)

------
dexen
Cheap backups? _Use de-duplication._

I have cronjob with daily dumps of several MySQL databases, in the usual
textual format. One day woth of dumps takes about 470MB now. Two years ago, we
started at about 20MB/day and it was growing ever since.

Each dump is committed into one common Git repo. After two years (that's just
over 700 dumps), this whole Git repo is about 180MB.

Yep, much less that one daily dump. Git performs thorough de-duplication and
delta-compression.

Cheap backups? _Use de-duplication._

~~~
fullmoon
Very cool. What options do you pass to `mysqldump' for best diffability?

~~~
dexen
Straight from the dump scripts:

    
    
      --complete-insert --hex-blob --skip-add-drop-table --single-transaction --order-by-primary --skip-dump-date
      --force # so mysqldump does not bail out on invalid view
      --no-create-info # do not emit CREATE TABLE, because its AUTO_INCREMENT changes often, and would create unnecessary differences
    
    

where the `order-by-primary' is probably the most important, and `skip-dump-
date' sure helps.

Also, I make a big deal out of spreading every large table into a set of
smaller dumps, each with fixed number of rows, sorted by record ID. For
various reasons, most of our tables are usually appended-to, and changes
(UPDATE, DELETE) are less common. Thanks to this, changes are usually confined
to the last file of a set (with newes records), and other files stay mostly
unchanged -- and so they pack the best.

    
    
      --where="_rowid >= $FROM AND < $TO"
    

I try to keep individual files down to about 8...16MB, 32MB max, so git's
repack (upon pull/push/automatic gc) doesn't take too much of time nor RAM.

~~~
rmc
How do you split the files? Is that part of mysqldump (if so, how), or is it a
handrolled thing?

~~~
dexen
For now, handrolled. The idea is to be able to do either `cat * .sql' or just
`cat LAST-PART.sql'. I run mysqldump once per each large table with
_\--where="_rowid >= $FROM AND < $TO"_ argument to mysqldump, and call
mysqldump in loop with consecutive _$FROM_ and _$TO_. It works, it gets the
job done, but it's not transaction safe.

That `_rowid' is a reserved symbol in MySQL. Refers to table's PRIMARY KEY
(but only if it's single INT). In the usual case, the script doesn't have to
know table's PRIMARY KEY.

Another way would be to use `rolling checksum' to split files; the concept
described in <http://beeznest.wordpress.com/2005/02/03/rsyncable-gzip/> But
you could end up with dump files split in the middle of SQL statement, not
very cool.

------
brucehart
I looked into backup options for my company and ended up rolling our own
solution using an open-source program called Duplicity:
<http://duplicity.nongnu.org/> . I've been really impressed with Duplicity.
Incremental backups are fast, the data is encrypted and you can target many
different source/destination types (local file system, ssh, Amazon S3, ftp).

~~~
rickmb
The main reasons for me to choose duplicity is the multiple protocol support.
I've had so much issues in the past switching to other storage facilities only
to have to completely change the backup process. Plus I'm using multiple
storage sites for really important data.

Duplicity can be a bit finicky though, especially when it comes to cleaning up
after itself, even more so if the backup was somehow interrupted.

~~~
ars
Duplicity is also unable to incrementally remove old backups.

A backup set starts with a full backup, then has incrementals after that. You
can remove the incrementals, but if you remove the (old) full backup, you have
to do a full backup again.

A typically strategy is incrementals every day, and a full once a week.
However this is really hard on home connections - it's just too much data to
do a full backup every week.

One of its pros is that everything is completely encrypted from when it leaves
your machine - so you can backup to a completely untrusted repository without
worry.

However, in order to do that it locally stores block-checksums of every file
it backs up. (So it can detect differences.) These files can get large.

Another option from the same source is rdiff-backup. This has none of those
limitations - it uses a CVS style reverse diff (so old incrementals can simply
be deleted, and the most current is simply stored as a file, which makes
restores very easy). However it's not encrypted from the source - so you have
to trust the repository at least enough to create a local encrypted volume on
it.

------
ars
Security reminder:

The live server should not have write access to the backup machine.

Instead the backup machine should have read access to the live server.

This prevents disaster in case of hacks.

~~~
cperciva
That's one option, but I prefer doing it the opposite way around: Have the
live server push data to the backup server via an append-only interface. This
is much simpler in terms of access control if you want to back up some of the
live server's data but not all of it.

~~~
redslazer
I think you missed his point. He meant its safer to give the live server no
access because if it gets hacked/virus then it can not impact the backup
server.

~~~
cperciva
I think you missed my point. I was suggesting that the live server should
access the backup server _via an append-only interface_ , i.e., one which
doesn't allow it to delete backups or modify them.

~~~
snowwrestler
Now the security of your backups is completely dependent on the construction
of the append-only interface. Are you 100% certain it can't be compromised or
permission-escalated?

~~~
cperciva
It's much easier to build an append-only interface than a read-only-and-only-
read-some-files-not-others interface.

------
gleb
Tarsnap pricing is for deduplicated data. You need to divide it by orders of
magnitude for a proper comparison.

Apples and oranges - what you are doing is on site backup, Tarsnap is offsite.
Both are needed.

~~~
jerf
Dedupe isn't magic. If your data isn't duplicated on the block level, it isn't
going to do anything for you. It does wonders on backing up 10,000 windows
machines, it doesn't do anything at all for a server's data store.

~~~
jacques_chester
I disagree. The deduping enables low-cost snapshot semantics which has
drastically simplified my life.

Every 24 hours, I dump a database and let tarsnap work out what to actually
send to S3. And it does such a good job, for such a low price, that it boggles
the mind.

~~~
jerf
Fair enough, but I'm reacting to the unbased claim that you can just divide it
by "orders of magnitude" when the original post does not contain enough
information to claim that, and indeed enough information that it's probably
not true. If you have 100GB of server-type service data, you're not looking at
"orders [plural] of magnitude" less space taken up on the backup service. The
huge multipliers that dedupe is sometimes cited as giving are for certain
datasets, you do not get "orders of magnitude" shrinking in the general case.

~~~
gleb
The claim is based on personal experience.

With Mixpanel I'd expect even better results since their data is append-only
by nature. I.e. think about deduplication when backing up a large append-only
log file on a daily basis.

Tarsnap has weaknesses but cost of storage for Mixpanel type of workload is
not one of them.

------
PanMan
Anybody else disappointed that an article "How to do cheap backups" didn't at
ALL describe how they, well, actually do the backups? I was expecting some
smart copy algorithm, not a post about the price of hardware. Also, they
compare their hardware when at full capacity, to AWS and others scaling
pricing model. Their first GB will cost a lot more than stated here in price /
GB.

------
ck2
Takes about 25 lines in bash to do rotating, encrypted, s3 backups with
Timkay's awesome aws script

I actually had to throttle our servers when sending to amazon as they seem to
be able to receive at impossible maximum speeds and eat the whole pipe!

~~~
teach
Is this the one? <http://timkay.com/aws/>

I somehow hadn't seen it before; thanks!

------
underwater
I hope the backup machine is connecting to the live machine and they're sealed
off from each other. I've heard of cases where hackers have managed to get
into a machine and then access backup machines to completely wipe all copies
of a database.

~~~
GoodIntentions
I think there's an argument for periodic off-line, "hold it in your hands"
backups.

Good luck wiping that...

------
rmc
If all your servers are on Amazon EC2, and your backups are in S3, then you
have all your eggs in one basket. One billing dispute with Amazon and your
servers _and_ backups are gone.

Backups are there to protect you in case the worst happens.

~~~
lusr
Make a second account?

~~~
rmc
Does the Amazon EC2 terms allow that? If not, you might give Amazon (or a
zealous front line fraud watcher) an excuse to lock out both your accounts,
before you get into any dispute.

~~~
cperciva
Yes, you can have multiple AWS accounts, as long as you're not doing so for
bad reasons. According to the AWS Service Terms, _"You may not access or use
the Services in a way intended to avoid any additional terms, restrictions, or
limitations (e.g., establishing multiple AWS accounts in order to receive
additional benefits under a Special Pricing Program)"_ but that's really just
a "don't try to cheat" clause.

Prior to the creation of IAM, the standard way of creating restricted-
privilege access keys was to have multiple accounts; I have at least 5 AWS
accounts (I say "at least" because it's possible I've forgotten some which I'm
not using any more...), lots of people at Amazon know about this, and nobody
has ever suggested that there is anything wrong with it.

On the other hand, if Amazon decided to close your account, they would
probably look to see if there were any other accounts owned by the same
person. On the gripping hand, I've _never_ heard about anyone _ever_ having a
billing dispute with Amazon Web Services, which at their scale tells me that
they're very reasonable people and not prone to Paypalesque random account
closing.

------
sehugg
I use a nice OS X app called Arq which does encrypted backups to my own S3
bucket. Since I only have a few GB of stuff that warrants offline backup (git
repos, etc) the ~$0.25 storage fees per month is well worth the convenience.

~~~
BadassFractal
Any good Windows/Linux alternatives?

~~~
brucehart
For Windows, Duplicati is a good option: <http://code.google.com/p/duplicati/>

For Linux, try Duplicity (which I mentioned in another comment):
<http://duplicity.nongnu.org/>

Both are open source and have similar features to Arq (including Amazon S3
support).

------
nico_h
It's nice, but they still need someone to keep an eye on the backup machine.

Also S3 is expensive because it keeps many copies of your data (though they
appear as one) and check them for corruption, so it would be more reliable
than a single backup machine.

------
wazoox
One important feature they didn't mention: if you have a decently powerful
backup server hosting all of your data, it may be relatively easy in case of
an emergency to use it to serve production data directly. For instance, you
could start a mysql instance with the backup data directly from the backup
server, if your production server (or datacenter) is fried.

There is no easy way to achieve that from S3 or tarsnap.

~~~
cperciva
Mirroring is not backup. If you keep your "backup" server in sync with your
live server so that you can promote it to production quickly, it's also going
to be in sync with any loss which occurs through sysadmin error or deliberate
attack.

Ideally you should have both a mirror on standby and a backup server with
multiple earlier copies of your data.

~~~
sbov
What they said doesn't necessarily have anything to do with mirroring.

When we take a backup of MySQL to our backup server, we also create a my.cnf
file configured to use said backup. As in, we have multiple config files, each
created at time of backup, specifically referencing that individual backup.
Each can be easily used to start MySQL up using that X day old backup. Then I
can shut it down, and start it up again using a completely different backup
just by specifying that backup's my.cnf file, all without modifying either
backup.

This is also nice because it makes it simple to make sure your backup actually
works.

------
suhail
Small plug but we're always looking for awesome people to join our ops team!
Link: [http://mixpanel.theresumator.com/apply/Xm0tLy/Software-
Engin...](http://mixpanel.theresumator.com/apply/Xm0tLy/Software-Engineer-
Operations.html)

------
latch
And softlayer is by no means the cheapest provider of dedicated hosting in the
US (though, there are few comparable providers (in terms of price and quality)
with multiple regions).

