Hacker News new | past | comments | ask | show | jobs | submit login
Backup your Database in Git (viget.com)
50 points by tortilla on May 8, 2009 | hide | past | favorite | 18 comments

I find my self saying this a lot lately

Just because technology A can do action B, does not mean it's the most appropriate path to take

I know what you mean. Yeah, I could do this, or I could just rely on automated iSCSI snapshots to seamlessly and automatically backup the entire machine whenever I want.

Are those reliable for a database, though?

I haven't done enough at such a low level to really understand how that would work. I mean, the database application layer typically promises to be atomic, sure. But at any given point in time are the bits on disk truly atomic? Does a database not sometimes take 2+ writes to fully change data? And does an iSCSI backup really copy the entire disk at a single point in time? I thought that was magic specific to something like ZFS. Because if it doesn't, couldn't you have a database backup that isn't really usable?

I guess what I'm trying to say is that application level DB dumps exist for a reason - and I'm no expert so I'd love someone to explain how full disk backups work re: databases.

DB dumps exist for a reason

Indeed they do, and that is the "faultless" way of creating a backup. Depending on your RDMS (and luck perhaps), the backups can be fine or they can be unusable. In my experience, MySQL databases backed up this way almost always end up being corrupted. On the other hand, I've had good success with PostgreSQL. Sure, occasionally the internal registry indices will be out of sync and will need to be recreated, but it has never been a big deal.

Even still, I wasn't decrying DB-level backups since, obviously, they're far better (and sometimes necessary). But you can still start a cronjob to write the backup and let the snapshots take care of the rest. My point is, you already get heavy-duty compression and incremental algorithms and restoration, not to mention the ability to restore the complete system state in the event of a catastrophe.

In general, you are supposed to lock the database for a couple seconds, take the snapshot, and then unlock it. This prevents that kind of data corruption.

Before I could complete saying "Just because you can doesn't ..." I read your comment. Voted up.

Too bad git is not good for large files - http://kerneltrap.org/mailarchive/git/2006/2/8/200591

Well, the extent to which this matters depends on how often you run it. Even if the hashing for a 10gb database dump takes ten minutes* , it will still get backed up ten minutes after the hour, every hour. There are very likely better solutions, but it would be easy to set up, and better than nothing.

(Besides, any solution that has to deal with a complete db dump is going to have to hash, diff, or otherwise process a large volume of data.)

* As a single data point, sha1 just ran for me at approx. 800mb/min.

a better fix is to mysqldump every table individually, and then git commit those as distinct .sql files. This way less-often updated data (reference tables, etc) tend not to need updating, and your hash generation is typically smaller. Plus, you can skip log/session type tables if you are that way inclined to use them. :)

That is exactly what 1 client of mine does for Postgres.

Looks like the same emails I was pulling up after switching from SVN and discovering my gigabyte-plus census files were exhausting Git's memory. Sadly, in the game industry many of us are stuck with Perforce just because it handles gigantic files (like a DVD image), has a visual client and is marginally better than Microsoft's offering.

This git backup approach is crap. Pure crap.

Sorry to lower the tone on HN, and I'll get to the _reasons_ soon, but my brain overloaded at how over-engineered this problem is.


Deterministic SQL Output ::

For the delta set in git not to jump around, the ordering of the statements MUST BE THE SAME every time. git is designed for source code

Trivial Amounts of Space Saving ::

just creating a well known bz2 of the file is going to give you excellent compression, and then you can save it to S3 / DropBox, your home computer


ALTER TABLE or any UPDATE * is going to yield a line by line delta... checking in the whole DB/Table again. Given that this is app specific, WHY not just compress the whole dump.

Backup Complexity ::

You've written a cron script, validated that the whole db is being dumped, hdd space is cheap, so why involve another step that can go wrong?

Duplicating a tar.bz2 file is a lot easier to validate than a git commit. Much much much faster too.

MySQL usually does not reorder the data on disc, so in the OP's case, this might actually work for him. I do agree though that just moving around bz2 files is much easier. Andit scales better (as in: continues to work) once you begin accumulating REAL data.

A common size for a complete dump of one of the plattforms I'm working with is 400 MB bzip2 compressed. You certainly would not want to have that amount of data uncompressed in a git repository.

should you decide to compress pre giting, you of couse lose all advantages of using fit for this in the first place.

This feels like a non-solution to me

All well and good, except that git doesn't explicitly store delta sets.

Which doesn't mean storing SQL dumps in git is or isn't a good idea, but the git repository itself probably won't bloat up too much just from moving stuff around in a file, especially if the initial commits were small.

Backing up MySQL's binary log is probably a better solution than doing periodic mysqldumps. Also the ordering of the inserts in the dump isn't likely to be deterministic, screwing up some of the benefit of this method.

Yep. I've tried tracking mysql dumps, and lots of little things can cause spurious diffs.

I wrote a short Ruby script that does a similar operation with Subversion - it's somewhat Rails specific in that it takes the username/password from RoR config files, but that can be easily modified.

The code is (ironically) here: http://github.com/niryariv/dbbackup/tree/master and some background here: http://niryariv.wordpress.com/2009/03/23/backing-up-mysql-da...

Happy to see someone confirm this can be done. It also made me think of http://www.liquibase.org/

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact