Just because technology A can do action B, does not mean it's the most appropriate path to take
I haven't done enough at such a low level to really understand how that would work. I mean, the database application layer typically promises to be atomic, sure. But at any given point in time are the bits on disk truly atomic? Does a database not sometimes take 2+ writes to fully change data? And does an iSCSI backup really copy the entire disk at a single point in time? I thought that was magic specific to something like ZFS. Because if it doesn't, couldn't you have a database backup that isn't really usable?
I guess what I'm trying to say is that application level DB dumps exist for a reason - and I'm no expert so I'd love someone to explain how full disk backups work re: databases.
Indeed they do, and that is the "faultless" way of creating a backup. Depending on your RDMS (and luck perhaps), the backups can be fine or they can be unusable. In my experience, MySQL databases backed up this way almost always end up being corrupted. On the other hand, I've had good success with PostgreSQL. Sure, occasionally the internal registry indices will be out of sync and will need to be recreated, but it has never been a big deal.
Even still, I wasn't decrying DB-level backups since, obviously, they're far better (and sometimes necessary). But you can still start a cronjob to write the backup and let the snapshots take care of the rest. My point is, you already get heavy-duty compression and incremental algorithms and restoration, not to mention the ability to restore the complete system state in the event of a catastrophe.
(Besides, any solution that has to deal with a complete db dump is going to have to hash, diff, or otherwise process a large volume of data.)
* As a single data point, sha1 just ran for me at approx. 800mb/min.
Sorry to lower the tone on HN, and I'll get to the _reasons_ soon, but my brain overloaded at how over-engineered this problem is.
Deterministic SQL Output ::
For the delta set in git not to jump around, the ordering of the statements MUST BE THE SAME every time. git is designed for source code
Trivial Amounts of Space Saving ::
just creating a well known bz2 of the file is going to give you excellent compression, and then you can save it to S3 / DropBox, your home computer
ALTER TABLE ::
ALTER TABLE or any UPDATE * is going to yield a line by line delta... checking in the whole DB/Table again.
Given that this is app specific, WHY not just compress the whole dump.
Backup Complexity ::
You've written a cron script, validated that the whole db is being dumped, hdd space is cheap, so why involve another step that can go wrong?
Duplicating a tar.bz2 file is a lot easier to validate than a git commit. Much much much faster too.
A common size for a complete dump of one of the plattforms I'm working with is 400 MB bzip2 compressed. You certainly would not want to have that amount of data uncompressed in a git repository.
should you decide to compress pre giting, you of couse lose all advantages of using fit for this in the first place.
This feels like a non-solution to me
Which doesn't mean storing SQL dumps in git is or isn't a good idea, but the git repository itself probably won't bloat up too much just from moving stuff around in a file, especially if the initial commits were small.
The code is (ironically) here: http://github.com/niryariv/dbbackup/tree/master and some background here: http://niryariv.wordpress.com/2009/03/23/backing-up-mysql-da...