Interested to hear more about your "system image". How do you do it and what do you store in it? Documents / preferences / Library etc? How would you use it in a recovery situation?
Restoration is downloading the vault to a new SSD and shoving it in a computer. The "root" user is included in the backup as an encrypted volume. My setup's a little strange in that the OS disks on my computers are filevaulted and each user's homedir is also in an encrypted volume only they have the password too that's mounted to the correct /Users/folder with fstab.
(It'd probably be helpful to note that I have a hackintosh this is done on with 10 internal bays, in which one resides this SSD with a clone of this SSD elsewhere)
There is one question I've had about it. I keep everything in Glacier (about 700GB, I think). My understanding is that nothing is deleted from Glacier. If I deleted a whole backup set and then re-added it, would it upload the content again?
Also, I'm keeping an eye on http://www.filosync.com and I'm planning on trying it out soon. We currently run aerofs which is mostly really good - but sometimes we get into a situation (of our own making, we run it on an aws spot instance that can vanish) and sorting it out is hard work (and time consuming). The most frustrating aspect is that we don't know how it works, so we don't know what to expect when we perform certain operations. Sometimes it doesn't work how we expect and that bites us (I'm currently waiting for 12GB to sync - even though all the machines already have all the content).
Given my (really positive) experiences with Arq I'd like to try a product that was a bit more open about how it worked.
My main system: Hackintosh -- core i5 / 32GB RAM / GTX680, 2 x 250GB SSD, 2x 60GB SSD, 6 x 3TB 7.2K drive. (The handy thing here is my hackintosh build runs the vanilla kernel and I patch all my kexts at runtime so all the binaries are stock -- thus the backups are portable). This is my workstation and it's primarily a single-user system but there is the occasional remote user login. I use this same setup on all my macs though, and this is how it works:
1 60GB SSD for the OS, applications, and caches
X number of SSDs (usually 256GB), one for each user
maroch='diskutil cs unlockVolume <logicalvolumeid>' ##mount aroch's home directory
/Volumes/aroch /Users/aroch hfs rw,bind 0 0
The 60GB backup SSD (in slot #4): An install of (now) OSX Mavericks that includes my standard set of homebrew installs, xcode and the ~20 applications I use daily + Arq. It also has the a backup of my fstab and the root account DMG. There's a daily cron that copies my personal dotfiles to it as well. There's a weekly cron that reboots the workstation into the backup SSD and there's a startup script on the backup SSD that checks for system updates /MAS updates, applies them, Arq backup (Takes ~10mins usually) and then reboots back into the main OS SSD. The files total about 17GB right now but the delta is usually less than 100mb.
If needed, the OS and settings can be restored with Arq by using the backup SSD's vault
Features I'm interested in:
- must use S3 or similar, as long as it's private and "unlimited"
- must support multiple clients, even though I don't need real time sync
- should encrypt on the client without need of a separate sw
I've been using Dropbox + CloudFogger so far, but dropbox doesn't scale anymore with the quantity of data I have, mostly due to the number of files (and I don't like depending on two different sw for one task).
Its not for "I may need this file next week" (thats what your local NAS / Time Machine is for...Glacier is for "what happens if this place burns down". At that point a week and £120 is not high on the list of your concerns...
I'll reply back from my personal email account.
You an cut down on requests by moving to daily backups (what I do) and narrowing down what you're backing up. Backing up your applications folders will take a ton of requests (Some apps have upwards of 1000 assets whose bless values may change if you open the application and this trigger Arq)
So for example, if you retrieve 100 GB in one hour, that's $720. If you retrieve 10 TB at the rate of 100 GB/hr for 100 hours, that's also $720.
Translated into bandwidth (assuming you're pulling at this rate for at least an hour), it's about $3 per Mbps. So if you limit your retrieval to 10 Mbps, you'll pay $30, regardless of how much data you retrieve. This makes it fairly easy to cap your expenditure, if you have a client that can stagger its requests appropriately (you need to throttle the Glacier retrieval requests, possibly using Glacier's range requests if you have very large files, not throttle at the network level).
$0.010 per GB / month - Storage 2112.146 GB-Mo 21.12
$0.050 per 1,000 Requests 40,447 Requests 2.02
Glacier takes hours just to get the listing of files, or start the file download.
Their pricing is very different. Crashplan charges per computer. Glacier charges per GB of storage and per GB of transfer out.
Think an archive of family videos and photos spanning 30 years, or all the tax documents for a large enterprise.
For my personal home machines, I was recently researching backup systems. I decided against cloud backups and opted for the simple solution of a USB 3.0 drive. It's cheap, and if I ever need to get my backup quickly, I just pick it up and take it. For critical documents, I can't imagine most people have more than a few megabytes anyways. That stuff can be archived/encrypted periodically and sent to cloud email (gmail, yahoo) much cheaper and easier than worrying with S3.
For the actual backups, I'm using Linux so I went with rsync. See here for roughly how it's done: http://www.mikerubel.org/computers/rsync_snapshots/
The USB drive is encrypted, I then tell rsync to use hard links and it does the incremental backup. The nice thing about hard links is you can get at any file you want to, in any backup you want to while preserving space. Files in new backups that exist in prior backups are simply hard links. Then you just "rm" older backups when no longer needed. It's an incredibly elegant solution on Linux, especially if you combine it with LVM snapshots.
I also recently did a bunch of research on the best way to back up around 500 GB's of data (too expensive for my taste to put on something like S3 and I also needed a NAS). The conclusion I walked away with was that at least two drives and ZFS is a must. Silent drive corruption is a real problem. ZFS takes care of that much more intelligently than any hardware or software RAID and sure as hell better than no redundancy at all.
IMO, the idea setup for a home backup system is two servers with ECC RAM with 2-3 drives each running ZFS and connected over at least a gigabit link. One of these would be the primary NAS server, and the other the backup. All important documents: your Documents folder, tax returns, home videos, pictures of kids, etc. go on the primary server. The backup retains months of copies of the primary server and exposes the files over SMB and AFP as read-only (unless you are all Linux, don't bother with NFS as nothing else supports it properly, even Macs). The backup server also has raw space for system-level backups, such as Time Machine, etc. Basically, space that's not backed up twice.
My current setup is simpler: just one machine with no ECC (this makes me very nervous), with two drives running ZFS in a mirror setup. Since I am treating this array as reliable, I am using it for both primary storage and backup storage, though I will be changing this at some point soon when I get the time/money to build things out a bit. My biggest worry is physically separating primary and backup servers to different locations, at least within the house.
As for software, I use netatalk for AFP (Mac file sharing), Samba for SMB (Windows), and rsnapshot for backups. rsnapshot is particularly nice because it's the least pain in the neck to browse backups. It is similar to Time Machine but FOSS.
* Small-ish type stuff on SpiderOak and Dropbox (both on free tiers). I assume both of these are "compromised".
* Data that I want to keep /private/ go into an encrypted USB flash drive and onto Tarsnap. I doubt this will ever get larger than 10-20MB.
* I have ~5 full HDD images of Linux boxes that I need to keep for archival purposes and storing that on a cloud backup service would be a pain to upload and download. Those are on a Synology 213+ and on a larger capacity USB3 drive.
I'm still evaluating git-annex, but right now I can't seem to find a workflow to my liking. I feel like it /could/ be something I'd use to replace most of the above, but I can't see it yet.
I thought git-annex would be a good solution but I can't seem to get it to stop locking files I'm working on without using its 'unsafe' mode. I can't seem to figure out the workflow it was designed to support.
I was reading about these just last weekend, and surprised to hear that several of the current, big name brand USB hard drives seem to come with "clever" software, not necessarily easily removable, that means they don't work like a vanilla disk any more. The result is that operating systems without supplied drivers, such as Linux, won't necessarily be able to access the drives properly.
I have no way to tell how much of this is real and how much is down to misunderstanding, but it was a recurring theme and applied to multiple major vendors, so as appealing as this simple approach is, I don't really feel like trusting my critical back-ups to any of these drives until I've got some facts.
WD drives detect a wd_ses device as well as the drive and I think that gives me LED control (great for turning off blinkenlights at night)
My main problem with backup drives at the moment is getting hdparm/power management settings to stick so the drives will spin down after an amount of time. I've had more luck with USB than ESATA in this case.
I'd recommend AGAINST any ESATA/USB "docks" - the USB to SATA bridges seem to have a very high failure rate
(https://bugzilla.redhat.com/show_bug.cgi?id=895085 for the first one I saw on Google)
It's a Seagate Backup Plus 3TB. I believe Newegg had it on sale for $100 when I picked it up. Been using it for maybe 3 months now. So far so good.
The drive seems to have no problem spinning down. But I typically unmount and unplug when I'm not using it, just to be on the safe side.
I also looked into NAS, which is what I originally wanted to go with. But my concern was none of the cheap ones seemed capable or fast enough to do encryption.
Of course I've got some setup around creating datestamped snapshots and keeping the N last ( http://sprunge.us/OKHb?bash for the bash function that does this), but in essence it's just
$ rsync src trg/current
$ btrfs subvolume snapshot -r trg/current trg/snapshot-$(date)
I also forgot to mention that hashdeep/md5deep should probably be integrated into this sort of backup system (http://md5deep.sourceforge.net/). What hashdeep does is generates a list of checksums for a given backup. You would store this list with the backup, and then use hashdeep to "audit" the backup when you restore it, or to check if hardware is failing.
One is, it could play within the Python ecosystem more nicely; it doesn't install with the usual setup.py mechanism and also is very resistant to any kind of API/in-process usage; it's various features are locked up tight within its commandline interface so writing frontends means you pretty much have to pipe to a process to work with it.
Another is that it needs a feature known as "synthetic backup". If you read up on how people use duplicity, a common theme is the need to "run a full backup" periodically. This is because you don't want to have months worth of backups as an endless long stream of incremental files; it becomes unwieldy and difficult to restore from. In my case, I'm backing up files from my laptop, and a full backup takes hours - I'd rather some process exists which can directly squash a series of incremental backup files into a full backup, and write them back out to S3. I'm actually doing this now, though in a less direct way; my front-end includes a synthetic backup feature which I run on my synology box - it restores from the incremental backup to a temp directory, then pushes everything back up as a new, full backup. My laptop doesn't need to remain open, it only needs to push small incremental files, and I get a nice full backup of everything nightly.
I would much rather use an incremental backup that records the diffs in reverse, so that the latest state is stored explicitly, and the earlier versions are recovered through a chain of reverse diffs.
Duplicity's security goals make it hard to implement this kind of reverse diff structure.
So if you want to expire old backups after a while you must do a full backup, then delete the entire full+incrementals chain.
rdiff-backups stores diffs in reverse so you can always easily delete the oldest ones.
Duplicity archives and encrypts all it's files so it really doesn't have much choice in the matter. rdiff-backups doesn't encrypt anything - it can't - it needs the full file on the other end to use for rsync.
Price: It costs only $3.96 a month if you pay for two years up front. ($4.17 a month if you pay for only 1 year up front.) They offer unlimited storage of everything on internal and external HDDs. 30 days of file revisions too. I think I have close to 1TB of data backed up with their service. (Using Duplicity w/ S3 would cost me $90 a MONTH in comparison.)
Time: I've always been using Windows File History to backup onto a external USB HDD, but I wanted an off-site backup. I initially started backing up to S3 myself and even dabbled with Glacier. However, maintaining this setup proved to be quite messy. Dealing with encryption and decryption was painful as well. Having a simple foolproof interface was worth it.
Our servers periodically run rdiff-backup of all important data to a /backup partition on the same server. Then this /backup partition -- with the backup version history and metadata -- is rsynced to various dumb backup storage locations.
We've tried many backup solutions, and rdiff-backup is by far the fastest and most robust backup program we know.
Rsync requires a program on the other end. True, it's a more commonly installed program than rdiff-backup but it's not a dump backup.
Tarsnap: easier to set up and use
Duplicity: fully open source, and I think the restore speed was significantly faster ... but I've not properly benchmarked it apples-to-apples.
If you are backing up a lot of frequently changing files (eg. database backups) then tarsnap's bandwidth charges (probably caused by the underlying EC2 charges) can swamp the storage cost. Running duplicity against rsync.net avoids this.
You can host your own backups if you want (supports ftp, sftp, ... check the docs: http://duplicity.nongnu.org/duplicity.1.html), but Tarsnap is an hosted solution.
In the end, that client decided that it was easier to use a commercial package, given their typical use case. Tried Carbonite (didn't work, support unhelpful) but settled on JungleDisk. YMMV.
Also it's easy to use: http://www.memset.com/blog/backup-your-windows-server-memsto...
You can use some sort of rsync alternative to get the content from the windows boxes to the linux machine (cwrsync is recommended - though I've never used it).
For a 2GB Linode, we're talking about backing up 96GB of storage. More than that, three backups are made (daily, weekly and 2-weekly). 288GB of active backup storage. Handled for you. For $10 a month.
Amazon is over twice that before you even consider the hassle of setting it up and restoring backups.
I don't trust hosting my server and backups with the same company. I backup to s3 with s3 credentials that forbid removal; files are expired out of the duplicity-related s3 buckets using native amazon functionality, not duplicity. It does mean there are some stale incrementals, but that's a small price to pay.
I currently prefer http://liw.fi/obnam/ which does dedupe and every backup is a snapshot. Only downside is that encrypted backups are broken if you use the latest gpg.
duplicity's simple yet pretty awesome.
But for system backups, you need something simple and easy. Just get arq/etc.
Nothing against forward deltas, but just consider that:
- restoring time increases with increment count
- risk of corruption of a single archives will make all further increments worthless.
These are very important things to consider when doing a backup.
Currently, you have to limit the chain length by performing a full backup every N increments (with N being as short as possible), which defeats the purpose of efficient increments.
I have requested the ability to specify manually the ancestor of the increments (by time or count), so that one could implement a non-linear hierarchy with multiple levels like one normally does with tar/timestamps, but the request was dubbed as "unnecessary given the delta efficiency" (despite the fact that efficiency is just one variable). Having a 3 level backup (daily, weekly, monthly) would make duplicity much more space efficient, reduce the number of full backups needed, and shorten the chains to the level that would make restores of "last year" actually _possible_.
I sent several patches to fix duplicity behavior with files larger than 1gb (by limiting block counts), which got integrated, but are still a far cry to make duplicity work decently as a whole-system backup&restore solution. It's just too slow. And like you said, several bugs afflicted duplicity in the past that would make restore fail in many circumstances. I also debugged my share of issues, which led me to think that very few people actually tried to restore from arbitrary increments with duplicity and/or used them to archive a large system.
Many got fixed, but I won't consider duplicity again until I can control the incremental ancestor and reduce the chain lengths (and it's silly to think that "rdiffdir", distributed with duplicity, would allow for that easily).
Nowdays I use rdiff-backup, and use a second script to tar/encrypt the deltas after the backup has completed.
I'm keeping an eye on "bup" (https://github.com/bup/bup), but I cannot backup "forever", thus without the ability of purging old versions it's only useful in a limited set of cases.
I wrote ddar before I knew about bup. It doesn't have this limitation; you can arbitrarily remove any archive. However, it does not do encryption, so I wouldn't recommend using it to store on S3 directly.
bup currently has no features that prune away old backups.
Because of the way the packfile system works, backups become "entangled" in weird ways and it's not actually possible to delete one pack (corresponding approximately to one backup) without risking screwing up other backups.