A small sample of the other options considered:
* EncFS + rsnapshot/rdiff-backup
* Weak cryptography in EncFS 
* Uses forward snapshots, so recovery takes longer with each successive snapshot, unless you upload a "fresh" copy of the data
* ext4 on LVM on LUKS on a loopback device provided by s3backer 
* OK I'll admit this one is pretty far out there, but it actually works decently well (ignoring the massive FUSE overhead)
* Downside: a slight hiccup in network connectivity guarantees data corruption
* Known data corruption issues with large datasets 
* raid1 on all data. This isn't a backup - it's for end-to-end detection of bit flips in (some of) the storage path (and at rest). This needs a better solution (zfs?)
* zbackup of all selected files to external hard disk, overnight. This handles the de-duplication.
* duplicity of the zbackup dataset to S3, immediately following. This is usually a small upload - zbackup diffs are tiny and it doesn't touch files it doesn't need to.
* Rate limited so full backup is about a week, incrementals usually only a few MB.
It seems to work well so far, and I'm prepared to do a fresh full backup every year, swapping disks periodically. This general idea is to keep data in 3 physical places: local, external and remote. Local can be recovered from external. External can be thrown away and recreated. Remote recreated.
Wish there was an all-in-one solution with all the checkboxes checked. It feels silly to have to get there with a bunch of scripts.
RAID1 will speed up reading by spreading reads across both disks, but data returned by the array comes from one disk or the other, not both. This means there is no opportunity to detect if a bit flip occurred.
Adding a checksumming filesystem that is handling the RAID1 in software would solve this problem. (i.e. If it were hardware RAID as opposed to ZFS's RAID-Z, ZFS could detect a bit-flip on the hardware RAID1, but it can't do anything about it since it is not aware of the two physical disks as a redundant source of data.)
* The intention is to detect at-rest bit flips before they progressively pollute everywhere. I don't mind if it's not instantaneously detected on-access. I perform a nightly full scan - so there's up to 24 hours where an at-rest bit-flip may lie undetected, but it won't progress past that.
* I use ECC at all cache hierarchy levels possible, and ECC DDR, with active background scrub. So bit flips here won't occur (with vanishingly small probability).
* I use software-RAID1 only. The path between DDR, thru CPU, and to storage controller is unlikely to have a bit flip. From there onwards, the data is essentially written twice, so there is vanishingly low probability of both having a bit-flip, except for systematic failure for that bit pattern in two attempts to two different disks and two different controllers.
So there's still some places in the stack where errors can be introduced, but the most common areas of fault are either duplicated or covered by detection mechanisms.
My goal is to detect, but not correct, transport and at-rest bit flips. I'll discard everything when an error is detected, and use backups.
For my next setup I'll probably switch to a filesystem with a better end-to-end error detection story, such as ZFS, and ditch RAID altogether.
Conversely, Btrfs scrubs will report a corrupt file path if there's a bit flip in a single copy. If there's another copy available then there's just a kernel note that there was a data csum error and the problem was fixed, no file name path.
Yes, a checksumming filesystem would be better, but at the time of creation (and even now?) none of the filesystem choices were mature, or proven to be sufficiently reliable. Ext4 and software RAID1 are both mature and proven reliable.
Yeah, btrfs or zfs are the better choice for this than RAID1 in the modern world I think.
ZFS anywhere it runs natively, btrfs on Linux.
(My experience with btrfs is that it’s fine on a server, but on laptops I’ve ended up with unmountable, unfixable filesystems a number of times. To be fair to btrfs I’ve always been able to recover the data though.)
I had some problem with space though using intensively docker: in some situations btrfs thought that the free space was finished and I had to manually rebuild the metadata.
Docker could use BTRFS / ZFS snapshots instead of using whole filesystem image files though, so this ought to get better.
"Uses forward snapshots, so recovery takes longer with each successive snapshot, unless you upload a "fresh" copy of the data"
(unless you have a relatively slow network connection)
duplicity will interoperate with plain old SFTP, so we have supported it since it was released. We have also contributed funds to the development of duplicity. In these 10 or so years that it has been around, we've never seen any complaints from customers or evidence of things breaking - it is quite solid.
That being said, we do really want to support attic since so many people ask for it. It requires a server-side component to be run, and we try to keep our environment squeaky clean and dead simple - but I think we can cxFreeze it and run it as a binary executable on our end...
 rsync.net, with HN-readers discount. Just email.
 There are no interpreters in the jail that each account is locked in, so we can't run a python component on our end.
Files (or chunks, if it's enabled) are gpg-encrypted (to a key or using a symmetric passphrase) and could be stored in nearly anything that could store named blobs. Supports various modes (archival, backup, RAID-like). Performance is fair for me.
Downside is that it's not a "real" filesystem, but a git repo that tracks metadata, plus blob storage. Assistant daemon saves a lot of keystrokes, though. There is also some FUSE-based implementation, but I haven't tried it. Getting started is easy, so I suggest taking a look and toying around for half an hour. Another downside is that it - being a git repo - may have problems with storing another repositories inside. Haven't tried this, though.
Joey and I put together a special discount rate for git-annex folks that want to point at rsync.net:
I think restic is going for that. I don't think it's "ready" yet, though.
7.9MB Attic etc backup -> 22MB Restic backup.
43GB Attic homes backup -> 71GB Restic backup.
And of course you can't do it after the fact because you can't turn off the crypto.
We do real-time compression, encryption (on the wire and at rest) and deduplication for object storage. We currently support an s3-compatible (including a full policy engine) api out the front end, and on the backend we can store to anything that exposes an s3 api (S3, Glacier etc.). Because of the s3-compatible interface we work with any existing client tools that work with s3.
We pride ourselves on our speed and scale. We can do 600mb/s sustained throughput and easily scale to multi-petabyte datasets. We typically see 95%-97% dedupe ratios on backup data. We support high availability clustering and replication (for example, replicate between regions for DR).
We don't currently support snapshotting but it's something we can implement relatively easily if people need it.
Our deployment model is based on a virtual appliance and can be deployed in the cloud or on premise. We can also do things like an on-premise writer (that only uploads unique, deduped data over the network), and a reader in the cloud to support cloud workloads or DR.
We have a real focus on backup to cloud in addition to supporting real time big-data use cases in the cloud.
Disclaimer: I work here - if you would like to contact me please feel free firstname.lastname@example.org.
Best of luck though.
Disclaimer: I'm co-founder of Skylable
Additionally we use HTTPS between clients and our server, and our server and the storage provider (e.g. S3), as well as being able to enable server-side encryption for S3.
Also check out Atticmatic, a wrapper script for both Attic and Borg that makes them much nicer to configure and use: https://torsion.org/atticmatic/
Here's a gist showing how to configure an encrypted store on s3: https://gist.github.com/cknave/3ddae29cc466663cb40e
Another solution I've seen people using is attic as a centralised backup for multiple machines and then running s3ql to push that over to s3.
He must have red Practical Cryptography from Ferguson and Schneier. In any case, it's not big issue. Encrypt-then-MAC is easier to get wrong.
More to the point, if you're fielding a new MAC-then-encrypt design, then I don't trust your crypto, because your crypto knowledge is over a decade old. Krawczyk's paper was published in 2001: https://eprint.iacr.org/2001/045
With AEZ, the "MAC" is actually just a block of zeros.
Essentially all 'misuse resistant' modes are some form of MtE, since in those modes one necessarily needs to process the entire message before beginning encryption. AEZ is MtE in the loosest sense, basically another way to say it's a two-pass/offline mode.
(That said, AEZ is really interesting.)
Two things jump out at me from a quick glance at the site:
1. "S3QL splits files in blocks of a configurable size (default: 10 MB)" <-- this is quite a large block size and will result in significantly less deduplication than tarsnap's variable-length average 64kB blocks. (On the other hand, large blocks significantly reduce the amount of work and RAM needed; S3QL's tuning probably makes sense for a "live" filesystem.)
2. "all data can AES encrypted with a 256 bit key" <-- leaving aside the grammatical problem, I can't find anything beyond this about key management. At best this means that anyone who can decrypt data can encrypt it, and vice versa; but the usual rule about crypto documentation applies here: If people don't realize that the details matter, they're probably doing something wrong. (Unfortunately it's 1AM and I'm allergic to snakes, or else I would dig into the code to see exactly how their crypto works.)
Additionally, S3QL would periodically issue new releases that didn't support old versions of the file format, or only supported them a set number of releases back. So if you didn't upgrade frequently enough, you'd find yourself with a release that refused to read your existing gigabytes of backup data. And then at that point, you have to do a binary search to find and recompile old releases in the vain attempt to resurrect your data and avoid having to do a full backup from scratch.
Bottom line: Stay far away from S3QL and instead use Attic, Borg, or Tarsnap.
Also, if you select Attic or Borg, check out Atticmatic: https://torsion.org/atticmatic/
I used s3ql to back up my stuff on servers where I had shell access and disk space but didn't trust root.
By contrast, from what I understand, tarsnap requires you to use the official server.
Main difference appears to be that it supports concurrent mounts, while S3QL can be mounted by one client at a time.
S3QL is FOSS.
That said, I still use it as it's easier & less error-prone than setting up my own backup system with tarsnap.
The software hasn't been updated since May 2011, when they released version 3.16.
How to best monitor something doesn't happen (e.g. a scheduled, remote job) is something I've yet to solve.
It could probably be improved (subject to the nuances of S3 which I'm not fully familiar with). One way to fix it would be to copy the concept of SQLite's WAL mode. Use an appended write operation on S3 (if it supports it) to append to an existing file that contains the transaction log. Then at certain intervals (say every few thousand transactions) one can finally flush that log to be stored in the main database file.
This would substantially reduce the number of times the database would need to be re-uploaded in full.
"""In principle, both VPN and NFS/CIFS-alike functionality could be integrated into S3QL to allow simultaneous mounting using mount.s3ql directly. However, consensus among the S3QL developers is that this is not worth the increased complexity."""
(And https://leastauthority.com/how_it_works )
This approach allows use of other tools like s3cmd and the Amazon web console but prevents advanced features like deduplication and snapshotting.
The documentation is just somewhat messed up...
That would be perfect for things we don't want to ever change...like container images. Or configuration files.
The thing that pops out of this is TRUST. We need to be able to trust an application we are running is the one we want to run, who was responsible for writing it, who's responsible for running it, and all the bits in-between.
You would still do versioned backups and then make one a month immutable. Or whatever your strategy is.
Basically, use S3 the same way you use an in house SAN.
Also, if I remember correctly, S3 integration is per account, and only if you enable the external file storage app. It wouldn't let you store all data on S3. So, mounting a bucket as the OwnCloud data folder would be the way I'd get that.