
S3QL: an S3 FS with encryption, de-dup, immutable trees and snapshotting - tambourine_man
https://bitbucket.org/nikratio/s3ql
======
indiv0
I was recently looking into options for offsite, encrypted (at-rest & in-
transit), deduplicated, snapshotted, backups to S3. S3QL was one of the
options I investigated, but unfortunately their crypto is MAC-then-encrypt
[0], which is no good.

A small sample of the other options considered:

    
    
        * EncFS + rsnapshot/rdiff-backup
            * Weak cryptography in EncFS [1]
        * Duplicity
            * Uses forward snapshots, so recovery takes longer with each successive snapshot, unless you upload a "fresh" copy of the data
        * ext4 on LVM on LUKS on a loopback device provided by s3backer [2]
            * OK I'll admit this one is pretty far out there, but it actually works decently well (ignoring the massive FUSE overhead)
            * Downside: a slight hiccup in network connectivity guarantees data corruption
        * Attic
            * Known data corruption issues with large datasets [3]
    

Does anyone know of anything that can provide the above requirements? At this
point it's really more of a thought experiment, as I've decided to go with
Tarsnap [4] for my use-case.

[0]: [http://www.rath.org/s3ql-
docs/impl_details.html](http://www.rath.org/s3ql-docs/impl_details.html)

[1]: [https://defuse.ca/audits/encfs.htm](https://defuse.ca/audits/encfs.htm)

[2]:
[https://github.com/archiecobbs/s3backer](https://github.com/archiecobbs/s3backer)

[3]: [http://librelist.com/browser/attic/2015/3/31/comparison-
of-a...](http://librelist.com/browser/attic/2015/3/31/comparison-of-attic-vs-
bup-vs-obnam/#cbbe599389a20c787a74b137dc78fb1a)

[4]: [https://www.tarsnap.com/](https://www.tarsnap.com/)

~~~
nabla9
>MAC-then-encrypt [0], which is no good.

He must have red Practical Cryptography from Ferguson and Schneier. In any
case, it's not big issue. Encrypt-then-MAC is easier to get wrong.

~~~
tptacek
I don't know about the constructions and attack scenarios pertaining to this
app, but MAC-then-encrypt is generally _insecure_ ; it's not "easier to get
wrong", but is itself a flaw (typically: of exposing the cipher to malicious
ciphertext, rather than screening it out with a MAC, which is designed to
distinguish between valid and invalid ciphertext).

~~~
zx2c4
Rogaway's bringing it back with a vengeance though:

[http://web.cs.ucdavis.edu/~rogaway/aez/aez.pdf](http://web.cs.ucdavis.edu/~rogaway/aez/aez.pdf)

With AEZ, the "MAC" is actually just a block of zeros.

~~~
pbsd
Rogaway brought it back with a vengeance back in 2006, with SIV:
[http://web.cs.ucdavis.edu/~rogaway/papers/keywrap.pdf](http://web.cs.ucdavis.edu/~rogaway/papers/keywrap.pdf)

Essentially all 'misuse resistant' modes are some form of MtE, since in those
modes one necessarily needs to process the entire message before beginning
encryption. AEZ is MtE in the loosest sense, basically another way to say it's
a two-pass/offline mode.

~~~
tptacek
Is there a distinction to be drawn here between "generic composition" and
"whatever the term is for the opposite of generic composition"?

~~~
pbsd
Not sure I understand precisely what you're asking. 'Generic composition' is a
tricky term, because it tends to hide the assumptions underlying what we're
actually composing together. Bellare and Namprempre's paper, which is the most
usually cited in this sort of discussion, takes a probabilistic encryption
scheme (read: something that takes a random IV) and a MAC, and results in a
probabilistic AE scheme. This is obviously not the only way to model
encryption and authentication schemes (SIV above transforms those same
primitives into a nonce-based AE scheme instead, or DAE with a fixed nonce),
but this tends to be overlooked from GC discussions. In some models, MtE and
friends is OK. But the truth is that EtM is safe in the widest range of
assumptions, while the other ones are more brittle.

------
aorth
I'm curious to see how this compares to tarsnap for encrypted backups. One
cool thing that people might not know about tarsnap is that you can generate
read-only subkeys of your main key!

[https://www.tarsnap.com/man-tarsnap-
keymgmt.1.html](https://www.tarsnap.com/man-tarsnap-keymgmt.1.html)

~~~
witten
Having used both S3QL and Tarsnap.. In my experience, S3QL's use of remote
mounted filesystems is fundamentally unsuited to unattended backups. On many
occasions, S3QL's mounted filesystem would break mid-backup due to network
issues, and stay mounted and unable to recover. Then, from that point forward,
no backups would ever work again without manual intervention because the mount
point was already open.

Additionally, S3QL would periodically issue new releases that didn't support
old versions of the file format, or only supported them a set number of
releases back. So if you didn't upgrade frequently enough, you'd find yourself
with a release that refused to read your existing gigabytes of backup data.
And then at that point, you have to do a binary search to find and recompile
old releases in the vain attempt to resurrect your data and avoid having to do
a full backup from scratch.

Bottom line: Stay far away from S3QL and instead use Attic, Borg, or Tarsnap.

Also, if you select Attic or Borg, check out Atticmatic:
[https://torsion.org/atticmatic/](https://torsion.org/atticmatic/)

~~~
Nikratio
Regarding the network issues: please file a bug report so that they can be
fixed.

------
stubish
The closest thing I've seen to this is Jungle Disk, which is closed source,
offers a less fully featured file system, but does have Windows and Mac
clients.

~~~
porker
JungleDisk works, but the clients are buggy PoS that crash/hang whenever they
choose. I've also had JungleDisk's server daemon silently quit working & lock
up on some servers, which is scary when you go to restore a file and realise
no backups have happened in a while. It doesn't engender trust.

That said, I still use it as it's easier & less error-prone than setting up my
own backup system with tarsnap.

~~~
lobster_johnson
Seconding this. I used Jungle for a while on my laptop, and it actually runs a
local WebDAV server that acts as a caching proxy, which it then mounts as a
local volume -- and it's horrible. Slow, unstable and prone to mysterious
hangs and errors. This was a couple of years ago, but I wouldn't trust it
unless they have completely rewritten it since then.

~~~
porker
> This was a couple of years ago, but I wouldn't trust it unless they have
> completely rewritten it since then.

The software hasn't been updated since May 2011, when they released version
3.16.

------
spydum
Not sure why it isn't listed in the overview, but this is based on SQLite
(hence the name). HVent looked into the details yet but it looks pretty fun

~~~
eloff
I think SQLite is only used for the metadata, which seems like a reasonable
choice. The speed of metadata operations won't matter next to the slowness of
the read/write operations (slow relative to local SSDs.)

~~~
atmosx
Isn't that kind limited? How exactly are two different clients going to access
the metadata at the same time, if SQLite3 is used?!

~~~
nbevans
SQLite supports multiple readers when Write Ahead Logging (WAL) is enabled.
This mode is so good and transforms the whole operation of SQLite that it is
shame that 1) it isn't the default, 2) more people don't know about it.

~~~
aidos
The way s3ql works is that it copies the db from s3 to your local machine. You
make changes and then write the new version back to s3 later with the data. So
it won't help in this case.

~~~
nbevans
I see. Out of interest how does it write back the new version of the SQLite
database file with only the differences? Does S3 support binary patching?

~~~
aidos
I believe it uploads the whole thing each time. I think it may even upload it
with a new index counter in the name to version it (but I can't find that in
the docs now).

[http://www.rath.org/s3ql-docs/impl_details.html#metadata-
sto...](http://www.rath.org/s3ql-docs/impl_details.html#metadata-storage)

~~~
tobias3
I think you found the Achilles heel. This won't scale, so it is only suitable
for small file systems.

~~~
nbevans
Indeed, I'm good at that ;)

It could probably be improved (subject to the nuances of S3 which I'm not
fully familiar with). One way to fix it would be to copy the concept of
SQLite's WAL mode. Use an appended write operation on S3 (if it supports it)
to append to an existing file that contains the transaction log. Then at
certain intervals (say every few thousand transactions) one can finally flush
that log to be stored in the main database file.

This would substantially reduce the number of times the database would need to
be re-uploaded in full.

------
timf
If this is interesting, recommend looking at Tahoe-LAFS too:
[https://www.tahoe-lafs.org/trac/tahoe-lafs](https://www.tahoe-
lafs.org/trac/tahoe-lafs)

(And
[https://leastauthority.com/how_it_works](https://leastauthority.com/how_it_works)
)

------
gaul
s3fs also layers a filesystem on top of S3 but preserves the native object
format:

[https://github.com/s3fs-fuse/s3fs-fuse](https://github.com/s3fs-fuse/s3fs-
fuse)

This approach allows use of other tools like s3cmd and the Amazon web console
but prevents advanced features like deduplication and snapshotting.

------
IgorPartola
This looks cool and I am going to give this a try. The problem for me is, as
is usually the case with such project, the packaging. If this thing is
production-ready, then why must I check for installed dependencies by running
random commands [1]? If it's a Python project, why isn't it distributed on
PyPI? I don't want to download stuff from BitBucket manually and install it by
executing setup.py. I understand that the project supports multiple OS's.
That's great. But there are simple steps that can be taken to make installing
this thing via automated tools (Puppet, Chef, Ansible, etc.) easier than how
it's set up now. A Debian package would be _so nice_ for Ubuntu/Debian.

[1] [http://www.rath.org/s3ql-
docs/installation.html#dependencies](http://www.rath.org/s3ql-
docs/installation.html#dependencies)

~~~
repomies691
There is plenty extensive packaging:
[https://bitbucket.org/nikratio/s3ql/wiki/Installation](https://bitbucket.org/nikratio/s3ql/wiki/Installation)

The documentation is just somewhat messed up...

~~~
IgorPartola
Nevermind the things I said above. Thank you for pointing this out. This is
exactly what I was looking for.

------
kordless
> Immutable Trees. Directory trees can be made immutable, so that their
> contents can no longer be changed in any way whatsoever. This can be used to
> ensure that backups can not be modified after they have been made.

That would be perfect for things we don't want to ever change...like container
images. Or configuration files.

~~~
porker
What is the benefit of immutable trees over versioned backups - or to put it
another way: why wouldn't you keep a change history in your backups?

~~~
kordless
Backups are for data generated by people using applications running on
services provided by infrastructure. What I'm referring to is the
configuration portion of launching and running a service, not the data
generated after a human uses it.

The thing that pops out of this is TRUST. We need to be able to trust an
application we are running is the one we want to run, who was responsible for
writing it, who's responsible for running it, and all the bits in-between.

------
jerrac
How well would this work for something like pointing OwnCloud storage at? Most
of my cloud experience is with an internal vmware cluster, so I'm not sure how
to evaluate something like this.

~~~
btgeekboy
I'm not sure what the point would be - OwnCloud already knows how to work with
S3 directly.

~~~
jerrac
Just as an example app.

Basically, use S3 the same way you use an in house SAN.

Also, if I remember correctly, S3 integration is per account, and only if you
enable the external file storage app. It wouldn't let you store all data on
S3. So, mounting a bucket as the OwnCloud data folder would be the way I'd get
that.

------
DavideNL
So would it be possible to mount this as a Volume in OS X and create
(encrypted) Time Machine backups to Amazon S3?

