I created nFreezer initially for my needs, because when doing remote backups (especially on servers on which we never have physical access), it's hard to 100% trust the destination server.
Being curious: how do you usually do remote backups of your important files?
--> With usual solutions, even if you use SSH/SFTP and an encrypted partition on destination, there will be a short time during which the data will be unencrypted on the remote server, just before being written to disk and before arriving to the encrypted filesystem layer.
Thus this software nFreezer: the data is never decrypted on the remote server.
As I discover it, borg seems to be pretty good for this use case indeed.
But still there is one thing (at least) that can be of interest to some people with nFreezer: it is very simple, it does the job in only 249 lines of code. You can then read the FULL source code in a few hours, to see if you trust it or not.
See here: https://github.com/josephernest/nfreezer/blob/master/nfreeze...
If I want to do this with the source code of the tool you mentioned, I would have to spend at least one full week. (this is normal: this program has 100 times more features).
The key point is: if you're looking for a solution for which you don't want to trust a remote server, then you probably don't want to trust the backup tool of a random internet person either. And you probably want to read the source code of the program.
So having only < 300 lines of code to read in a single .py can be an advantage.
"But still there is one thing (at least) that can be of interest to some people with nFreezer: it is very simple, it does the job in only 249 lines of code. You can then read the FULL source code in a few hours, to see if you trust it or not. See here: https://github.com/josephernest/nfreezer/blob/master/nfreeze..."
I really appreciate this and find this very interesting.
I would be happy to give you a free dev/test account at rsync.net if that would help you continue this development.
I don't see the key point: the two things (trusting the remote storage and the tool) are rather independent for me. I already trust probably billions lines of code which handle my data (basically, every program installed on my system). I don't trust them because I checked all of them, but because I know that the same lines are used by many other people, some of which have actually read parts of them. There is a global trust repository to which every programmer contributes a little bit and every user benefits a little bit. For this reason, I find more trustworthy a project already used and developed by many people, like borg. I didn't just take that for granted, of course: I read the documentation (which, in the case of borg, is very clean and extensive), I looked how it work, I even looked a part of the source code, and I was satisfied. And all of this has nothing to do with where I am putting that data in the end.
On the other hand, missing features in the name of small code size is not a great advantage, if those missing features make my backup less reliable or quick or compact. And maybe I end up backing up less stuff because (say) the simple tool does not handle deduplication or compression as effectively as the complicated one, and so is not viable for me. This would be a net loss.
Be careful, I am just saying my perspective. You have, of course, the entire right to decide what are your priorities and which tool is better for you.
[...]
with open(f2, 'wb') as f, src_cm.open(chunkid.hex(), 'rb') as g:
decrypt(g, pwd=encryptionpwd, out=f)
[...]
def decrypt(f=None, s=None, pwd=None, out=None):
[...]
while True:
block = f.read(BLOCKSIZE)
if not block:
break
out.write(cipher.decrypt(block))
try:
cipher.verify(tag)
except ValueError:
print('Incorrect key or file corrupted.')
So, basically, you decrypt the whole file and write the result before checking the tag. You're using a block cipher in a streaming fashion, and as has already been said before (see https://www.imperialviolet.org/2015/05/16/aeads.html, "AEADs with large plaintexts") this is dangerous if you don't do it correctly. Your data may be garbage, but it's too late it's already written on the disk before you know it's bad, it's not deleted and you won't know which file it was.
As some HN crypto celebrity said some time ago, if you write "AES" in your code then you're wrong. You MUST use misuse-resistant libraries unless you know exactly what you're doing.
TL;DR: your crypto is broken, use NaCl instead of doing it yourself.
except ValueError:
print('Incorrect key or file corrupted.')
So if there is a tag problem, this is clearly logged, and you know something is wrong. The good thing is that you can easily edit it and have instead (pass the filename fn to the function, to be able to log):
print('Incorrect key or file corrupted, will be deleted:', fn)
os.remove(fn)
exit(...)
Real question: is it possible to `.verify(tag)` before having decrypt(...) the whole file? I doubt it is possible. So an option could be to write the file in a temporary place, and then, only when tag is verified, move it to the right place. Delete it if it is not verified. Another option would be to do a first pass of decrypt(), without writing anything to disk, then get the tag, verify it, and then if ok, redo the whole decryption with writing on disk this time. The latter might be a bit too extreme and halfs the performance.
> So if there is a tag problem, this is clearly logged, and you know something is wrong [...] The good thing is that you can easily edit it and have instead (pass the filename fn to the function, to be able to log)
That's the thing: the script in its current version is incorrect, and even doing that won't be a perfect solution. That's why other people are saying that other softwares, with large usage and that can do more than what nFreezer can do, should be analyzed before trying to do it your own way.
It's good to not rely on anyone else, but crypto is the one domain where you can't have "good enough" -- it's either correct, or it's not.
> Another option would be to do a first pass of decrypt(), without writing anything to disk, then get the tag, verify it, and then if ok, redo the whole decryption with writing on disk this time
Yep, that's the way: do the decrypting in memory, or in /tmp, verify the tag, and only after you can put the file where it belongs. I just checked the API of the crypto module, and there's a `decrypt_and_verify` that should do it properly.
Of course that's problematic especially for big files, so what you want to do is chunk the files, encrypt the chunks separately and store the file as a list of such chunks.
The step after is to use Content-Defined Chunking, ie chunking based on the content of the file. This way when a big file modifies only the chunk around the modification will change, the rest of the file will be chucked exactly the same way. So you don't need to store the full content of each version of the file, just a small-ish diff.
That's not a novel system, bup (https://github.com/bup/bup) kinda pioneered it... and as others have advised, restic, borg-backup and tarsnap do exactly that.
Please try it with pycryptodome, you will see it is.
You might find this unforunate in the naming, and .init(), .update(), etc. might have been better names to emphasize this.
So this shows that, in its current state, the chunking is just a "RAM-efficient" way to encrypt, but it writes exactly the same encrypted content, as if you did encrypt(...) in one pass. So as long as the file is under ~2^39 bits, it is fine (see https://csrc.nist.gov/publications/detail/sp/800-38d/final).
___
Then, there is another layer of chunking that would be possible, and that would add many benefits: even better deduplication, avoid to reencrypt a whole 10GB file if only a few bytes have changed.
This would be an interesting addition, it's on the Todo list, and I know some other programs do it, of course.
But to clarify: this "content-chunking" is independent to the "RAM-efficient" one I use here.
If you want to continue the discussion (more convenient than here), you're very welcome to post a Github issue.
My setup depends only on crappy home made bash scripts and standard
utilities, and never leaks encryption keys or filesystem metadata off
the local machine. I rsync my home directory with a luks encrypted
filesystem image, split the image file into fixed sized chunks using
the standard GNU split utility, and then rsync the encrypted split up
chunks with a directory on a remote server. The remote server
periodically syncs the encrypted image chunks with S3 using the usual
aws cli tools, and runs a script to "cat" the chunks back into a
remote copy of the whole image and display the overall md5
sum. Another machine off site keeps amazon honest by syncing the s3
bucket with a local copy, reassembling the encrypted filesystem image,
and computing the md5 sum. An extra benefit is that I can use the
filesystem image on the remote server over sshfs as the backing device
for a locally loopback mounted luks partition to access or modify
individual files without downloading the whole image or corrupting it,
provided no more than one session is open at a time. Back when I was
more paranoid, I used only a raw dmcrypt image rather than a luks
encrypted image in this setup so that there would be no luks header to
give it away and I could plausibly claim that it was one-time-pad
encrypted data, for which I could construct a key to yield any
plaintext of my choice. (Yes, I know I won't be such a smartass when
it comes to a rubber hose attack.)
Personally, I use zfs on my home server. All my other machines rsync to a backups folder on there, and once a month a scheduled email reminds me to grab my backup usb hard drive and do a zfs send | recv to sync for an offline backup.
If you were using an untrusted server as a zfs send target, as of a few months ago, you can actually do an encrypted send to a remote without unlocking on the remote, which is pretty cool.
Thanks for asking the question, it's right, there are thousands of backup programs already.
The "key thing" that lead me to code my own is that with nearly all the solutions I tried, data was resent over network when a big file is moved to another dir+renamed, WHEN used in encrypted-at-rest mode.
It's the case for rsync (the --fuzzy only helps when renamed but stays in same dir), duplicity, and even rclone when in encrypted mode (see https://forum.rclone.org/t/can-not-use-track-renames-when-sy...: "Can not use –track-renames when syncing to and from Crypt remotes").
> The "key thing" that lead me to code my own is that with nearly all the solutions I tried, data was resent over network when a big file is moved to another dir+renamed, WHEN used in encrypted-at-rest mode.
restic (and i suppose borg which is similar) solves this problem and goes even further by chunking your files, hashing and deduping the chunks - chunks with the same hash aren't resent. Great e.g. for backing up VM images or encrypted containers where only a small part of the file change - only that small part will be resent between snapshots. Chunking algo is "content-defined", can probabilistically quite efficiently detect shifted chunks and duplicated chunks across different files.
(Naturally, all this machinery will also handle the simple cases of renamed and duplicated files on your filesystem)
>Being curious: how do you usually do remote backups of your important files?
I keep my important files in a ~50GB veracrypt container then backup that to multiple locations including a local usb disk every day or so with a shell script.
dropbox nuked its copy once but it worked fine again after a name change (version number)
I use dar[0] to create incremental, compressed, encrypted backup archives, and then upload those to a GCS bucket. GCS is dirt cheap even for terabytes of data, so as added bonus this gives me a
history of all my files (eg, i still have backup archives from 2013). Both calls wrapped into a shell script that is regularly scheduled via a cronjob on my local server. Within my network i sync between laptop, pc and server via syncthing. Works well for me, even though it doesn't handle file renames gracefully.
Do you think there is a way to avoid to retransfer files that have been renamed/moved with this method?
This is the use case I mentioned: I often move files from one dir to another when working on audio- or video-production projects, and it can be dozains of gigabytes.
Within my home network, syncthing handles file renames gracefully [0]. For backups to cloud storage, This might require changes to dar, because according to [1] it's unclear how that tool handles renames. Since I only do these off-site backups to cloud storage once a month, I'm okay with it using a bit more bandwith/storage (as long as it runs over night and doesn't get in anybody's way).
Nice work. I personally use Arq, which supports SFTP. And like you, I backup to a remote server over which I have no control. Arq is really desktop software, however.
nFreezer is different to RClone and Duplicity on a very important point (for me because I often move big multimedia projects): if you move `/path/to/10GB_file` to `/anotherpath/subdir/10GB_file_renamed`, no data will be re-transferred over the network, thus saving 10 GB of data transfer.
Duplicacy is a separate backup tool, but has a confusingly similar name to duplicity. I believe its chunking algorithm means large renamed files aren't re-transferred.
You will probably be surprised, but I wasn't aware of this one when starting my project! In a way, this is fortunate, because it was an interesting journey to code this :)
I'm currently writing my own backup tool because of one feature I want that nothing else seems to have. In my opinion, it's not enough to distrust the server where the backups are stored. The client should also not need to keep around a key that can be used to decrypt the backups for the purpose of encrypting new backups.
This is quite easy to accomplish with public key cryptography - just use a new, random AES key for each backup job, then encrypt that with a public key you keep around. The private key can be stored offline in cold storage.
Cloud backends aren't really needed imho if you use something like rclone. Native cloud backends can make transfer order more intelligent but rclone does deletes last meaning you won't lose data if the transfer dies mid-transfer.
I use Borg to a local home server and then Duplicity on the Borg repository to Google Drive (where I happen to have a lot of space for free due to my university). In this way I am protected from both loss or breakage of local server and from revocation of remote account (as Google is known to do). Also, backup is quicker on the local network.
My current solution to perform backups with local encryption (locally encrypted before syncing) is to use a gocryptfs setup. Specifically, I create an overlay with gocryptfs in which the overlay has the unencrypted filesystem, and what's physically written to disk are the encrypted files.
Then, I create another mount on the physical disk and simply sync _that_ mount to multiple remote sources. Of course, this solution is fairly simple, and does not provide the features like file moves. Further, this approach potentially "leaks" some information, such as how many files I have, the approximate sizes of each file, etc.
This was set up a while ago; will definitely take a look at NFreezer to see if it's time for a refresh of my setup.
Don't forget to mention rdedup (https://github.com/dpc/rdedup), which is an impressive piece of software. The advantage it has over all other solutions I came across is, that it can do UNATTENDED backups without providing the encryption key... on the other hand it does not have a file scanner, which is a really pity.
I thought there are a lot options for this already (perhaps even any descent back up tool?)
For example, restic with SFTP backend. I thought restic encrypts data client side before SSH transport. Am missing something?
Other tools: duplicacy, rclone, Borg (this might run on server), ... Duplicity does asymmetric and symmetric crypto, but no dedup. So if you change a file path, it “might “ be reencryptped.
Remember encryption isn't a magic method of preventing hackers getting your data remotely
A security problem is still a security problem, encryption at rest is about physical security and encryption in transit is about moving data securely between locations
Your applications still need to be secure otherwise encryption won't save you
I created nFreezer initially for my needs, because when doing remote backups (especially on servers on which we never have physical access), it's hard to 100% trust the destination server.
Being curious: how do you usually do remote backups of your important files?
--> With usual solutions, even if you use SSH/SFTP and an encrypted partition on destination, there will be a short time during which the data will be unencrypted on the remote server, just before being written to disk and before arriving to the encrypted filesystem layer.
Thus this software nFreezer: the data is never decrypted on the remote server.
How do you work with this?