Show HN: NFreezer – Encrypted-at-rest remote backup tool

josephernest · on Nov 28, 2020

Hi there! Author here.

I created nFreezer initially for my needs, because when doing remote backups (especially on servers on which we never have physical access), it's hard to 100% trust the destination server.

Being curious: how do you usually do remote backups of your important files?

--> With usual solutions, even if you use SSH/SFTP and an encrypted partition on destination, there will be a short time during which the data will be unencrypted on the remote server, just before being written to disk and before arriving to the encrypted filesystem layer.

Thus this software nFreezer: the data is never decrypted on the remote server.

How do you work with this?

rsync · on Nov 28, 2020

borg[1] has become the de facto standard for this use-case.[2]

It can run over SSH with the borg binary on the remote server or it can run in an SFTP mode with nothing installed on the destination.

[1] https://www.borgbackup.org/

[2] https://www.stavros.io/posts/holy-grail-backups/

josephernest · on Nov 28, 2020

As I discover it, borg seems to be pretty good for this use case indeed.

But still there is one thing (at least) that can be of interest to some people with nFreezer: it is very simple, it does the job in only 249 lines of code. You can then read the FULL source code in a few hours, to see if you trust it or not. See here: https://github.com/josephernest/nfreezer/blob/master/nfreeze...

If I want to do this with the source code of the tool you mentioned, I would have to spend at least one full week. (this is normal: this program has 100 times more features).

The key point is: if you're looking for a solution for which you don't want to trust a remote server, then you probably don't want to trust the backup tool of a random internet person either. And you probably want to read the source code of the program.

So having only < 300 lines of code to read in a single .py can be an advantage.

rsync · on Nov 28, 2020

"But still there is one thing (at least) that can be of interest to some people with nFreezer: it is very simple, it does the job in only 249 lines of code. You can then read the FULL source code in a few hours, to see if you trust it or not. See here: https://github.com/josephernest/nfreezer/blob/master/nfreeze..."

I really appreciate this and find this very interesting.

I would be happy to give you a free dev/test account at rsync.net if that would help you continue this development.

giomasce · on Nov 28, 2020

I don't see the key point: the two things (trusting the remote storage and the tool) are rather independent for me. I already trust probably billions lines of code which handle my data (basically, every program installed on my system). I don't trust them because I checked all of them, but because I know that the same lines are used by many other people, some of which have actually read parts of them. There is a global trust repository to which every programmer contributes a little bit and every user benefits a little bit. For this reason, I find more trustworthy a project already used and developed by many people, like borg. I didn't just take that for granted, of course: I read the documentation (which, in the case of borg, is very clean and extensive), I looked how it work, I even looked a part of the source code, and I was satisfied. And all of this has nothing to do with where I am putting that data in the end.

On the other hand, missing features in the name of small code size is not a great advantage, if those missing features make my backup less reliable or quick or compact. And maybe I end up backing up less stuff because (say) the simple tool does not handle deduplication or compression as effectively as the complicated one, and so is not viable for me. This would be a net loss.

Be careful, I am just saying my perspective. You have, of course, the entire right to decide what are your priorities and which tool is better for you.

rakoo · on Nov 28, 2020

From https://github.com/josephernest/nfreezer/blob/master/nfreeze..., this is how you decrypt files:

   [...]
   with open(f2, 'wb') as f, src_cm.open(chunkid.hex(), 'rb') as g:
      decrypt(g, pwd=encryptionpwd, out=f)
   [...]

   
   def decrypt(f=None, s=None, pwd=None, out=None):
     [...]
     while True:
        block = f.read(BLOCKSIZE)
        if not block:
            break
        out.write(cipher.decrypt(block))
     try:
        cipher.verify(tag)
     except ValueError:
        print('Incorrect key or file corrupted.')

So, basically, you decrypt the whole file and write the result before checking the tag. You're using a block cipher in a streaming fashion, and as has already been said before (see https://www.imperialviolet.org/2015/05/16/aeads.html, "AEADs with large plaintexts") this is dangerous if you don't do it correctly. Your data may be garbage, but it's too late it's already written on the disk before you know it's bad, it's not deleted and you won't know which file it was.

As some HN crypto celebrity said some time ago, if you write "AES" in your code then you're wrong. You MUST use misuse-resistant libraries unless you know exactly what you're doing.

TL;DR: your crypto is broken, use NaCl instead of doing it yourself.

josephernest · on Nov 28, 2020

    except ValueError:
        print('Incorrect key or file corrupted.')

So if there is a tag problem, this is clearly logged, and you know something is wrong. The good thing is that you can easily edit it and have instead (pass the filename fn to the function, to be able to log):

     print('Incorrect key or file corrupted, will be deleted:', fn)
     os.remove(fn)
     exit(...)

Real question: is it possible to `.verify(tag)` before having decrypt(...) the whole file? I doubt it is possible. So an option could be to write the file in a temporary place, and then, only when tag is verified, move it to the right place. Delete it if it is not verified. Another option would be to do a first pass of decrypt(), without writing anything to disk, then get the tag, verify it, and then if ok, redo the whole decryption with writing on disk this time. The latter might be a bit too extreme and halfs the performance.

rakoo · on Nov 29, 2020

> So if there is a tag problem, this is clearly logged, and you know something is wrong [...] The good thing is that you can easily edit it and have instead (pass the filename fn to the function, to be able to log)

That's the thing: the script in its current version is incorrect, and even doing that won't be a perfect solution. That's why other people are saying that other softwares, with large usage and that can do more than what nFreezer can do, should be analyzed before trying to do it your own way.

It's good to not rely on anyone else, but crypto is the one domain where you can't have "good enough" -- it's either correct, or it's not.

> Another option would be to do a first pass of decrypt(), without writing anything to disk, then get the tag, verify it, and then if ok, redo the whole decryption with writing on disk this time

Yep, that's the way: do the decrypting in memory, or in /tmp, verify the tag, and only after you can put the file where it belongs. I just checked the API of the crypto module, and there's a `decrypt_and_verify` that should do it properly.

Of course that's problematic especially for big files, so what you want to do is chunk the files, encrypt the chunks separately and store the file as a list of such chunks.

The step after is to use Content-Defined Chunking, ie chunking based on the content of the file. This way when a big file modifies only the chunk around the modification will change, the rest of the file will be chucked exactly the same way. So you don't need to store the full content of each version of the file, just a small-ish diff.

That's not a novel system, bup (https://github.com/bup/bup) kinda pioneered it... and as others have advised, restic, borg-backup and tarsnap do exactly that.

cperciva · on Dec 1, 2020

bup (https://github.com/bup/bup) kinda pioneered it... and as others have advised, restic, borg-backup and tarsnap do exactly that.

According to wikipedia, bup was released in 2010, 3 years after Tarsnap started doing this. (And Tarsnap wasn't the first either.)

josephernest · on Nov 30, 2020

To clarify: if you start from a given nonce and key:

    cipher = AES.new(key, AES.MODE_GCM, nonce)
    while True:
        block = f.read(16*1024*1024)
        if not block:
            break
        out.write(cipher.encrypt(block))

you get exactly the same result as if you do (with a big RAM, bigger than your file) it in one pass:

        cipher = AES.new(key, AES.MODE_GCM, nonce)
        out.write(cipher.encrypt(f.read()))

Please try it with pycryptodome, you will see it is.

You might find this unforunate in the naming, and .init(), .update(), etc. might have been better names to emphasize this.

So this shows that, in its current state, the chunking is just a "RAM-efficient" way to encrypt, but it writes exactly the same encrypted content, as if you did encrypt(...) in one pass. So as long as the file is under ~2^39 bits, it is fine (see https://csrc.nist.gov/publications/detail/sp/800-38d/final).

___

Then, there is another layer of chunking that would be possible, and that would add many benefits: even better deduplication, avoid to reencrypt a whole 10GB file if only a few bytes have changed. This would be an interesting addition, it's on the Todo list, and I know some other programs do it, of course.

But to clarify: this "content-chunking" is independent to the "RAM-efficient" one I use here.

If you want to continue the discussion (more convenient than here), you're very welcome to post a Github issue.

Thanks for your remarks, I appreciate it.

rufugee · on Nov 28, 2020

Very well said. Sometimes simplicity carries immense value and reassurance.

noodlesUK · on Nov 28, 2020

Don’t forget restic - it also works great on s3 compatible services!

https://restic.net

sjellis · on Nov 29, 2020

Yes, restic would be my pick: easy to deploy and use, well-maintained.

If you want some assurance, CERN are deploying restic:

https://indico.cern.ch/event/862873/contributions/3724442/

nix23 · on Nov 28, 2020

Yes restic is great! No python no dependencies (like borg), just one GO exe and that's it.

josephernest · on Nov 29, 2020

I'll have a look too.

When using with SFTP, do Restic binaries need to be installed on both local and remote, or just local?

tonyhb · on Nov 28, 2020

Also want to plug an old HN favourite - tarsnap.

https://www.tarsnap.com/

gradschool · on Nov 28, 2020

> How do you work with this?

My setup depends only on crappy home made bash scripts and standard utilities, and never leaks encryption keys or filesystem metadata off the local machine. I rsync my home directory with a luks encrypted filesystem image, split the image file into fixed sized chunks using the standard GNU split utility, and then rsync the encrypted split up chunks with a directory on a remote server. The remote server periodically syncs the encrypted image chunks with S3 using the usual aws cli tools, and runs a script to "cat" the chunks back into a remote copy of the whole image and display the overall md5 sum. Another machine off site keeps amazon honest by syncing the s3 bucket with a local copy, reassembling the encrypted filesystem image, and computing the md5 sum. An extra benefit is that I can use the filesystem image on the remote server over sshfs as the backing device for a locally loopback mounted luks partition to access or modify individual files without downloading the whole image or corrupting it, provided no more than one session is open at a time. Back when I was more paranoid, I used only a raw dmcrypt image rather than a luks encrypted image in this setup so that there would be no luks header to give it away and I could plausibly claim that it was one-time-pad encrypted data, for which I could construct a key to yield any plaintext of my choice. (Yes, I know I won't be such a smartass when it comes to a rubber hose attack.)

wheybags · on Nov 28, 2020

Personally, I use zfs on my home server. All my other machines rsync to a backups folder on there, and once a month a scheduled email reminds me to grab my backup usb hard drive and do a zfs send | recv to sync for an offline backup.

If you were using an untrusted server as a zfs send target, as of a few months ago, you can actually do an encrypted send to a remote without unlocking on the remote, which is pretty cool.

quantumofalpha · on Nov 28, 2020

restic, borg, even plain rclone. plenty of backup tools exist with encryption at rest, what's the key selling point of yours?

I use rclone+restic because of google drive(=cheap unlimited storage) integration in rclone, and restic provides snapshotting on top.

josephernest · on Nov 28, 2020

Thanks for asking the question, it's right, there are thousands of backup programs already.

The "key thing" that lead me to code my own is that with nearly all the solutions I tried, data was resent over network when a big file is moved to another dir+renamed, WHEN used in encrypted-at-rest mode.

It's the case for rsync (the --fuzzy only helps when renamed but stays in same dir), duplicity, and even rclone when in encrypted mode (see https://forum.rclone.org/t/can-not-use-track-renames-when-sy...: "Can not use –track-renames when syncing to and from Crypt remotes").

So I wanted to solve this problem, and it works. See point #3 of https://github.com/josephernest/nfreezer#features.

quantumofalpha · on Nov 28, 2020

> The "key thing" that lead me to code my own is that with nearly all the solutions I tried, data was resent over network when a big file is moved to another dir+renamed, WHEN used in encrypted-at-rest mode.

restic (and i suppose borg which is similar) solves this problem and goes even further by chunking your files, hashing and deduping the chunks - chunks with the same hash aren't resent. Great e.g. for backing up VM images or encrypted containers where only a small part of the file change - only that small part will be resent between snapshots. Chunking algo is "content-defined", can probabilistically quite efficiently detect shifted chunks and duplicated chunks across different files.

(Naturally, all this machinery will also handle the simple cases of renamed and duplicated files on your filesystem)

DarkmSparks · on Nov 28, 2020

>Being curious: how do you usually do remote backups of your important files?

I keep my important files in a ~50GB veracrypt container then backup that to multiple locations including a local usb disk every day or so with a shell script.

dropbox nuked its copy once but it worked fine again after a name change (version number)

basicneo · on Nov 29, 2020

Do you copy the entire 50GB on every backup, even if you've only changed 1MB?

DarkmSparks · on Dec 9, 2020

To an external 1TB HDD yes. Keep one for the end of each month. Veracrypt containers rsync and dropbox quite nicely, just keep one copy of those.

querez · on Nov 28, 2020

I use dar[0] to create incremental, compressed, encrypted backup archives, and then upload those to a GCS bucket. GCS is dirt cheap even for terabytes of data, so as added bonus this gives me a history of all my files (eg, i still have backup archives from 2013). Both calls wrapped into a shell script that is regularly scheduled via a cronjob on my local server. Within my network i sync between laptop, pc and server via syncthing. Works well for me, even though it doesn't handle file renames gracefully.

[0] http://dar.linux.free.fr/

josephernest · on Nov 28, 2020

Do you think there is a way to avoid to retransfer files that have been renamed/moved with this method?

This is the use case I mentioned: I often move files from one dir to another when working on audio- or video-production projects, and it can be dozains of gigabytes.

querez · on Nov 28, 2020

Within my home network, syncthing handles file renames gracefully [0]. For backups to cloud storage, This might require changes to dar, because according to [1] it's unclear how that tool handles renames. Since I only do these off-site backups to cloud storage once a month, I'm okay with it using a bit more bandwith/storage (as long as it runs over night and doesn't get in anybody's way).

[0] https://forum.syncthing.net/t/rename-file/11091 [1] https://wiki.archlinux.org/index.php/Synchronization_and_bac...

bartvk · on Nov 28, 2020

Nice work. I personally use Arq, which supports SFTP. And like you, I backup to a remote server over which I have no control. Arq is really desktop software, however.

ajvs · on Nov 28, 2020

Rclone for encrypted sync (can be virtually mounted) and Duplicacy for encrypted backup. What makes nFreezer different?

josephernest · on Nov 28, 2020

nFreezer is different to RClone and Duplicity on a very important point (for me because I often move big multimedia projects): if you move `/path/to/10GB_file` to `/anotherpath/subdir/10GB_file_renamed`, no data will be re-transferred over the network, thus saving 10 GB of data transfer.

Indeed, Rclone does not handle renames when in encrypted mode, see https://forum.rclone.org/t/can-not-use-track-renames-when-sy...

Idem for duplicity (tests done).

TL;DR: see 3rd point of https://github.com/josephernest/nfreezer#features

___

Another interesting feature is that you can read/audit the full source code of nFreezer quickly: it's only 249 source lines of code, as of today.

jaden · on Nov 28, 2020

Duplicacy is a separate backup tool, but has a confusingly similar name to duplicity. I believe its chunking algorithm means large renamed files aren't re-transferred.

sebmellen · on Nov 28, 2020

And don't forget Duplicati!

jogundas · on Nov 28, 2020

Thanks for sharing! Is there a reason why borg is excluded from the comparison?

josephernest · on Nov 28, 2020

I will add it.

You will probably be surprised, but I wasn't aware of this one when starting my project! In a way, this is fortunate, because it was an interesting journey to code this :)

xmodem · on Nov 28, 2020

I'm currently writing my own backup tool because of one feature I want that nothing else seems to have. In my opinion, it's not enough to distrust the server where the backups are stored. The client should also not need to keep around a key that can be used to decrypt the backups for the purpose of encrypting new backups.

This is quite easy to accomplish with public key cryptography - just use a new, random AES key for each backup job, then encrypt that with a public key you keep around. The private key can be stored offline in cold storage.

aborsy · on Nov 28, 2020

But asymmetric crypto does that under hood (hybrid encryption).

In fact, if you encrypt the same file twice with rsa you get different cipher text.

No one encrypts a file directly with RSA. It’s the data encryption key that is encrypted .

cyphar · on Nov 28, 2020

rdedup (mentioned in a few other comments) uses asymmetric cryptography[1].

[1]: https://github.com/dpc/rdedup/

xmodem · on Nov 28, 2020

Thanks - this seems useful, but currently

> cloud backends are WIP

and I probably don't have good enough rust skills to help out myself practically.

cyphar · on Nov 28, 2020

Cloud backends aren't really needed imho if you use something like rclone. Native cloud backends can make transfer order more intelligent but rclone does deletes last meaning you won't lose data if the transfer dies mid-transfer.

seized · on Nov 28, 2020

Duplicacy can do that with it's RSA option. My private key isn't on the VM doing the backing up.

giomasce · on Nov 28, 2020

I use Borg to a local home server and then Duplicity on the Borg repository to Google Drive (where I happen to have a lot of space for free due to my university). In this way I am protected from both loss or breakage of local server and from revocation of remote account (as Google is known to do). Also, backup is quicker on the local network.

bloopernova · on Nov 28, 2020

That's a nice setup, I like it!

entire-name · on Nov 28, 2020

My current solution to perform backups with local encryption (locally encrypted before syncing) is to use a gocryptfs setup. Specifically, I create an overlay with gocryptfs in which the overlay has the unencrypted filesystem, and what's physically written to disk are the encrypted files.

Then, I create another mount on the physical disk and simply sync _that_ mount to multiple remote sources. Of course, this solution is fairly simple, and does not provide the features like file moves. Further, this approach potentially "leaks" some information, such as how many files I have, the approximate sizes of each file, etc.

This was set up a while ago; will definitely take a look at NFreezer to see if it's time for a refresh of my setup.

sandreas · on Nov 28, 2020

Don't forget to mention rdedup (https://github.com/dpc/rdedup), which is an impressive piece of software. The advantage it has over all other solutions I came across is, that it can do UNATTENDED backups without providing the encryption key... on the other hand it does not have a file scanner, which is a really pity.

aborsy · on Nov 28, 2020

I thought there are a lot options for this already (perhaps even any descent back up tool?)

For example, restic with SFTP backend. I thought restic encrypts data client side before SSH transport. Am missing something?

Other tools: duplicacy, rclone, Borg (this might run on server), ... Duplicity does asymmetric and symmetric crypto, but no dedup. So if you change a file path, it “might “ be reencryptped.

josephernest · on Nov 28, 2020

> Am missing something?

about duplicity and rclone, they don't track file renames/moves in encrypted mode (https://forum.rclone.org/t/can-not-use-track-renames-when-sy...).

i haven't tested the two others (duplicacy, borg), but by reading other comments, it seems the second one has this feature.

slifin · on Nov 28, 2020

So many people don't understand encrypting at rest

To the point where in my organisation we're told to do application encryption on resources that really don't need it

Please read what encrypting at rest is for here: https://www.hln.com/encrypting-data-at-rest-on-servers-what-...

slifin · on Nov 28, 2020

Remember encryption isn't a magic method of preventing hackers getting your data remotely

A security problem is still a security problem, encryption at rest is about physical security and encryption in transit is about moving data securely between locations

Your applications still need to be secure otherwise encryption won't save you

thinkloop · on Nov 28, 2020

How does backblaze compare on the features chart: https://github.com/josephernest/nfreezer#comparison