Hacker News new | past | comments | ask | show | jobs | submit login

borg[1] has become the de facto standard for this use-case.[2]

It can run over SSH with the borg binary on the remote server or it can run in an SFTP mode with nothing installed on the destination.

[1] https://www.borgbackup.org/

[2] https://www.stavros.io/posts/holy-grail-backups/




As I discover it, borg seems to be pretty good for this use case indeed.

But still there is one thing (at least) that can be of interest to some people with nFreezer: it is very simple, it does the job in only 249 lines of code. You can then read the FULL source code in a few hours, to see if you trust it or not. See here: https://github.com/josephernest/nfreezer/blob/master/nfreeze...

If I want to do this with the source code of the tool you mentioned, I would have to spend at least one full week. (this is normal: this program has 100 times more features).

The key point is: if you're looking for a solution for which you don't want to trust a remote server, then you probably don't want to trust the backup tool of a random internet person either. And you probably want to read the source code of the program.

So having only < 300 lines of code to read in a single .py can be an advantage.


"But still there is one thing (at least) that can be of interest to some people with nFreezer: it is very simple, it does the job in only 249 lines of code. You can then read the FULL source code in a few hours, to see if you trust it or not. See here: https://github.com/josephernest/nfreezer/blob/master/nfreeze..."

I really appreciate this and find this very interesting.

I would be happy to give you a free dev/test account at rsync.net if that would help you continue this development.


I don't see the key point: the two things (trusting the remote storage and the tool) are rather independent for me. I already trust probably billions lines of code which handle my data (basically, every program installed on my system). I don't trust them because I checked all of them, but because I know that the same lines are used by many other people, some of which have actually read parts of them. There is a global trust repository to which every programmer contributes a little bit and every user benefits a little bit. For this reason, I find more trustworthy a project already used and developed by many people, like borg. I didn't just take that for granted, of course: I read the documentation (which, in the case of borg, is very clean and extensive), I looked how it work, I even looked a part of the source code, and I was satisfied. And all of this has nothing to do with where I am putting that data in the end.

On the other hand, missing features in the name of small code size is not a great advantage, if those missing features make my backup less reliable or quick or compact. And maybe I end up backing up less stuff because (say) the simple tool does not handle deduplication or compression as effectively as the complicated one, and so is not viable for me. This would be a net loss.

Be careful, I am just saying my perspective. You have, of course, the entire right to decide what are your priorities and which tool is better for you.


From https://github.com/josephernest/nfreezer/blob/master/nfreeze..., this is how you decrypt files:

   [...]
   with open(f2, 'wb') as f, src_cm.open(chunkid.hex(), 'rb') as g:
      decrypt(g, pwd=encryptionpwd, out=f)
   [...]

   
   def decrypt(f=None, s=None, pwd=None, out=None):
     [...]
     while True:
        block = f.read(BLOCKSIZE)
        if not block:
            break
        out.write(cipher.decrypt(block))
     try:
        cipher.verify(tag)
     except ValueError:
        print('Incorrect key or file corrupted.')
So, basically, you decrypt the whole file and write the result before checking the tag. You're using a block cipher in a streaming fashion, and as has already been said before (see https://www.imperialviolet.org/2015/05/16/aeads.html, "AEADs with large plaintexts") this is dangerous if you don't do it correctly. Your data may be garbage, but it's too late it's already written on the disk before you know it's bad, it's not deleted and you won't know which file it was.

As some HN crypto celebrity said some time ago, if you write "AES" in your code then you're wrong. You MUST use misuse-resistant libraries unless you know exactly what you're doing.

TL;DR: your crypto is broken, use NaCl instead of doing it yourself.


    except ValueError:
        print('Incorrect key or file corrupted.')
So if there is a tag problem, this is clearly logged, and you know something is wrong. The good thing is that you can easily edit it and have instead (pass the filename fn to the function, to be able to log):

     print('Incorrect key or file corrupted, will be deleted:', fn)
     os.remove(fn)
     exit(...)     
Real question: is it possible to `.verify(tag)` before having decrypt(...) the whole file? I doubt it is possible. So an option could be to write the file in a temporary place, and then, only when tag is verified, move it to the right place. Delete it if it is not verified. Another option would be to do a first pass of decrypt(), without writing anything to disk, then get the tag, verify it, and then if ok, redo the whole decryption with writing on disk this time. The latter might be a bit too extreme and halfs the performance.


> So if there is a tag problem, this is clearly logged, and you know something is wrong [...] The good thing is that you can easily edit it and have instead (pass the filename fn to the function, to be able to log)

That's the thing: the script in its current version is incorrect, and even doing that won't be a perfect solution. That's why other people are saying that other softwares, with large usage and that can do more than what nFreezer can do, should be analyzed before trying to do it your own way.

It's good to not rely on anyone else, but crypto is the one domain where you can't have "good enough" -- it's either correct, or it's not.

> Another option would be to do a first pass of decrypt(), without writing anything to disk, then get the tag, verify it, and then if ok, redo the whole decryption with writing on disk this time

Yep, that's the way: do the decrypting in memory, or in /tmp, verify the tag, and only after you can put the file where it belongs. I just checked the API of the crypto module, and there's a `decrypt_and_verify` that should do it properly.

Of course that's problematic especially for big files, so what you want to do is chunk the files, encrypt the chunks separately and store the file as a list of such chunks.

The step after is to use Content-Defined Chunking, ie chunking based on the content of the file. This way when a big file modifies only the chunk around the modification will change, the rest of the file will be chucked exactly the same way. So you don't need to store the full content of each version of the file, just a small-ish diff.

That's not a novel system, bup (https://github.com/bup/bup) kinda pioneered it... and as others have advised, restic, borg-backup and tarsnap do exactly that.


bup (https://github.com/bup/bup) kinda pioneered it... and as others have advised, restic, borg-backup and tarsnap do exactly that.

According to wikipedia, bup was released in 2010, 3 years after Tarsnap started doing this. (And Tarsnap wasn't the first either.)


To clarify: if you start from a given nonce and key:

    cipher = AES.new(key, AES.MODE_GCM, nonce)
    while True:
        block = f.read(16*1024*1024)
        if not block:
            break
        out.write(cipher.encrypt(block))
you get exactly the same result as if you do (with a big RAM, bigger than your file) it in one pass:

        cipher = AES.new(key, AES.MODE_GCM, nonce)
        out.write(cipher.encrypt(f.read()))
Please try it with pycryptodome, you will see it is.

You might find this unforunate in the naming, and .init(), .update(), etc. might have been better names to emphasize this.

So this shows that, in its current state, the chunking is just a "RAM-efficient" way to encrypt, but it writes exactly the same encrypted content, as if you did encrypt(...) in one pass. So as long as the file is under ~2^39 bits, it is fine (see https://csrc.nist.gov/publications/detail/sp/800-38d/final).

___

Then, there is another layer of chunking that would be possible, and that would add many benefits: even better deduplication, avoid to reencrypt a whole 10GB file if only a few bytes have changed. This would be an interesting addition, it's on the Todo list, and I know some other programs do it, of course.

But to clarify: this "content-chunking" is independent to the "RAM-efficient" one I use here.

If you want to continue the discussion (more convenient than here), you're very welcome to post a Github issue.

Thanks for your remarks, I appreciate it.


Very well said. Sometimes simplicity carries immense value and reassurance.


Don’t forget restic - it also works great on s3 compatible services!

https://restic.net


Yes, restic would be my pick: easy to deploy and use, well-maintained.

If you want some assurance, CERN are deploying restic:

https://indico.cern.ch/event/862873/contributions/3724442/


Yes restic is great! No python no dependencies (like borg), just one GO exe and that's it.


I'll have a look too.

When using with SFTP, do Restic binaries need to be installed on both local and remote, or just local?


Also want to plug an old HN favourite - tarsnap.

https://www.tarsnap.com/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: