Hacker News new | past | comments | ask | show | jobs | submit login
Dropbox clone with git, ssh, EncFS and rsync.net
54 points by rsync on Oct 15, 2013 | hide | past | web | favorite | 28 comments
We've been trying to solve this problem for longer than Dropbox has existed: how can you maintain plaintext locally and ciphertext remotely, easily, with efficient changes-only (a la rsync) sync ?

We created a lot of half-working Truecrypt/Filevault/encfs schemes and fiddled with a lot of --partial --progress --whatever switches to rsync, but eventually we just told people to use duplicity[1][2] and call it a day.

Then Mr. Raymii came along and dropped this in our lap:


The tl;dr there is:

"This article describes my truly secure, encrypted file synchronization service. It used EncFS and dvcs-autosync which lets me share only the encrypted data and mount that locally to get the plaintext."

... and we couldn't be happier. Finally we (and anyone using rsync.net, or any ssh host with git on it) have an elegant way to sidestep the issues of trust and authority[3] over remote data on systems you don't control.

Enjoy! [4]

[1] http://duplicity.nongnu.org/

[2] ... and it's still a very good solution.

[3] http://www.rsync.net/resources/notices/canary.txt

[4] HN discount is 10c/GB/mo. Just email us.

Does this make any attempt to handle conflicts on files when changed on two offline hosts? I assume since it's based on Git it will just try to do merges that way, but how does it handle conflicts that can't be automatically resolved?

Ideally git-annex should be able to cover this in the future.

How much does encryption defeat the rsync algorithm? I don't see anything in the article that addresses that.

That is, say you change one line in 1 MB text file. If you rsync plain text, it will transfer a handful of bytes. If you encrypt and then rsync encrypted text, then presumably a lot more bytes will be scrambled and you will have a bigger diff to transfer. I guess it depends on the block size.

It depends.

If the resulting file is truly random vs. the last time you mounted it, then you (of course) need to retransmit everything.

However, imagine that every time you closed your TC volume or unmounted filevault it had to rewrite an entirely new multi-GB file ? Of course that would take forever and would be difficult to use.

So most encryption image software organizes the image internally such that new data written only affects portions of the file. And then, in theory, rsync with options like --partial --inplace --whatever will then transfer it efficiently.

BUT, our experience is this seldom works properly. The amount of internally changed data changes dramatically form usage to usage, and often has very little to do with the amount of actual data you changed. We just never got it to work well, consistently.

Does this mean you guys found that a large portion of the encrypted file had to be retransmitted as compared with the number of bits actually changed?

How does one consistently store encrypted data using rsync.net as a backend, whilst maintaining the benefits of the rsync protocol?

This seems to rule out the benefits of using rsync w/ encrypted files.

Is there something I'm missing?

The thing is to stay secure you need to re-encrypt the whole file. Otherwise you will get this: https://en.wikipedia.org/wiki/Block_cipher_modes_of_operatio... That's why encfs uses the second mode in the list (chaining).

Could you expand a bit? I've been thinking of building a bandwidth efficient encrypted syncing system, and this article (almost) hits the spot.

How is the best way to get this done, security considerations as against bandwidth?

I'm not highly knowledgeable in security, but the gist is if a block cipher uses the simplest mode of operation (ECB) each block is encrypted independently - good for sync bandwidth but from that the attacker can derive the structure of the file which will help to recover it or it can be used to check if known file in present in encrypted data(watermark attack [1]). More advanced block cipher modes "blend" to some extent blocks hiding file structure information (but if you change single bit a bunch (or all) of the blocks is changed). Also there are stream ciphers, which don't use blocks at all. If you change 1 bit stream cipher will change 1 bit in its output. The problem is it's insecure to use the same key twice with stream cyphers (reused key attack - [2])- e.g. if files with known text is encrypted (think English text, xml files, jpeg headers). Of course it can be overcome by using initialization vector ("seed" which is used with the key), but that IV should be different for each file and you should somehow save it to be able to decrypt it.

[1] https://en.wikipedia.org/wiki/Watermarking_attack [2] https://en.wikipedia.org/wiki/Stream_cipher_attack

Try doing the matasano challenges :). I completed the first two and it was my first exposure to block cipher weaknesses. I had almost zero crypto knowledge before that. All you need to know is programming and basic arithmetic.

YES!! This is exactly what I've been looking to do!! Going to replicate this to RamNode host where I get 50GB for $1.38/month.

I see price there starting from $24/year.

Sorry, forgot to mention the coupon code from serverbear.com: SB31

31% off for life - 24*0.69/12=1.38


Edit: Not to mention the other uses I get out of my VPS - voice chat server, VPN tunneling for my Roku (In Canada), game server, etc. Ofcourse, if you are storing important information, I wouldn't suggest loading up your VPS with services increasing risk of hacking.

Thank you!

... someone on irc asked about encfs for windows, which appears to be here:


We have never used this and cannot vouch for it in any way, but there it is ...

Could Tahoe-LAFS be used for this purpose?

A few comments about that...

To be a Tahoe-LAFS "target" you do need to be running their code on the server side ... and it's python.

We make a point to keep our environment as simple and sanitized as possible, which implies having no interpretors, so at this time you can't use rsync.net as a Tahoe-LAFS target.

BUT! We've always been very excited about Tahoe-LAFS and are well acquainted with Zooko and his team, etc., and so we are experimenting with a frozen[1] implementation of it that we can place into our environment as a binary executable[2].

The two solutions aren't really that related, as Tahoe-LAFS (sort of) implies that rsync.net would be just one of many (perhaps ten) remote containers out there, whereas this solution is targeted to just one remote host... but since you asked ...

[1] http://cx-freeze.sourceforge.net/

[2] We already do this with rdiff-backup, which is how we are able to support that ...

That's exciting, and something I might be interested in. My question and concern, however, is that Tahoe-LAFS seems better suited for distributed storage among unreliable nodes. I consider rsync.net to be fairly durable storage, especially since the way I use it is as backup and not the only storage location for a file. I have the same question about using S3 as a tahoe backend, which is another thing I've seen.

Of course, you could just use Tahoe-lafs to store everything on one or two nodes when they're reliable and durable, but then why not just use gpg or encfs, which don't require custom clients or gateway/introducer nodes?

I'm not sure how easy it would be to store git repos on Tahoe-LAFS, but there are two other options - 1) duplicity has native support for Tahoe-LAFS as a backend, or 2) use git-annex with a special remote [1].

[1] http://git-annex.branchable.com/special_remotes/

I believe so. If I understand Tahoe-LAFS, it can be used in a system where you don't trust the host of the data. The other benefit of using it is that it has data duplication incase one server fails.

> IMPORTANT! Make sure to remove the .encfs files from the secure folder before syncing. IF THESE FILES ARE IN THE SYNCED FOLDER, YOUR FILES ARE MUCH MORE EASIER TO BE CRACKED.*

What?! No don't do it! If you lost your .encfs6.xml file pretty much you lost everything. You can't decode your files back. Of course you don't have to sync it, you can back it up by other means. But you can't just remove it.

I thinks ZFS is more appropriate than git in this context


High Five. And with PEFS, there's no comparison: https://github.com/glk/pefs

List encrypted FSL http://en.wikipedia.org/wiki/List_of_cryptographic_file_syst...

I just use SSH or stunnel w/ a SQL database with procedures built in.

Umm... Git-annex does this without it taking half an hour to do, plus it takes less very little effort to setup on multiple locations.

git-annex is explicitly designed to avoid putting file contents under version control because git doesn't handle large files (like videos) very well. It seems like this system is running headlong into that problem. I'm probably going to be sticking with duplicity for now.

@prirun is Hashbackup open source? Or close source and filled with backdoors?

You can do this with HashBackup too (I'm the author). There are sites using HB to backup terabytes of data. One of the larger sites has 25M files they backup every day.

  1. Copy ssh public key to rsync.net server:

  [jim@mb hbdev]$ scp ~/.ssh/id_rsa.pub XXXX@usw-s002.rsync.net:.ssh/authorized_keys
			 100%  392     0.4KB/s   00:00    

  2. Create a local HashBackup backup directory:

  [jim@mb hbdev]$ hb init -c hb
  HashBackup build 1070 Copyright 2009-2013 HashBackup, LLC
  Backup directory: /Users/jim/hbdev/hb
  Permissions set for owner access only
  Created key file /Users/jim/hbdev/hb/key.conf
  Key file set to read-only
  Setting include/exclude defaults: /Users/jim/hbdev/hb/inex.conf

  VERY IMPORTANT: your backup is encrypted and can only be accessed with
  the encryption key, stored in the file:
  You MUST make copies of this file and store them in a secure location,
  separate from your computer and backup data.  If your hard drive fails, 
  you will need this key to restore your files.  If you setup any
  remote destinations in dest.conf, that file should be copied too.

  Backup directory initialized

  3. Create a dest.conf file in the backup directory:

  [jim@mb hbdev]$ cat - >hb/dest.conf
  destname rsyncnet
  type rsync
  dir NNNN@usw-s002.rsync.net:
  password XYZZY

  4. Enable dedup using up to 1GB of memory:

  [jim@mb hbdev]$ hb config -c hb dedup-mem 1g
  HashBackup build 1070 Copyright 2009-2013 HashBackup, LLC
  Backup directory: /Users/jim/hbdev/hb
  Current config version: 0
  Showing current config

  Set dedup-mem to 1g (was 0)

  5. Create a test file, 10 x 10k random blocks

  [jim@mb hbdev]$ dd if=/dev/urandom of=ran10k bs=10k count=1
  1+0 records in
  1+0 records out
  10240 bytes transferred in 0.001602 secs (6392272 bytes/sec)

  [jim@mb hbdev]$ cat ran10k ran10k ran10k ran10k ran10k ran10k ran10k ran10k ran10k ran10k >test100k

  [jim@mb hbdev]$ ls -l test100k
  -rw-r--r--  1 jim  staff  102400 Oct 15 14:20 test100k

  6. Backup the test file:

  [jim@mb hbdev]$ hb backup -c hb test100k
  HashBackup build 1070 Copyright 2009-2013 HashBackup, LLC
  Backup directory: /Users/jim/hbdev/hb
  Using destinations in dest.conf
  This is backup version: 0
  Dedup enabled, 0% of current, 0% of max
  Writing archive 0.0
  Copied arc.0.0 to rsyncnet (11 KB 2s 5.0 KB/s)
  Copied hb.db.0 to rsyncnet (4.3 KB 2s 2.1 KB/s)
  Copied dest.db to rsyncnet (340 B 2s 162 B/s)

  Time: 4.8s
  Wait: 6.5s
  Checked: 5 paths, 125418 bytes, 125 KB
  Saved: 5 paths, 125418 bytes, 125 KB
  Excluded: 0
  Dupbytes: 0
  Compression: 87%, 8.1:1
  Space: 15 KB, 15 KB total
  No errors

  7. Create a different test file, with junk at the beginning, middle, and end:

  [jim@mb hbdev]$ echo abc>junk

  [jim@mb hbdev]$ cat junk test100k junk test100k junk >test200k
  [jim@mb hbdev]$ ls -l test200k
  -rw-r--r--  1 jim  staff  204812 Oct 15 14:24 test200k

  8. Backup the new file:

  [jim@mb hbdev]$ hb backup -c hb test200k
  HashBackup build 1070 Copyright 2009-2013 HashBackup, LLC
  Backup directory: /Users/jim/hbdev/hb
  Using destinations in dest.conf
  This is backup version: 1
  Dedup enabled, 0% of current, 0% of max
  Writing archive 1.0
  Copied arc.1.0 to rsyncnet (42 KB 2s 15 KB/s)
  Copied hb.db.1 to rsyncnet (4.6 KB 1s 2.3 KB/s)
  Copied dest.db to rsyncnet (388 B 1s 207 B/s)

  Time: 3.0s
  Wait: 6.7s
  Checked: 5 paths, 227898 bytes, 227 KB
  Saved: 5 paths, 227898 bytes, 227 KB
  Excluded: 0
  Dupbytes: 122880, 122 KB, 53%
  Compression: 79%, 4.8:1
  Space: 47 KB, 62 KB total
  No errors

  9. What do the stats look like?

  [jim@mb hbdev]$ hb stats -c hb
  HashBackup build 1070 Copyright 2009-2013 HashBackup, LLC
  Backup directory: /Users/jim/hbdev/hb

		     2 completed backups
		353 KB file bytes checked since initial backup
		353 KB file bytes saved since initial backup
		    0s total backup time
		176 KB average file bytes checked per backup in last 2 backups
		176 KB average file bytes saved per backup in last 2 backups
		  100% average changed data percentage per backup in last 2 backups
		    0s average backup time for last 2 backups
		353 KB file bytes currently stored
		     2 archives
		 53 KB archive space
		 53 KB active archive bytes - 100%
		   5:1 industry standard dedup ratio
		 26 KB average archive space per backup for last 2 backups
		   6:1 reduction ratio of backed up files for last 2 backups
		6.2 MB dedup table current size
		     4 dedup table entries
		    0% dedup table utilization at current size
		     2 files
		     6 paths
		    12 blocks
		     6 unique blocks
		16,386 average variable-block length (bytes)

  10. How much space are we using on the rsync server?

  [jim@mb hbdev]$ ssh XXXX@usw-s002.rsync.net ls -l
  total 309
  -rw-r--r--  1 XXXX  XXXX     33 Oct 15 18:22 DESTID
  -rw-r--r--  1 XXXX  XXXX  11216 Oct 15 18:22 arc.0.0
  -rw-r--r--  1 XXXX  XXXX  42416 Oct 15 18:25 arc.1.0
  -rw-r--r--  1 XXXX  XXXX    388 Oct 15 18:25 dest.db
  -rw-r--r--  1 XXXX  XXXX   4308 Oct 15 18:22 hb.db.0
  -rw-r--r--  1 XXXX  XXXX   4604 Oct 15 18:25 hb.db.1

  11. By default, HB creates a local backup too.  Delete the
      local arc files (this is the file backup data)

  [jim@mb hbdev]$ hb config -c hb cache-size-limit 0
  HashBackup build 1070 Copyright 2009-2013 HashBackup, LLC
  Backup directory: /Users/jim/hbdev/hb
  Current config version: 2
  Showing current config

  Set cache-size-limit to 0 (was -1)

  [jim@mb hbdev]$ hb backup -c hb /dev/null
  HashBackup build 1070 Copyright 2009-2013 HashBackup, LLC
  Backup directory: /Users/jim/hbdev/hb
  Using destinations in dest.conf
  This is backup version: 2
  Dedup enabled, 0% of current, 0% of max
  Removing archive 2.0
  Copied hb.db.2 to rsyncnet (4.0 KB 1s 2.0 KB/s)
  Copied dest.db to rsyncnet (404 B 7s 57 B/s)

  Time: 2.5s
  Wait: 9.2s
  Checked: 3 paths, 5692 bytes, 5.6 KB
  Saved: 3 paths, 5692 bytes, 5.6 KB
  Excluded: 0
  Dupbytes: 0
  Compression: 29%, 1.4:1
  Space: 4.0 KB, 66 KB total
  No errors

  [jim@mb hbdev]$ ls -l hb
  total 30600
  -rw-r--r--  1 jim  staff       65 Oct 15 14:03 HBID
  -rw-r--r--  1 jim  staff       76 Oct 15 14:05 dest.conf
  -rw-r--r--  1 jim  staff     4096 Oct 15 14:31 dest.db
  -rw-r--r--  1 jim  staff  6291716 Oct 15 14:31 hash.db
  -rwxr-xr-x  1 jim  staff  9182624 Aug 10 09:45 hb
  -rw-r--r--  1 jim  staff   139264 Oct 15 14:31 hb.db
  -rw-r--r--  1 jim  staff     4308 Oct 15 14:22 hb.db.0
  -rw-r--r--  1 jim  staff     4604 Oct 15 14:25 hb.db.1
  -rw-r--r--  1 jim  staff     4012 Oct 15 14:31 hb.db.2
  -rw-r--r--  1 jim  staff        6 Oct 15 14:31 hb.lock
  -rw-r--r--  1 jim  staff      412 Oct 15 14:31 hb.sig
  -rw-r--r--  1 jim  staff      511 Oct 15 14:03 inex.conf
  -r--------  1 jim  staff      333 Oct 15 14:03 key.conf

  12. Mount the backup as a FUSE filesystem:

  [jim@mb hbdev]$ hb mount -c hb mnt >mount.log 2>&1 &
  [1] 62772

  13. What's in the mounted filesystem?

  [jim@mb hbdev]$ ls -l mnt
  total 8
  drwx------  1 jim  staff  1 Oct 15 14:22 2013-10-15-1422-r0
  drwx------  1 jim  staff  1 Oct 15 14:25 2013-10-15-1425-r1
  drwx------  1 jim  staff  1 Oct 15 14:31 2013-10-15-1431-r2
  drwx------  1 jim  staff  1 Oct 15 14:31 latest
  [jim@mb hbdev]$ find mnt/latest

  14. All file attributes are correct in the mounted filesystem:

  [jim@mb hbdev]$ ls -l mnt/latest/Users/jim/hbdev
  total 602
  -rw-r--r--  1 jim  staff  102400 Oct 15 14:20 test100k
  -rw-r--r--  1 jim  staff  204812 Oct 15 14:24 test200k

  [jim@mb hbdev]$ ls -l test*k
  -rw-r--r--  1 jim  staff  102400 Oct 15 14:20 test100k
  -rw-r--r--  1 jim  staff  204812 Oct 15 14:24 test200k

  15. Test the backup: are the remote and local files equal?

  [jim@mb hbdev]$ time cmp test100k mnt/latest/Users/jim/hbdev/test100k

  [jim@mb hbdev]$ time cmp test200k mnt/latest/Users/jim/hbdev/test200k

  [jim@mb hbdev]$ time cmp ran10k mnt/latest/Users/jim/hbdev/test100k
  cmp: EOF on ran10k

  (ran10k and remote test100k are equal for 1st 10k, which is correct)
Beta site is: http://www.hashbackup.com

There is a security doc on the site that has been reviewed by Bruce Schneier.

If someone is looking for an alternative to a closed-source product, I don't see why another closed-source product would be the answer (according to Hashbackup's FAQ it won't be open-source). Usually the goal of replacing a closed-source product is to have something that you can fully control / tweak / audit ...

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact