
Asynchronous filesystem replication with FUSE and rsync - Lethalman
https://github.com/immobiliare/sfs
======
jewel
FUSE is definitely the way to go with file synchronization, since the system
will never miss a write, and can lazily load data for reads. It means that the
entire filesystem doesn't have to fit on every machine that's synced. For a
more sophisticated FUSE sync filesystem, be sure to check out orifs:

[http://ori.scs.stanford.edu/](http://ori.scs.stanford.edu/)

The best introduction to orifs is their paper, which is linked from the above
site.

~~~
nona
Very interesting. I always wanted a solution to asynchronously replicate 100's
of GBs over long distances.

I've longingly looked at Intermezzo/Coda a long time ago but it never went
anywhere; played with block level replication (lvmsync), but it doesn't allow
concurrent use; in the end, the only solution I had to fall back on was rsync
(which needs to iterate over the entire directory structure, crazy expensive)
and git and/or unison (both of which can't cope with many GBs).

I'm going to give orifs a try.

~~~
jewel
Just in case you're not aware, here are some other options:

btsync ([http://getsync.com](http://getsync.com)) works well with at least
500GB of photo-sized files.

git-annex works well for large collections. I am using this for my own photo
collection, which is about 450GB. I like it because you can do partial
checkout.

I haven't tried it with a large collection but git-annex assistant should work
well too, if you're interested in automatic sync.

~~~
icebraining
Another git-annex user with a few hundred GBs. I have my laptop, VPS, nexus
tablet and an S3 account being synced without any issues (with a different
subset of files in each).

------
pmoriarty
I wonder if this could be used as a replacement for Vagrant's rsync
feature.[1]

From Vagrant's documentation:

    
    
      Vagrant can use rsync as a mechanism to sync a folder to the
      guest machine. This synced folder type is useful primarily in
      situations where other synced folder mechanisms are not
      available, such as when NFS or VirtualBox shared folders aren't
      available in the guest machine.
    
      The rsync synced folder does a one-time one-way sync from the
      machine running to the machine being started by Vagrant.
    

The disadvantage of the above is that it's a one-time, one-way sync. SFS would
overcome this limitation, if I'm not mistaken.

[1] [http://docs.vagrantup.com/v2/synced-
folders/rsync.html](http://docs.vagrantup.com/v2/synced-folders/rsync.html)

~~~
geerlingguy
It's one-time right now (though there's an open ticket to make it
bidirectional [1]), but you can use `vagrant rsync-auto` to watch for changes
and continously sync. I posted an article a few weeks ago highlighting one of
the reasons I use rsync rather than NFS shares with Vagrant [2].

Though I would love to see FUSE+rsync (like this article mentions) as a
default/standard option in Vagrant!

[1]
[https://github.com/mitchellh/vagrant/issues/3062](https://github.com/mitchellh/vagrant/issues/3062)

[2] [http://www.midwesternmac.com/blogs/jeff-geerling/nfs-
rsync-a...](http://www.midwesternmac.com/blogs/jeff-geerling/nfs-rsync-and-
shared-folder)

------
0x0
Wouldn't the inotify API be a better way to detect file writes rather than
writing a full FUSE wrapper?

~~~
Lethalman
Yes, we've thought a lot about using inotify, our first prototype was also
using inotify.

\- Our system needs to cope with millions of directories. Millions of
directories for inotify mean a lot of structures in the kernel. For large
numbers it can also mean gigabytes of ram. Add to that the mapping in
userspace of file descriptors for rename operations.

\- Using inotify would take a lot of time at startup to recurse into millions
of subdirectories.

\- inotify will not automatically watch new directories. You have to list all
the files right after the creation of a directory and watch the
subdirectories, etc. Not a problem at all, but way simpler with FUSE.

\- If your system is not able to keep up with inotify events, you miss events
because the kernel cannot buffer all the events. For us it's better to slow
down the system but never miss an event.

\- inotify is attached to the original filesystem. That means it's hard (not
impossible) to handle loops in an active/active replication setup. Whereas
with SFS, replication is done on straight to the original filesystem.

\- If the inotify application crashes, you lose events because software keeps
writing to the filesystem. If SFS crashes the mountpoint is unwritable by the
application which reports an error and can switch to a different storage.

Some choices as you can read depends on the requirements, our requirements
where not met by inotify.

~~~
rwmj
There's a new thing (fanotify). LWN did a short series on different filesystem
notification methods in July:

[https://lwn.net/Articles/604686/](https://lwn.net/Articles/604686/)

[https://lwn.net/Articles/605128/](https://lwn.net/Articles/605128/)

Unfortunately the fanotify article is yet to come!

(Having said that, I certainly appreciate your pain trying to use inotify to
track write events on an entire filesystem)

~~~
mopo2000
fanotify is super limited, and not useful for this app. It's from 2009, but
has seen little use.

------
kyledrake
This is neat. I like that it's using stable off-the-shelf unix components.

I'm putting together the new [https://neocities.org](https://neocities.org)
fileserver stuff right now, so I'll definitely be looking into this.

The current plan is to use hourly rsyncs, and then implement this (or some
flavor of it):
[http://code.google.com/p/lsyncd/](http://code.google.com/p/lsyncd/)

RE inotify vs FUSE: The former is event-driven from an API, the latter I
believe uses lower level blocks. Which one is better here is entirely
debatable. Gluster uses a similar approach this does I believe. I'm not an
expert on unix file APIs, so take all of this with a grain of salt.

The biggest reason we can't use Gluster replication is that if you request a
file when replicating, it goes to ask all the servers if the file is on them,
instead of just failing because it's not on the local system. That's fine for
many things, but it's an instant DDoS if someone runs siege on the server and
just blasts you with random bunk 404 requests. You can't cache your way out of
that one. Apparently the performance for instant request for lots of small
files can be pretty slow too.

SSHFS (and rsync using SSH) blow S3 out of the water on performance for remote
filesystem work. The difference is pretty insane.

~~~
pmoriarty
_" it's an instant DDoS if someone runs siege on the server and just blasts
you with random bunk 404 requests"_

Would you run your fileservers on a publically accessible network? If not,
then the above attack would require the attackers have access to your private,
internal network and at that point it could be argued that you've already
lost.

Of course, it's good to design defense-in-depth, so that you could survive a
hit on your internal network as well. But I'm not sure how much sleep I'd lose
over the possibility of a DDOS attack against a private network.

On the other hand, if your servers are on a private network, I'm not sure why
you'd use SFS over something like NFS.

~~~
emeraldd
In this context a web-server is just a fancy interface to a file server. So
you have a bunch of read-only "file servers" sitting on the public web talking
to a private read/write back end.

------
patrickg_zill
I have been looking into this, and my current idea is tending towards using
VMs running DragonflyBSD.

VM-1 - local NFS server that runs DBSD and Hammer filesystem with many nice
features (auto snapshots, etc.) Will be fast, especially if the VM-1 is on the
same physical host-node as my worker VMs.

VM-2 is remote, and receives the DBSD filesystems I send it. All snapshots
etc. from the original FS are retained. If the connection is interrupted, the
Hammer sync code will figure it out and restart from the latest successful
transaction.

~~~
senorsmile
Sounds really cool. Have you actually implemented this as a prototype yet?

------
andyidsinga
for newly built applications, why not setup a ceph or swift cluster ..then use
an s3 interface for access to files/objects? total an minimum replicas can be
configured... so you will get something like eventual consistency if you use
small min than total.

~~~
Lethalman
That's true. I've tried ceph, swift and riakcs.

Our requirements were two servers, the storage is several terabytes, and
losing data because of an unknown bug in the filesystem wasn't an option.

Such filesystems are quorum-based and certainly need more than two servers
with terabytes of data. More than what we could afford. Also if you lose
quorum, the system hangs to ensure atomicity. But we don't need such
properties.

Also after some testing, those filesystem resulted about 5-10% slower than
traditional filesystems, with few nodes of course. They are designed to scale
with many nodes.

~~~
andyidsinga
Thank you

------
Lethalman
I'm an employee at Immobiliare.it and yesterday we've released for the first
time some internal software to the public on github. Just wanting to share our
work :)

~~~
jjviana
I wonder how your system deals with write conflicts? The documentation is not
clear about that...

~~~
icebraining
Sounds clear to me:

 _Because of no locks, the last write wins according to rsync update semantics
based on file timestamp + checksum. In case two files have the same mtime,
rsync compares the checksum to decide which one wins._

[https://github.com/immobiliare/sfs#consistency-
model](https://github.com/immobiliare/sfs#consistency-model)

------
hrez
Since it's using FUSE, it would be nice if it had lazy initial syncing. I.e.
on empty filesystem slave it would serve/sync accessed files even if they are
not synced yet. That's at least until full sync is completed. Millions of
files or GB's of data can take long time for initial sync. And lazy sync would
allow to start serving clients with no delay.

~~~
Lethalman
It's already like this, there's no initial sync.

------
minaguib
This is very interesting indeed for the described use case. AFAIK the other
alternative is lsyncd

~~~
Lethalman
lsyncd uses inotify. Also, it does rsync (or other backends) of the whole tree
instead of single files AFAIK, but that's of course not a stopper as it would
need few changes to implement.

In addition, I don't think lsyncd has been designed for an active/active
setup, rather for a master-slave setup.

~~~
gnur
Lsyncd does not sync the entire tree every time. It depends on the situation,
most of the time in our use case it starts a rsync for 5 specific files. Only
on startup or when the system is overloaded it syncs the entire tree.

~~~
Lethalman
Good, thanks for clarifying.

------
vtemian
Did you check
[https://news.ycombinator.com/item?id=8735937](https://news.ycombinator.com/item?id=8735937)?

~~~
Lethalman
No I've just read about it on HN today. It's interesting. I guess for millions
of files you finish inodes quickly if you don't cut the history periodically.
But it's a good idea.

------
illumen
Would this work with sub folders being synced from different machines onto a
single machine?

Even better, could it grab data as required from other machines?

~~~
Lethalman
You mean only one way from N machines to a single machine? Yes. When other
machines add/change any file it will be uploaded to that single machine.

------
fit2rule
Poor name for a filesystem - there are already a couple of filesystems named
"SFS" ..

------
anon4
If this could be ported to windows, mac, android and ios, I'd drop dropbox in
a heartbeat.

~~~
emcrazyone
I too am basically looking for a Windows port for RSync. I'm aware of a
project that runs on top of Cygwin (DeltaCopy?) but it's not kept up with the
latest Rsync protocol last I checked.

Any recommendations?

~~~
beagle3
rsync runs very well in cygwin (or without on its own - just copy rsync.exe,
cygwin1.dll from a cygwin installation and you are good to go if you have some
ssh; if you want rsync to use cygwin's ssh, you'll need to also copy ssh.exe
and a handful of other .dll).

There doesn't seem to be anything else close to rsync's robustness or
completeness of functionality - on Windows or Linux.

