
LeoFS: a highly available, distributed, eventually consistent object/blob store - e_proxus
https://github.com/leo-project/leofs
======
majewsky
The feature list sounds extremely similar to OpenStack Swift [1]. In fact, the
first sentence on there is nearly identical to the submission title: "Swift is
a highly available, distributed, eventually consistent object/blob store."

Advantages of LeoFS over Swift seem to be NFS support (I think there were some
efforts to expose Swift containers as filesystems, but I didn't see anything
becoming stable) and regional replication (which is tedious with Swift).

I quickly skimmed through the docs for LeoFS, and I'm a bit confused: They
talk about master nodes and slave nodes, while at the same moment claiming
that there is no SPOF. So I would guess that there is some sort of leader
election going on. Swift doesn't do anything like that; the cluster shares a
ring file that determines where each file goes, and if you have N replicas,
there are at most 2N locations where these replicas can be stored (N regular
locations and N fallback locations).

Another thing that concerns me is that the documentation implies that the
servers need to come up in the correct order (see [2], section "Method of
server launch"). One thing that I like about Swift is that individual nodes
and even individual services on each node can be operated completely
independently, which keeps my Kubernetes setup and Helm charts simple.

[1]
[http://docs.openstack.org/developer/swift/](http://docs.openstack.org/developer/swift/)

[2] [http://leo-
project.net/leofs/docs/getting_started/getting_st...](http://leo-
project.net/leofs/docs/getting_started/getting_started_2.html)

~~~
dan353hehe
> They talk about master nodes and slave nodes, while at the same moment
> claiming that there is no SPOF.

If I remember right (its been a while), the master and slave nodes mentioned
are just in the manager which is control of the ring. Every other component
can operate with nodes failing all over the place without failures.

------
dan353hehe
I remember evaluating this project several years ago, (maybe 2 or 3?) as a
possible S3 replacement, and it actually worked fairly well. We ended up not
using it as our requirements changed. Yosuke Hara the main developer on the
project was very helpful. We needed a way for clients that didn't speak the S3
api to access the files, and after talking with him we prototyped a NFS
endpoint in a few days that kind of worked, while he worked on getting an
actual one added in.

------
corobo
Any functionality or plans for deduplication? For example if I used this for
an image host and a handful of users upload the same image the image only gets
stored ${replication} times instead of ${replication}*${uploads} times.

I had been using Skylable's sx server for this but unfortunately it looks like
they're ceasing operations, I'm in the market for a decent replacement.

~~~
zenlikethat
Why not hash the image on upload and store them with names based on hash? That
way if one copy already exists it will just get over-written on next user
upload

~~~
aquark
Someone would still need to handle life-timing the objects in the presence of
deletes. If the storage layer does it then it is one less thing to handle at
the application level.

If you upload by hash then you need your own ref counting for deletes (or just
don't support delete ... storage is cheap after all!)

~~~
Spivak
Wouldn't any account based system have an indirect reference counter basically
for free? Images are associated with accounts so you safely delete an image
when there are no accounts that own it.

------
gcoguiec
Another alternative to S3 cloud storage is Minio
([https://www.minio.io](https://www.minio.io)), works pretty well.

~~~
Royalaid
I am currently using minio in a side project was going to ask how LeoFS
compares.

~~~
nine_k
Minio is small, can run one node in a container on your dev laptop.

Leo is a large, distributed, no-SPOF thing, only makes sense if you have
multiple boxes and want high reliability at the cost of some redundancy.

Both happen to talk S3, but that's it.

------
windkithk
Thanks for your interest on the project.

I would position LeoFS as a in-house S3 compatible storage so you can access
them quickly within your environment.

One of the strengths of LeoFS is handling small objects (e.g. Images, Logs,
...) you can find more at the benchmark report page.

[https://github.com/leo-
project/notes/tree/master/leofs/bench...](https://github.com/leo-
project/notes/tree/master/leofs/benchmark/leofs/1.3/1.3.1/front)

Over the past few months, we focused on improving the stability and
performance of LeoFS.

Please feel free to test it, report any issues you encountered on GitHub, we
highly appreciate users' voice.

------
rdtsc
I see good things there:

* S3 compatibility

* NFS support (this is really neat)

* Proper distribution and fault tolerance built in

I like the benchmark numbers:

[https://github.com/leo-
project/notes/tree/master/leofs/bench...](https://github.com/leo-
project/notes/tree/master/leofs/benchmark/leofs/1.3/1.3.1/front/20161216_image_f4m_load_eleveldb_1)

Seems to be used in production at Rakuten

------
zigsandzags
On the note of distributed systems, are there systems that perform well when
the a node is not connected? Eg, i'm designing my own "distributed fs"
_(though my needs are very different)_ , and i'm mainly doing this because i
don't see this in many places. Figure it's on topic here.

Eg, i want a single point of contact - my local node _(laptop /etc)_ \- which
can behave normally whether it is connected to the network, part of the
network, or none of the network. The availability of the data at large is
obviously dependent on how much of it you're connected to, but that's fine for
me.

This gives me the UX of being able to write to the filestore everywhere,
rather than losing connection to a self-hosted s3 when i'm out of cell signal
range or w/e. Then later, when i'm back and connected to the network,
different nodes who are configured to own all portions of data can pull the
data as needed.

Are there other distributed offline-able filesystems like this?

~~~
ergl
The performance of such a system really depends on what you can expect from
it.

If your system is distributed without full replication (and no overlap between
nodes), then having offline nodes is easy - you just access to your local
data, and the rest of the network can't access it.

If you want to support full replication, or even partial, you'll want to
detect (and maybe resolve) conflicts, and depending on how long you think a
node can be offline, you'll need different mechanisms. If you expect nodes to
be offline for really long periods of time, then you'll need to keep a lot of
state around to be able to detect conflicts later - therefore making it
perform worse (bigger latency while replicating data).

------
tannhaeuser
I'm all for options in the storage service space but "FS" signals (to me) that
this acts as a (POSIX) file system. Does it expose NFS rpc? If not, what does
an S3-like storage service API expose that basic HTTP/1.1 with eg. range
request doesn't?

~~~
nine_k
The page says that NFS support is present, but is beta quality.

------
colanderman
This sounds a lot like Ceph [1], which is much more widely used (and has the
backing of Red Hat). Why would one use LeoFS over Ceph?

[1] [http://ceph.com/](http://ceph.com/)

~~~
mocchira
LeoFS is used on not only linux but freebsd[1] and smartos through Project
FIFO[2].

[1]
[https://www.freshports.org/databases/leofs](https://www.freshports.org/databases/leofs)

[2] [https://docs.project-fifo.net/docs/leofs-overview](https://docs.project-
fifo.net/docs/leofs-overview)

------
the_duke
Apparently written in Erlang, which I reckon to be a good fit.

But 56% of the code is "Shell code". What's up with that?

~~~
dan353hehe
the linked repo doesn't really contain all the erlang code. its mostly
designed to pull and build the actual projects. hence the "Shell code"

~~~
the_duke
That makes sense, thanks for the info.

------
mgalka
Sounds intriguing. But how does storing the objects as blobs perform better
than storing them as standard file types?

~~~
majewsky
Seeing how they emphasize using XFS for the backing disks [1], I'm pretty sure
that the objects/blobs end up as actual files on the disk (same as with
OpenStack Swift).

So in the end, an "object/blob" is the same as a "file". The different term
emphasizes that the user does not know (or care) which disk actually holds the
file, or how many replicas exist. That's abstracted away by the cluster.

[1] [http://leo-
project.net/leofs/docs/installation/install_2.htm...](http://leo-
project.net/leofs/docs/installation/install_2.html)

~~~
mocchira
Actually we have adopted a file format haystack object store[1][2] originally
developed by FB for their photo storage platform.

[1]
[https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf)

[2] [https://code.facebook.com/posts/685565858139515/needle-
in-a-...](https://code.facebook.com/posts/685565858139515/needle-in-a-
haystack-efficient-storage-of-billions-of-photos/)

------
illumin8
Looks interesting, but S3 gives me both eventual or read after write
consistency, as well as 11 9s of durability. Why would I use this instead?

~~~
devty
For one, this project is open source and can be installedon your own set of
machines.

~~~
joneholland
There is no way that's cheaper than s3 though.

~~~
kikoreis
Are you serious? It's definitely possible, and almost simple, to build a
limited scale service that is cheaper than S3. S3 has the Amazon footprint
behind it, which you can't compete against, but the price is not hard to beat.

~~~
matt_wulfeck
Yes, but who will be on call to manage, monitor, and fix it when it breaks?
Answer: you! S3 is one of those things that's so cheap and works so well I'd
never think of leaving it.

If it came down to it, you can lease dx lines from AWS to your datacenter
switch and get 10 Gbs to S3 for pretty cheap.

~~~
sroussey
It is quite affordable until you have many files / objects.

