
GlusterFS 3.5 released - conductor
http://www.gluster.org/2014/05/glusterfs-3-5-unveiled/
======
jacquesm
Gluster is one of those projects that seem under exposed. In the early days
(when I was following the project more closely) I got to know some of the core
developers quite well and I've always been impressed with the overall
architecture and the goalset.

If you feel like you're a good C programmer have a look at the codebase and
prepare to have your mind expanded.

It's a very nice piece of software and a very strong proof that not all
successful development is done in SV (Almost all of GlusterFS before the buy-
out was written in India, not sure what the current situation is).

~~~
amatix
RedHat seem to be investing a lot of (needed) effort and resources into it and
it's certainly being actively improved. The RH "product" version of Gluster is
the RedHat Storage Server: [http://www.redhat.com/products/storage-
server/](http://www.redhat.com/products/storage-server/)

------
axanoeychron
I use GlusterFS for storage across my LAN. Each machine has its own brick and
uses replication for a bit of safety. Works really well when you want to go a
level below 'file sync' and things like Dropbox and use the network as the
file system without having a dedicated NAS - just use your existing computers.

~~~
rwmj
The thing that stopped me setting up this is the incompatibility between
versions. With so many different versions of Linux on my LAN I just cannot
ensure that they would all have the same version of gluster installed, so
gluster is a non-starter for me.

~~~
strikerz
Have you tried more recent versions of GlusterFS? Starting with GlusterFS 3.3,
all major versions are compatible. Even the recently released GlusterFS 3.5 is
compatible with GlusterFS 3.3.

------
rdl
I tried using Gluster a couple years ago, but gave up and went with pairs of
systems with DRBD and OCFS2 (which I'm incredibly happy with), on Ubuntu. My
backing store is RAID6, so it's probably overkill -- I'm going to switch to
RAID1 on pairs 4-6TB drives in the next hardware rev, on Atom 2758 or low-end
xeon systems, with 32GB RAM.

The main failure I was getting involved VMs. I know I could do something with
openstack, but this is mainly a platform to experiment, so I wanted to be able
to support any kind of VM.

~~~
Nux
With DRBD you are stuck with 2 nodes and OCFS2 has given me nothing but head
aches, at least in trying to getting it to work in CentOS 6. Apparently Oracle
is "forcing" people to use their RHEL clone or at least their kernel to get
OFCS2. When I had to use DRBD I went with CLVM.

Having said that, starting with v3.4.0 gfapi can be used if your Qemu supports
it (it's the case with CentOS 6.4+ or Ubuntu 14.04), therefore bypassing the
FUSE layer. This should give you an improved performance.

~~~
wazoox
OCFS2 is extremely easy to set up and use with Debian (basically just apt-get
install ocfs2-console, launch it and let it guide you). On the other hand,
RedHat (and I suppose its clones) makes it easier to use GFS2, while GFS2 is
incredibly complex to set up on Debian.

------
mike-cardwell
Is anyone here using GlusterFS in production? I'd be interested to hear about
how people have recovered from critical hardware failure when using it...

~~~
syrneus
We run hundreds of GlusterFS clusters in production on EC2. We're currently on
3.0 and in the process of fully migrating to 3.4 (and maybe 3.5 one day).

Our primary use case for Gluster is in serving persistent filesystems for
Drupal. Our customers store potentially millions of files on their GlusterFS
clusters.

We've built a number of tools/processes to help protect Gluster against
failures in EC2 (for instance fencing network traffic at the iptables layer to
help protect GlusterFS clients from hanging talking to down nodes), as well as
to help our team perform common tasks (resizing clusters, moving customers
from cluster to cluster, etc.). We haven't necessarily hit blocker issues
recovering from underlying hardware fails, but our team is definitely very
experienced with many possible failure modes.

Overall GlusterFS has been very reliable over the years and our research has
shown it is the best option out there for when our customers can't use
something such as S3 directly.

If you want more details or would love to hack on a 8000+ node EC2 cluster
running things such as GlusterFS feel free to ping me.

~~~
orf
Just out of interest why don't you use S3 for this? Amazon provides a few
options for scalable storage, is it cheaper to roll your own ontop of EC2?

~~~
syrneus
Many of our customers do use S3 and we make use of S3 extensively ourselves.
However, Drupal often expects to operate on a POSIX compatible filesystem.
Drupal 7 does support PHP file streams which can be configured to use S3, but
not every Drupal module follows the best practices. Plus, we support every
flavor of Drupal under the sun (including custom code).

All of our enterprise customers receive a highly available setup running on
multiple nodes--thus, we have the need for a persistent filesystem attached to
multiple EC2 instances. We utilize GlusterFS to ensure all of our clients have
the filesystem capabilities their apps may need.

------
read
I have some questions on GlusterFS:

(1) How do you secure GlusterFS traffic? Does GlusterFS use some kind of
encryption on the wire or do you have to manually setup a VPN?

(2) How well does GlusterFS work if you host bricks on web servers? If data
are sharded per user, can you make a web server always have fast access to
that user data (can you guarantee a copy of the data is always hosted locally
on that web server?) -- or is this something that only happens indirectly
through the OS cache?

(3) Do you need something more from the underlying filesystem to guarantee
data integrity, like running GlusterFS on ZFS? Or is GlusterFS enough?

~~~
Nux
> (1) How do you secure GlusterFS traffic? Does GlusterFS use some kind of
> encryption on the wire or do you have to manually setup a VPN?

Although I have not used the feature, there is support for SSL, look it up. I
have also heard about people trying it over OpenVPN though not sure how
successful they were.

> (2) How well does GlusterFS work if you host bricks on web servers? If data
> are sharded per user, can you make a web server always have fast access to
> that user data (can you guarantee a copy of the data is always hosted
> locally on that web server?) -- or is this something that only happens
> indirectly through the OS cache?

Don't have first hand experience with this, but I can recommend you Joe
Julian's blog, i.e. [http://joejulian.name/blog/optimizing-web-performance-
with-g...](http://joejulian.name/blog/optimizing-web-performance-with-
glusterfs/)

> (3) Do you need something more from the underlying filesystem to guarantee
> data integrity, like running GlusterFS on ZFS? Or is GlusterFS enough?

GlusterFS relies on the underlying filesystem to do its job, the recommended
one is XFS with 512 bytes inode size.

------
kiyoto
It looks like ClassMethod (a dev shop based in Japan) is using it:
[http://dev.classmethod.jp/cloud/aws/glusterfs-with-
fluentd/](http://dev.classmethod.jp/cloud/aws/glusterfs-with-fluentd/)

------
nwmcsween
There are issues with Glusterfs, specifically that it's userspace and uses
FUSE, small IO kills performance due to context switches.

~~~
Nux
FUSE is overridden more and more with the introduction of gfapi. Qemu and
Samba can now both "talk" it, so you don't need FUSE for that any more, for
example.

The native client still uses it though and this is unlikely to change too
soon, afaik.

The performance hit is disputable, there aren't really better alternatives.
Most distributed filesystems out there are using FUSE too (XtreemFS, MooseFS)
and I am not touching Lustre.

There are issues with everything, your dismissing it like that is not really
fair or constructive.

~~~
colin_mccabe
_FUSE is overridden more and more with the introduction of gfapi. Qemu and
Samba can now both "talk" it, so you don't need FUSE for that any more, for
example._

It should be possible to avoid the userspace to kernelspace transition in the
client by using LD_PRELOAD with a shim library as well.

 _The performance hit is disputable, there aren 't really better alternatives.
Most distributed filesystems out there are using FUSE too (XtreemFS, MooseFS)
and I am not touching Lustre._

The Ceph distributed filesystem has an in-kernel client which is upstream.
XtreemFS and MooseFS are designed for a different use-case (WAN filesystem)
and aren't really directly comparable.

Lustre runs the majority of the world's supercomputers. It seems a little
unfair to dismiss them without at least giving a reason why. They have made
some progress towards getting their kernel client upstream recently.

I was a Ceph developer for a while. I read about Red Hat's recent acquisition
of Inktank with interest. (Inktank was founded by Sage Weil, the creator of
Ceph, to commercialize the filesystem.) Since Red Hat previously acquired the
main company behind Gluster, this makes things a little-- how shall we say?--
interesting. It's unclear to me whether Red Hat will want to support two
distributed filesystems going forward, or whether they will try to streamline
things.

~~~
notacoward
"It should be possible to avoid the userspace to kernelspace transition in the
client by using LD_PRELOAD with a shim library as well."

We (I'm a GlusterFS developer) actually did this for a while once, and people
have recently started talking about doing it again. With GFAPI it shouldn't
even be that complicated, but there are still problems e.g. around fork(2).

"The Ceph distributed filesystem has an in-kernel client which is upstream."

...and it's probably not a coincidence that the file system component is the
one piece of Ceph that still hasn't reached production readiness. Development
velocity counts. The fact that it uses FUSE has rarely been the cause of
performance problems in GlusterFS. More often than not, the real culprit is
(relative) lack of caching, which is fixable in user space.

"XtreemFS and MooseFS are designed for a different use-case (WAN filesystem)"

Pretty true for XtreemFS - for which I have the utmost respect and which I
promote at every opportunity - but MooseFS targets _exactly_ the same use case
as GlusterFS. OK, a subset, because they don't have all the features and
integrations we do. ;) I'd also add PVFS/OrangeFS, which is contemporaneous
with Lustre. It doesn't use FUSE, but has its own user-space "interceptor"
which is equivalent.

"Lustre runs the majority of the world's supercomputers."

It runs a lot of the world's _biggest_ supercomputers, because those people
can afford to keep full-time Lustre developers on staff to baby-sit it. Not
ops staff - _developers_ to apply the latest patches, add their own, etc. I
spent over two years at SiCortex trying to make Lustre usable for our
customers. At that time, I believe LLNL had four Lustre developes. ORNL had
slightly less. Cray, DDN, etc. each had their own as well. When it works,
Lustre can be great. On the other hand, few users can afford to devote that
level of staff to running a distributed file system. Those that can't will
find themselves in the weeds with MDS meltdowns and "@@@ processing error"
messages _all the time_.

Because of this, I'd say it's Lustre that's not really targeting the same
market or use case as GlusterFS. We _never_ encounter them head to head, in
either the corporate or community context. The "performance at any cost"
market is a hard place to make a living, and it barely overlaps at all with
the "performance plus usability" market.

"It's unclear to me whether Red Hat will want to support two distributed
filesystems going forward"

Why not? They've supported multiple local file systems for a long time, and
there's an even bigger overlap there. When you look at both Ceph and GlusterFS
in terms of distributed _data_ systems rather than just file systems
specifically, maybe things will look a bit different. Now we're talking about
block and object as well as files. Maybe we're talking about integrating
distributed storage with distributed computation in ways not covered by any of
those interfaces. We're certainly talking about users having their own
preferences in each of these areas. If there are enough APIs, and enough users
who prefer one over the other for a certain kind of data or vertical market,
then it makes quite a bit of sense to continue maintaining two separate
systems. On the other hand, _of course_ we want to reduce the number of
components we'll have to maintain, and there will be plenty of technology
sharing back and forth. Only time will tell which way the balance shifts.

~~~
colin_mccabe
I'm curious what your experience was with the LD_PRELOAD thing. By "problems
with fork," you mean the possibility of forking without that environment
variable set via exevpe or similar?

 _MooseFS targets exactly the same use case as GlusterFS._

Yeah, actually you are quite right. I was getting MooseFS confused with a
different filesystem. MooseFS looks like it has a GFS heritage. Kind of like
what I am working on now... HDFS.

 _...and it 's probably not a coincidence that the file system component is
the one piece of Ceph that still hasn't reached production readiness._

Ceph's filesystem component is reasonably stable. It's the multi-MDS use case
which still had problems (at least a few years ago when I was working on the
project.) The challenge was coordinating multiple metadata servers to do
dynamic subtree partitioning and other distributed algorithms. Ceph has a FUSE
client which you can use if you don't want to install kernel modules, and a
library API.

It seems that Lustre has an in-kernel server. This might have led to some of
the maintenance difficulties people seem to have with it. (I never worked on
Lustre myself.) I don't think this discredits the idea of in-kernel _clients_
, which are different beasts entirely. Especially when the client is in the
mainline kernel, rather than an out-of-tree patch like with Lustre in the
past.

It's a tough market out there for clustered filesystems. HPC is shrinking due
to government sequesters and budget freezes. The growth area seems to be
Hadoop, but most users prefer to just run HDFS, since it has a proven track
record and is integrated well into the rest of the project. Moving into the
object store world is one option, but that is also a very crowded space. It
will be interesting to see how things unfold.

~~~
notacoward

      > By "problems with fork," you mean the possibility of
      > forking without that environment variable set via exevpe
      > or similar?
    

That might be a problem, but it's not the one I was thinking of. The problem
with LD_PRELOAD hacks is that they need to maintain some state about the
special fds on the special file system. That state immediately becomes stale
when you fork (because copy on write) and gets blown away when you exec.
Therefore you always end up having to store the state in shared memory
(nowadays probably an mmap'ed file) with extra grot to manage that. Even then,
exec and dlopen won't work on things that aren't real files. So it's not
impossible, but it gets pretty tedious (especially when you have to re-do all
this for the 50+ calls you need to intercept) and there are always some
awkward limitations.

With regard to in-kernel clients, I'm not trying to discredit them, but
they're not the only viable alternative. User-space clients have their place
too, as I'm sure you know if you work on HDFS. Every criticism of FUSE clients
goes double for JVM clients, but both can actually work and perform well. It
seemed like some people were dissing GlusterFS because of something they'd
heard from someone else, who'd heard something from someone else, based on
versions of FUSE before 2.8 (when reverse invalidations and a lot of
performance improvements went in). This being HN, it seemed like they were
just repeating FUD uncritically, so I sought to correct it. The fact that
GlusterFS uses FUSE is simply not a big issue affecting its performance.

