
Google and Red Hat announce cloud-based scalable file servers - Sami_Lehtinen
http://googlecloudplatform.blogspot.com/2016/02/Google-and-Red-Hat-announce-cloud-based-scalable-file-servers.html
======
justinclift
Heyas,

Ex-GlusterFS person here (used to work at Red Hat on the project side, leaving
mid last year).

"Small file access", and "lots of files in a directory" have been a pain point
with GlusterFS for ages. The 3.7.0 release had some important improvements in
it, specificially designed to help fix that:

[https://www.gluster.org/community/documentation/index.php/Fe...](https://www.gluster.org/community/documentation/index.php/Features/Feature_Smallfile_Perf)

The latest Gluster release is 3.7.8 (the same series as 3.7.0), and is worth
looking at if you're needing a good distributed file system. If you have
something like 1Mill files in a single directory though... hrmmm... NFS or
other technologies might still be a better idea. ;)

~~~
DannoHung
What makes having a lot of files in a directory a hard problem?

~~~
notacoward
Hi. I'm one of the GlusterFS developers. Hi, Justin. ;)

The basic answer to your question is that _networks are slow_ compared to
local storage. In order to get decent performance, you must either avoid
network round trips or amortize their cost over many operations. We - not just
GlusterFS but distributed filesystems in general - can do this pretty well for
some operations. We can batch, buffer, cache, etc. This works great for plain
old reads and writes to large files (for example). It doesn't work so well for
operations that have to touch many small files. For those, in order to ensure
the required level of consistency/currency, we have two choices.

(1) Send a request per file to get current metadata.

(2) Cache metadata, and participate in some sort of consistency protocol to
make sure we don't serve stale cached information.

Both approaches have workloads where they perform better and workloads where
they perform worse. In addition, the second approach adds _a lot_ of
complexity, especially in a system where failures are common and a loosely
coordinated set of servers must respond (systems with a single master server
have an easier time here but are less resilient). The inherent difficulty of
this approach is why e.g. CephFS has taken so long to mature.

GlusterFS took the first route instead. It does mean that "ZOT" (Zillions Of
Tiny files) workloads will perform poorly. I won't deny that. On the other
hand, it's easier to test or prove correct, and the time not spent on solving
the hard version of that problem - often for little eventual benefit - can
instead be spent on other kinds of improvements. Some people are happy with
that. Some are not. Some spread FUD. Some try to implement the practical
equivalent of a distributed filesystem on top of some alternative (e.g. object
stores) with their own even more serious limitations, and experience even more
pain as a result. Some are initially unhappy with these tradeoffs, but work
with us and learn to work more effectively with these limitations to enjoy
other benefits. That's life in the big city.

~~~
justinclift
Thanks for clearing that up, that's awesome. :)

------
gamegod
I worked with a GlusterFS deployment in production about 2 years ago, and it
was such a nightmare that I both feel compelled to write about it and never
touch anything made by that team ever again.

It was the whole shebang: Kernel panics, inconsistent views, data loss, very
slow performance, split-brain problems all the time. Our set up IIRC was very
simple: two bricks in a replicated volume. It worked so poorly that we had to
take it out of production. Some of our experience can be explained by
GlusterFS performing poorly under network partitions, but nothing could
justify kernel panics. It blew my mind that Redhat acquired that company and
product.

Edit: I hope there's been a big improvement to the reliability and performance
of GlusterFS. Can anyone with more recent experience running it in production
comment?

~~~
illumin8
I'm not a GlusterFS expert, and haven't used it before, but you should know
that most consensus algorithms (Paxos, Raft, etc) only function reliably with
an odd number of nodes. I have to wonder if your problems were mostly self-
inflicted from having 2 nodes. Of course, any network partition in a 2-node
cluster has a huge potential for data corruption, as each node now thinks it
is the master (split-brain).

In a 3-node cluster, any system with a decent consensus algorithm (to be
clear, I'm not sure if GlusterFS has one) would know that during a partition
the cluster can only continue to operate if at least 2 nodes can communicate
with each other to elect a new master.

~~~
oh_sigh
But then in a split with a 3-node cluster, there is now a 2-node cluster in
charge, and what happens if another partition happens?

~~~
illumin8
Typically, clusters without a majority (2 out of 3, 3 out of 5, 4 out of 7,
etc) of nodes present will shut themselves down to prevent data corruption.

------
pilif
Last time I tried GlusterFS was in 2012. The way it worked was very impressive
back then and I would have loved to actually put it into production.

Unfortunately, I hit a roadblock in relation to enumeration of huge
directories: Even with just 5K files in a directory, performance started to
drop really badly to the point where enumerating a directory containing 10K
files would take longer than 5 minutes.

Yes. You're not supposed to store many files in a directory, but this was
about giving third parties FTP upload access for product pictures and I can't
possibly ask them to follow any schema for file and folder naming. These
people want a directory to put stuff to with their GUI FTP client and they
want their client to be able to not upload files if the target already exists.
So having all files in one directory was a huge improvement UX-wise.

So in the end, I had to move to nfs on top of drbd to provide shared backend
storage. Enumerating 20K files over NFS still isn't fast but completes within
2 seconds instead of more than 5 minutes.

Of course, now that we're talking about GlusterFS, I wonder whether this has
been fixed since?

~~~
mikaelj
Couldn't your FTP server have handled this in a clever way? For example, sort
in directories by first letter, then by first two letters. While still
providing a virtual flat view to the FTP client user. It'd be a simple mapping
away.

~~~
pilif
You assume I wanted to write an FTP server.

I don't. I am using stock vsftp with a PAM module that allows authentication
against our web application.

------
prohor
I'm not sure what is announced here. Gluster FS is for few years already
(version 3.6 now), while the article doesn't mention that there is started any
managed service based on it. It is more like reminder that you can set up
distributed file system on your cloud servers using Gluster. Not even any
step-by-step tutorial how to do that.

~~~
baldfat
> Google Cloud Platform and Red Hat are proud to announce the availability of
> Red Hat Gluster Storage on Google Compute Engine.

Gluster is available not announcing that this is a new technology.

~~~
prohor
How this availability will be implemented then? I can install it now, so it is
already available for me. Will they provide a service like Amazon Elastic File
System? Or there will be some pre-configured images?

~~~
bonzini
On AWS you get pre-configured AMIs, I suppose it's similar.

------
goodcjw2
Basically, GlusterFS is trying to solve a hard problem: make
distributed/remote filesystem to feel like a local filesystem for applications
built on top of it. For the client, you can choose from NFS, SMB or its
homemake fuse client, which makes the remote system accessible as if
everything is on local file system. I used to build similar systems in house
and find it extremely painful to design and maintain, we did lots of custom
hacks to make our system to suit our need. GlusterFS, as a general solution,
won't have that much flexibility and may or may not suit your custom needs.

Overall, I feel AWS S3 is a better (or at least simpler) approach. Just
acknowledge that files are not locally stored and use them as is. AWS is
experimenting EFS as well, which we found not as desirable as well.

Edit: I am not saying that you cannot make GlusterFS or EFS perform great. My
appoint it that it's hard to do so, and might not worth the effort to develop
such a system given that S3 can serve most needs of distributed file storage.

~~~
spydum
aren't you comparing apples to oranges? S3 is an object store (and non POSIX..
Also only eventually consistent).. GlusterFS is neither of those. They simply
solve different problem spaces

~~~
ej_campbell
Parent is saying that it might be simpler for many applications to forgo the
need for a file system abstraction and just use S3's more limited API
directly. That way, you don't get bit by thorny edge cases like "ls" of 10K
files taking orders of magnitude more time than you'd expect give you're used
to the speed of a local disk.

------
Beldur
Red Hat announcement: [http://www.redhat.com/en/about/press-releases/red-hat-
unveil...](http://www.redhat.com/en/about/press-releases/red-hat-unveils-
flexible-and-portable-cloud-storage-red-hat-gluster-storage-google-cloud-
platform)

------
chatmasta
I needed a shared volume across multiple EC2 instances in a VPC. My use case
is that multiple "ingress" boxes write files to the shared volume, and then a
single "worker" box processes those files. This is a somewhat unusual use case
in that it means one box is responsible for 99% of IO heavy operations, and
the other boxes are responsible only for writing to the volume, with no
latency requirements.

My solution was to mount an EBS on the "worker box," along with an NFS server.
Each "ingress box" runs an NFS client that connects to the server via its
internal VPC IP address, and mounts the NFS volume to a local directory. It
works wonderfully. In three months of running this setup, I've had no downtime
or issues, not even minor ones. Granted I don't need any kind of extreme I/O
performance, so I haven't measured it, but this system took less than an hour
to setup and fit my needs perfectly.

~~~
helper
Amazon now has an NFS service (in preview release) called EFS:
[https://aws.amazon.com/efs/](https://aws.amazon.com/efs/)

------
Nux
Gluster is not for the faint of heart, but as far as distributed filesystems
go it's probably the easiest to set up and deal with.

We've been using it in production for a few years now and having a single
namespace that can basically grow ad infinitum has been pretty neat.

If you want a trouble free Gluster experience stay away from MANY small files
and replicated volumes.

------
cgarrigue
It's a bit light for a press release. Considering RedHat is officially
promoting AWS on their website, providing more information to let people know
whether the offering on Google Cloud will be better or similar would have been
better.

~~~
justinclift
Yeah, it seems to be missing the "click here for next steps / further
information" bit, which is less than optimal. :(

~~~
milesward
Disclaimer: I work on GCP Yup, we goofed. Here's the next step bits:
[https://cloud.google.com/solutions/filers-on-compute-
engine](https://cloud.google.com/solutions/filers-on-compute-engine)

~~~
justinclift
Ugh. So it's literally still RH Gluster, needing manual setup.

Also, there's a sentence on the end of the GlusterFS section there saying:

"If you want to deploy a Red Hat Gluster Storage cluster on Compute Engine,
see this white paper for instructions on how to provision a multi-node cluster
that includes cross-zone and cross-region replication:"

Apart from the typo (pedant alert!), the URL on "white paper" goes to a non-
public document only Red Hat subscribers have access to. That should probably
be that fixed, so non-Red-Hat-subscribers can read the doc and know what
they'll need to do up front.

If people need to subscribe to RH in order to get that info... just to know
what they need to do... that's probably going to hinder adoption. Potentially
by a lot. ;)

------
jqueryin
Before reading the article, I was going to ask if it solves the "high read
access of many small files" I/O problem, but alas, it's on GlusterFS, so only
insomuch as Gluster has been making improvements these last few minor
releases.

Is anyone here running a GlusterFS setup with high read/write volume on small
files successfully? If so, what's your secret?

------
objectivefs
If you are looking for a POSIX compatible file system for GCE or EC2, we think
our ObjectiveFS[1] is the easiest way to get started and use. It is a log
structured filesystem using GCS or S3 for backend storage and with a ZFS like
interface to manage your filesystems.

[1] [https://objectivefs.com](https://objectivefs.com)

------
godzillabrennus
Glad to see Gluster is still making waves. I was an early customer. It's
impressive when a brand survives acquisition much less a transition into a new
type of offering like this. Kudos to everyone who helped make Gluster special!

------
melted
I wonder why they even did this. They already have a state of the art
distributed filesystem (Colossus) which doesn't have any scalability problems
at all, since they use it for everything.

[http://www.highlyscalablesystems.com/3202/colossus-
successor...](http://www.highlyscalablesystems.com/3202/colossus-successor-to-
google-file-system-gfs/)

~~~
wmf
Can you mount Colossus? Does it have POSIX semantics?

~~~
melted
I'm sure this could be retrofitted even if it does not.

------
profeta
you can see why google is such a good marketing company. None of the links in
the article is not to their own products.

So, here is the link to the star of the show:
[https://www.redhat.com/en/technologies/storage/gluster](https://www.redhat.com/en/technologies/storage/gluster)

------
amelius
I'm wondering if there exist better (simpler) solutions than GlusterFS for the
case where files are strictly write-once.

------
jpgvm
GlusterFS is basically synonymous with pain. Use at your own peril.

------
stevenking86
nice

------
secopdev
pricing?

~~~
milesward
Disclaimer: I work on GCP. GlusterFS works best on RHEL, and consumes normal
GCP resources like GCE and PD-SSD storage. To host a rocking fast, best
practices, HA, 3TB all-SSD filer, it'd be less than $900 on GCP:
[https://cloud.google.com/products/calculator/#id=e76e9a5a-bf...](https://cloud.google.com/products/calculator/#id=e76e9a5a-bf21-4ccd-9cab-5590c77394db)

------
thrownaway2424
I see that the GlusterFS FAQ says it is fully POSIX compliant. That's a pretty
good trick. Ten years ago or so I had a suite of compliance tests I would use
to embarrass salesmen from iBrix and Panasas. The only actually POSIX-
compliant distributed filesystem I could find in those days was Lustre
(unrelated to Gluster, despite the naming). Lustre works well but it almost
impossible to install and operate.

~~~
batbomb
In HPC/HTC, Lustre is very common, especially in DOE labs. I've never tried to
install it, but I don't think my colleagues who have are some kind of special
genius.

~~~
jpgvm
Most HPC labs run distributions that go to quite a lot of pain to ensure
Lustre works well.

The issue with Lustre is that it usually requires kernel patches if you are
not running the aforementioned distributions that are designed for HPC.
Specifically the Lustre OSD used a modified version of ext2 among other
issues.

That could be getting better now with the new ZFS based OSD though I wouldn't
put money on it being "easy" to install.

Lustre also doesn't provide any means of replication. If you want to achieve
HA with Lustre you need to make each OSD individually HA. This can be done
with multi-pathed SAS arrays and a ton of scripting but it's still not exactly
a walk in the park.

Hopefully one day we will see a real high performance distributed filesystem
that also bundles replication, tiering and some semblance of POSIX
compatiblity. I doubt Ceph is going to be it so we are probably still 5-10
years from a solution to the problem.

~~~
batbomb
Well most HPC installation I know are just running RHEL6/7 (except for NERSC
which is using Cray linux based on SUSE still I think?) which I wouldn't peg
as particularly exotic. The HA thing is know, typically Lustre is just a piece
of the file system, along with local disks and GPFS (and now towards Ceph),
and HPSS/tape (which can be a tier of GPFS).

I'm not so sure the hope for "one filesystem to rule them all" will ever
really work out, but Ceph is the best positioned.

------
cdnsteve
File storage in 2016? Why not just use S3? If file storage even a problem
anymore?

~~~
epistasis
Anytime you use the word "just", you're sweeping a huge number of assumptions
under the rug.

For example, what if you have to run code that depends on a POSIX interface?
That "just" becomes a massive rewrite project.

