

Real Parallel Filesystems - wmf
http://pl.atyp.us/wordpress/?p=2662

======
jart
Interesting stuff. I read a few more articles here and walked away with more
questions than answers. CassFS definitely looks fun learning project.

What I'd really love to see, is an article that explains the
goals/niche/maturity/strengths/weaknesses of various distributed storage
solutions.

Can anyone here offer a recommendation? (For example, here's something
similar: [http://www.metabrew.com/article/anti-rdbms-a-list-of-
distrib...](http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-
key-value-stores/) )

I've been casually researching this topic for months and I still feel so
overwhelmed! Not many people are giving a clear picture of what they're trying
to accomplish, when/how their system should be used, and how they compare to
others.

I'm mostly asking because I recently ended up writing my own. It's just a
Python script that sends a UDP message to all our servers asking who has the
file. First person to respond wins and the file is downloaded via nginx.
Replication is accomplished with durable message queues which are consumed
during off-peak hours. It took days to code and suits our needs perfectly.

I learned from this project that reliable, scalable, distributed data storage
is very simple, if you focus on solving a very specific use-case scenario.

------
skorgu
Some notes on the production-quality filesystems listed in the article:

PVFS2 requires shared storage [1] (i.e. a SAN/DRBD/GNBD), that alone makes it
fall out of the "parallel filesystem" bucket for me.

Lustre won't have distributed metadata until 2.2 according to their roadmap
[2]. Considering 2.0 is still in alpha...

GlusterFS looks promising, I haven't found any major red flags in a quick
perusal of the site anyway.

[1]
[http://www.pvfs.org/cvs/pvfs-2-7-branch.build/doc/pvfs2-ha-h...](http://www.pvfs.org/cvs/pvfs-2-7-branch.build/doc/pvfs2-ha-
heartbeat-v2/pvfs2-ha-heartbeat-v2.php#SECTION00031000000000000000) [2]
<http://wiki.lustre.org/index.php/Lustre_Roadmap>

~~~
CPlatypus
Some notes from the OP author. ;)

(1) Neither PVFS2 nor Lustre requires shared storage except to do hot failover
for a failed server. I've personally run both in configurations where the only
storage was node-local RAM. Disk failures can be avoided using host-based
RAID, while node failures can be handled either by software disk mirroring
(e.g. DRBD) or by physically moving the disks to another server. Are these
approaches optimal, or even acceptable in a typical production environment? Of
course not; there's a severe performance hit for the first and an even worse
availability hit for the second. I do believe this is a positive
differentiator for GlusterFS (though the AFR translator does exact a high
performance toll) but it's not quite accurate to say or imply Lustre or PVFS2
can't be used without shared storage and it's a bit unfair to mention that for
one but not the other.

(2) It's only somewhat surprising to me that Lustre still doesn't have
distributed metadata. I knew they were making a very big push for it in 1.8.
Then I left SiCortex, and after two years dealing with Lustre lameness I
actively avoided it thereafter. Adding distributed metadata after the fact is
Very Hard, so the fact that they failed (for at least the third time) isn't a
surprise.

~~~
skorgu
Coping automatically with hardware failures is feature #1 to me so I compared
them all on that assumption, apologies if that wasn't clear. I couldn't
quickly find an affirmative reference to shared storage on Lustre's so I gave
it the benefit of the doubt.

I'm not convinced that getting acceptable performance out of a distributed
system that maintains POSIX or near-POSIX semantics is even possible in the
real world; as you say it's decidedly nontrivial. The path of least pain in
the real world seems to be glorified key/value blob stores i.e. MogileFS.

Thanks for the writeup, I feel like I do the same round up every few years
when someone suggests "so just run $LEGACYAPP on top of a distributed
filesystem!" and I have to reiterate in detail why that's usually a Bad Idea.

------
Kaya
www.isilon.com makes the best clustered filesystem around. It's not open
source, however.

