Real Parallel Filesystems

jart · on Feb 13, 2010

Interesting stuff. I read a few more articles here and walked away with more questions than answers. CassFS definitely looks fun learning project.

What I'd really love to see, is an article that explains the goals/niche/maturity/strengths/weaknesses of various distributed storage solutions.

Can anyone here offer a recommendation? (For example, here's something similar: http://www.metabrew.com/article/anti-rdbms-a-list-of-distrib... )

I've been casually researching this topic for months and I still feel so overwhelmed! Not many people are giving a clear picture of what they're trying to accomplish, when/how their system should be used, and how they compare to others.

I'm mostly asking because I recently ended up writing my own. It's just a Python script that sends a UDP message to all our servers asking who has the file. First person to respond wins and the file is downloaded via nginx. Replication is accomplished with durable message queues which are consumed during off-peak hours. It took days to code and suits our needs perfectly.

I learned from this project that reliable, scalable, distributed data storage is very simple, if you focus on solving a very specific use-case scenario.

skorgu · on Feb 13, 2010

Some notes on the production-quality filesystems listed in the article:

PVFS2 requires shared storage [1] (i.e. a SAN/DRBD/GNBD), that alone makes it fall out of the "parallel filesystem" bucket for me.

Lustre won't have distributed metadata until 2.2 according to their roadmap [2]. Considering 2.0 is still in alpha...

GlusterFS looks promising, I haven't found any major red flags in a quick perusal of the site anyway.

[1] http://www.pvfs.org/cvs/pvfs-2-7-branch.build/doc/pvfs2-ha-h... [2] http://wiki.lustre.org/index.php/Lustre_Roadmap

CPlatypus · on Feb 13, 2010

Some notes from the OP author. ;)

(1) Neither PVFS2 nor Lustre requires shared storage except to do hot failover for a failed server. I've personally run both in configurations where the only storage was node-local RAM. Disk failures can be avoided using host-based RAID, while node failures can be handled either by software disk mirroring (e.g. DRBD) or by physically moving the disks to another server. Are these approaches optimal, or even acceptable in a typical production environment? Of course not; there's a severe performance hit for the first and an even worse availability hit for the second. I do believe this is a positive differentiator for GlusterFS (though the AFR translator does exact a high performance toll) but it's not quite accurate to say or imply Lustre or PVFS2 can't be used without shared storage and it's a bit unfair to mention that for one but not the other.

(2) It's only somewhat surprising to me that Lustre still doesn't have distributed metadata. I knew they were making a very big push for it in 1.8. Then I left SiCortex, and after two years dealing with Lustre lameness I actively avoided it thereafter. Adding distributed metadata after the fact is Very Hard, so the fact that they failed (for at least the third time) isn't a surprise.

skorgu · on Feb 13, 2010

Coping automatically with hardware failures is feature #1 to me so I compared them all on that assumption, apologies if that wasn't clear. I couldn't quickly find an affirmative reference to shared storage on Lustre's so I gave it the benefit of the doubt.

I'm not convinced that getting acceptable performance out of a distributed system that maintains POSIX or near-POSIX semantics is even possible in the real world; as you say it's decidedly nontrivial. The path of least pain in the real world seems to be glorified key/value blob stores i.e. MogileFS.

Thanks for the writeup, I feel like I do the same round up every few years when someone suggests "so just run $LEGACYAPP on top of a distributed filesystem!" and I have to reiterate in detail why that's usually a Bad Idea.

on Feb 13, 2010

[deleted]

on Feb 13, 2010

[deleted]

thaumaturgy · on Feb 13, 2010

> I don't see what this has to do with parallel filesystems at all

Yeah, you're right, sorry. I read the article but came back and flew off the handle. I'll delete my comment, it's just noise.

patrickgzill · on Feb 13, 2010

OpenSolaris and ZFS is a great combination for that.

Advocacy aside, change the requirement for USB to Compact Flash with an IDE adapter that makes it look like an IDE drive, and your problems will be solved.

Kaya · on Feb 13, 2010

www.isilon.com makes the best clustered filesystem around. It's not open source, however.