But if you do know how to ride it; you have track to ride it on; and it decides to let you ride; you are in for a real treat. It is a rush that can't be beat. Even though after that ride your body will ache from the beating it's been given.
In the right workloads Lustre can't be beat. I've seen it completely transform large HPC clusters from dogs to kentucky derby winners. But as mentioned by many it is a complex beast that needs the right people; right hardware; and right workloads to make it run.
We used it for many years for large geospatial big data workloads. Huge images to many 1000's of HPC nodes. Worked great. But it was highly sensitive to the technical staff running it; and ultimately it became too "cranky" to use in production mode without Gandalf and his wizard army to keep it from eating itself.
So we now run on NFS over IB and happily give up some performance for 24*365 uptime.
OT: I think you meant pure-bred. :) Reminds me of this section from Orwell's «Politics and the English Language»:
> Some metaphors now current have been twisted out of their original meaning without those who use them even being aware of the fact. For example, toe the line is sometimes written as tow the line. Another example is the hammer and the anvil, now always used with the implication that the anvil gets the worst of it. In real life it is always the anvil that breaks the hammer, never the other way about: a writer who stopped to think what he was saying would avoid perverting the original phrase.
That said, I'd have chosen PVFS^WOrangeFS at my previous site if I'd had the choice, but I've not been able to make or find a useful performance comparison.
Care to be more specific?
Lustre is for large HPC clusters. It's not for your desktop; it's not for your video editing suite. It is the only way to provide a filesystem that scales to support, for instance, a compute job of 100k ranks, all writing checkpoints and snapshots periodically. No, Hadoop doesn't do it. Ceph is for performance-and-scaling-insensitive cloud installs.
Lustre is for when you expect to saturage 100 100Gb IB links to storage. it works remarkably well for its use case (though even on HPC clusters, MDS performance can be a problem).
So does GPFS ... er ... Spectrum Scale by IBM.
There are a few others that fit in this, but Lustre is not alone here.
I'm not aware of MPI-IO support for BeeGFS. Are there any installations of it comparable with, say, the filesystems of the big HPC sites like Tri-labs?
1.25Tbyte/sec isn't a particularly large amount of data. (btw: makes more sense to saturate 1000 10Gb ethernet, economically speaking, at this point)
Rewriting large existing codebases to use a different paradigm for storage and distribution may make economic sense, but it's a risky proposition. And when you get one shot at building a Top 500 cluster, taking that risk and delivering a cluster that can't run the code you need it to may mean that all that economic spend was wasted.
Ultimately, rewriting your application to not require low-latency is not a risky proposition. it's in fact the least risky.
There's literature on improved performance running HDFS over Infiniband and Hadoop over Lustre and PVFS at least.
And from up the thread, for a discussion of file systems for high performance checkpointing of MPI jobs (which is what's relevant to HPC) see, for instance http://nowlab.cse.ohio-state.edu/static/media/publications/a...
The improved performance of HDFS over Infiniband comes at a price (the aggregation switches are much more expensive and challenging to configure; there are far fewer network engineers who understand them).
As for the paper you linked to- I see numerous problems. First is the fact that HPC codes spend 60% of their time checkpointing to deal with MTBFs. This is not a problem in data warehouse applications- failures of components do not affect the job's progress (except in the form of slower progress). Second, the paper has the checkpoint data buffered in local RAM then spilled out to local or remote storage. That's fine, except that you're still at danger if the node crashes before the data is committed to true backing storage (or has a way to recover the spill process after repair/reboot).
We have the Infiniband (or similar) fabric, and we understand it -- in my case better than the Ethernet one we're proposed to have to understand as well, and we're typically not "network engineers" doing this stuff. I disagree that Infiniband is more challenging anyway, having had and known about bad problems with cluster Ethernet. Then this 10G Ethernet get a whole 10% extra bandwidth on top of current HPC fabrics while presumably adding jitter.
HPC codes don't generally spend 60% checkpointing, and "data warehouse applications" typically aren't of interest to us, but we can do mapreduce-type things fast. See the bits about redundancy and the advantage of RDMA in the checkpointing filesystem.
Take, for example, my favorite storage system Ceph. As I understand it was originally going to be CephFS, with multiple metadata servers and lots of distributed POSIX goodness. However, in the 10+ years its been in development, the parts that have gotten tons of traction and have widespread use are seemingly one-off side projects from the underlying storage system: object storage and the RBD block device interfaces. Only in the past 12 months is CephFS becoming production ready. But only with a single metadata server, and the multiple metadata servers are still being debugged.
With Ceph, some of these timing issues are that the market for object store and network-based block devices are dwarf the market for distributed POSIX. But I bring it up to point out that distributed POSIX is also just a really really hard problem, with limited use cases. It's super convenient for getting an existing Unix executable to run on lots of machines at once. But that convenience may not be worth the challenges it imposes on the infrastructure.
CephFS simply didn't gain as much traction because it made sense to just store objects in many cases, and let something else worry about what is stored and where. A massive distributed file system is not nearly as necessary as people make it out to be for a lot of different workloads.
RBD works so well because it can replace iSCSI/FC storage and provide failover and redundancy, and as you scale servers it gets faster. We had 20 Gbit/sec bandwidth for our storage network (to each hypervisor), with 20 Gbit/sec for the front-end Ceph access network, and Ceph RBD easily saturated it on the hypervisor side. Spinning up VM's in OpenStack was a pleasure and soooooo quick (RBD backed images too).
We looked at increasing the hypervisor to 40 Gbit/sec, but decided against it for cost reasons, and in reality we had very little workloads that actually needed that much storage IO.
None of the internet facing services actually need a strongly consistent POSIX filesystem that scales across multiple datacenters and a huge chunk of them won't put up with corresponding latencies of something like that for mere convinience. So the products are not really flawed, they just don't need to do those things.
also, the only standard distributed protocol for posix, nfs, has always had a lot of design and implementation issues. v4 is complex and tries to address some of the consistency issues, but I don't know how successful it is in practice given the limited usage.
treating blobs as immutable addressable objects and using other systems for coordination avoids a lot of the pain with caching, consistency, metadata performance, etc. you can layer those things...its a good cut for large scale distributed systems
I have (somewhat esoteric corner cases, but still) benchmarks that will cause failures within seconds when run against NFS.
But its weakness is that it is a complex beast to setup and maintain. If you think that configuring and maintaining a Hadoop cluster is hard, then Ceph is twice as hard. Ceph has really good documentation, but all the knobs you have to read and play around with is still too much.
I prefer running a few ZFS servers, very easy to setup and maintain and much better performance. But if we reach Petabyte scale I think I would need something like Ceph.
I haven't actually had the occasion to use either of them in a project but very curious to hear about whether they have successfully enabled some useful legacy->cloud migration scenarios.
So I think it's a question of the correct development focus in the correct order.
The goal is to have a single CephFS across the globe, with the current pace they will get there, eventually!
The real question however is not where Intel does not see its business today but where it sees its future business.
Intel seems to be going back to being a hardware only company.
1999: Lustre is started as a research project at Carnegie Mellon University
2001: Cluster File Systems, Inc. founded to commercialize Lustre
2007: Sun acquires Cluster File Systems
2010: Oracle buys Sun Microsystems, acquires Lustre assets
2010: Eric Barton leaves Oracle and founds Whamcloud, continues Lustre dev there
2012: Intel buys Whamcloud for its Lustre assets
2013: Oracle sells Lustre-related assets to Xyratex
2013: Intel tries to extend Lustre to Hadoop use-cases
2014: Seagate buys Xyratex
2017: Intel discontinues its Lustre Filesystem business
I seem to remember IBM and HP being involved at some stage, but I'm having trouble finding it online now.
The only serious open-source competitors to Lustre are Ceph and glusterfs. But Ceph is too unstable, and glusterfs 3.0 is based off of distributed hash tables and so is not strongly consistent.
These days, Ceph/Gluster might work, but Lustre's proven.
Lustre is a monstrosity - badly designed, poorly implemented, very hard to configure and keep running or even get adequate performance under other than a single limited use case.
Good riddance to bad rubbish.
From the article: " In the core HPC market, Lustre has 75 of the Top 100 systems on the Top 500 rankings, and is used on nine of the top ten systems."
That is a strong argument for it being performant. That 'single limited use case' you mention is... high performance computing.
And no, "single limited use case" I mean, is a relatively small number of very large files that need to be sequentially streamed out/in as fast as possible. That's a small subset of HPC.
GPFS is the gold standard but it's expensive.
Lustre is more a byproduct of the HPC vendors trying to get more margin by using open source. They've drunk from the poisoned chalice in that the amount of time, money and effort required to get Lustre to acceptability (limited as that may be) has been far more than GPFS license/support cost. Even from the brain-eating zombie that is today's IBM.
Confirming! And going on a small tangent (ha!).
Our previous configuration was Lustre and XFS/NFS; the former was the scratch file system for HPC applications and the latter for home directories and what not.
Lustre was definitely a beast (in a good sense), but we'd occasionally get bit by a work load high in metadata operations - which would bring Lustre to its knees due to latency.
XFS/NFS was great for its purpose (no HPC workloads), but we'd also get bit by a user or users reading seemingly tiny, innocuous files (inadvertently in parallel) which would cause load averages to spike; surprisingly the latency wasn't as bad as the Lustre workload mentioned above.
Not too drink the GPFS kool aid here, but it's definitely solved both problems above. It has its issues, but definitely handles the common I/O patterns seen on our cluster.
At least the Lustre developers (pre-Intel) had the foresight to enable extremely good debugging - you could simply enable a few procfs settings and easily find the offenders.
It's still amusing to me however, that the biggest offenders of Lustre slow-downs were single core processes. I'd check the RPC counts per node, find the violator, and then check per PID statistics; it was always (95%+) a single core application performing thousands of calls.
We never did experiment with the 2.x branch of the software. I recall one of our co-workers stating that even the developers did not believe the dual MDS set-up was production ready at that time.
it definitely served a need that nothing else did, but from an armchair position we could collectively do better
Ceph and Gluster can't achieve the kind of performance required for HPC scratch. Hadoop can, but if you are managing a general purpose HPC cluster, nearly all software out there is expecting to write to a posix filesystem. Much of this software is built out of old C and FORTRAN that is not going to be rewritten. Researchers are not software engineers. You need a "lowest common denominator" of storage that you can read/write at high speed in parallel. There are few technologies for this particular use case.
The others that I know of (but haven't tried) are:
- OrangeFS (http://www.orangefs.org/)
- BeeGFS (http://www.beegfs.com/)
- OpenAFS/Arla (ancient, and weird, it is one of the big reasons Scientific Linux exists as a distro: http://www.openafs.org/, http://www.stacken.kth.se/project/arla/)
- Scality (http://www.scality.com/)
- GPFS (Spectrum now: https://www-03.ibm.com/systems/storage/spectrum/scale/)
- Qumulo (they say they can hit the numbers, but I have doubts: http://qumulo.com/
There are probably many more, but these are the ones I know about.
I run two Gluster clusters for more boring enterprisey reliability.
It is still kind of a mess - there are rough edges all over. Log messages make zero sense by themselves - you either need to learn what they're indicative of or read the code. Frequently, typoing configuration commands (configuration is mostly imperative command line driven, not via configuration files) so that it fails, actually does make changes, such that you'll have to unwind what it did before you can try again, etc.
All that said, Gluster is pretty solid for our sue - distributed, fault-tolerant POSIX storage. But I wouldn't use it in an HPC environment.
MooseFS seems to do its job well, the only problem is that by default it lacks HA. The master server is very important since it keeps track of where all data is located, so if it is gone your data is gone. You can set up one or more metalogger nodes, which sole purpose is to backup the metadata information. You can also manually promote metalogger to master.
They have commercial offering that includes HA, but I never used so hard to comment about it.
You could also set up HA yourself using corosync and peacemaker, but it's a bit challenging task.
Unfortunately, it's prior licensing scheme prevented larger initial adoption. A good chunk of large scale setups I've seen have adopted Ceph, but I cannot comment on performance.
Any knucklehead can install it and it works transparently with excellent performance. It's currently not enterprise-grade fault tolerant - that's the worst thing I can say about it.
Its getting there quickly. BeeGFS v6 has metadata mirroring via buddy groups, and data mirroring the same way. A few things are still not there, but it is rapidly getting closer.
Being open source helps as well.
Definitely recommend it, and have deployed it and supported it many places in my previous life: http://scalableinformatics.com .
I'll get my coat.