Hacker News new | comments | show | ask | jobs | submit login
Intel Shuts Down Lustre File System Business (nextplatform.com)
108 points by arcanus 155 days ago | hide | past | web | 64 comments | favorite

Lustre is a pure-bread racehorse. It is not something the general masses should ever take a ride on; or even get up close and personal with. It will bite you; kick you; piss on you; crap on you; stomp on you; and generally try to kill you. It is cranky. It is not user friendly. It doesn't play nice with others.

But if you do know how to ride it; you have track to ride it on; and it decides to let you ride; you are in for a real treat. It is a rush that can't be beat. Even though after that ride your body will ache from the beating it's been given.

In the right workloads Lustre can't be beat. I've seen it completely transform large HPC clusters from dogs to kentucky derby winners. But as mentioned by many it is a complex beast that needs the right people; right hardware; and right workloads to make it run.

We used it for many years for large geospatial big data workloads. Huge images to many 1000's of HPC nodes. Worked great. But it was highly sensitive to the technical staff running it; and ultimately it became too "cranky" to use in production mode without Gandalf and his wizard army to keep it from eating itself.

So we now run on NFS over IB and happily give up some performance for 24*365 uptime.

> pure-bread

OT: I think you meant pure-bred. :) Reminds me of this section from Orwell's «Politics and the English Language»:

> Some metaphors now current have been twisted out of their original meaning without those who use them even being aware of the fact. For example, toe the line is sometimes written as tow the line. Another example is the hammer and the anvil, now always used with the implication that the anvil gets the worst of it. In real life it is always the anvil that breaks the hammer, never the other way about: a writer who stopped to think what he was saying would avoid perverting the original phrase.

It's not uncommon to see anvils with their horns broken off from too much hammering.

Is it possible multiple hammers were involved?

Such a wonderfully insightful and well written comment and you choose to belabor a typo? sigh

In my experience, Lustre on a typical medium-size (several hundred node) cluster has been more reliable than NFS (at least once the cluster vendor was out of the way), and you can actually use it for MPI-IO, which is crucial. It's been quite straightforward to operate, at least with a simple LNET setup. I don't understand this stuff about racehorses. NFS has been easily flattened in various circumstances, too.

That said, I'd have chosen PVFS^WOrangeFS at my previous site if I'd had the choice, but I've not been able to make or find a useful performance comparison.

>>> But it was highly sensitive to the technical staff running it;

Care to be more specific?

I'd be very interested in how much "some performance" is :) I'm assuming good old nfsv3? No encrypted transport etc?

Actually Lustre gives good "out of the box" performance... I've used a few HPC clusters and have almost never felt the need to issue commands to configure striping, etc.

if your alternative to Lustre is NFS over IB, you didnt need Lustre in the first place

ironic to see comments here criticizing Lustre, from people who don't seem to understand it at all.

Lustre is for large HPC clusters. It's not for your desktop; it's not for your video editing suite. It is the only way to provide a filesystem that scales to support, for instance, a compute job of 100k ranks, all writing checkpoints and snapshots periodically. No, Hadoop doesn't do it. Ceph is for performance-and-scaling-insensitive cloud installs.

Lustre is for when you expect to saturage 100 100Gb IB links to storage. it works remarkably well for its use case (though even on HPC clusters, MDS performance can be a problem).

Actually Lustre isn't the only game in town for this. BeeGFS (http://beegfs.com) does a very good job at this as well, has better small to large scaling, understandable by mere-mortal error messages, doesn't require a specific (ancient) kernel or distro for the server ...

So does GPFS ... er ... Spectrum Scale by IBM.

There are a few others that fit in this, but Lustre is not alone here.

I'm not sure why the filesytem server kernel is relevant or how RHEL 7.3's is ancient.

I'm not aware of MPI-IO support for BeeGFS. Are there any installations of it comparable with, say, the filesystems of the big HPC sites like Tri-labs?

Also Panasas (http://www.panasas.com/).

To my knowledge GPFS is IBM only.

It's an IBM commercial product, if that's what you mean by "IBM only". You can install it on any hardware though, we run it on a combination of SuperMicro, Dell, and DDN currently.

actually, Hadoop-like systems do precisely this (a compute job of 100k ranks, all writing checkpoints and snapshots periodically).

1.25Tbyte/sec isn't a particularly large amount of data. (btw: makes more sense to saturate 1000 10Gb ethernet, economically speaking, at this point)

1000 10Gb ethernet doesn't usually make sense in the situations that are using Lustre and Infiniband; they actually need the reduced latency for inter-process communication. It may be fine for ethernet in front of the disks, but they still need to get the bits to the processors that are connected by infiniband.

Rewriting large existing codebases to use a different paradigm for storage and distribution may make economic sense, but it's a risky proposition. And when you get one shot at building a Top 500 cluster, taking that risk and delivering a cluster that can't run the code you need it to may mean that all that economic spend was wasted.

The 10Gb's are for node to storage. The IB is for low latency. Having both in the same machine is a win. Wasting IB for storage is dumb, because storage times are orders of magnitude higher than the IB latency.

Ultimately, rewriting your application to not require low-latency is not a risky proposition. it's in fact the least risky.

That makes no sense to an HPC person (who has actually been evaluating low-latency Ethernet recently). How do you even separate the MPI-IO traffic over a different fabric from other MPI in, say, Open MPI, and how could maybe doubling network cost and complexity be a win if you could? And then we have to re-write the applications somehow?

There's literature on improved performance running HDFS over Infiniband and Hadoop over Lustre and PVFS at least.

And from up the thread, for a discussion of file systems for high performance checkpointing of MPI jobs (which is what's relevant to HPC) see, for instance http://nowlab.cse.ohio-state.edu/static/media/publications/a...

An MPI implementation is free to spread traffic over available transports; there's nothing in the design of MPI that prevents such a configuration.

The improved performance of HDFS over Infiniband comes at a price (the aggregation switches are much more expensive and challenging to configure; there are far fewer network engineers who understand them).

As for the paper you linked to- I see numerous problems. First is the fact that HPC codes spend 60% of their time checkpointing to deal with MTBFs. This is not a problem in data warehouse applications- failures of components do not affect the job's progress (except in the form of slower progress). Second, the paper has the checkpoint data buffered in local RAM then spilled out to local or remote storage. That's fine, except that you're still at danger if the node crashes before the data is committed to true backing storage (or has a way to recover the spill process after repair/reboot).

The question was how you'd configure a concrete MPI implementation to separate MPI-IO traffic from the rest if it was a good idea.

We have the Infiniband (or similar) fabric, and we understand it -- in my case better than the Ethernet one we're proposed to have to understand as well, and we're typically not "network engineers" doing this stuff. I disagree that Infiniband is more challenging anyway, having had and known about bad problems with cluster Ethernet. Then this 10G Ethernet get a whole 10% extra bandwidth on top of current HPC fabrics while presumably adding jitter.

HPC codes don't generally spend 60% checkpointing, and "data warehouse applications" typically aren't of interest to us, but we can do mapreduce-type things fast. See the bits about redundancy and the advantage of RDMA in the checkpointing filesystem.

Hadoop is embarrassingly parallel IO, Lustre can do that, but it does much more than that specific workload.

This entire space is littered with hard to use and/or flawed products. It's extremely difficult to get right. And even things like HDFS, which redefine the problem into something much much more manageable and with better semantics for distributed computing, have had their own issues.

Take, for example, my favorite storage system Ceph. As I understand it was originally going to be CephFS, with multiple metadata servers and lots of distributed POSIX goodness. However, in the 10+ years its been in development, the parts that have gotten tons of traction and have widespread use are seemingly one-off side projects from the underlying storage system: object storage and the RBD block device interfaces. Only in the past 12 months is CephFS becoming production ready. But only with a single metadata server, and the multiple metadata servers are still being debugged.

With Ceph, some of these timing issues are that the market for object store and network-based block devices are dwarf the market for distributed POSIX. But I bring it up to point out that distributed POSIX is also just a really really hard problem, with limited use cases. It's super convenient for getting an existing Unix executable to run on lots of machines at once. But that convenience may not be worth the challenges it imposes on the infrastructure.

Object Storage and layered on top of that RBD are much easier to get right.

CephFS simply didn't gain as much traction because it made sense to just store objects in many cases, and let something else worry about what is stored and where. A massive distributed file system is not nearly as necessary as people make it out to be for a lot of different workloads.

RBD works so well because it can replace iSCSI/FC storage and provide failover and redundancy, and as you scale servers it gets faster. We had 20 Gbit/sec bandwidth for our storage network (to each hypervisor), with 20 Gbit/sec for the front-end Ceph access network, and Ceph RBD easily saturated it on the hypervisor side. Spinning up VM's in OpenStack was a pleasure and soooooo quick (RBD backed images too).

We looked at increasing the hypervisor to 40 Gbit/sec, but decided against it for cost reasons, and in reality we had very little workloads that actually needed that much storage IO.

"This entire space is littered with hard to use and/or flawed products."

None of the internet facing services actually need a strongly consistent POSIX filesystem that scales across multiple datacenters and a huge chunk of them won't put up with corresponding latencies of something like that for mere convinience. So the products are not really flawed, they just don't need to do those things.

this is right. posix for posix sake shouldn't really be relevant anymore. look at the success of write-once s3. posix semantics were designed to address issues of concurrent access to the same bytes.

also, the only standard distributed protocol for posix, nfs, has always had a lot of design and implementation issues. v4 is complex and tries to address some of the consistency issues, but I don't know how successful it is in practice given the limited usage.

treating blobs as immutable addressable objects and using other systems for coordination avoids a lot of the pain with caching, consistency, metadata performance, etc. you can layer those things...its a good cut for large scale distributed systems

Well, NFS isn't really cache coherent, and hence not POSIX compliant. Lustre is, but pays for it with an amazingly complicated architecture.

I have (somewhat esoteric corner cases, but still) benchmarks that will cause failures within seconds when run against NFS.

I've played around with Ceph, I really like its features and strong consistency.

But its weakness is that it is a complex beast to setup and maintain. If you think that configuring and maintaining a Hadoop cluster is hard, then Ceph is twice as hard. Ceph has really good documentation, but all the knobs you have to read and play around with is still too much.

I prefer running a few ZFS servers, very easy to setup and maintain and much better performance. But if we reach Petabyte scale I think I would need something like Ceph.

I'll echo that, we had a project that at one time, even with a couple experienced Ceph admins, suffered meltdown after meltdown on Ceph until we just took the time to move the project and workload over to HDFS. Our admins learned to setup and administer HDFS from scratch and have had far fewer issues.

When you have an object store available (Ceph, Cleversafe, S3, etc), and need a POSIX file system you can also use ObjectiveFS[1]. By using an existing object store for the backend storage, it separates the concerns of reliable storage from the file system semantics which makes it (we think) easier to set up and use. With object and block storage working great for their use cases, file systems can provide some extra functionality, e.g. point-in-time snapshots of the whole file system, which makes it easy to create a consistent backup of your data.

[1]: https://objectivefs.com

I generally agree with your assessment, but for what it's worth more recently Amazon has shipped its Elastic Filesystem (which exposes NFSv4) and Microsoft offers Azure File Storage (SMB 3.0). AIUI neither is truly POSIX compliant but both are probably good enough for realizing most of the dream?

I haven't actually had the occasion to use either of them in a project but very curious to hear about whether they have successfully enabled some useful legacy->cloud migration scenarios.

Actually, the reason why rbd and rgw was prioritised was because of them being in popular demand. There were also a lot of cash and movement behind them too. I'm not saying it is an easy problem, just that their focus has been on other stuff along the way.

So I think it's a question of the correct development focus in the correct order.

The goal is to have a single CephFS across the globe, with the current pace they will get there, eventually!

I use Tahoe-LAFS for all of my distributed FS stuff, and I really do love it. One minor downside is that the introducer server is a SPOF, but they can be backed up/spun up and distributed introducers are on the roadmap for the next year.

Intel seems to be cleaning its house from a number of long ongoing efforts. The most high profile one was the IDF, then OpenStack and this one here fits the pattern as well. Looks like at the beginning of the year they did some hard thinking where the company should be heading and decided to change direction. And it looks like they mean it.

The real question however is not where Intel does not see its business today but where it sees its future business.

They've also spanoff Intel Security back into McAffee as a separate company which Intel only has a stake in.

Intel seems to be going back to being a hardware only company.

As a counter-example, they bought back Itseez which had spun off to develop OpenCV.

As well, they've gutted the Intel XDK, soon to be stopping cloud build services and removing several other features.

A lot of commenters here are treating this announcement like it is the end of Lustre. It is not. Lustre has a LONG history and a lot of deployments. A partial history:

1999: Lustre is started as a research project at Carnegie Mellon University 2001: Cluster File Systems, Inc. founded to commercialize Lustre 2007: Sun acquires Cluster File Systems 2010: Oracle buys Sun Microsystems, acquires Lustre assets 2010: Eric Barton leaves Oracle and founds Whamcloud, continues Lustre dev there 2012: Intel buys Whamcloud for its Lustre assets 2013: Oracle sells Lustre-related assets to Xyratex 2013: Intel tries to extend Lustre to Hadoop use-cases 2014: Seagate buys Xyratex 2017: Intel discontinues its Lustre Filesystem business

I seem to remember IBM and HP being involved at some stage, but I'm having trouble finding it online now.

The only serious open-source competitors to Lustre are Ceph and glusterfs. But Ceph is too unstable, and glusterfs 3.0 is based off of distributed hash tables and so is not strongly consistent.

Whatever happened to parallel NFS/pNFS? I know Panasas basically uses this + some of their own special sauce, but hasn't been clear to me if there is an open source implementation of pNFS that could approximate performance of Panasas on the right non-proprietary hardware.

IBM operates an internal company-wide storage system called GSA or Global Storage Architecture, which is GPFS on the servers, with various and sundry methods of access through automounted NFS, Samba, and a basic web interface. I wonder if such a system could be constructed with Lustre, one that wouldn't require a Lustre wizard with a Unix beard to keep it happy.

We used Lustre, pNFS and GPFS at Stanford on HPC gear (DDN, Panasas and some enterprise COTS). Luster has a lot of moving parts and config. Most folks tend to use Puppet/Chef and/or Rocks distro to deploy clusters in a somewhat sane manner. (Sometimes AFS too but not much.)

These days, Ceph/Gluster might work, but Lustre's proven.

It's been amusing to see all the apologia and simple ass-covering following this.

Lustre is a monstrosity - badly designed, poorly implemented, very hard to configure and keep running or even get adequate performance under other than a single limited use case.

Good riddance to bad rubbish.

> or even get adequate performance under other than a single limited use case.


From the article: " In the core HPC market, Lustre has 75 of the Top 100 systems on the Top 500 rankings, and is used on nine of the top ten systems."

That is a strong argument for it being performant. That 'single limited use case' you mention is... high performance computing.

That's an artifact of the limited choices available. Plus the core HPC market is both notoriously inbred and tolerant of low system reliability and high delicacy that the enterprise wouldn't let in the door.

And no, "single limited use case" I mean, is a relatively small number of very large files that need to be sequentially streamed out/in as fast as possible. That's a small subset of HPC.

GPFS is the gold standard but it's expensive.

Lustre is more a byproduct of the HPC vendors trying to get more margin by using open source. They've drunk from the poisoned chalice in that the amount of time, money and effort required to get Lustre to acceptability (limited as that may be) has been far more than GPFS license/support cost. Even from the brain-eating zombie that is today's IBM.

> GPFS is the gold standard but it's expensive.

Confirming! And going on a small tangent (ha!).

Our previous configuration was Lustre and XFS/NFS; the former was the scratch file system for HPC applications and the latter for home directories and what not.

Lustre was definitely a beast (in a good sense), but we'd occasionally get bit by a work load high in metadata operations - which would bring Lustre to its knees due to latency.

XFS/NFS was great for its purpose (no HPC workloads), but we'd also get bit by a user or users reading seemingly tiny, innocuous files (inadvertently in parallel) which would cause load averages to spike; surprisingly the latency wasn't as bad as the Lustre workload mentioned above.

Not too drink the GPFS kool aid here, but it's definitely solved both problems above. It has its issues, but definitely handles the common I/O patterns seen on our cluster.

If I walked you through the Lustre metadata state machine you wouldn't be surprised by Lustre latency any longer. It's a veritable Rube Goldberg machine without the amusement.

I believe it.

At least the Lustre developers (pre-Intel) had the foresight to enable extremely good debugging - you could simply enable a few procfs settings and easily find the offenders.

It's still amusing to me however, that the biggest offenders of Lustre slow-downs were single core processes. I'd check the RPC counts per node, find the violator, and then check per PID statistics; it was always (95%+) a single core application performing thousands of calls.

We never did experiment with the 2.x branch of the software. I recall one of our co-workers stating that even the developers did not believe the dual MDS set-up was production ready at that time.

right, so they took a vertical that no one else wanted and served it kind of badly. I don't know about the last few years, but Lustre used to be famous for eating up a huge amount of time in the field, even given that its only supposed to be relevant for really large files.

it definitely served a need that nothing else did, but from an armchair position we could collectively do better

In your opinion, what would be better than Lustre? Ceph, Gluster, HDFS?

Those choices are apples and oranges. Particularly HDFS as you can't really use it as a native fs.


Any recommendations for alternatives, and what their relative weak points are in your experience?

I only have experience running Lustre and Gluster. Gluster was a mess, to be honest, though that was several years ago. Lustre (and we run Intel Enterprise Lustre) has been pretty solid. Most HPC outfits run lustre over ZFS, actually, so you get the benefits of both.

Ceph and Gluster can't achieve the kind of performance required for HPC scratch. Hadoop can, but if you are managing a general purpose HPC cluster, nearly all software out there is expecting to write to a posix filesystem. Much of this software is built out of old C and FORTRAN that is not going to be rewritten. Researchers are not software engineers. You need a "lowest common denominator" of storage that you can read/write at high speed in parallel. There are few technologies for this particular use case.

The others that I know of (but haven't tried) are:

- OrangeFS (http://www.orangefs.org/) - BeeGFS (http://www.beegfs.com/) - OpenAFS/Arla (ancient, and weird, it is one of the big reasons Scientific Linux exists as a distro: http://www.openafs.org/, http://www.stacken.kth.se/project/arla/) - Scality (http://www.scality.com/) - GPFS (Spectrum now: https://www-03.ibm.com/systems/storage/spectrum/scale/) - Qumulo (they say they can hit the numbers, but I have doubts: http://qumulo.com/

There are probably many more, but these are the ones I know about.

People may use Gluster for HPC; I don't know. But it really isn't suited.

I run two Gluster clusters for more boring enterprisey reliability.

It is still kind of a mess - there are rough edges all over. Log messages make zero sense by themselves - you either need to learn what they're indicative of or read the code. Frequently, typoing configuration commands (configuration is mostly imperative command line driven, not via configuration files) so that it fails, actually does make changes, such that you'll have to unwind what it did before you can try again, etc.

All that said, Gluster is pretty solid for our sue - distributed, fault-tolerant POSIX storage. But I wouldn't use it in an HPC environment.

Qumulo can't hit the numbers. It's just NFS, running on ho-hum hardware with an Ethernet cluster interconnect. It's like slow Isilon if cheaper and with less functionality.

So far I heard a lot about GlusterFS and Ceph being great but so far I only seen MooseFS in production.

MooseFS seems to do its job well, the only problem is that by default it lacks HA. The master server is very important since it keeps track of where all data is located, so if it is gone your data is gone. You can set up one or more metalogger nodes, which sole purpose is to backup the metadata information. You can also manually promote metalogger to master.

They have commercial offering that includes HA, but I never used so hard to comment about it.

You could also set up HA yourself using corosync and peacemaker, but it's a bit challenging task.

Oh yeah, MooseFS. I forgot about that one.

Any opinion on BeeGFS?

We used BeeGFS since it was called FhGFS on a cluster of 32 nodes for a total capacity ranging from 256tb-1Pb. BeeGFS is very easy to configure and maintain, outperforming Lustre for every task we threw it at. In fact, we never transitioned lustre into production. Lustre was often slower than plain NFS for the cases where all storage could be mounted by a single node(!).

Unfortunately, it's prior licensing scheme prevented larger initial adoption. A good chunk of large scale setups I've seen have adopted Ceph, but I cannot comment on performance.

BeeGFS is great. I originally evaluated it when it was FhGFS and am still involved with a large production BeeGFS cluster deployment.

Any knucklehead can install it and it works transparently with excellent performance. It's currently not enterprise-grade fault tolerant - that's the worst thing I can say about it.

> It's currently not enterprise-grade fault tolerant - that's the worst thing I can say about it.

Its getting there quickly. BeeGFS v6 has metadata mirroring via buddy groups, and data mirroring the same way. A few things are still not there, but it is rapidly getting closer.

Being open source helps as well.

Definitely recommend it, and have deployed it and supported it many places in my previous life: http://scalableinformatics.com .

openvstorage and sheepdog probably don't come close to the scale of ceph or lustre?

So now they lack lustre?

I'll get my coat.

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact