Hacker News new | comments | show | ask | jobs | submit login

This entire space is littered with hard to use and/or flawed products. It's extremely difficult to get right. And even things like HDFS, which redefine the problem into something much much more manageable and with better semantics for distributed computing, have had their own issues.

Take, for example, my favorite storage system Ceph. As I understand it was originally going to be CephFS, with multiple metadata servers and lots of distributed POSIX goodness. However, in the 10+ years its been in development, the parts that have gotten tons of traction and have widespread use are seemingly one-off side projects from the underlying storage system: object storage and the RBD block device interfaces. Only in the past 12 months is CephFS becoming production ready. But only with a single metadata server, and the multiple metadata servers are still being debugged.

With Ceph, some of these timing issues are that the market for object store and network-based block devices are dwarf the market for distributed POSIX. But I bring it up to point out that distributed POSIX is also just a really really hard problem, with limited use cases. It's super convenient for getting an existing Unix executable to run on lots of machines at once. But that convenience may not be worth the challenges it imposes on the infrastructure.

Object Storage and layered on top of that RBD are much easier to get right.

CephFS simply didn't gain as much traction because it made sense to just store objects in many cases, and let something else worry about what is stored and where. A massive distributed file system is not nearly as necessary as people make it out to be for a lot of different workloads.

RBD works so well because it can replace iSCSI/FC storage and provide failover and redundancy, and as you scale servers it gets faster. We had 20 Gbit/sec bandwidth for our storage network (to each hypervisor), with 20 Gbit/sec for the front-end Ceph access network, and Ceph RBD easily saturated it on the hypervisor side. Spinning up VM's in OpenStack was a pleasure and soooooo quick (RBD backed images too).

We looked at increasing the hypervisor to 40 Gbit/sec, but decided against it for cost reasons, and in reality we had very little workloads that actually needed that much storage IO.

"This entire space is littered with hard to use and/or flawed products."

None of the internet facing services actually need a strongly consistent POSIX filesystem that scales across multiple datacenters and a huge chunk of them won't put up with corresponding latencies of something like that for mere convinience. So the products are not really flawed, they just don't need to do those things.

this is right. posix for posix sake shouldn't really be relevant anymore. look at the success of write-once s3. posix semantics were designed to address issues of concurrent access to the same bytes.

also, the only standard distributed protocol for posix, nfs, has always had a lot of design and implementation issues. v4 is complex and tries to address some of the consistency issues, but I don't know how successful it is in practice given the limited usage.

treating blobs as immutable addressable objects and using other systems for coordination avoids a lot of the pain with caching, consistency, metadata performance, etc. you can layer those things...its a good cut for large scale distributed systems

Well, NFS isn't really cache coherent, and hence not POSIX compliant. Lustre is, but pays for it with an amazingly complicated architecture.

I have (somewhat esoteric corner cases, but still) benchmarks that will cause failures within seconds when run against NFS.

I've played around with Ceph, I really like its features and strong consistency.

But its weakness is that it is a complex beast to setup and maintain. If you think that configuring and maintaining a Hadoop cluster is hard, then Ceph is twice as hard. Ceph has really good documentation, but all the knobs you have to read and play around with is still too much.

I prefer running a few ZFS servers, very easy to setup and maintain and much better performance. But if we reach Petabyte scale I think I would need something like Ceph.

I'll echo that, we had a project that at one time, even with a couple experienced Ceph admins, suffered meltdown after meltdown on Ceph until we just took the time to move the project and workload over to HDFS. Our admins learned to setup and administer HDFS from scratch and have had far fewer issues.

When you have an object store available (Ceph, Cleversafe, S3, etc), and need a POSIX file system you can also use ObjectiveFS[1]. By using an existing object store for the backend storage, it separates the concerns of reliable storage from the file system semantics which makes it (we think) easier to set up and use. With object and block storage working great for their use cases, file systems can provide some extra functionality, e.g. point-in-time snapshots of the whole file system, which makes it easy to create a consistent backup of your data.

[1]: https://objectivefs.com

I generally agree with your assessment, but for what it's worth more recently Amazon has shipped its Elastic Filesystem (which exposes NFSv4) and Microsoft offers Azure File Storage (SMB 3.0). AIUI neither is truly POSIX compliant but both are probably good enough for realizing most of the dream?

I haven't actually had the occasion to use either of them in a project but very curious to hear about whether they have successfully enabled some useful legacy->cloud migration scenarios.

Actually, the reason why rbd and rgw was prioritised was because of them being in popular demand. There were also a lot of cash and movement behind them too. I'm not saying it is an easy problem, just that their focus has been on other stuff along the way.

So I think it's a question of the correct development focus in the correct order.

The goal is to have a single CephFS across the globe, with the current pace they will get there, eventually!

I use Tahoe-LAFS for all of my distributed FS stuff, and I really do love it. One minor downside is that the introducer server is a SPOF, but they can be backed up/spun up and distributed introducers are on the roadmap for the next year.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact