Take, for example, my favorite storage system Ceph. As I understand it was originally going to be CephFS, with multiple metadata servers and lots of distributed POSIX goodness. However, in the 10+ years its been in development, the parts that have gotten tons of traction and have widespread use are seemingly one-off side projects from the underlying storage system: object storage and the RBD block device interfaces. Only in the past 12 months is CephFS becoming production ready. But only with a single metadata server, and the multiple metadata servers are still being debugged.
With Ceph, some of these timing issues are that the market for object store and network-based block devices are dwarf the market for distributed POSIX. But I bring it up to point out that distributed POSIX is also just a really really hard problem, with limited use cases. It's super convenient for getting an existing Unix executable to run on lots of machines at once. But that convenience may not be worth the challenges it imposes on the infrastructure.
CephFS simply didn't gain as much traction because it made sense to just store objects in many cases, and let something else worry about what is stored and where. A massive distributed file system is not nearly as necessary as people make it out to be for a lot of different workloads.
RBD works so well because it can replace iSCSI/FC storage and provide failover and redundancy, and as you scale servers it gets faster. We had 20 Gbit/sec bandwidth for our storage network (to each hypervisor), with 20 Gbit/sec for the front-end Ceph access network, and Ceph RBD easily saturated it on the hypervisor side. Spinning up VM's in OpenStack was a pleasure and soooooo quick (RBD backed images too).
We looked at increasing the hypervisor to 40 Gbit/sec, but decided against it for cost reasons, and in reality we had very little workloads that actually needed that much storage IO.
None of the internet facing services actually need a strongly consistent POSIX filesystem that scales across multiple datacenters and a huge chunk of them won't put up with corresponding latencies of something like that for mere convinience. So the products are not really flawed, they just don't need to do those things.
also, the only standard distributed protocol for posix, nfs, has always had a lot of design and implementation issues. v4 is complex and tries to address some of the consistency issues, but I don't know how successful it is in practice given the limited usage.
treating blobs as immutable addressable objects and using other systems for coordination avoids a lot of the pain with caching, consistency, metadata performance, etc. you can layer those things...its a good cut for large scale distributed systems
I have (somewhat esoteric corner cases, but still) benchmarks that will cause failures within seconds when run against NFS.
But its weakness is that it is a complex beast to setup and maintain. If you think that configuring and maintaining a Hadoop cluster is hard, then Ceph is twice as hard. Ceph has really good documentation, but all the knobs you have to read and play around with is still too much.
I prefer running a few ZFS servers, very easy to setup and maintain and much better performance. But if we reach Petabyte scale I think I would need something like Ceph.
I haven't actually had the occasion to use either of them in a project but very curious to hear about whether they have successfully enabled some useful legacy->cloud migration scenarios.
So I think it's a question of the correct development focus in the correct order.
The goal is to have a single CephFS across the globe, with the current pace they will get there, eventually!