Man, reading that made me wish Clustrix (YC '06) open sourced their database (https://www.clustrix.com/). They had a MySQL compatible scale our DB nearly 10 years ago, wireline compatible with MySQL without using any MySQL code, could participate in a MySQL replication cluster with normal MySQL servers (made migration easy). It was scale out shared-nothing, writes would scale linearly as you added nodes, unlike POLARDB which is shared-everything with a single master. It used RDMA 10 years ago, and custom PCIe devices because NVMe didn't exist.
But they didn't open source it, so only a small handful of companies get to use it. Sad.
The hardest part of a Distributed File system (and I mean File system here) is managing the Meatadata (where a file is, where the directory is, who last did something to it.)
Lustre, GFS2 and GPFS all have centralised metadatstores, which is both a boon and a drawback.
What I can't figure out is what they've done here. It appears like metadata is stored in a special partition ("journal") which is shared? But there is a control process as well.
It's become very stable with the years. It's somehow never penetrated the silicon valley echo chamber, probably due to instability in its earlier years. Operationally, it's got most things you need - on-line add node, rolling upgrades, monitoring, etc.
We use NDB to store both the metadata in-memory, but also to store small files on NVMe disks. We had talks with lots of other DB vendors, but, frankly, none of them have high performance support for cross-partition transactions, which is needed. DBs like VoltDB, MemSQL, NuoDB all have promise, but serialize cross-shard operations.
The community is very helpful over on the forums (https://forums.foundationdb.org), so if you do end up having time to check it out, come on over! I'm sure the team at Apple (I'm not affiliated with them) would love to talk about it.
With respect to metadata, the premise of PVFS^WOrangeFS is that providing full POSIX semantics (specifically locking?) makes the issues even worse, so it doesn't try, to decent effect. That may rule it out, for some applications like databases, though.
Lustre has DNE for distributed metadata now. Presumably multi-tenancy would be important in this sort of application.
There's no planning, no communication, tons of underqualified middle management, tons of politics, and a lot of really bad ideas are pushed by HIPPOs.
China is general about 10-15 years behind on software development. Currently Alibaba's big push internally is a framework that essentially resembles EJB 2.1 stateful beans, but built on Spring.
I guess you are frustrated but most of it sounds like everywhere else. Also I think you are exaggerating. For sure they are not 10-15 years behind. There are some fine ideas and implementation for instance the Pouch container and p2p distribution system, even by Alibaba. Hyper.sh is chinese, too. OpenResty is nice. PingCAP, the makers of TiDB are chinese, as well.
Well, half the time things went down is because the data center hit capacity and someone exploits something in the console to shut down your workloads, which makes for a bunch of fun.
I am rather sleep deprived, so I may have misread things, but this doesn't seem to me to be the best benchmark to evaluate Ceph for database work.
From what I understand best practice in ceph for databases is to make a rbd image and format that with your filesystem of choice. I believe. The rbd stripe size should be tuned to you database writes in mind.
I believe ceph rbd supports rdma, but I cannot find much current details about it.
How not? You can map an rbd image from multiple VMs/servers at the same time, just if you want to do so you need a multi-client aware filesystem. Otherwise if you just need something attached to a single host you just format the volume ext4/xfs/whatever and map it, throw your database on that.
I think the most interesting part is PolarFS taking full advantage of the emerging techniques like RDMA, NVMe, and SPDK. And the Parallel raft consensus algorithm
But they didn't open source it, so only a small handful of companies get to use it. Sad.