But they didn't open source it, so only a small handful of companies get to use it. Sad.
Lustre, GFS2 and GPFS all have centralised metadatstores, which is both a boon and a drawback.
What I can't figure out is what they've done here. It appears like metadata is stored in a special partition ("journal") which is shared? But there is a control process as well.
HopsFS (HDFS derivative work) stores distributed metadata, enabling 16X throughput improvements over HDFS - https://www.usenix.org/conference/fast17/technical-sessions/...
Technically, the hard part is correct concurrent and consistent operations across shards (partitions).
Disclaimer: one of the designers of HopsFS.
We use NDB to store both the metadata in-memory, but also to store small files on NVMe disks. We had talks with lots of other DB vendors, but, frankly, none of them have high performance support for cross-partition transactions, which is needed. DBs like VoltDB, MemSQL, NuoDB all have promise, but serialize cross-shard operations.
Also you can shard many tables as hash, so most hot-path transactions be inside a server, which you can't guarantee with range-sharding.
Lustre has DNE for distributed metadata now. Presumably multi-tenancy would be important in this sort of application.
But given how often the Alibaba cloud fails in production, I won't hold my breath.
I was curious: could you compare and contrast it with what I imagine are it's competitors? Hdfs, ceph, glusterfs etc.
Have you replaced any of those existing systems internally yet?
The main competitors at this scale would be Lustre & GPFS
Disclaimer: Not Alibaba employee
There's no planning, no communication, tons of underqualified middle management, tons of politics, and a lot of really bad ideas are pushed by HIPPOs.
China is general about 10-15 years behind on software development. Currently Alibaba's big push internally is a framework that essentially resembles EJB 2.1 stateful beans, but built on Spring.
https://whatis.techtarget.com/definition/HiPPOs-highest-paid...: "HiPPO (highest paid person's opinion, highest paid person in the office)"
Still, this is all very interesting.
From what I understand best practice in ceph for databases is to make a rbd image and format that with your filesystem of choice. I believe. The rbd stripe size should be tuned to you database writes in mind.
I believe ceph rbd supports rdma, but I cannot find much current details about it.