PolarFS: Alibaba Distributed File System for Shared Storage Cloud Database [pdf]

antoncohen · on Aug 22, 2018

Man, reading that made me wish Clustrix (YC '06) open sourced their database (https://www.clustrix.com/). They had a MySQL compatible scale our DB nearly 10 years ago, wireline compatible with MySQL without using any MySQL code, could participate in a MySQL replication cluster with normal MySQL servers (made migration easy). It was scale out shared-nothing, writes would scale linearly as you added nodes, unlike POLARDB which is shared-everything with a single master. It used RDMA 10 years ago, and custom PCIe devices because NVMe didn't exist.

But they didn't open source it, so only a small handful of companies get to use it. Sad.

pstuart · on Aug 22, 2018

Do you think they could have found a model to go open source and still satisfactorily monetize it?

KaiserPro · on Aug 22, 2018

The hardest part of a Distributed File system (and I mean File system here) is managing the Meatadata (where a file is, where the directory is, who last did something to it.)

Lustre, GFS2 and GPFS all have centralised metadatstores, which is both a boon and a drawback.

What I can't figure out is what they've done here. It appears like metadata is stored in a special partition ("journal") which is shared? But there is a control process as well.

jamesblonde · on Aug 22, 2018

It doesn't look like PolarFS has distributed metadata.

HopsFS (HDFS derivative work) stores distributed metadata, enabling 16X throughput improvements over HDFS - https://www.usenix.org/conference/fast17/technical-sessions/... http://www.hops.io Technically, the hard part is correct concurrent and consistent operations across shards (partitions).

Disclaimer: one of the designers of HopsFS.

ddorian43 · on Aug 22, 2018

What do you think are some gotchas/limitations when using mysql-ndb-cluster ?

jamesblonde · on Aug 22, 2018

It's become very stable with the years. It's somehow never penetrated the silicon valley echo chamber, probably due to instability in its earlier years. Operationally, it's got most things you need - on-line add node, rolling upgrades, monitoring, etc.

We use NDB to store both the metadata in-memory, but also to store small files on NVMe disks. We had talks with lots of other DB vendors, but, frankly, none of them have high performance support for cross-partition transactions, which is needed. DBs like VoltDB, MemSQL, NuoDB all have promise, but serialize cross-shard operations.

ryanworl · on Aug 22, 2018

Have you explored using FoundationDB? It doesn't have a SQL interface (anymore), which might mean it isn't suitable if SQL is required.

jamesblonde · on Aug 22, 2018

Not yet, it's only recently been open-sourced. We have a startup now commercializing Hops, called Logical Clocks (of course), so we're busy with that.

ryanworl · on Aug 22, 2018

The community is very helpful over on the forums (https://forums.foundationdb.org), so if you do end up having time to check it out, come on over! I'm sure the team at Apple (I'm not affiliated with them) would love to talk about it.

ddorian43 · on Aug 23, 2018

They're against coprocessors (having functions in the db) so you'll always be somewhat limited. Doing with shared-memory-etc will be hard.

Also you can shard many tables as hash, so most hot-path transactions be inside a server, which you can't guarantee with range-sharding.

gnufx · on Aug 22, 2018

With respect to metadata, the premise of PVFS^WOrangeFS is that providing full POSIX semantics (specifically locking?) makes the issues even worse, so it doesn't try, to decent effect. That may rule it out, for some applications like databases, though.

Lustre has DNE for distributed metadata now. Presumably multi-tenancy would be important in this sort of application.

j16sdiz · on Aug 22, 2018

The protocol is interesting.

But given how often the Alibaba cloud fails in production, I won't hold my breath.

baotiao · on Aug 22, 2018

The protocol is interesting, and we will provide the TLA+ proof soon.

kjeetgill · on Aug 22, 2018

Are you from the PolarFS team? This paper looks amazing, I'll be sure to give it a thorough read!

I was curious: could you compare and contrast it with what I imagine are it's competitors? Hdfs, ceph, glusterfs etc.

Have you replaced any of those existing systems internally yet?

KaiserPro · on Aug 22, 2018

HDFS isn't really a filesystem.

The main competitors at this scale would be Lustre & GPFS

baotiao · on Aug 22, 2018

Yes I am from the PolarFS team. You can read from the paper that we have compared PolarFS with ceph.

tanilama · on Aug 22, 2018

Will appreciate some specific examples.

Disclaimer: Not Alibaba employee

exabrial · on Aug 22, 2018

Former Alibaba employee: every day.

There's no planning, no communication, tons of underqualified middle management, tons of politics, and a lot of really bad ideas are pushed by HIPPOs.

China is general about 10-15 years behind on software development. Currently Alibaba's big push internally is a framework that essentially resembles EJB 2.1 stateful beans, but built on Spring.

B1FF_PSUVM · on Aug 22, 2018

> pushed by HIPPOs

https://whatis.techtarget.com/definition/HiPPOs-highest-paid...: "HiPPO (highest paid person's opinion, highest paid person in the office)"

ofrzeta · on Aug 22, 2018

I guess you are frustrated but most of it sounds like everywhere else. Also I think you are exaggerating. For sure they are not 10-15 years behind. There are some fine ideas and implementation for instance the Pouch container and p2p distribution system, even by Alibaba. Hyper.sh is chinese, too. OpenResty is nice. PingCAP, the makers of TiDB are chinese, as well.

exabrial · on Aug 23, 2018

Well, half the time things went down is because the data center hit capacity and someone exploits something in the console to shut down your workloads, which makes for a bunch of fun.

antongribok · on Aug 21, 2018

Kind of disappointing that they compared it with Ceph non-RDMA vs. PolarFS on RMDA, unless I misread this part.

Still, this is all very interesting.

sekh60 · on Aug 22, 2018

I am rather sleep deprived, so I may have misread things, but this doesn't seem to me to be the best benchmark to evaluate Ceph for database work.

From what I understand best practice in ceph for databases is to make a rbd image and format that with your filesystem of choice. I believe. The rbd stripe size should be tuned to you database writes in mind.

I believe ceph rbd supports rdma, but I cannot find much current details about it.

daviesliu · on Aug 22, 2018

The rbd image can't be used as a shared storage for database.

sekh60 · on Aug 22, 2018

How not? You can map an rbd image from multiple VMs/servers at the same time, just if you want to do so you need a multi-client aware filesystem. Otherwise if you just need something attached to a single host you just format the volume ext4/xfs/whatever and map it, throw your database on that.

RobLach · on Aug 21, 2018

Most interesting bit is their consensus protocol.

baotiao · on Aug 22, 2018

Thank you. We will provide our TLA+ proof soon..

kjeetgill · on Aug 22, 2018

I'm on mobile so I haven't been able to really fully read the paper: Can you highlight what's interesting or novel about it?

baotiao · on Aug 22, 2018

I think the most interesting part is PolarFS taking full advantage of the emerging techniques like RDMA, NVMe, and SPDK. And the Parallel raft consensus algorithm

toolslive · on Aug 22, 2018

link is down. Is there an alternative link?

cbsmith · on Aug 22, 2018

RDMA is still an "emerging technology"?

tinktank · on Aug 22, 2018

IN the datacenter, yes. Not many big deployments around.

cbsmith · on Aug 23, 2018

Things I didn't know...