
PolarFS: Alibaba Distributed File System for Shared Storage Cloud Database [pdf] - blopeur
http://www.vldb.org/pvldb/vol11/p1849-cao.pdf
======
antoncohen
Man, reading that made me wish Clustrix (YC '06) open sourced their database
([https://www.clustrix.com/](https://www.clustrix.com/)). They had a MySQL
compatible scale our DB nearly 10 years ago, wireline compatible with MySQL
without using any MySQL code, could participate in a MySQL replication cluster
with normal MySQL servers (made migration easy). It was scale out shared-
nothing, writes would scale linearly as you added nodes, unlike POLARDB which
is shared-everything with a single master. It used RDMA 10 years ago, and
custom PCIe devices because NVMe didn't exist.

But they didn't open source it, so only a small handful of companies get to
use it. Sad.

~~~
pstuart
Do you think they could have found a model to go open source and still
satisfactorily monetize it?

------
KaiserPro
The hardest part of a Distributed File system (and I mean File system here) is
managing the Meatadata (where a file is, where the directory is, who last did
something to it.)

Lustre, GFS2 and GPFS all have centralised metadatstores, which is both a boon
and a drawback.

What I can't figure out is what they've done here. It appears like metadata is
stored in a special partition ("journal") which is shared? But there is a
control process as well.

~~~
jamesblonde
It doesn't look like PolarFS has distributed metadata.

HopsFS (HDFS derivative work) stores distributed metadata, enabling 16X
throughput improvements over HDFS -
[https://www.usenix.org/conference/fast17/technical-
sessions/...](https://www.usenix.org/conference/fast17/technical-
sessions/presentation/niazi) [http://www.hops.io](http://www.hops.io)
Technically, the hard part is correct concurrent and consistent operations
across shards (partitions).

Disclaimer: one of the designers of HopsFS.

~~~
ddorian43
What do you think are some gotchas/limitations when using mysql-ndb-cluster ?

~~~
jamesblonde
It's become very stable with the years. It's somehow never penetrated the
silicon valley echo chamber, probably due to instability in its earlier years.
Operationally, it's got most things you need - on-line add node, rolling
upgrades, monitoring, etc.

We use NDB to store both the metadata in-memory, but also to store small files
on NVMe disks. We had talks with lots of other DB vendors, but, frankly, none
of them have high performance support for cross-partition transactions, which
is needed. DBs like VoltDB, MemSQL, NuoDB all have promise, but serialize
cross-shard operations.

~~~
ryanworl
Have you explored using FoundationDB? It doesn't have a SQL interface
(anymore), which might mean it isn't suitable if SQL is required.

~~~
jamesblonde
Not yet, it's only recently been open-sourced. We have a startup now
commercializing Hops, called Logical Clocks (of course), so we're busy with
that.

~~~
ryanworl
The community is very helpful over on the forums
([https://forums.foundationdb.org](https://forums.foundationdb.org)), so if
you do end up having time to check it out, come on over! I'm sure the team at
Apple (I'm not affiliated with them) would love to talk about it.

~~~
ddorian43
They're against coprocessors (having functions in the db) so you'll always be
somewhat limited. Doing with shared-memory-etc will be hard.

Also you can shard many tables as hash, so most hot-path transactions be
inside a server, which you can't guarantee with range-sharding.

------
j16sdiz
The protocol is interesting.

But given how often the Alibaba cloud fails in production, I won't hold my
breath.

~~~
baotiao
The protocol is interesting, and we will provide the TLA+ proof soon.

~~~
kjeetgill
Are you from the PolarFS team? This paper looks amazing, I'll be sure to give
it a thorough read!

I was curious: could you compare and contrast it with what I imagine are it's
competitors? Hdfs, ceph, glusterfs etc.

Have you replaced any of those existing systems internally yet?

~~~
KaiserPro
HDFS isn't really a filesystem.

The main competitors at this scale would be Lustre & GPFS

------
antongribok
Kind of disappointing that they compared it with Ceph non-RDMA vs. PolarFS on
RMDA, unless I misread this part.

Still, this is all very interesting.

------
sekh60
I am rather sleep deprived, so I may have misread things, but this doesn't
seem to me to be the best benchmark to evaluate Ceph for database work.

From what I understand best practice in ceph for databases is to make a rbd
image and format that with your filesystem of choice. I believe. The rbd
stripe size should be tuned to you database writes in mind.

I believe ceph rbd supports rdma, but I cannot find much current details about
it.

~~~
daviesliu
The rbd image can't be used as a shared storage for database.

~~~
sekh60
How not? You can map an rbd image from multiple VMs/servers at the same time,
just if you want to do so you need a multi-client aware filesystem. Otherwise
if you just need something attached to a single host you just format the
volume ext4/xfs/whatever and map it, throw your database on that.

------
RobLach
Most interesting bit is their consensus protocol.

~~~
kjeetgill
I'm on mobile so I haven't been able to really fully read the paper: Can you
highlight what's interesting or novel about it?

~~~
baotiao
I think the most interesting part is PolarFS taking full advantage of the
emerging techniques like RDMA, NVMe, and SPDK. And the Parallel raft consensus
algorithm

------
toolslive
link is down. Is there an alternative link?

------
cbsmith
RDMA is still an "emerging technology"?

~~~
tinktank
IN the datacenter, yes. Not many big deployments around.

~~~
cbsmith
Things I didn't know...

