Hacker News new | past | comments | ask | show | jobs | submit login
PolarFS: Alibaba Distributed File System for Shared Storage Cloud Database [pdf] (vldb.org)
137 points by blopeur on Aug 21, 2018 | hide | past | favorite | 33 comments

Man, reading that made me wish Clustrix (YC '06) open sourced their database (https://www.clustrix.com/). They had a MySQL compatible scale our DB nearly 10 years ago, wireline compatible with MySQL without using any MySQL code, could participate in a MySQL replication cluster with normal MySQL servers (made migration easy). It was scale out shared-nothing, writes would scale linearly as you added nodes, unlike POLARDB which is shared-everything with a single master. It used RDMA 10 years ago, and custom PCIe devices because NVMe didn't exist.

But they didn't open source it, so only a small handful of companies get to use it. Sad.

Do you think they could have found a model to go open source and still satisfactorily monetize it?

The hardest part of a Distributed File system (and I mean File system here) is managing the Meatadata (where a file is, where the directory is, who last did something to it.)

Lustre, GFS2 and GPFS all have centralised metadatstores, which is both a boon and a drawback.

What I can't figure out is what they've done here. It appears like metadata is stored in a special partition ("journal") which is shared? But there is a control process as well.

It doesn't look like PolarFS has distributed metadata.

HopsFS (HDFS derivative work) stores distributed metadata, enabling 16X throughput improvements over HDFS - https://www.usenix.org/conference/fast17/technical-sessions/... http://www.hops.io Technically, the hard part is correct concurrent and consistent operations across shards (partitions).

Disclaimer: one of the designers of HopsFS.

What do you think are some gotchas/limitations when using mysql-ndb-cluster ?

It's become very stable with the years. It's somehow never penetrated the silicon valley echo chamber, probably due to instability in its earlier years. Operationally, it's got most things you need - on-line add node, rolling upgrades, monitoring, etc.

We use NDB to store both the metadata in-memory, but also to store small files on NVMe disks. We had talks with lots of other DB vendors, but, frankly, none of them have high performance support for cross-partition transactions, which is needed. DBs like VoltDB, MemSQL, NuoDB all have promise, but serialize cross-shard operations.

Have you explored using FoundationDB? It doesn't have a SQL interface (anymore), which might mean it isn't suitable if SQL is required.

Not yet, it's only recently been open-sourced. We have a startup now commercializing Hops, called Logical Clocks (of course), so we're busy with that.

The community is very helpful over on the forums (https://forums.foundationdb.org), so if you do end up having time to check it out, come on over! I'm sure the team at Apple (I'm not affiliated with them) would love to talk about it.

They're against coprocessors (having functions in the db) so you'll always be somewhat limited. Doing with shared-memory-etc will be hard.

Also you can shard many tables as hash, so most hot-path transactions be inside a server, which you can't guarantee with range-sharding.

With respect to metadata, the premise of PVFS^WOrangeFS is that providing full POSIX semantics (specifically locking?) makes the issues even worse, so it doesn't try, to decent effect. That may rule it out, for some applications like databases, though.

Lustre has DNE for distributed metadata now. Presumably multi-tenancy would be important in this sort of application.

The protocol is interesting.

But given how often the Alibaba cloud fails in production, I won't hold my breath.

The protocol is interesting, and we will provide the TLA+ proof soon.

Are you from the PolarFS team? This paper looks amazing, I'll be sure to give it a thorough read!

I was curious: could you compare and contrast it with what I imagine are it's competitors? Hdfs, ceph, glusterfs etc.

Have you replaced any of those existing systems internally yet?

HDFS isn't really a filesystem.

The main competitors at this scale would be Lustre & GPFS

Yes I am from the PolarFS team. You can read from the paper that we have compared PolarFS with ceph.

Will appreciate some specific examples.

Disclaimer: Not Alibaba employee

Former Alibaba employee: every day.

There's no planning, no communication, tons of underqualified middle management, tons of politics, and a lot of really bad ideas are pushed by HIPPOs.

China is general about 10-15 years behind on software development. Currently Alibaba's big push internally is a framework that essentially resembles EJB 2.1 stateful beans, but built on Spring.

> pushed by HIPPOs

https://whatis.techtarget.com/definition/HiPPOs-highest-paid...: "HiPPO (highest paid person's opinion, highest paid person in the office)"

I guess you are frustrated but most of it sounds like everywhere else. Also I think you are exaggerating. For sure they are not 10-15 years behind. There are some fine ideas and implementation for instance the Pouch container and p2p distribution system, even by Alibaba. Hyper.sh is chinese, too. OpenResty is nice. PingCAP, the makers of TiDB are chinese, as well.

Well, half the time things went down is because the data center hit capacity and someone exploits something in the console to shut down your workloads, which makes for a bunch of fun.

Kind of disappointing that they compared it with Ceph non-RDMA vs. PolarFS on RMDA, unless I misread this part.

Still, this is all very interesting.

I am rather sleep deprived, so I may have misread things, but this doesn't seem to me to be the best benchmark to evaluate Ceph for database work.

From what I understand best practice in ceph for databases is to make a rbd image and format that with your filesystem of choice. I believe. The rbd stripe size should be tuned to you database writes in mind.

I believe ceph rbd supports rdma, but I cannot find much current details about it.

The rbd image can't be used as a shared storage for database.

How not? You can map an rbd image from multiple VMs/servers at the same time, just if you want to do so you need a multi-client aware filesystem. Otherwise if you just need something attached to a single host you just format the volume ext4/xfs/whatever and map it, throw your database on that.

Most interesting bit is their consensus protocol.

Thank you. We will provide our TLA+ proof soon..

I'm on mobile so I haven't been able to really fully read the paper: Can you highlight what's interesting or novel about it?

I think the most interesting part is PolarFS taking full advantage of the emerging techniques like RDMA, NVMe, and SPDK. And the Parallel raft consensus algorithm

link is down. Is there an alternative link?

RDMA is still an "emerging technology"?

IN the datacenter, yes. Not many big deployments around.

Things I didn't know...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact