I think they are coming at this problem from the wrong perspective - instead of growing from virtual servers to their own dedicated hardware to get better CephFS performance, they should take a hard look at their application and see if they can architect it in a way that does not require a complex distributed filesystem to present a single mount that they can expose over NFS. At some point in the future, it will bite them. Not an if, but a when.
In addition, this means that running physical hardware, CephFS and Kubernetes (among other things) are now going to be part of their core competencies - I think they are going to underestimate the cost to GitLab in the long run. When they need to pay for someone to be a 30 minute drive from their DC 24/7/365 after the first outage, when they realize how much spare hardware they are going to want around, etc.
As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.
As someone who administers GitLab for my company, yes please.
Any high availability scenario that involves "just mount the same NFS volume on your standby" is a nonstarter for us. (We've found mounting NFS across different datacenters to be too unreliable and our failover scenarios include loss of a data center.)
It would also be wonderful to be able to run GitLab in any of the PaaSes that only have ephemeral disk, but that's a secondary concern.
What are the alternatives?
I suppose there's MySQL's "stream asynchronously to the standby at the application level."
... which, now that I think about it, should be pretty easy to do with Git, since pushing and pulling discrete units of data are core concepts...
I mean, especially because git's whole model of object storage is content-addressable and immutable, it looks like it's a prime use for generic object storage.
Latency and consistency would be my concerns - S3 does not quite have the right semantics for some of this, so you'd have to build a small shim on top to work around this. Ceph's rados doesn't even have these problems, so it is quite a good contender.
And considering the usual compressed size of commits and many text files, you're going to have more HTTP header traffic than actual data if you want to do something like a rev-list.
I would imagine any implementation that used S3 or similar as a backing store would have to heavily rely on an in-memory cache for it (relying on the content-addressable-ness heavily) to avoid re-looking-up things.
I wonder how optimized an object store's protocol would have to be (http2 to compress headers? Protobufs?) before it starts converging on something that has similar latency/overhead to NFS.
"AWS CodeCommit is built on highly scalable, redundant, and durable AWS services such as Amazon S3 and Amazon DynamoDB."
This is a really good point. That's easily $1M in payroll. You could probably run a decent tiered SAN with 80-95% fewer labor dollars. Plus have the benefit of torturing the vendor when you hit scaling hiccups.
Can you give some examples of the problems you ran into?
Something that always seemed to cause nagging issues was that we wanted our cluster to have data encryption at rest. Ceph does not support this out of the box, which means that you need to use dmcrypt on top of your partitions, and present those encrypted partitions to Ceph. This requires some work to make sure that decrypt keys are setup properly, and that the machine can reboot automatically and remount the proper partitions. In addition, we ran into several issues where device mapper or otherwise would lock an OSD, which would send the entire machine into lockup, messy!
We also had to work pretty hard to build quality monitoring around Ceph - by default, there are very little tools that provide at-scale fine grained monitoring for the various components. We spent a lot of time figuring out what metrics we should be tracking, etc.
We also spent a good amount of time reaching out to other people and companies running ceph at scale to figure out how to tune and tweak it to work for us. The clusters were all-SSD, so there was a ton of work to tune the myriad of settings available, on ceph and the hosts themselves, to make sure we were getting the best possible performance out of the software.
When you run dozens-to-hundreds of servers with many SSDs in them that are doing constant traffic, you tend to hit every edge case in the hardware and software, and there are a lot of lurking demons. We went through controller upgrades, SSD firmware upgrades, tweaking OP settings, upgrading switches, finding that the write workload on certain SSDs caused problems and much more.
That's just a snapshot of some of the issues that we ran into with Ceph. It was a very fun project, but if you are getting into it for a high-throughput setup with hundreds of OSDs, it can be quite a bit of work.
Happy to chat more w/ anyone that's curious - there is some very interesting and fun stuff going on in the Ceph space.
I've always wondered how automatic reboots are handled with filesystem encryption.
What's the process that happens at reboot?
Where is the key stored?
How is it accessed automatically?