My underlying assumption is that this is a production service with customers depending on it.
1. Don't fuck with networking. Do you have experience operating same or similar workloads on your super micro sdn? Will the CEO of your super micro VAR pickup his phone at 2AM when you call?
My advice: Get a quote from Arista.
2. Don't fuck with storage.
32 file servers for 96TB? Same question as with networking re:ceph. What are your failure domains? How much does it cost to maintain the FTEs who can run this thing?
3. What's the service SLA on the servers? Historically, supermicro VARs have been challenged with that.
If I were building this solution, I'd want to understand what the ROI of this science project is as compared to the current cloud solution and a converged offering like Cisco/NetApp, HP/3Par or something like Nutanix. You're probably saving like 20-25% on hardware.
This sounds to me like pinching pennies on procurement and picking up tens of dollars of cost on the labor side. If you're accustomed to SLAs from AWS, this will be a rude awakening.
I think they are coming at this problem from the wrong perspective - instead of growing from virtual servers to their own dedicated hardware to get better CephFS performance, they should take a hard look at their application and see if they can architect it in a way that does not require a complex distributed filesystem to present a single mount that they can expose over NFS. At some point in the future, it will bite them. Not an if, but a when.
In addition, this means that running physical hardware, CephFS and Kubernetes (among other things) are now going to be part of their core competencies - I think they are going to underestimate the cost to GitLab in the long run. When they need to pay for someone to be a 30 minute drive from their DC 24/7/365 after the first outage, when they realize how much spare hardware they are going to want around, etc.
As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.
As someone who administers GitLab for my company, yes please.
Any high availability scenario that involves "just mount the same NFS volume on your standby" is a nonstarter for us. (We've found mounting NFS across different datacenters to be too unreliable and our failover scenarios include loss of a data center.)
It would also be wonderful to be able to run GitLab in any of the PaaSes that only have ephemeral disk, but that's a secondary concern.
What are the alternatives?
I suppose there's MySQL's "stream asynchronously to the standby at the application level."
... which, now that I think about it, should be pretty easy to do with Git, since pushing and pulling discrete units of data are core concepts...
I mean, especially because git's whole model of object storage is content-addressable and immutable, it looks like it's a prime use for generic object storage.
Latency and consistency would be my concerns - S3 does not quite have the right semantics for some of this, so you'd have to build a small shim on top to work around this. Ceph's rados doesn't even have these problems, so it is quite a good contender.
And considering the usual compressed size of commits and many text files, you're going to have more HTTP header traffic than actual data if you want to do something like a rev-list.
I would imagine any implementation that used S3 or similar as a backing store would have to heavily rely on an in-memory cache for it (relying on the content-addressable-ness heavily) to avoid re-looking-up things.
I wonder how optimized an object store's protocol would have to be (http2 to compress headers? Protobufs?) before it starts converging on something that has similar latency/overhead to NFS.
"AWS CodeCommit is built on highly scalable, redundant, and durable AWS services such as Amazon S3 and Amazon DynamoDB."
This is a really good point. That's easily $1M in payroll. You could probably run a decent tiered SAN with 80-95% fewer labor dollars. Plus have the benefit of torturing the vendor when you hit scaling hiccups.
Can you give some examples of the problems you ran into?
Something that always seemed to cause nagging issues was that we wanted our cluster to have data encryption at rest. Ceph does not support this out of the box, which means that you need to use dmcrypt on top of your partitions, and present those encrypted partitions to Ceph. This requires some work to make sure that decrypt keys are setup properly, and that the machine can reboot automatically and remount the proper partitions. In addition, we ran into several issues where device mapper or otherwise would lock an OSD, which would send the entire machine into lockup, messy!
We also had to work pretty hard to build quality monitoring around Ceph - by default, there are very little tools that provide at-scale fine grained monitoring for the various components. We spent a lot of time figuring out what metrics we should be tracking, etc.
We also spent a good amount of time reaching out to other people and companies running ceph at scale to figure out how to tune and tweak it to work for us. The clusters were all-SSD, so there was a ton of work to tune the myriad of settings available, on ceph and the hosts themselves, to make sure we were getting the best possible performance out of the software.
When you run dozens-to-hundreds of servers with many SSDs in them that are doing constant traffic, you tend to hit every edge case in the hardware and software, and there are a lot of lurking demons. We went through controller upgrades, SSD firmware upgrades, tweaking OP settings, upgrading switches, finding that the write workload on certain SSDs caused problems and much more.
That's just a snapshot of some of the issues that we ran into with Ceph. It was a very fun project, but if you are getting into it for a high-throughput setup with hundreds of OSDs, it can be quite a bit of work.
Happy to chat more w/ anyone that's curious - there is some very interesting and fun stuff going on in the Ceph space.
I've always wondered how automatic reboots are handled with filesystem encryption.
What's the process that happens at reboot?
Where is the key stored?
How is it accessed automatically?
My philosophy today is that if the data is important at all, it's worthwhile going spendy and getting a SAN. Get a good one. I like Nimble a lot right now, but there are other good ones, too. (Don't believe crazy compression numbers or de-duplication quackery; I've told more than one SAN vendor to fuck off after they said they'd get 20:1 on our data, without doing any research on what our data was).
Have everything backed up? Great! How long until you can go live again after water drips on your storage? If you spend a week waiting for a restore, that's indistinguishable from failure. If you wait a month for replacement hardware to be shipped, you might have killed your product.
Perhaps I don't understand the problem domain, but I don't understand why CephFS is being considered for this task. You're trying to treat your entire set of files across all repos as a single filesystem, but that's an entirely incorrect assumption. The I/O activity on one repo/user does not affect the I/O activity of an entirely different user. Skip the one filesystem idea, shard based on user/location/whatever.
I'd appreciate any comments explaining why I'm wrong, because this doesn't seem to be a productive design to me.
None of the clouds use any of that super-expensive gear, so if you're going for cost savings, you'll need to use the same sort of commodity gear they use.
Gitlab is obviously Linux-savvy and comfortable writing automation, so things like Cumulus Linux and minimal-handholding hardware vendors shouldn't cause them any indigestion.
<disclaimer: co-founder of Cumulus Networks, so slightly biased>
My point isn't to knock them down. It takes cohones to be public about stuff like this. My instinct as a grumpy engineering director type is that there are holes here that need to be filled in.
Putting a major product at risk to save $30k against an Arista switch isn't a decision to make lightly. That means pricing the labor, upside benefit and business risk. If they are going to 100x this environment, Cumulus will save millions. If it will 3x, it will save a few thousand bucks -- who cares.
Arista and Cisco shouldn't cost top dollar; though anyone buying EMC or Netapp for any new build should have their union card revoked. FreeNAS ftw uber alles.
Source: Did it twice.
I've run it in my home a few times out of curiosity, and that was never my impression.
We're... displeased with the current solution that we're using at work for this use case. :)