Hacker News new | comments | show | ask | jobs | submit login

I'm happy to see this - I could not agree more with these points.

I think they are coming at this problem from the wrong perspective - instead of growing from virtual servers to their own dedicated hardware to get better CephFS performance, they should take a hard look at their application and see if they can architect it in a way that does not require a complex distributed filesystem to present a single mount that they can expose over NFS. At some point in the future, it will bite them. Not an if, but a when.

In addition, this means that running physical hardware, CephFS and Kubernetes (among other things) are now going to be part of their core competencies - I think they are going to underestimate the cost to GitLab in the long run. When they need to pay for someone to be a 30 minute drive from their DC 24/7/365 after the first outage, when they realize how much spare hardware they are going to want around, etc.

As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.




> they should take a hard look at their application and see if they can architect it in a way that does not require a complex distributed filesystem to present a single mount that they can expose over NFS

As someone who administers GitLab for my company, yes please.

Any high availability scenario that involves "just mount the same NFS volume on your standby" is a nonstarter for us. (We've found mounting NFS across different datacenters to be too unreliable and our failover scenarios include loss of a data center.)

It would also be wonderful to be able to run GitLab in any of the PaaSes that only have ephemeral disk, but that's a secondary concern.


>Any high availability scenario that involves "just mount the same NFS volume on your standby" is a nonstarter for us

What are the alternatives?

I suppose there's MySQL's "stream asynchronously to the standby at the application level."

... which, now that I think about it, should be pretty easy to do with Git, since pushing and pulling discrete units of data are core concepts...



I don't see why it wouldn't be difficult to do a git implementation that replaces all the FS syscalls with calls to S3 or native ceph or some other object store. If all they're using NFS for is to store git, it seems like a big win to put in the up front engineering cost.

I mean, especially because git's whole model of object storage is content-addressable and immutable, it looks like it's a prime use for generic object storage.


This is precisely how I would recommend doing this if I were to be asked. Because git is content-addressable, it lends itself very well to being stored in an object storage system (as long as it has the right consistency guarantees). Instead of using CephFS, they could use Ceph's rados gateway which would allow them to abstract the storage engine away to working with any object storage system.

Latency and consistency would be my concerns - S3 does not quite have the right semantics for some of this, so you'd have to build a small shim on top to work around this. Ceph's rados doesn't even have these problems, so it is quite a good contender.


Latency is an issue. Especially when traversing history in operations like log or blame, it is important to have an extremely low latency object store. Usually this means a local disk.


Yuup. Latency is a huge issue for Git. Even hosting Git at non-trivial scales on EBS is a challenge. S3 for individual objects is going to take forever to do even the simplest operations.

And considering the usual compressed size of commits and many text files, you're going to have more HTTP header traffic than actual data if you want to do something like a rev-list.


I'm trying to think of the reason why NFS's latency is tolerable but S3's wouldn't be. (Not that I disagree, you're totally right, but why is this true in principle? Just HTTP being inefficient?)

I would imagine any implementation that used S3 or similar as a backing store would have to heavily rely on an in-memory cache for it (relying on the content-addressable-ness heavily) to avoid re-looking-up things.

I wonder how optimized an object store's protocol would have to be (http2 to compress headers? Protobufs?) before it starts converging on something that has similar latency/overhead to NFS.


That's how AWS CodeCommit works which makes it unique amongst GitHub, Gitlab and friends.

Source: https://aws.amazon.com/codecommit/faqs/

"AWS CodeCommit is built on highly scalable, redundant, and durable AWS services such as Amazon S3 and Amazon DynamoDB."


> As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.

This is a really good point. That's easily $1M in payroll. You could probably run a decent tiered SAN with 80-95% fewer labor dollars. Plus have the benefit of torturing the vendor when you hit scaling hiccups.


> As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.

Can you give some examples of the problems you ran into?


Definitely! Not going to be an exhaustive list, but I can talk about some of the bigger pieces of work.

Something that always seemed to cause nagging issues was that we wanted our cluster to have data encryption at rest. Ceph does not support this out of the box, which means that you need to use dmcrypt on top of your partitions, and present those encrypted partitions to Ceph. This requires some work to make sure that decrypt keys are setup properly, and that the machine can reboot automatically and remount the proper partitions. In addition, we ran into several issues where device mapper or otherwise would lock an OSD, which would send the entire machine into lockup, messy!

We also had to work pretty hard to build quality monitoring around Ceph - by default, there are very little tools that provide at-scale fine grained monitoring for the various components. We spent a lot of time figuring out what metrics we should be tracking, etc.

We also spent a good amount of time reaching out to other people and companies running ceph at scale to figure out how to tune and tweak it to work for us. The clusters were all-SSD, so there was a ton of work to tune the myriad of settings available, on ceph and the hosts themselves, to make sure we were getting the best possible performance out of the software.

When you run dozens-to-hundreds of servers with many SSDs in them that are doing constant traffic, you tend to hit every edge case in the hardware and software, and there are a lot of lurking demons. We went through controller upgrades, SSD firmware upgrades, tweaking OP settings, upgrading switches, finding that the write workload on certain SSDs caused problems and much more.

That's just a snapshot of some of the issues that we ran into with Ceph. It was a very fun project, but if you are getting into it for a high-throughput setup with hundreds of OSDs, it can be quite a bit of work.

Happy to chat more w/ anyone that's curious - there is some very interesting and fun stuff going on in the Ceph space.


> This requires some work to make sure that decrypt keys are setup properly, and that the machine can reboot automatically and remount the proper partitions.

I've always wondered how automatic reboots are handled with filesystem encryption.

What's the process that happens at reboot?

Where is the key stored?

How is it accessed automatically?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: