Hacker News new | comments | show | ask | jobs | submit login

I'm a cranky old person now, I think this is a crazy approach to take and I would be having a very challenging conversation with the engineer pitching this to me.

My underlying assumption is that this is a production service with customers depending on it.

1. Don't fuck with networking. Do you have experience operating same or similar workloads on your super micro sdn? Will the CEO of your super micro VAR pickup his phone at 2AM when you call?

My advice: Get a quote from Arista.

2. Don't fuck with storage.

32 file servers for 96TB? Same question as with networking re:ceph. What are your failure domains? How much does it cost to maintain the FTEs who can run this thing?

3. What's the service SLA on the servers? Historically, supermicro VARs have been challenged with that.

If I were building this solution, I'd want to understand what the ROI of this science project is as compared to the current cloud solution and a converged offering like Cisco/NetApp, HP/3Par or something like Nutanix. You're probably saving like 20-25% on hardware.

This sounds to me like pinching pennies on procurement and picking up tens of dollars of cost on the labor side. If you're accustomed to SLAs from AWS, this will be a rude awakening.




I'm happy to see this - I could not agree more with these points.

I think they are coming at this problem from the wrong perspective - instead of growing from virtual servers to their own dedicated hardware to get better CephFS performance, they should take a hard look at their application and see if they can architect it in a way that does not require a complex distributed filesystem to present a single mount that they can expose over NFS. At some point in the future, it will bite them. Not an if, but a when.

In addition, this means that running physical hardware, CephFS and Kubernetes (among other things) are now going to be part of their core competencies - I think they are going to underestimate the cost to GitLab in the long run. When they need to pay for someone to be a 30 minute drive from their DC 24/7/365 after the first outage, when they realize how much spare hardware they are going to want around, etc.

As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.


> they should take a hard look at their application and see if they can architect it in a way that does not require a complex distributed filesystem to present a single mount that they can expose over NFS

As someone who administers GitLab for my company, yes please.

Any high availability scenario that involves "just mount the same NFS volume on your standby" is a nonstarter for us. (We've found mounting NFS across different datacenters to be too unreliable and our failover scenarios include loss of a data center.)

It would also be wonderful to be able to run GitLab in any of the PaaSes that only have ephemeral disk, but that's a secondary concern.


>Any high availability scenario that involves "just mount the same NFS volume on your standby" is a nonstarter for us

What are the alternatives?

I suppose there's MySQL's "stream asynchronously to the standby at the application level."

... which, now that I think about it, should be pretty easy to do with Git, since pushing and pulling discrete units of data are core concepts...



I don't see why it wouldn't be difficult to do a git implementation that replaces all the FS syscalls with calls to S3 or native ceph or some other object store. If all they're using NFS for is to store git, it seems like a big win to put in the up front engineering cost.

I mean, especially because git's whole model of object storage is content-addressable and immutable, it looks like it's a prime use for generic object storage.


This is precisely how I would recommend doing this if I were to be asked. Because git is content-addressable, it lends itself very well to being stored in an object storage system (as long as it has the right consistency guarantees). Instead of using CephFS, they could use Ceph's rados gateway which would allow them to abstract the storage engine away to working with any object storage system.

Latency and consistency would be my concerns - S3 does not quite have the right semantics for some of this, so you'd have to build a small shim on top to work around this. Ceph's rados doesn't even have these problems, so it is quite a good contender.


Latency is an issue. Especially when traversing history in operations like log or blame, it is important to have an extremely low latency object store. Usually this means a local disk.


Yuup. Latency is a huge issue for Git. Even hosting Git at non-trivial scales on EBS is a challenge. S3 for individual objects is going to take forever to do even the simplest operations.

And considering the usual compressed size of commits and many text files, you're going to have more HTTP header traffic than actual data if you want to do something like a rev-list.


I'm trying to think of the reason why NFS's latency is tolerable but S3's wouldn't be. (Not that I disagree, you're totally right, but why is this true in principle? Just HTTP being inefficient?)

I would imagine any implementation that used S3 or similar as a backing store would have to heavily rely on an in-memory cache for it (relying on the content-addressable-ness heavily) to avoid re-looking-up things.

I wonder how optimized an object store's protocol would have to be (http2 to compress headers? Protobufs?) before it starts converging on something that has similar latency/overhead to NFS.


That's how AWS CodeCommit works which makes it unique amongst GitHub, Gitlab and friends.

Source: https://aws.amazon.com/codecommit/faqs/

"AWS CodeCommit is built on highly scalable, redundant, and durable AWS services such as Amazon S3 and Amazon DynamoDB."


> As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.

This is a really good point. That's easily $1M in payroll. You could probably run a decent tiered SAN with 80-95% fewer labor dollars. Plus have the benefit of torturing the vendor when you hit scaling hiccups.


> As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.

Can you give some examples of the problems you ran into?


Definitely! Not going to be an exhaustive list, but I can talk about some of the bigger pieces of work.

Something that always seemed to cause nagging issues was that we wanted our cluster to have data encryption at rest. Ceph does not support this out of the box, which means that you need to use dmcrypt on top of your partitions, and present those encrypted partitions to Ceph. This requires some work to make sure that decrypt keys are setup properly, and that the machine can reboot automatically and remount the proper partitions. In addition, we ran into several issues where device mapper or otherwise would lock an OSD, which would send the entire machine into lockup, messy!

We also had to work pretty hard to build quality monitoring around Ceph - by default, there are very little tools that provide at-scale fine grained monitoring for the various components. We spent a lot of time figuring out what metrics we should be tracking, etc.

We also spent a good amount of time reaching out to other people and companies running ceph at scale to figure out how to tune and tweak it to work for us. The clusters were all-SSD, so there was a ton of work to tune the myriad of settings available, on ceph and the hosts themselves, to make sure we were getting the best possible performance out of the software.

When you run dozens-to-hundreds of servers with many SSDs in them that are doing constant traffic, you tend to hit every edge case in the hardware and software, and there are a lot of lurking demons. We went through controller upgrades, SSD firmware upgrades, tweaking OP settings, upgrading switches, finding that the write workload on certain SSDs caused problems and much more.

That's just a snapshot of some of the issues that we ran into with Ceph. It was a very fun project, but if you are getting into it for a high-throughput setup with hundreds of OSDs, it can be quite a bit of work.

Happy to chat more w/ anyone that's curious - there is some very interesting and fun stuff going on in the Ceph space.


> This requires some work to make sure that decrypt keys are setup properly, and that the machine can reboot automatically and remount the proper partitions.

I've always wondered how automatic reboots are handled with filesystem encryption.

What's the process that happens at reboot?

Where is the key stored?

How is it accessed automatically?


Couldn't agree more. It seems Gitlab is severely underestimating the cost of having similar infrastructure on bare metal as their AWS infra. I would probably start to re-architect their software and replace Ceph with something better, probably with S3. Much easier to scale up and operate. Also their Ruby stack has a lot of optimisation potential. Starting to optimise on the other end (hardware, networking, etc.) is a much harder job, starting with staffing questions. AWS, Google and MSFT has the best datacenter engineers you can find and there is a huge amount of effort went into engineering their datacenters. Not to mention your leverage you have being a small startup vs a cloud vendor when talking to HW vendors. Anyways, in few years we gonna know if they managed to do this successfully.


I cannot tell you how many weird-ass storage systems I've decommissioned. Typically people go cheap, and are burned. It can be a year after deployment, or three, but one thing all the cheap stuff seems to have in common is that it fails very, very badly just when you need it very, very badly. Usually at 3 in the morning.

My philosophy today is that if the data is important at all, it's worthwhile going spendy and getting a SAN. Get a good one. I like Nimble a lot right now, but there are other good ones, too. (Don't believe crazy compression numbers or de-duplication quackery; I've told more than one SAN vendor to fuck off after they said they'd get 20:1 on our data, without doing any research on what our data was).

Have everything backed up? Great! How long until you can go live again after water drips on your storage? If you spend a week waiting for a restore, that's indistinguishable from failure. If you wait a month for replacement hardware to be shipped, you might have killed your product.


Thank goodness someone said it.

Perhaps I don't understand the problem domain, but I don't understand why CephFS is being considered for this task. You're trying to treat your entire set of files across all repos as a single filesystem, but that's an entirely incorrect assumption. The I/O activity on one repo/user does not affect the I/O activity of an entirely different user. Skip the one filesystem idea, shard based on user/location/whatever.

I'd appreciate any comments explaining why I'm wrong, because this doesn't seem to be a productive design to me.


Treating the whole thing as one FS is the current architecture GitLab uses, so is more of an existing constraint than a proposed architecture. To get distributed storage you either need to rewrite GitLab to deal with distributed storage, or run it on another layer that presents an illusion of one big FS (whether that's CephFS or a storage appliance).


If you're going to spend top dollar on Arista/Cisco/EMC/NetApp, you might as well stay in the cloud.

None of the clouds use any of that super-expensive gear, so if you're going for cost savings, you'll need to use the same sort of commodity gear they use.

Gitlab is obviously Linux-savvy and comfortable writing automation, so things like Cumulus Linux and minimal-handholding hardware vendors shouldn't cause them any indigestion.

<disclaimer: co-founder of Cumulus Networks, so slightly biased>


But this is a couple of hundred boxes, not AWS. I've been to a Microsoft data center... the scale is infinitely larger and solutions are different well.

My point isn't to knock them down. It takes cohones to be public about stuff like this. My instinct as a grumpy engineering director type is that there are holes here that need to be filled in.

Putting a major product at risk to save $30k against an Arista switch isn't a decision to make lightly. That means pricing the labor, upside benefit and business risk. If they are going to 100x this environment, Cumulus will save millions. If it will 3x, it will save a few thousand bucks -- who cares.


cojones


I disagree. With over 300 servers in AWS, you can almost certainly build a redundant data center with hardware at less than 60% of the costing assuming 3 year depreciation.

Arista and Cisco shouldn't cost top dollar; though anyone buying EMC or Netapp for any new build should have their union card revoked. FreeNAS ftw uber alles.

Source: Did it twice.


Is FreeNAS something people actually run Serious Business, at-scale production datacenters on?

I've run it in my home a few times out of curiosity, and that was never my impression.


Yes. I know of several billion dollar companies that run it in a web facing production operations capacity.


Any good talks or other resources about this sort of use case that you might recommend?

We're... displeased with the current solution that we're using at work for this use case. :)


I've created a subreddit here [0] for discussion. Ask some questions and I'll give you the information I have that's relevant.

[0] https://www.reddit.com/r/WebScaleFreeNAS/


Agree, as a dev on a large distributed project(include software/hardware), I don't recommend any team to use distributed storage system in production IF they don't have lots experience on it..especially opensource system.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: