Hacker News new | comments | show | ask | jobs | submit login
Proposed server purchase for GitLab.com (gitlab.com)
356 points by jfreax on Dec 11, 2016 | hide | past | web | favorite | 320 comments

Think much harder about power and cooling. A few points:

1. Talk to your hosting providers and make sure they can support 32kW (or whatever max number you need) in a single rack, in terms of cooling. At many facilities you will have to leave empty space in the rack just to stay below their W per sq ft cooling capacity.

2. If you're running dual power supplies on your servers, with separate power lines coming into the rack, model what will happen if you lose one of the power lines, and all of the load switches to the other. You don't want to blow the circuit breaker on the other line, and lose the entire rack.

3. Thinking about steady state power is fine, but remember you may need to power the entire rack at full power in the worst case. Possibly from only one power feed. Make sure you have excess capacity for this.

The first time I made a significant deployment of physical servers into a colo facility, power and cooling was quite literally the last thing I thought about. I'm guessing this is true for you too, based on the number of words you wrote about power. After several years of experience, power/cooling was almost the only thing I thought about.

Something you should do to understand the actual power consumption of a server (ac power, in watts , at its input):

Build one node with the hardware configuration you intend to use. Same CPU, ram, storage.

Put it on a watt meter accurate to 1W.

Install Debian amd64 on it in a basic config and run 256 threads of the 'cpuburn' package while simultaneously running iozone disk bench and memory benchmarks.

This will give the figure of the absolute maximum load of the server, in watts, when it is running at 100% load on all cores, memory IO, and disk IO.

Watts = heat, since all electricity consumed in a data center is either being used to do physical work like spinning a fan, or is going to end up in the air as heat. Laws of physics. As whatever data center you're using will be responsible for cooling, this is not exactly your problem, but you should be aware of it if you're going to try to do something like 12 kilowatts density per rack.

Then multiply wattage of your prototype unit by number of servers. This will tell you how many fully loaded systems will fit in one 208v 30a circuit on the AC side.

Also, use the system bios option for power recovery to stagger bootup times by 10 seconds each per server, so that in the event of a total power loss and an entire rack of servers does not attempt to power up simultaneously.

This is good advice. I'd consider not having them auto power on, however. This allows them to bring things up in a controlled manner.

That brings me to my next point: GitLab should also being mindful on what services are stored on which hardware - performing heroics to work around circular dependencies is the last thing you want to be doing when recovering from a power outage.

Yes, this is a choice that depends on the facility, and what sort of recovery plan you have for total power failure. In many cases you would want to have everything remain off, and have a remote hands person power up certain Network equipment, and key service, before everything else. On the other hand, you may want to design everything to recover itself, without any button pushing from humans.

Depends a lot of server/HA/software architecture.

Think much harder about power and cooling. A few points:

I've only ever built desktop machines, and this top comment drew a surprising parallel to most help me with my desktop build type posts. Granted, I'm sure as you dig deeper, the reasoning may be much different, but myself being ignorant about a proper server build, it was somehow reassuring to see power and cooling at the top!

Nowadays once you get past 10 racks or 50kw, you generally only pay for power - the space/cooling/etc is "free" as your limiting rate is power and the vendor's ability to move thermal. You'll likely want a chimney cabinet like the Chatsworth GlobalFrame [1].

[1] - http://www.chatsworth.com/products/cabinet-and-enclosure-sys...

This does depend a lot on geographical location. You're not going to get free racks in a major carrier exchange point in Manhattan, or downtown Seattle, or close to the urban core of San Francisco. There will definitely be a significant quantity discount, as you increase in numbers, and a lot of your cost will be power and not racks.

In somewhere that is very close (OSI layer 1) topologically to a major traffic interexchange point, you will definitely be paying somehow for the monthly cost of the square footage occupied. For example a colo in 60 Hudson or 25 Broadway in NYC, or in one Wilshire in LA.

Large-scale colocation pricing that is based on power only will be found in locations that are not also major traffic inter exchange points. For example quincy WA or the many data centers in suburban new jersey.

I think your point is right for 111 8th or Equinix/SV1/Great Oaks.

But for other sites with high value peering, including EQIX in Ashburn, Coresite, your limiting rate is likely to be the power, not the space or power. I.e. they'll "give" you 1 cabinet for every 15kw you buy.

So my assertion assumes you're doing large number of dense cabinets.

If one is doing a large number of dense cabinets, they almost certainly should not do it at a high value peering point, and should backhaul it. You should be able to get diverse metro dark fiber for <$5k/mo if not substantially less. Put a single cabinet (or pair for redundancy) of $2k/mo cabinets in Equinix or Fisher and off you go.

fully agreed here, especially leave yourself room for when you misjudged some resource utilization (and need more). Nothing worse than having a resource crunch (cpu/mem/io) and not being able to resolve it because your rack is out of power/cooling/etc - you'll come to appreciate how easy it was in cloud just clicking the button and turning out your wallet.

Great points. We'll make sure to wire to separate power feeds that can both handle the entire load. Suggestions in how to calculate this? Taking the maximum rated load seems over the top.

I recently had to do this. The server I was putting up was rated for 3kW. To determine the expected load, I put it under a dummy load that I reasonably considered the maximum for what I would expect on the server (this was a dev machine, so I picked compiling the linux kernel as a benchmark). I ran that until the power stabilized (SuperMicro servers can measure power consumption in hardware and expose this via IPMI - very handy), because power consumption may keep creeping up for a few minutes as the fans adjust to the new operating temperature. I then repeated the same exercise with a CPU torture test (Prime95), just to see what the maximum I could possible get out of the machine was. The numbers turned out to be about 1.8kW for the linux kernel benchmark, and 2.3kW for the torture test. What I ended up doing was to provision for 2kW and use the BIOS' power limiting feature to enforce this. That would kill the machine if it exceeded its designed load, but that's usually better than tripping the breaker and killing the entire circuit. You may also want to talk to your data center provider about overrages. Some of the quotes I got wouldn't kill your circuit when you went over, but would just charge you (a crazy amount, but better than losing all your servers).

Hope that helps. Your deployment is larger than ours, so there may be other techniques, but that's what we did.

I love the idea of killing the server instead of the circuit, thanks!

Might be better to just throttle the CPU/etc when too much power is being used.

Prime95! I haven't seen that since 2000-2001! Solid tool for burning in a box and putting the CPU under maximum load.

Having designed and run colo data centers for many years my rule of thumb for calculating this for customers was to use the vendors tools if available or take 80% of power supply rating. Keep in mind that most servers will have a peak load during initial power on as the fans and components come online and run through testing. If a rack ever comes up on a single channel and the circuits are not rated right that breaker will just pop right back offline and you'll have to bring it up by unplugging servers and bringing them up in sets. Also most breakers are only rated at 80% load so you have to de-rate them for your load. So, e.g., a 20amp x 208 3 phase circuit really only has 8KW of constant power draw available.

+1. As tempting as it is to say "oh this server only needs full power when it boots" this may come back to bite you. As you grow usage in CPU and Disk, the power used by the server will increase substantially.

note that you can get around the 80% breaker limit by having your DC hardwire the power, if you have enough scale to have them do this for you.

Circuit breakers aren't just there for the amusement value of watching a clustered system go into split-brain mode...

I.e. you would also want to be sure that your wiring was rated for 100% utilization, and that other circuit-breaker-like functions exist.

Fire is an actual thing, and figuring out the best way to recharge a halon system isn't exactly what you want to be doing.

Yes. That's why they hardwire and use a breaker rated at 100%. In most jurisdiction, code requires breaker at 80% if a plug/receptacle is used. If you are hardwired you can use 100% of your capacity before tripping the breaker. You have incorrectly assumed that I suggested that you ignore breakers.

When you hardwire the circuit the electrical code allows you to use a 100% breaker.

Thank you for adding this detail... what you said makes WAY more sense to me now.

What exactly gets hardwired to what? This is surprisingly hard to search for details on.

Instead of your plug of your PDU going into a receptacle, the wires that would go into the plug are hardwired to a panel circuit breaker.

This is less common in DCs historically but more and more as folks do 208v 3phase 100A circuits.

This really is only a concern when you are paying a monthly recurring charge (MRC) by the breaker amp with many power drops.

For a deployment of this scale it should be metered power (For example 1 (or more) 3phase a+b drops to each cabinet) where you only pay a Non-Recurring setup Charge (NRC) and then the MRC is based on actual power draw.

3phase also means fewer physical PDU's (uses less space), but more physical breakers. Over-building delivery capability will eliminate any over-draw concerns for startup cycles.

Not really, although I agree with your reasoning. The other is issue is capex. When I deploy 240kW pods, if I use 80% breakers, I have to deploy 25% more PDUs than if I have 100% breakers.

Since my cabinet number is usually evenly divisible by N*PDUs, this impacts overall capital.

We are talking about 1 to 2 cab density here so capex doesn't carry that much weight.

Having a little headroom on your power circuits is also incredibility important, and not every facility will sell 100% rated breakers. It may make more sense to be in a facility with 80% rated breakers than 100%, even with the added capex of an extra PDU or two.

Goes back to my previous comment. What is important to you, at the pod / multiple pod level, isn't as important to the 1-2 cab deployment.

That's fair and accurate.

Similarly, ensure spare room in the cabinet for adjustments, that thing you forgot, and small growth. Much better to have 70% full and not need the space then to have no free RU and need the space.

Most DC PDUs will stagger your outlet power on to stagger the initially large power draws.

Some DCs provide PDU's and some don't.

At RS we would burn the servers in measure the load with a clamp meter. For large scale build-outs(swift, cloud servers) we would burn-in entire racks and measure the load reported by the DC power equipment.

I would recommend verifying everything is fault tolerant/HA as expected every step of the way. We ran into issues where the power strips on both sides were plugged into the same circuit(D'oh), wrong SST's, redundant routers getting cabled up to the same power strips, etc and you name it.

After a rack is setup have people at the DC(your own employees or the DC's techs) help simulate(create) failure in power, networking, and systems to verify everything is setup correct. It sounds like you have people coming onboard with experience provisioning/delivering physical systems though, so I would expect them to be on the ball with most of this stuff.

Thanks, that is very helpful. And we haven't made these hires yet but we hope we can in the coming months.

You can certainly do some math/estimates based on looking at individual component specifications, but I like using a power meter (built in to some PDUs - which is a feature worth having, or you can buy one, they are quite inexpensive).

A system at idle vs full CPU vs full cpu + all disk will produce very different measurements.

Also keep in mind 80% derating - many electrical codes will state that an X amp circuit should only be used at 0.8X on a continuous basis (and the circuit breakers will be sized accordingly).

For HP gear, use HP Power Advisor utility. For Dell, see the Data Center Capacity Planner. Not sure what SuperMicro has -- check with your VAR.

you want 2*N Power feeds where N = 1 or 2. I recommend ServerTech PDUs. You'll want to factor in the maximum load, but with the amount of servers and power you're doing, you should be able to negotiate metered power at under $250/kW/month. Give them a commit that's 1/3 of your TDW/reportd power load, and then pay for what you use.

I just remembered a blog post I wrote a while back about exactly these points:



These are all points a competent engineer would raise :)

I've found that being considered a 'competent engineer' merely means 'never too arrogant to learn'.

I didn't know half of the stuff in grandparents' post.

I was attempting to confirm that these are all points are important. Being too arrogant to learn and being in incompetent are two entirely different things.

Sorry, I misunderstood. To me, your response came across as a "well, duh!".

No problem; as I read it back to myself I can see how it came across that way.

I'm a cranky old person now, I think this is a crazy approach to take and I would be having a very challenging conversation with the engineer pitching this to me.

My underlying assumption is that this is a production service with customers depending on it.

1. Don't fuck with networking. Do you have experience operating same or similar workloads on your super micro sdn? Will the CEO of your super micro VAR pickup his phone at 2AM when you call?

My advice: Get a quote from Arista.

2. Don't fuck with storage.

32 file servers for 96TB? Same question as with networking re:ceph. What are your failure domains? How much does it cost to maintain the FTEs who can run this thing?

3. What's the service SLA on the servers? Historically, supermicro VARs have been challenged with that.

If I were building this solution, I'd want to understand what the ROI of this science project is as compared to the current cloud solution and a converged offering like Cisco/NetApp, HP/3Par or something like Nutanix. You're probably saving like 20-25% on hardware.

This sounds to me like pinching pennies on procurement and picking up tens of dollars of cost on the labor side. If you're accustomed to SLAs from AWS, this will be a rude awakening.

I'm happy to see this - I could not agree more with these points.

I think they are coming at this problem from the wrong perspective - instead of growing from virtual servers to their own dedicated hardware to get better CephFS performance, they should take a hard look at their application and see if they can architect it in a way that does not require a complex distributed filesystem to present a single mount that they can expose over NFS. At some point in the future, it will bite them. Not an if, but a when.

In addition, this means that running physical hardware, CephFS and Kubernetes (among other things) are now going to be part of their core competencies - I think they are going to underestimate the cost to GitLab in the long run. When they need to pay for someone to be a 30 minute drive from their DC 24/7/365 after the first outage, when they realize how much spare hardware they are going to want around, etc.

As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.

> they should take a hard look at their application and see if they can architect it in a way that does not require a complex distributed filesystem to present a single mount that they can expose over NFS

As someone who administers GitLab for my company, yes please.

Any high availability scenario that involves "just mount the same NFS volume on your standby" is a nonstarter for us. (We've found mounting NFS across different datacenters to be too unreliable and our failover scenarios include loss of a data center.)

It would also be wonderful to be able to run GitLab in any of the PaaSes that only have ephemeral disk, but that's a secondary concern.

>Any high availability scenario that involves "just mount the same NFS volume on your standby" is a nonstarter for us

What are the alternatives?

I suppose there's MySQL's "stream asynchronously to the standby at the application level."

... which, now that I think about it, should be pretty easy to do with Git, since pushing and pulling discrete units of data are core concepts...

I don't see why it wouldn't be difficult to do a git implementation that replaces all the FS syscalls with calls to S3 or native ceph or some other object store. If all they're using NFS for is to store git, it seems like a big win to put in the up front engineering cost.

I mean, especially because git's whole model of object storage is content-addressable and immutable, it looks like it's a prime use for generic object storage.

This is precisely how I would recommend doing this if I were to be asked. Because git is content-addressable, it lends itself very well to being stored in an object storage system (as long as it has the right consistency guarantees). Instead of using CephFS, they could use Ceph's rados gateway which would allow them to abstract the storage engine away to working with any object storage system.

Latency and consistency would be my concerns - S3 does not quite have the right semantics for some of this, so you'd have to build a small shim on top to work around this. Ceph's rados doesn't even have these problems, so it is quite a good contender.

Latency is an issue. Especially when traversing history in operations like log or blame, it is important to have an extremely low latency object store. Usually this means a local disk.

Yuup. Latency is a huge issue for Git. Even hosting Git at non-trivial scales on EBS is a challenge. S3 for individual objects is going to take forever to do even the simplest operations.

And considering the usual compressed size of commits and many text files, you're going to have more HTTP header traffic than actual data if you want to do something like a rev-list.

I'm trying to think of the reason why NFS's latency is tolerable but S3's wouldn't be. (Not that I disagree, you're totally right, but why is this true in principle? Just HTTP being inefficient?)

I would imagine any implementation that used S3 or similar as a backing store would have to heavily rely on an in-memory cache for it (relying on the content-addressable-ness heavily) to avoid re-looking-up things.

I wonder how optimized an object store's protocol would have to be (http2 to compress headers? Protobufs?) before it starts converging on something that has similar latency/overhead to NFS.

That's how AWS CodeCommit works which makes it unique amongst GitHub, Gitlab and friends.

Source: https://aws.amazon.com/codecommit/faqs/

"AWS CodeCommit is built on highly scalable, redundant, and durable AWS services such as Amazon S3 and Amazon DynamoDB."

> As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.

This is a really good point. That's easily $1M in payroll. You could probably run a decent tiered SAN with 80-95% fewer labor dollars. Plus have the benefit of torturing the vendor when you hit scaling hiccups.

> As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.

Can you give some examples of the problems you ran into?

Definitely! Not going to be an exhaustive list, but I can talk about some of the bigger pieces of work.

Something that always seemed to cause nagging issues was that we wanted our cluster to have data encryption at rest. Ceph does not support this out of the box, which means that you need to use dmcrypt on top of your partitions, and present those encrypted partitions to Ceph. This requires some work to make sure that decrypt keys are setup properly, and that the machine can reboot automatically and remount the proper partitions. In addition, we ran into several issues where device mapper or otherwise would lock an OSD, which would send the entire machine into lockup, messy!

We also had to work pretty hard to build quality monitoring around Ceph - by default, there are very little tools that provide at-scale fine grained monitoring for the various components. We spent a lot of time figuring out what metrics we should be tracking, etc.

We also spent a good amount of time reaching out to other people and companies running ceph at scale to figure out how to tune and tweak it to work for us. The clusters were all-SSD, so there was a ton of work to tune the myriad of settings available, on ceph and the hosts themselves, to make sure we were getting the best possible performance out of the software.

When you run dozens-to-hundreds of servers with many SSDs in them that are doing constant traffic, you tend to hit every edge case in the hardware and software, and there are a lot of lurking demons. We went through controller upgrades, SSD firmware upgrades, tweaking OP settings, upgrading switches, finding that the write workload on certain SSDs caused problems and much more.

That's just a snapshot of some of the issues that we ran into with Ceph. It was a very fun project, but if you are getting into it for a high-throughput setup with hundreds of OSDs, it can be quite a bit of work.

Happy to chat more w/ anyone that's curious - there is some very interesting and fun stuff going on in the Ceph space.

> This requires some work to make sure that decrypt keys are setup properly, and that the machine can reboot automatically and remount the proper partitions.

I've always wondered how automatic reboots are handled with filesystem encryption.

What's the process that happens at reboot?

Where is the key stored?

How is it accessed automatically?

Couldn't agree more. It seems Gitlab is severely underestimating the cost of having similar infrastructure on bare metal as their AWS infra. I would probably start to re-architect their software and replace Ceph with something better, probably with S3. Much easier to scale up and operate. Also their Ruby stack has a lot of optimisation potential. Starting to optimise on the other end (hardware, networking, etc.) is a much harder job, starting with staffing questions. AWS, Google and MSFT has the best datacenter engineers you can find and there is a huge amount of effort went into engineering their datacenters. Not to mention your leverage you have being a small startup vs a cloud vendor when talking to HW vendors. Anyways, in few years we gonna know if they managed to do this successfully.

I cannot tell you how many weird-ass storage systems I've decommissioned. Typically people go cheap, and are burned. It can be a year after deployment, or three, but one thing all the cheap stuff seems to have in common is that it fails very, very badly just when you need it very, very badly. Usually at 3 in the morning.

My philosophy today is that if the data is important at all, it's worthwhile going spendy and getting a SAN. Get a good one. I like Nimble a lot right now, but there are other good ones, too. (Don't believe crazy compression numbers or de-duplication quackery; I've told more than one SAN vendor to fuck off after they said they'd get 20:1 on our data, without doing any research on what our data was).

Have everything backed up? Great! How long until you can go live again after water drips on your storage? If you spend a week waiting for a restore, that's indistinguishable from failure. If you wait a month for replacement hardware to be shipped, you might have killed your product.

Thank goodness someone said it.

Perhaps I don't understand the problem domain, but I don't understand why CephFS is being considered for this task. You're trying to treat your entire set of files across all repos as a single filesystem, but that's an entirely incorrect assumption. The I/O activity on one repo/user does not affect the I/O activity of an entirely different user. Skip the one filesystem idea, shard based on user/location/whatever.

I'd appreciate any comments explaining why I'm wrong, because this doesn't seem to be a productive design to me.

Treating the whole thing as one FS is the current architecture GitLab uses, so is more of an existing constraint than a proposed architecture. To get distributed storage you either need to rewrite GitLab to deal with distributed storage, or run it on another layer that presents an illusion of one big FS (whether that's CephFS or a storage appliance).

If you're going to spend top dollar on Arista/Cisco/EMC/NetApp, you might as well stay in the cloud.

None of the clouds use any of that super-expensive gear, so if you're going for cost savings, you'll need to use the same sort of commodity gear they use.

Gitlab is obviously Linux-savvy and comfortable writing automation, so things like Cumulus Linux and minimal-handholding hardware vendors shouldn't cause them any indigestion.

<disclaimer: co-founder of Cumulus Networks, so slightly biased>

But this is a couple of hundred boxes, not AWS. I've been to a Microsoft data center... the scale is infinitely larger and solutions are different well.

My point isn't to knock them down. It takes cohones to be public about stuff like this. My instinct as a grumpy engineering director type is that there are holes here that need to be filled in.

Putting a major product at risk to save $30k against an Arista switch isn't a decision to make lightly. That means pricing the labor, upside benefit and business risk. If they are going to 100x this environment, Cumulus will save millions. If it will 3x, it will save a few thousand bucks -- who cares.


I disagree. With over 300 servers in AWS, you can almost certainly build a redundant data center with hardware at less than 60% of the costing assuming 3 year depreciation.

Arista and Cisco shouldn't cost top dollar; though anyone buying EMC or Netapp for any new build should have their union card revoked. FreeNAS ftw uber alles.

Source: Did it twice.

Is FreeNAS something people actually run Serious Business, at-scale production datacenters on?

I've run it in my home a few times out of curiosity, and that was never my impression.

Yes. I know of several billion dollar companies that run it in a web facing production operations capacity.

Any good talks or other resources about this sort of use case that you might recommend?

We're... displeased with the current solution that we're using at work for this use case. :)

I've created a subreddit here [0] for discussion. Ask some questions and I'll give you the information I have that's relevant.

[0] https://www.reddit.com/r/WebScaleFreeNAS/

Agree, as a dev on a large distributed project(include software/hardware), I don't recommend any team to use distributed storage system in production IF they don't have lots experience on it..especially opensource system.

I have built out a few racks of Supermicro twins. In general I would suggest hiring your ops people first and then letting them buy what they are comfortable with.

C2: The Dell equivalent is C6320.

CPU: Calculate the price/performance of the server, not the processor alone. This may lead you towards fewer nodes with 14-core or 18-core CPUs.

Disk: I would use 2.5" PMR (there is a different chassis that gives 6x2.5" per node) to get more spindles/TB, but it is more expensive.

Memory: A different server (e.g. FC630) would give you 24 slots instead of 16. 24x32GB is 768GB and still affordable.

Network: I would not use 10GBase-T since it's designed for desktop use. I suggest ideally 25G SFP28 (AOC-MH25G-m2S2TM) but 10G SFP+ (AOC-MTG-i4S) is OK. The speed and type of the switch needs to match the NIC (you linked to an SFP+ switch that isn't compatible with your proposed 10GBase-T NICs).

N1: A pair of 128-port switches (e.g. SSE-C3632S or SN2700) is going to be better than three 48-port. Cumulus is a good choice if you are more familiar with Linux than Cisco. Be sure to buy the Cumulus training if your people aren't already trained.

N2: MLAG sucks, but the alternatives are probably worse.

N4: No one agrees on what SDN is, so... mu.

N5: SSE-G3648B if you want to stick with Supermicro. The Arctica 4804IP-RMP is probably cheaper.

Hosting: This rack is a great ball of fire. Verify that the data center can handle the power and heat density you are proposing.

Wow Wes, these are all awesome suggestions. All of them (CPU, disk, memory, network, hosting) are things we'll consider doing. I added the Dell server with https://gitlab.com/gitlab-com/www-gitlab-com/commit/34bd78d8...

Would you mind if we contact you to discuss?

Go ahead.

I too have built out several racks of SuperMicro twins.

I'm never buying another SuperMicro, for many reasons. The amount of cabling for properly redundant connections is a killer; it's at least three cables per system (five in our case); a rack will have hundreds of wires to manage. Access to the blade is in the back, where the cables are, and you have to think ahead and route things cleverly if you want be able to remove blades later.

The comment about doing something other than three 48-port switches is bang on. And if you're running Juniper hardware, avoid the temptation to make a Virtual Chassis (because it becomes a single point of failure, honest, and you will hate yourself when you have to do a firmware upgrade).

19KW is still a ton of power, and I'm surprised the datacenter isn't worried (none of the datacenters we use worldwide go much above 16KW usable). Also, you need to make sure you're on redundant circuits, and that things still work one of the power legs totally off. Make sure you know what kind of PDU you need (two-phase or three-phase), and that you load the phases equally.

Can you elaborate on what deficiencies 10GBase-T has in server applications?

One, category 6A cables actually cost two and a half times as much as a basic single mode fiber patch cables. Two, cable diameter. Ordinary LC to LC fiber cables with 2 millimeter diameter duplex fiber are much easier to manage then category 6A. Three, choice of network equipment. There is a great deal more equipment that will take ordinary sfp+ 10 gig transceivers, then equipment that has 10 gigabit copper ports. As a medium-sized ISP, I rarely if ever see Copper 10 gigabit.

SFP+ 10Gbase-T 3rd party optics started hitting the market this year, but none of the major switch vendors offer them yet, so the coding effectively lies about what cable type it is. Just an option to keep in your back pocket, as onboard 10Gb is typically 10Gbase-T. Thankfully, onboard SFP+ is becoming more common however.

For short distances of known length, twinax cables which are technically copper can be used. They're thinner than regular cat6a but only about the same as thin 6a and thicker than typical unshielded Duplex fiber patch cables. Twinax can be handy if connecting Arista switches to anything else that restricts 3rd party optics, as Arista only restricts other cable types. Twinax is also the cheapest option.

Latency, power and cost.

The PHY has to do a lot of forward error correction and filtering, so it adds latency (for the FEC), power (for all the DSP) and cost (for the silicon area to do all of the above).

The latency is almost certainly immaterial, 0.3uSec v. 2uSec. That's MICROseconds not MILLIseconds. Power draw used to be an issue, but not anymore.

Consider the power pull from copper v fiber listed here [2]. The Arista 7050TX 128 port pulls 507W while the 7056SX 128 port pulls 235W. Yes, copper is more but we're talking half a kW for 2 TOR switches. And for this you get much cheaper cabling, as the SFP+ are much more expensive (go AOC if you do go fiber, BTW) and you have to worry about clean fiber, etc.

Where fiber is REALLY nice is with the density, although the 28AWG [3] almost makes that moot.

There's few DC builds that can't do end-to-end copper any more, at least for the initial 50 racks.

[1] - http://www.datacenterknowledge.com/archives/2012/11/27/data-... [2] - https://www.arista.com/en/products/7050x-series [3] - https://www.monoprice.com/product?p_id=13510

I figure it's simply the availability of datacenter level switches (and other components).

I politely disagree on the 10GBaseT. I built out hundreds of nodes with TOR 10G using Arista switches and it worked great.

With regards to the switches, I would argue that they should skip SDN switches all together and get some Cisco Catalysts as TOR Switches. 2x 48-port switches for each rack with redundant core routers in spine-leaf. SDN is cool, but at the current scale, it seems like it would be more resource intensive than it is worth.

This is one rack of equipment; no spines are needed. http://blog.ipspace.net/2014/10/all-you-need-are-two-top-of-...

Cumulus isn't really SDN so I'm not sure what you're saying there.

Traditional networking is fine but it's totally different than Linux so you need dedicated netops people to manage it. And Cisco is the most expensive traditional vendor.

I'm not super well-versed in the networking area. I misread the posting. I thought they were looking at 64Us of servers.

I've just gotten the notion that SDN is almost ready for primetime, just not yet.

It's been almost ready for primetime for over a decade now.

Agreed, skip SDN. Skip MC-LAG.

Juniper, Cisco, and Arista all have solutions for this environment.

The raw performance benefits of bare metal vs cloud are incredible, but why does that necessarily mean building & maintaining your own hardware when you can lease (or work out whatever financing you want, but still let the hosting company maintain a lot of the responsibility for HW)? And besides financing, taking on all the HW maint? I'm not sure your needs are so unique as to require custom hardware.

You're talking about only 64 nodes right now. Your storage and IOPS requirements are not huge. A lot of mid-size hosting companies will give you fantastic service at the 10-1000 servers range. If I were you I'd talk to someone like https://www.m5hosting.com/ (note: happy dedicated server customer for many years -- and I'm sure there's similar scale operations on east coast if that's really what you need) who have experience running rock solid hosting for ~100s of dedicated servers per customer.

I suspect you may just be able to get your 5-10X cost/month improvement (and bare metal performance gains) without having to take on the financing and hardware bits yourself.

When I looked at how expensive a rented dedicated server was compared to AWS I expected a big difference, but I was still suprised how cheap you can get dedicated servers today. Not worrying about actually owning the hardware is nice, and you still get vastly more hardware for the same price as in the cloud.

There are many advantages to AWS and similar services, but if you can't really take advantage of all the goodies because e.g. you also need to provide a local version of your software (which is the case for Gitlab, as far as I understand), renting dedicated servers is an order of magnitude cheaper.

Right now that makes sense yes. What about 5 years from now when it doesn't? This way they get to start building a team and culture to support that kind of infrastructure and wean the baby teeth on nice soft furniture.

I wonder if they could go with a hosting provider, hire a sysadmin, then second them to the hosting provider to learn how it's done, in return for a credit against their costs.

In fact, i wonder if that's a business model for hosting providers. Kind of a sysadmin incubator.

Except the really good -> exceptional talent just ends up on staff at the operations incubator and the rest go to the world at large :'(

We looked at providers such as Softlayer but while they guarantee the performance of the servers they typically can't guaranty network latency. Since we're doing this to reduce latency https://about.gitlab.com/2016/11/10/why-choose-bare-metal/ this is essential to us.

We'll be glad to look into alternatives that manage the servers and network for us although the argument in https://news.ycombinator.com/item?id=13153455 that this is the time to build a team that can handle this makes sense to me too.

Been a softlayer customer for 4 years now. Their network is pretty awesome. When hosted in the same datacenter it's sub-ms response time always. If there is an issue they get right on it. You can even ask them to host the stuff in the same rack to get even better response time.

We are also a SL customer -- 4-figures of hosts with them. We have had networking problems in the past (latency and loss far higher than I would expect to see in a well-provisioned DC) and talked to them about it. It ended up being contention with another customer, it got fixed, and our network performance has been great since.

I would encourage you to look at what you can get without trying to do your own colo. You're not at the scale where you should be thinking about that.

Thanks for the suggestion. Any idea if they can offer 40 Gbps networking?

Few will offer 40Gbps without charging a pretty penny, generally the jump will go from 10Gbps to 100Gbps but not until it becomes cost-effective (and not anytime soon).

That said, SoftLayer does provide 20Gbps access within a private rack and 20Gbps access to the public network.

Most large scale operations are (or soon will be) deploying 25GE and/or 50GE in place of 10Gbps Ethernet. 100GE to each node is unnecessary for most workloads & more importantly it's obscenely expensive and likely to remain so for at least 3 more years.

Architecturally I would honestly wonder why you need that. Git, with its heavy reliance on immutable objects, should replicate very well. I would expect to be NIC-bound, but I would also expect it to horizontally scale reasonably well. Storage is obviously a concern -- you will have to make sure you don't end up needing a full copy of your data on every node -- but there are well known solutions to this problem that account for hot keys/objects well enough for you to get really really far. Even reaching as far back as the BigTable paper there is valuable stuff to look at and you can obviously look at Cassandra to see how the OSS world has tackled that problem.

Same experience here, going on seven or eight years. They offer a great solution between cloud native and 100% owned/leased/caged bare metal.

Cool, I was not aware you can ask for same rack hosting.

in the past they've spread servers into different rooms that had interconnects that challenged a team I was on to spend extra time writing compression code. is this better now?

None of that would convince me to lift a finger. Not unless you put dollar amounts on one vs the other. Version control isn't something I even care about performance.

I do. We had horrible download\merge\resolve times at my place because of some poor choices (storing many copies of large binary files across all branches in perforce) and it made a huge improvement in development time when those issues were resolved. Granted, this was an extreme case, but VC performance is not irrelevant to the cost of software development.

Version control is the primary tool your developers are using at any org doing software development. It's performance is key to ensure devs aren't wasting their time waiting on version control.

Performance along with reliability are the most important metrics for someone providing VCS as a service.

IDE and local dev environment are primary. I'm currently saddled with a proprietary SVN-based enterprisey monstrosity that costs me 10 minutes a day at the absolute worst. It sucks, but we can easily live with it.

To be sure, I've seen benchmarks (can look up later) that found a 10x+ performance improvement over and above the cost savings, leading to I guess a 100x price/performance improvement.

If you're committed to having a robust architecture (this may not be financially viable immediately) you should study the mistakes that Github have made, e.g. https://news.ycombinator.com/item?id=11029898

Geo-redundancy seems like a luxury, until your entire site comes down due to a datacenter-level outage. (E.g. the power goes down, or someone cuts the internet lines when doing construction work on the street outside).

(This is one of the things that is much easier to achieve with a cloud-native architecture).

Exactly this. I don't think that if you take into account __all__ parameters bare metal is cheaper. My other problem is that they are moving from cloud to bare metal because of performance while using a bunch of software that are notoriously slow and wasteful. I would optimise the hell out of my stack before commit to a change like this. Building your own racks does not deliver business value and it is extremely error prone process (been there, done that). There are a lot of vendors where the RMA process is a nightmare. We will see how it turns out for Gitlab.

The RMA process is pretty much a moot point at this scale. It simply doesn't make sense to buy servers with large warranties. Save the money, stock spares, and when a component dies replace it. In the long run it will come out to cost a lot less. I'd much rather pay much less knowing a component may fail here and there and you buy a new one.

As far as moving from cloud to bare metal, another thing to take into consideration (that this was replied to), if you don't architect your AWS (or other cloud solution) to take advantage of multiple geographic regions, the cloud won't benefit you.

I 100% agree that there should be more than one region deployed for this service. As others have said, all it takes is 1 event and the site will be down for days to weeks to months (It may not happen often, but when it does, you go out of business). The size and complexity of this infrastructure will make it nearly impossible to reproduce on short order in a new facility. If I were the lead on this I would have either active / active sites, or an active / replica site.

I would also have both local (fast restores), and off-site backups of all data. A replica site protects against site failure not data loss and point-in-time recovery.

"As far as moving from cloud to bare metal, another thing to take into consideration (that this was replied to), if you don't architect your AWS (or other cloud solution) to take advantage of multiple geographic regions, the cloud won't benefit you."

Yep, this is why scaling starts with scalable distributed design. We were moving a fairly large logging stack from NFS to S3 once, for the same reason Gitlab is trying to move to bare metal now. Moving off cloud was not an option, moving to a TCO efficient service was. NFS did not scale and there was the latency problem. I think moving to bare metal cannot help with scale as much as a good architecture can. We will see how deep the datacenter hole goes. :)

Agreed. Application Architecture is far more important than Cloud vs. Bare Metal. It is just easier and more cost effective to through more bare metal hardware at the problem than it is cloud instances. For some this does make bare metal the better option.

To add to my previous comment though, AWS (and cloud in general) tends to make much more sense if you are utilizing their features and services (Such as Amazon RDS, SQS, etc.), and if you aren't using these services I can absolutely guarantee I can deliver a much lower TCO on bare metal than AWS. (Which is why I offered to consult for them) I see this all the time. Company moves from bare metal to AWS as bare metal is getting expensive, then they quickly find out AWS can't deliver the performance they need without massive scale (because they aren't using a proper salable distributed design and can't afford to re-architect their platform)

This is definitely what perked my ears up when they mentioned the US East Coast, especially when you consider the risk that a natural disaster might take out the facility.

There's no reason you can't have a hot cloud site, and operate in lipo mode (or even scale it up temporarily) if you lose your DC. Best of both worlds.

What is lipo mode? Closest I can guess is a typo for 'limp mode'.

Haha, yep... was also autocorrected to limo. :(

Z1: the word "monitoring" does not appear in this document.

You will need to monitor: - ping - latency - temperatures - cpu utilization - ram utilization - disk utilization - disk health - context switches - IP addresses assigned and reaching expected MACs - appropriate ports open and listening - appropriate responses - time to execute queries - processes running - process health - at least something for every bit of infrastructure

once you collect that information, you need to record it, graph it, display it, recognize non-normal conditions, alert on those, page for appropriate alerts, and figure out who answers the pagers and when.

Good point, we know monitoring is very important.

Our Infrastructure lead Pablo will do a webinar of our Prometheus monitoring soon https://page.gitlab.com/20161207_PrometheusWebcast_LandingPa...

We're bundling Prometheus with GitLab https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1481

Brian Brazil is helping us https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1481#not...

On January 9 our Prometheus lead will join us (who was very valuable already helping with this research behind this blog post) and we're hiring Prometheus engineers https://about.gitlab.com/jobs/prometheus-engineer/

In the short term we might send our monitoring and logs to our existing cloud based servers. In the long term we'll host them on our own Kubernetes cluster.

For our monitoring vision see https://gitlab.com/gitlab-org/gitlab-ce/issues/25387

Gitlab package includes too many things! I remember Gitlab sponsors a guy to do proper packaging to include Gitlab in debian last year, not sure what is the status.

Since GitLab is heavily betting on Prometheus now, my guess is that it will be used to cover almost all of those points and more. Yep, lots of work though.

Disclaimer: Prometheus co-founder and just started as a Prometheus contractor for GitLab.

EDIT: ah, didn't reload the page to see that sytse had already responded with this :)

neither the word "IT salary(ies)"

Are you sure about the location? I would go with Frankfurt, Germany. Biggest IX in the world and if you want a "low-latency" solution for all users this is basically the middle of everything. NYC will have a worse connection to Asia and I don't want to begin with India or something. While Frankfurt is basically only 70ms away from NY and around 120ms to the west coast, while even south america should be < 200ms. Russia is 40-50ms and Asia should be between 100ms and 200ms. Australia will probably have the worst ping with around 200-300ms.

Just as a suggestion since you seem to be so certain about the location. I basically know every DC in Frankfurt, so if you need any help or info in that regard feel free to contact me :-)

If you can only deploy one place in the world, east coast US is generally the right fit (if you have typically distributed user traffic, of course). You can cover western Europe and the major population centers of eastern North America, and still have reasonable performance in western North America as well. While it's increasingly less true over time, keep in mind that this is the place that other countries all prioritized building connectivity to, because historically this is where early commercial websites lived.

Of course there's the issue of potential legalities for your business, but don't kid yourself that you're safe from prying eyes by deploying in a particular country.

The next step (and particularly important for a business like GitLab) is to land a second site in either western Europe or west coast US. Honestly you should be thinking about this right away, and look to sign leases on both spaces simultaneously with a 3-mo delay built in for the second site. This should help you negotiate price as well, if you're able to go with the same dc provider, but be aware that that itself is a single point of failure as well. Make absolutely sure you negotiate your MSA to the n-th degree, get a good SLA, etc. You can still get burned, but do your legal due diligence now because you won't have a chance to change terms later.

Then continue to optimize by having multiple sites per-region (so that failover doesn't involve a big performance hit), adding APAC / AUNZ regions, and so forth. For a service like GitLab, I wouldn't think that time to sync the repo is hyper important, but responsiveness of the web interface is fairly key. So that may lead to a hub/spoke design where there are a few larger sites storing the bulk of the data and more small sites to handle metadata and such to present the web views.

That's all years down the road though. For a first pass, I can't see a problem with northern Virginia as the first site.

Frankfurt surely has a great IX. Even if Frankfurt would make everyone better off (which I'm not sure about) there is another problem. People in the US are used to lower latencies because most SaaS services are hosted there.

I have less of a concern on latency than I do with who has taps in the lines...

I would assume that the .de lines are saturated with 5 eyes...

I remain skeptical about the existence of non-tapped lines.


EDIT: Funny, every single documented, proven link for the last several years has proven me and GP correct and all you nay-sayers wrong.

I believe the downvotes are due to "+1" not adding anything to the conversation, as opposed to people disagreeing with the sentiment.

Isnt it ironic that a (+1) on HN will get downvotes yet thats basically what every single HNer seeks to see in their requests on git. :-)

So what location would be out of reach of FVEY? And why would you put secret stuff on Gitlab.com?

Why would you care?

What's your email address, SSN and password. If you have nothing to hide, then you shouldn't care why I want to read all your messages.

Where are the customers coming from that are immune to national intelligience apparatus?

I wouldn't make a business decision around a general feeling of paranoia. German spies aren't an improvement over American or British ones.

Why would you care, as in, why would you have cleartext data on the wire?

This is Gitlab. There should never be any data in or out except HTTPS and SSH.

thank you. i didnt read it with that context

> I basically know every DC in Frankfurt, so if you need any help or info in that regard feel free to contact me :-)

I (and others I bet) would be interested in a quick summary of the options (and your opinions of them) for facilities in Frankfurt.

There are basically a shitload of datacenters in Frankfurt so I won't talk about every DC, but more of a general overview. It doesn't matter what your budget is or what your requirements are, you will find a DC in Frankfurt.

You basically have the big brands like Equinix (7 facilities in Frankfurt), Interxion (They are currently building their 11th facility in Frankfurt), Telecity (now part of Equinix) but also some local ones like e-Shelter, First Colo or Accelerated. There is also a DC only meters away from the Interxion campus that is called Interwerk and is basically the cheapest DC in Frankfurt, but it has an awesome price/performance ratio and you have access to all peering/transit options of Interxion (via CWDM) and can get a 1/1 rack for under 200 EUR/mo. I have been in that DC for several years and never had a single issue, so if you don't rely on any certificates this is a cheap option.

I was also colocating in the First Colo DC for some time, they are also in the locality around the Interxion campus. It's a pretty small DC but it is more premium than the Interwerk and also is kinda cheap. I personally wouldn't go with Accelerated, since they had multiple issues in the past.

For the big brands I would definitely go with Interxion, they are great, have all the certificates and are premium while not going crazy with their pricing like Equinix does. DigitalOcean, Twitch, etc. they are all in one of the many Interxion facilities in Frankfurt. If Price isn't a concern I would probably go with Equinix FRA5.

DE-CIX is present in around 7 facilities, 3 of them are Interxion and I think 2 Equinix.

So what about the data privacy in the USA, and patriot act and all that. Also think fin tech where geolocation means more things than just latency.

Why does latency matter for fin tech for a repository / source control?

"more things than just latency"

Some financial institutions require all sensitive* data to be stored/hosted in the same country/state(archaic yes).

It's real hard to actually define sensitive data but the IP* in some code a quant wrote can totally be considered a trade secret by a non technical person Please don't get me started on how stupid I think it is that people consider code to be IP

>archaic yes

Definitely not archaic. If the FISA courts taught you anything, it should be that the country your things are living in determines which entities can tap your traffic/hardware.

To that end I'd be cautious of facilities in NJ. If you are dead-set on the US, Ashburn, or even better further inland such as Chicago or Dallas, may be better suitable if you do not plan on having redundancy in another facility.

Frankfurt may be a better option solely on the basis that it is more "environmentally" stable (in terms of events that may disrupt the operation of the facility).

Australia will be 300-350ms.

It really depends on where most of their high-paying customers are. Which is, very likely, in the US/Canada and some of EU. So US East Coast makes sense.

If the customer base has lots of US enterprise customers, that choice will cost you money. Ex-US data residency is an issue for many compliance standards.

Isn't that the same vice versa? Sure, if the majority of the customers sit in the US _and_ care about that: Fine. Otherwise I felt that the online community rather likes to avoid the US for data if that is an option.

(I'm from Germany, but I couldn't care less about the country per se or GitLab being co-located in Frankfurt)

It's a problem IMO that they are trying to pick 1 site. Geodiversity is necessary for any serious enterprise, and given that, I would suggest 1 site on the US west coast to please users in SF/LA/SEA (and most of Asia) plus 1 site somewhere in Europe (AMS, FRA, etc). Once you can afford 3 sites, add Asia or US east, depending on where the users are.

I'm sure that you guys have done a lot of analysis but flipping to steel is the very last thing I would consider before reviewing the tech in the world today in relationship to TCOS as things move very fast.

You could frame infrastructure cost savings in many different ways though. The most obvious solution may seem to be the spend to move from the cloud to in house bare metal but I feel like you'll have a lot of costs that you haven't accounted for in maintenance, operational staff spend, cost in lost productivity as you make a bunch of mistakes.

Your core competency is code, not infrastructure, so striking out to build all of these new capabilities in your team and organization will come at a cost that you can not predict. Looking at total cost of ownership of cloud vs steel isn't as simple as comparing the hosting costs, hardware and facilities.

You could reduce your operating costs by looking at your architecture and technology first. Where are the bottlenecks in your application? Can you rewrite them or change them to reduce your TCOS? I think you should start with the small easy wins first. If you can tolerate servers disappearing you can cut your operating costs by 1/8th by using preemptible servers in google cloud platform for instance. If you optimize your most expensive services I'm sure you can cut hunks of your infrastructure in half. You'll have made some mistakes a long the way that contribute to your TCOS - why not look at those before moving to steel and see what cost savings vs investment you can get there?

Ruby is pretty slow but that's not my area of expertise - I wouldn't build server software on dynamic languages for a few reasons and performance is one of them but that's neither here nor there as you can address some of these issues in your deployment with modern tech I'm sure. Aren't there ways to compile the ruby and run it on the JVM for optimizations sake?

Otherwise do like facebook and just rewrite the slowest portions in scala or go or something static.

Try these angles first - you're a software company so you should be able to solve these problems by getting your smartest people to do a spike and figure out spend vs savings for optimizations.

Yeah, link aggregation doesn't work how they think it does. And not having a separate network for Ceph is going to bite them in the arse.

GitLab is fine software but fuck me, they need to hire someone with actual ops experience (based on this post and their previous "we tried running a clustered file system in the cloud and for some reason it ran like shit" post).

They'd save themselves a whole lot of time, effort, and money if they looked at partitioning their data storage instead of plowing ahead with Ceph. They have customers with individual repositories. There is no need to have one massive filesystem / blast radius.

I don't know much about this stuff, but won't that stop working if they ever decide to expand to multiple geographical sites, to reduce latency to customers in different locations? In that case, different sites can receive requests for the same repositories, and ideally each site would be able to provide read only access without synchronization, with some smarts for maintaining caches, deciding which site should 'own' each file, etc. They could roll their own logic for that, but doesn't that pretty much exactly describe the job of a distributed filesystem? So they'd end up wanting Ceph anyway, so they may as well get experience with it now.

Seems this is one of the goals of the article: "We're hiring producton engineers and if you're spotting mistakes in this post we would love to talk to you (and if you didn't spot many mistakes but think you can help us we also want to talk to you)".

My bad, I didn't make it that far down the post before hitting rant mode.

Ceph should get a separate network which is only used for re-replication in case something happens. Consider a node goes down.

Another thing I might recommend is a third network (just a simple 1GB and a quality switch) for consensus. Re-replication can max the network out and further cause consensus fails, causes more re-replication winding down everything ... If that's not possible, add firewall rules to prioritize all consensus related ports high.

They said very little about how they think link aggregation works. Just that they can send packets on both and continue working with only one port. That's basically the definition of link aggregation. So what's wrong about the post?

If you are looking for performance, do not get the 8TB drives. In my experience, drives above 5TB do not have good response times. I don't have hard numbers, but I built a 10 disk RAID6 array with 5TB disks and 2TB disks and the 2TB disks were a lot more responsive.

From my own personal experience, I would go with a PCIe SSD cache/write buffer, and then a primary SSD tier and a HDD tier. Storage, as it seems you guys have experienced, is something you want to get right on the first try.


N1: Yes, the database servers should be on physically separate enclosures.

D5: Don't add the complexity of PXE booting.

For Network, you need a network engineer.

This is the kind of thing you need to sitdown with a VAR for. Find a datacenter and see who they recommend.

Thanks for commenting! You're right that 2TB disks will have more heads per TB and therefore more IO. We want to increase performance of the 8TB drives with Bcache. If we go to smaller drives we won't be able to fit enough capacity in our common server that has only 3 drives per node. In this case we'll have to go to dedicated file servers, reducing the commonality of our setup. We're using JBOD with Ceph instead of RAID.

If the 8TB drives don't offer the IO we need we'll have to leave part of them empty. I assume that if you fill them with 2TB they should perform equally well. Word on the street is that Google offers Nearline to fill all the capacity they can't use because of latency restrictions for their own applications.

It is an interesting idea to do SSD cache + SSD storage + HDD instead of just SSD cache + HDD. I'm not sure if Ceph allows for this.

I recommend testing and evaluating Bcache carefully before planning to use it in a production system. I found it unwieldly, opaque, and difficult to manage when testing on a simple load; Benefits were mild.

I'm sure it has its uses! But I'd emphatically advise anyone to obtakn hands-on experience with Bcache before planning around it! As I'm sure you are already doing or considering of course :)

It doesn't work like that. 2TB drive will always be faster than an 8TB drive. The amount of data has no effect when compared to the physical attributes of the drive. More platters will increase the response time.

Ceph seems to offer Tiering, which would move frequently accessed data into a faster tier while the infrequent data to a slower tier.

> Ceph seems to offer Tiering, which would move frequently accessed data into a faster tier while the infrequent data to a slower tier.

By "Tiering", is this moving data between different drive types? Or by moving the data to different parts of the platter to optimize for rotational physics?

By Tiering, I'm talking about moving blocks of data from slower to faster or visa verse. If I have 10TB of 10k IOPS storage and 100TB of 1k IOPS storage in a tiered setup, data that is frequently accessed would reside in the 10k IOPS tier while less frequently accessed data would be in the 1k tier. In this case, the blocks of popular repositories would be stored in SSD, while the blocks of your side project that you haven't touched in 4 years would be on the HDD. You still have access to it, it might just take a bit longer to clone.

Ceph can probably explain it better than I can. http://docs.ceph.com/docs/jewel/rados/operations/cache-tieri...

That is pretty awesome. Should the SSD's for the fast storage be on the OSD nodes that also have the HDD's or should it be separate OSD nodes?

The fact that you're asking this question on Hacker News leaves little doubt in my mind that you and your team are not prepared for this (running bare metal).

I read the entire article, and while you talked about having a backup system (in a single server, no less!) that can restore your dataset in a reasonable amount of time, you have no capability for disaster recovery. What happens when a plumbing leak in the datacenter causes water to dump into your server racks? How long did it take you to acquire servers, build the racks and get them ready for you to host customer data? Can your business withstand that amount of downtime (days, weeks, months?) and still operate?

These questions are the ones you need to be asking.

In other words, double your budget because you'll need a DR site in another datacenter.

They don't need another datacenter. AWS has had more issues they NYI's datacenters have had in the past year. There are many companies that are based out of a single datacenter that haven't had any major issues.

Backblaze comes to mind.

That would be something I would test. I don't know Ceph, so I would be taking a shot in the dark. I would guess it would not make much of a difference as everything is block level. I, personally, would do 1x PCIe SSD for cache, 1x 2/4TB SSD, and 2x 4TB HDD for each storage node.

Edit. If Ceph is smart enough, it would be aware of the tiers present on the node, and would tier blocks on that node. So a Block on node A will stay on node A.

Though 2TB of data on an 8TB drive does mean only 1/4 as many requests hit it, right?

I'm not sure I understand your question. A 2TB/2TB disk will have the same number of requests as a 2TB/8TB disk, as they both have the same amount of data.

If you are talking physically, 2TB/8TB would theoretically be faster than a 2TB/2TB disk if the performance attributes were the same. But a 2TB HDD will have a faster average seek time than an 8TB HDD due to physical design. Any performance gains of only partially filling a drive would probably be offset by the slower overall drive.

> a 2TB HDD will have a faster average seek time than an 8TB HDD due to physical design. Any performance gains of only partially filling a drive would probably be offset by the slower overall drive.

I'm skeptical here. So your minimum seek time goes up on the 8TB because it's harder to align. But your maximum seek time should drop tremendously because the drive arm only has to move along the outer fifth of the drive. And your throughput should be great because of the doubled linear density combined with using only the outer edge.

I'm not saying that you wouldn't see any performance gain. 2TB on the outer track will be faster than 8TBs on the same 8TB disk, but I'm saying any gains will be lost due the dense nature of the drives.

A quick google search shows that there are marginal gains on the outer track vs the inner, but that is only on sequential workloads. For something like GitLab, the workloads would be anything but.



Ignore the part about where the partition is, then.

1. If I look at 2TB vs. 8TB HGST drives, their seek times are 8ms and 9ms respectively. But if you're only using a quarter of the 8TB drive, the drive arm needs to move less than a quarter as much. Won't that save at least 1ms?

2. The 8TB drive has a lot more data per inch, and it's spinning at the same speed. Once a read or write begins, it's going to finish a lot faster.

3. Here's a benchmark putting 8TB ahead in every single way http://hdd.userbenchmark.com/Compare/HGST-Ultrastar-He8-Heli...

So, the 8TB drives are fine if you're doing sequential writes, mainly doing write-only workloads, or have a massive buffer in front.

In my experience, the PCIE drives they mention (the DC P3700) are incredible. They're blazing fast (though the 3D X-Point stuff is obviously faster), have big supercaps (in case of power failure), and really low latency for an SSD (150 microseconds or so). They're a pretty suitable alternative to in-memory logbuffers where latency is less crucial. Much cheaper than the equivalent memory too. Having a few of these in front of a RAID 10 made up of 8TB drives will work just fine.

FWIW your experience with a RAID 6 of big disks is unsurprising. Raid 6 is almost always a bad choice nowadays since disks are so big. Terrible recovery times and middling performance.

True about RAID6, I was just commenting on performance. A 2TB will spank an 8TB in any configuration.

I would think that GitLab's workload is mostly random, which would pose a problem for larger drives. The SSDs are a great idea, but I've only seen 8TB drives used when there are 2 to 3 tiers; with 8TB drives being all the way on the bottom. I'm not sure how effective having a single SSD as a cache drive for 24TBs of 8TB disks will be.

Don't have hard numbers you say?



Oops that's reliability, I can jump the gun sometimes, sorry.

I'll be here all day to learn from suggestions. I'm hoping for much feedback so please reference questions with the letter and number: 'Regarding R1'.

Moving to your own hardware will almost certainly improve performance, reduce incidental downtime, and cut costs substantially. Including hiring more engineers, you might expect total costs to be ~40-50% of what you would have spent on cloud-based services over the first 24 months. If your hardware lifecycle is 36-48 months, you will see large savings beyond 24 months.

A few things to watch out for, if your team doesn't have a lot of experience in direct management of dedicated servers/datacenters:

- There is an increased risk of larger/longer very rare outages due to crazy circumstances. Make sure you have a plan to deal with that. (I've had servers hosted at a datacenter that flooded in Hurricane Sandy, and another where an electrical explosion caused a >3 day outage..)

- It's easy to think you'll rely on managed services, but that rarely works out well. It also can become very, very expensive -- possibly more so than cloud-based hosting.

- Specifically, regarding H1, H2: Dedicated hardware is substantially cheaper than cloud-based hosting, but if you rely too much on managed services you negate a large portion of the savings. Consider that most service providers will be both more expensive and less competent than doing it yourselves. Also, having your own team have direct knowledge + their own documentation of the setup will be beneficial.

- I'd recommend budgeting for and ordering some extra parts to keep on hand for replacement, if you are having datacenter ops handle hardware or can have an engineer located relatively close to your datacenters. (A few power supplies, some memory, a couple drives - nothing too crazy.)

- Supermicro's twins systems are great. In the past I've gone with their 1U models vs. 2U to slightly reduce the impact of unit downtime. (Having to take one 2U Twin down affects four nodes. It sounds like you'll have to decide on balancing that against the increased drive capacity, in your case.)

> reduce incidental downtime

In a long run, probably. Immediately after deploying, unless they hire very experienced, I expect quite a few "never seen before" issues (may not result in publicly visible downtime though). Monitoring for "thermal events", weird and hard to debug issues requiring firmware updates, "bad cable" issues, etc. are not what you have to deal with in the cloud.

I'm curious to know if you think this work is within the core competency of GitLab. if so, how did you decide it was. If not, how do you realize the investment over time in something that isn't? Is the GitLab CEO here?

GitLab CEO here. Hardware and hosting are certainly not our core competencies. Hence all the questions in the blog post. And I'm sure we made some wrong assumptions on top of that. But it needs to become a core competency, so we're hiring https://about.gitlab.com/jobs/production-engineer/

I think you're under estimating the number of people required to run your own infrastructure. You need people who can configure networking gear, people swapping out failed NICs/Drives at the datacenter, someone managing vendor relationships, and people doing capacity planning.

I think you could probably get the IO performance you're talking about in your blog post from AWS instances or Google Cloud's local NVMe drives, but if you truly need baremetal, I'd recommend Packet or Softlayer. Don't try to run your own infrastructure or in a year you'll be: https://imgflip.com/i/1fs7it

I would challenge your assertion that you need a core competency in bare metal. AWS and GCE are performant enough--you're just not using them correctly. Invest in IaaS expertise and be successful; invest in bare metal at this day and age and live to regret it forever.

You are ruling out providers like OVH because lack of sufficient SLA but then you are replacing it with your own solution that will have no SLA.

Lower your SLA requirements and go with multiple providers like OVH. Make your site work at multi datacenters. At the end of the day your users will be much happier.

My 2 cents.

I love that this question baited your well-known introduction out again. Wasn't there a post at some point that searched for 'GitLab CEO here' on this very site, just for kicks and giggles?

> But it needs to become a core competency, so we're hiring


Just a few quick notes. I've experience running ~300TB of usable Ceph storage.

Stay away from the 8TB drives. Performance and recovery will both suck. 4TB drives still give the best cost per GB.

Why are you using fat twins? Honestly, what does that buy you? You need more spindles, and fewer cores and memory. With your current configuration, what are you getting per rack unit?

Consider a 2028u based system. 30 of those with 4TB drives gets you the 1.4PB raw storage you're looking for. 2683v4 processors will give you back your core count, yielding 960 cores (1920 vCPUs) across that entire set. You can add a half terabyte of memory or more per system in that case.

Sebastien Han has written about "hyperconverged" ceph with containers. Ping him for help.

The P3700 is the right choice for Ceph journals. If you wanted to go cheap, maybe run a couple M2.NVME drives on adapters in PCI slots.

I didn't really need the best price per GB in my setup, so I went with 6TB HGST Deskstar NAS drives. I'm suggesting you use 4TB as you need the IOPs and aren't willing to deploy SSD. Those particular drives have 5 platters and a higher relatively high areal density giving them them some of the best throughput numbers in among spinning disks.

If you can figure out a way to make some 2.5" storage holes in your infrastructure, the Samsung SM863 gives amazing write performance and is way, way cheaper than the P3700. I recently picked up about $500k worth, I liked them so much. They run around $.45/GB. Increase over-provisioning to 28% and they outperform every other SATA SSD on the market (Intel S3710 included).

You'll probably want to use 40GE networking. I've not heard good things about Supermicro's switches. If I were doing this, I'd buy switches from Dell and run Cumulus linux on them.

Treat your metal deployment like IaaC just like any cloud deployment. Everything in git, including the network configs. Ansible seems to be the tool of choice for NetDevOps.

Thanks for the great suggestions.

We're considering the fat twins so we get both a lot of CPU and some disk. GitLab.com is pretty CPU heavy because of ruby and the CI runners that we might transfer in the future. So we wanted the maximum CPU per U.

The 2028u has 2.5" drives. For that I only see 2TB drives on http://www.supermicro.com/support/resources/HDD.cfm for the SS2028U-TR4+. How do you suggest getting to 4TB?

Also whatever you do, don't buy all one kind of disk. That'll be the thing that dies first and most frequently. Buy from different manufacturers and through different vendors to try and get disks from at least a few different batches. That way you don't get hit by some batch of parts being out of spec by 5% instead of 2% and them all failing within a year, all at the same time.

If you do somehow manage to pick the perfect disk sure having everything from a single batch would be the best since that'll ensure you have the longest MTBF. But how sure are you that you'll be picking the perfect batch simply by blind luck?

I had this problem with the Supermicro SATA DOMs. Had problems with the whole batch.

That said, I bought the same 6TB HGST disk for two years.

As long as you're not buying all the disks at once sticking with one manufacturer and brand should be fine. If you're buying 25% of your total inventory every year it'll all be spread out to just a few percent per month.

But when you're buying 100% of your disk inventory at once there's a serious "all eggs in one basket" risk.

Sorry, I was confused by the part numbers. I was thinking of the 6028u based system that have 12x3.5" drives. These are what I used for my OSD nodes in my Ceph deployment.

As for CPU density, I still feel like you're going to need more spindles to get the IO you're looking for.

The estimate of 19KW gives a rough estimated requirement of ~90A @ 208v.

4 x 208v 30A circuits gives a total of 120A -- of which they can only be used at 80% capacity, so that gives you 96A usable -- without redundancy.

My initial feelings (eyeballing it) are you should be looking for 3 full racks, 2x208V@30A in each.

As a juniper shop, we implemented the 2xQFX 5100 48T (in virtual chassis) + ex4300 for remote access per rack. This will be a decision based on your local expertise though.

I also looked hard into the twin boxes -- but power to the rack in the end ruled the day, and it made not much of a difference to use the 1U boxes.

Don't forget about out of band (serial for the switching gear). We've been using OpenGear.com for this stuff with 4G-LTE builtin.

VPN access to access console devices?

Any site-to-site VPN needs?

Also, I would consider not pre-purchasing N times your required horsepower/disk if you can avoid it, but rather add in pre-planned yearly, or biannually stages.

As the CEO of a cloud hosting & server management company -- I have much more to say about this if you would like to chat via phone or email anytime.

There are two major pitfalls to crowd-sourced consulting such as this.

1) Contributors have not been vetted - Some responses are based on real world experience, and some is conjecture from arm-chair quarterbacks. (A simple example would be that nobody has mentioned with any of the SuperMicro 2U Twins that you have to be cautious about the PDU models and outlet locations of 0U PDU's to not block node service/replacement in the rack)

2) There are multiple ways to skin a cat - There are many viable solutions in this thread, but you can't simply take a little bit of this, a little bit of that, and piece together a new platform that "just works." -- Better to go with a know working solution than a little of this and a little of that. Multiple drivers tend to be less effective.

I am the owner of a bespoke Infrastructure as a Service provider that delivers solutions to sites of similar metrics, and speak with plenty of real-world experience.

Larger Providers - We find a lot of clients move away from AWS, Softlayer, Rackspace, et al. as the larger providers aren't nearly as interested in working with the less-than-standard configurations. They want highly-repeatable business, not one-off solutions.

I'd love to talk in more depth with you about how we can deliver exactly what you need based on years of experience in delivering highly customized solutions. We'll save you money and headache.

HN crowdsourcing is a pretty reasonable strategy for entities that cannot reliably identify & hire 1+ rockstar employee(s) and/or VARs to cover the compute, networking, storage, electrical, environmental, etc. If you can rationally evaluate the HN comments you should get pretty close to the best, cutting-edge advice. Whereas when you are small and you listen to 1 or 2 VARs and/or 1-2 internal employees you can expect, on average, to get average advice. Or advice that was excellent 2-3 years ago but is now out-of-date due to HW/SW progress that the employee/VAR is unaware of.

Cutting-edge and availability aren't really always best friends. For example cutting edge routing and switching (the latest products, fabric, SDN's, MC-LAG, etc) are notorious for failures and outages.

Average is not the appropriate word, however I'll use your word. I'd rather have, and so would every enterprise out there, rather have average advice based on tried and true solutions that costs 10% more with 99.99 to 99.999% availability than cutting edge, saved 10-20% with 99.8% availability. The downtime alone can (and does) kill reputations of sites like GitLab.com (and others).

I wish more companies were open like this, pretty cool to see the decision making and reasoning behind operational stuff like this.

It'd be interesting to see more application-level solutions to scale rather than just adding hardware. Like extending git so objects can be fetched via a remote object-store rather than dealing with a locally mounted POSIX file system. This would allow you to use native cloud object stores and might simplify your latency requirements to the point where you could consider staying in the cloud.

This definitely increases software complexity but going the other way increases other complexities (ops, capex, capacity).

I thought the same and looked into this years ago but the short story is that for many git operations it needs block access to the repository. With an object store it becomes very slow.

Z2 (no, Z is not on the list you have)

You need at least two nodes that do DNS, DHCP, NTP, and other miscellaneous services that you absolutely want to have but do not seem to have mentioned. You want them to be permanently assigned, so that you never have to search for them, and you want them to fail over each service as needed, preferably both operating at once. Three nodes would be better. Consider doing some basic monitoring on these nodes, too, as a backup to your main monitoring system.

Absolutely agree. Typically I deploy about 4-6 "tools" hosts in a site. From a computational standpoint, you could make do with fewer hosts, but there are some things that I prefer to separate out:

1&2. Authoritative DNS (internal)

1&2. NTP

1&2. Caching DNS resolver

3&4. Outbound HTTP proxy (if necessary)

3&4. PXE / installer / dhcp

3&4. Local software mirror (apt / yum / etc.)

5&6. SSH jump hosts.

Or something like that.

Make sure the second host is not just in a separate chassis from the first host, but also in a separate rack.

For external authoritative DNS, don't do it yourself -- pay for someone to run that for you (Route53, ns1, etc.). For e-mail, if you can possibly not deal with it then don't -- use Mailchimp or something.

Sidenote: We are running coredns.io in production as authoritative internal DNS and as hidden master with NOTIFY to a secondary DNS provider (currently DNSmadeEasy).

The DNS records for the internal records are done using the kubernetes middleware (basically serving the service records). The external records are pulled in from a git repository hosting our zones as bind files. If need be zones are split into subzones per team/project. Same permission system as our code via MRs using Gitlab.

Our recommendation is build on open standards (BIND, AXFR) and use services on top of these.

I agree that using an external mail provider is usually a good idea. It mostly is your fallback communication channel and is usually easy to switch (doing replication to an offsite mail storage needs to be done to make switching easy/possible/fast). MX records \o/

Good points, thanks!

Since hosting git repositories is core to your business, you should take the time to do it right (https://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/fi...) instead of using vanilla git and relying on a magic filesystems and vertical scaling to solve your issues.

The presentation you linked is something we've considered. But it is build on top of Google's distributed filesystem. So we consider our move to Ceph a first step in that journey.

You can swap in any other distributed storage system in jgit or libgit2.

N1/N2: I think you are making a mistake by concentrating on the advantage of having a single node type. Databases really aren't like other systems.

SuperMicro has a 4 node system available in which each node gets 6 disks, 2028-TP-xx-yyy. Get two of these, populate 2 nodes on each as your databases, and you can grow into the other spaces later. Run your databases on all-SSD; store backups on spinners elsewhere.

Having two node types is not a calamity compared to having just one.

Thanks, I agree that multiple nodes could be acceptable if needed.

So the proposal now is to have one chassis (excluding the backup system) and two types of nodes (normal and database).

Regarding the 2028 please see the conversation in https://news.ycombinator.com/item?id=13153853

I'm surprised to see that GitLab is using Unicorn. Isn't Unicorn grossly inefficient, because each of the worker processes can only handle one request at a time. Are web application processes actually CPU-bound these days?

I don't know much about the Ruby web server landscape, but might Puma (http://puma.io/) be better?

Puma is something we want to eventually run, but right now it's not really a priority. Instead we're focusing on response timings and memory usage. See the following links for more information:



For "grossly inefficient" I imagine it depends. If you're loading most of the app prefork then the memory overhead for more processes is pretty low. (Disclaimer: Dunno much about ruby deployments)

But the number of Unicorn processes that GitLab runs is "virtual cores + 1". It seems to me that this only makes sense if web application request handling is actually CPU bound. Maybe it is if you have a really good caching implementation and hardly ever have to query the DB.

I was wondering the same. For my projects Puma has been awesome and better than Unicorn.

Puma is great. We tried it years ago and multithreading caused problems. This is likely due to problems in our application and its dependencies, not Puma. We reverted it in https://github.com/gitlabhq/gitlabhq/commit/3bc4845874112242...

A few things that Immediately jump out at me.

The 2u fat twins are very dense, You'll need make sure that thy hosting company can actually cool it effectively.

I'd look again at your storage strategy. You'll save a lot of money, and it'll be much easier to debug if you dump ceph.

You have a clustered FS, when you appear to not really need it. By the looks of it, its all going to be hosted in the same place. So what you have is lots of CPU to manage not very much disk space. The overhead of ceph for such a tiny amount of data all in the same place seems pointless. You can easily fit your data set on one 90 disk shelf 4 times over with 100% redundancy.

First things first, File servers serve files and nothing else. This means that most of your ram is going to be a file server cache. Putting applications on there is just going to mean that you're hitting the disk more often, and stealing resources from other apps in a non transparent way.

get four 90 drive superchassis, connect them to four servers(via SAS, direct connect) with as many CPUs and as much ram as possible. JBOD them, either use ZFS, or hit up IBM for GPFS. (well IBM Spectrum Scale). You can invest in a decent SAS controller and raid 6 8 10 disk stripes. But rebuild time will be very high. Whats your strategy for disk replacement? 400 disks means about 1 every four weeks will be failing.

What is your workload like on the disks? is it write heavy? read heavy? lots of streaming IO? or random? can it be cached?

Your network should be simple. don't use cat6 10gig. It's expensive, not very good in dense racks, just use copper cables with inbuilt SPFs, they are cheap, reliable.

Don't use Super micro switches. They are almost certainly re-badged.

Your network should be fairly simple. A storage VLAN, a application VLAN, and a DMZ. (out of band will need a strongly protected vlan as well)

on the buying side, you need a reseller, who is there to get the best deal. You'll need to audition them. They will also help with design as well. But you need to be harsh with them, they are not your friends, they are out to make money.

If the plan is still to build a huge Ceph-backed filesystem to store your git repos on, you are doomed.

Redhat, if you're out there: Now would be a good time to chime in about the limits of Ceph and the reasonable size of a filesystem.

We'll have a Ceph expert review our configuration.

How about an expert in enterprise and/or cloud storage in general?

I know Stackoverflow and all their related services are running bare-metal and has done so for a long time. They have written a very detailed series of blog posts about their whole infrastructure and hardware which I really recommend that you read if you haven't already.

Here's Part 1 of the series: http://nickcraver.com/blog/2016/02/17/stack-overflow-the-arc... (the other parts are linked from there)

> Each of the two physical network connections will connect to a different top of rack router. We want to get a Software Defined Networking (SDN) compatible router so we have flexibility there. We're considering the 10/40GbE SDN SuperSwitch (SSE-X3648S/SSE-X3648SR) that can switch 1440 Gbps.

Noooooo. Do not go for Supermicro for anything where the operating system matters. Their software quality leaves a lot to be desired.

You might want something Cumulus Linux supports, since it seems like you have a lot more Linux experience then networking...

Thanks, Cumulus Linux seems to be the way to go.

All of your data and IOPs could be handled by a single Pure Storage FlashBlade. This would significantly simplify almost everything about your system. (I'm an investor, but generally wouldn't point this out unless someone is seriously considering on premise deployment.)

> We are attempting to build a fault-tolerant and performant CephFS cluster

Ohhh boy. I hope this works, and look forward to hearing updates.

from CephFS website:

"Important: CephFS currently lacks a robust ‘fsck’ check and repair function. Please use caution when storing important data as the disaster recovery tools are still under development. For more information about using CephFS today, see CephFS for early adopters."

I'm getting the same feeling I get when I'm watching those "Hold my beer and watch this..." videos.

From my vantage point they look to have 0 experience in building and running infrastructure... and asking advice on HN. They might ask well post a Ask Slashdot thread if they want armchair advice. Genuinely, I think they've crunched some numbers and think they can run their stuff cheaper and faster in-house... but probably underestimated the human-experience angle.

For just 10-20 physical servers, this is going to be either extremely expensive (if they hire right) or extremely painful (if they don't).

We certainly don't have any experience hosting our own hardware.

We're not doing this to save money, we're doing it to increase performance https://about.gitlab.com/2016/11/10/why-choose-bare-metal/

Then prototype first. Rent a server or two somewhere for 3-6 months and run a shadow first. Once you're confident that you understand all the "other 80%" stuff that is involved running your own infrastructure and don't lose data, then think about doing it yourself.

A service providers' biggest responsibilities to its customers are security, durability, availability and performance -- in that order. You guys are vastly underestimating the complexity involved in getting first 3 right.

It seems they have Ceph experience, considering they've made work on AWS.

For such a big expenditure, some prototyping first? Maybe buy a mainboard or two and some CPUs/HDDs/SSDs and benchmark them on your specific workloads. Also look into using something like bcache if going all-SSDs is too expensive.

We're in a rush because we need GitLab.com to get fast now. We might shoot ourselves in the foot by not prototyping first but we're taking the risk. We're hiring more consultants to help us with the move.

Bcache is mentioned in the article under the disk header.

It seems based on this post that all the hardware is going to end up in one datacenter. that seems like a big risk if that datacenter ends up having issues.

Perhaps look into how to have the nodes split between 2 or 3 different datacenters?

We've considered having part of the servers in a different location. But it is a big risk to see if Ceph can handle the latency. We're also considering using GitLab Geo to have a secondary installation. It seems that data centers can have intermittent network or power issues but they are less likely to go down for days (for example Backblaze is in one DC). At some point we'll likely have multiple datacenters but our first focus is to make GitLab.com fast as soon as possible.

You will have to go out of Ceph and build a middleware to route the git operations on the right shard. Take a look at libgit2[0] for that.

It is pretty easy to build such middleware both for the git over ssh (a simple script performing a lookup of where the shard is and then you connect to shard to operate there) and just a little bit more for the http part. At the webapp level, you will have a kind of RPC to run the git related operations which will connect to the right shard to run the operations.

When you use Ceph you are basically running a huge FS at the full scale of your GitLab installation, but practically, you have many independent datasets within your GitLab installation and you do not need to pay the cost for the global consistency of Ceph. You have many islands of data.

[0]: https://github.com/libgit2/

Edit: Typos/missing words.

Do you have a disaster recovery plan that starts with "A meteor has destroyed our primary data center."?

I do; that's my default scenario. If you can survive that, you can survive all sorts of smaller issues like network congestion, data center power problems, grid power problems, and zombie plagues (or flu, which is more likely.)

Depends on what you mean by 'survive'. I'd call a backup in google nearline sufficient for the meteor scenario, but that's going to be very slow and unpleasant to depend on for milder problems.

It's not sufficient. How quickly could you procure new hardware, install that in a datacenter, make it fully functional, and restore your backups? The answer is likely weeks/months. Could your business survive being offline that long? It sounds unlikely.

In an emergency you don't need new hardware. You can get cloud servers in minutes. If people have been practicing restores then it should not take particularly long to get the containers working again. A couple days to get things working-ish. That should be survivable while everyone focuses on the news coverage of the meteor.

But that's a true emergency situation. Don't go offline for multiple days for something that's reasonably likely to happen.

The cost of bandwidth between sites would kill this idea.

As someone already mentioned "all the other services" are missing (dns, ntp, monitoring, etc.), but also:

- Shouldn't there be a puppet / chef / whatever deployment coordinator in there somewhere?

- There's no mention of a virtualisation environment. While it's not a hardware issue really, all the extra services mentioned before will not take the whole server and you'll want to collocate some of them. (maybe even some of the main services too, if the resource usage on real hardware turns out wildly different than the estimates) If the choice is KVM, great. But if it's VMWare, you want to include that in the cost. (and the network model)

- The "staging x 1" is interesting... so what happens when you need to test a new version of postgres before deployment? You can't pretend that a test on one server (or 4 virtual ones) will be comparable to actual deployment, especially if you need to verify the real data performance.

- "Backing up 480TB of data [...] with a price of $0.01 per GB per month means that for $4800 we don't have to worry about much." - This makes a really bad assumption of X B of data produces X B of backup. That's a mirror, not a backup - it won't protect you from human errors if you overwrite your only mirror. For a backup you need to have actual system of restoring the data and a plan for how many past versions you want to store. On the other hand, gitlab data seems to be mostly text files - that should compress amazingly well, so that should also help with the restore speed.

- Network doesn't seem to mention any out of band access (iLO, or similar). That's another port per node required.

- Because tech is tech, I expect both the staging and spare servers to be repurposed soon to be used for "that one role we forgot about". One each is really not enough. (seen it happen so many times...)

- We would continue to use our existing cloud hosted Chef server for now.

- We want to use Kubernetes instead of virtualization.

- We provisioned 2 spare database servers for that reasons.

- Git already compresses files with zlib so I'm not sure we can compress it much further https://git-scm.com/book/uz/v2/Git-Internals-Packfiles

- The servers have a separate management port and "Apart from those routers we'll have a separate router for a 1Gbps management network."

- I agree we'll probably need more spare servers.

> The servers have a separate management port and "Apart from those routers we'll have a separate router for a 1Gbps management network."

It was followed with "For example to make STONITH reliable when there is a lot of traffic on the normal network". That sounded like you want the heartbeats on it, not out of band management.

I probably made a mistake here. I assumed that since the management network was never congested we could run the heartbeats there. But it might be that the server can't address the management port from the operating system.

I hope to god you are not using Kubernetes for your databases.

Why not? Google Cloud SQL runs on Kubernetes.

Docker, and Kubernetes is still a rapidly evolving technology and there are too many horror stories of it crashing and burning in production for it to be worth it, in my view. Databases are not immutable, and even though Docker has persistent volumes it is still new and evolving. If the database breaks in some way, I don't want to have to worry about the new container having a slightly different patch level and that messing up the database. I don't want to have to worry about rebuilding my database cluster to upgrade it.

Google might do some really cool shit with Kubernetes, but they are Google. I don't think using them as a good example for anything but infrastructure on a massive scale is correct. They are years ahead of everyone else, and if they are doing something in production, they have been testing it for years. If shit hits the fan, they have thousands of employees to throw at the problem. GitLab is building a team, and therefore does not have the experience to know these systems inside and out. In my view, using Docker/Kubernetes is adding unnecessary complexity to the database fabric for minimal tradeoffs.

And considering that they are speaking out dedicated database servers, it makes no sense to add a unneeded layer of abstraction when in all likelihood the container will be bound to a node.

Kudos to gitlab for being so open about this. As an outsider find it very interesting to observe and learn from this transition.

It's refereshing too see that a few quite important players including i.e. dropbox are moving away from having everything in public clouds in contrast to others such as Netflix that go all in. Looks like on premise is not dead yet.

Good luck!

A bit confused why you are trying to roll your own storage solution. Ceph is great for a lot of applications, but I am not sure it really fits the bill for what you describe. Especially when you indicate you are going to use spinning disk behind it. Have you looked at any of the storage arrays on the market? Your TCO is likely to be much lower and your performance/resiliency much higher if you buy something with $100's of millions of R&D behind it rather than all the hours and costs of rolling your own. Not saying its impossible to make it work, but it just sounds like something that will be a PITA going forward.

I think a storage appliance (NetApp, etc.) makes a lot of sense in the short term. The TCO is lower since we'll spend a lot of time making Ceph work.

In the longer term the storage appliance will lock us in and will get very expensive. I've heard pretty bad stories of companies betting on it. Especially with many small files like us (IOPS heavy).

And one goal of GitLab.com is to gain experience that we can reuse at our customers. Most of our customers use a storage appliance now but are interested in switching to something open source.

Disregarding the obvious signs that there is severe lack of experience from gitlab staff on the physical DC build, they completely underestimate how difficult it is to run ceph at scale with a reliable SLA.

My initial gut feeling is that you are moving out of the cloud for the wrong reasons. Any performance gain you get with bare metal will be erased with the complexity of running a hybrid environment, namely moving data back and forth between your datacenter and the cloud, and also the mental overhead of programming to that model.

You also now need to gain internal expertise in networking, security, datacenter operations, and people who can rack and stack well.

Here's a crazy but interesting suggestion:

> Backup

> ...Even with RAID overhead it should be possible to have 480TB of usable storage (66%).

Quoting https://code.facebook.com/posts/1433093613662262/-under-the-...:

> Fortunately, with a technique called erasure coding, we can. Reed-Solomon error correction codes are a popular and highly effective method of breaking up data into small pieces and being able to easily detect and correct errors. As an example, if we take a 1 GB file and break it up into 10 chunks of 100 MB each, through Reed-Solomon coding, we can generate an additional set of blocks, say four, that function similar to parity bits. As a result, you can reconstruct the original file using any 10 of those final 14 blocks. So, as long as you store those 14 chunks on different failure domains, you have a statistically high chance of recovering your original data if one of those domains fails.

Facebook didn't release the system they used to do this. I can see two reasons why not to: desire for competitive edge; or the implementation not being a general-purpose solution.

Considering Facebook's general openness, I say get in touch, just in case! It's quite possible that you might be able to figure out something interesting.

I suspect the reason the system wasn't released was due to the latter case - it seems to be technically quite simple and easily achievable for a[ny] company full of algorithms Ph.Ds.

Any halfway experienced storage engineer is fluent in ECC. You don't need secret sauce from Facebook. That said, because of the first statement, a lot of today's storage solutions will use ECC on the backend if they present you with a logical FS. So you may not (should not?) need to reinvent this wheel. At Facebook's scale this absolutely makes sense, but they're not particularly breaking new ground here. Look at RAID 6 if you need more evidence.

I'm still figuring all of this out, thanks for the headsup. I realize Facebook don't have anything particularly interesting, but now I understand just how standard this is, huh.

Thanks again.

I recently built some data center processing using SuperMicro Twin-Nodes^2 [0] and MicroBlades [1].

We're setup for 38kw [a] possible gross wattage using the dual-node blades using the Xeon D-1541 [2]. The amazing thing about the D-1541 blades is we get around 100W per server, with 8 hyperthreaded cores, and a 3.84T SSD. With the 6U chassis, you have 28 blades with 2 nodes each - 56 medium sized servers for 5.6kw in 6U - under a kw per RU.

For your 70 some server workloads, I'd recommend using the microblades.

For your higher lever workloads, I think the SuperMicro TwinNode makes sense.

Be very very careful about Ceph hype. Ceph is good at redundancy and throughput, but not at iops, and rados iops are poor. We couldn't get over 60k randrw iops across a 120 OSD cluster with 120 SSDs.

For your NFS server, I'd recommend a large FreeNAS system, put a big brain in it and throw spinning platters.

Datacenters can/will do your 30kw

[0] - https://www.supermicro.com/products/nfo/2UTwin2.cfm [1] - https://www.supermicro.com/products/MicroBlade/ [2] - https://www.supermicro.com/products/MicroBlade/module/MBI-62... [a] - Although we have 38kw of possible power there, it's practically well under the 27kw we'll get with 4x208v@60A PDUs at 80%.

If not Ceph, what are you using for storage?

2RU SuperMicro server with 24xSSD with FreeNAS raid10 or raidz2 exporting volumes via iScsi.

Separately I'm using a FreeNAS controller with 4 SAS HBAs supporting 3 JBODs with 45 8TB HGST He8 near line SATA disks each (135 disks or ~1PB) for backups and slow data.

Coming from a complex hosting background - the move to metal is always good one especially when growing. Still, I recommend looking into, where possible using a hypervisor management system like ProxMox where possible.

There is a great deal of pain in moving from one piece of metal to the next, and there is nothing wrong with underpinning your metal with a tech where you can at any time move your architecture to be any combination of a private, public or hybrid cloud, storage aside.

This looks to be a really interesting project, I hope you can continue to blog about it in detail.

This undertaking seems like a huge investment in time & overhead for only 64 servers. My guess is that with this move, performance will go up and availability will go down.

I can understand the need for performance but if it were my business I would have taken a significantly different approach.

It likely not be 64 servers for long.

Why go with 3.5' disks? You usually want 2.5' and you can fit 24 ones in 2HE. I recommend those using a supermicro cabinet. 2 cache SSDs and 20 SAS disks (+ 2 spares) in combination with a raid controller will give you very much quite fast storage at a sweet price spot.

What disk would you recommend for that setup?

I didn't see it mentioned but what are your plans for the network strategy. Are you planning to run dual-stack IPv4/IPv6 ? IPv4 only? Internal IPv6 only with NAT64 to the public stuff?

Hopefully IPv6 shows up somewhere in the stack. It's sad to see big players not using it yet.

Probably IPv4 on a /24 block but we'll open up a vacancy for a network engineer.

Love how they did some number crunching and decided that rent vs own, own won. I think that if more places looked they would find that out also. There must be a margin in it since the big players are making money at it.

I'm interested to see what they end up with in the end.

Thanks! The decision to move to metal was because of performance problems https://about.gitlab.com/2016/11/10/why-choose-bare-metal/

It is nice that we'll save on costs but we anticipate a lot of extra complexity that will slow us down. So if it wasn't needed we would have stayed in the cloud. But it is interesting that both our competitors (GitHub.com and BitBucket.org) also moved to metal.

Have you considered hosting with Packet.net? You'd be on bare metal, thus solving your performance problems, but you'd still be renting by the hour as you are now, and you wouldn't have to deal with buying your own hardware and all the complexity that comes with that.

I looked at their site and they talk about bring your own block, anycast, and IPv6. But I can't find any information about networking speeds. What if we end up needing 40 Gbps between the CephFS servers?

They provide dual 10Gb as standard. But talk to them about options.

Would love to chat with you to see if a switch to GCP might solve your performance and pricing issues. It's also always interesting to see how GCP vs AWS vs bare metal fairs.

email me at bookman@google.com and I'll be sure to get you in contact with the right people at google.

I've been following the technical discussions around this move, and I'm wondering if you guys looked at making architectural changes to shard your data into more manageable chunks?

Naively it seems like you should be able to reduce your peak filesystem iops by sharding the data at the application layer. That does introduce application complexity, but it might shake out as being less work than the operational complexity of running my own metal.

Of course, easier said than done -- I just didn't spot any discussion of this option, and it seemed like the design choice of having one filesystem served by Ceph was taken for granted.

We have sharding on the application layer in GitLab right now https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/7273 and we're using it heavily to split the load among NFS servers.

Then we have to think about redundancy. The simple solution is to have an secondary NFS server and use DRBD. For the shortcomings of that read http://githubengineering.com/introducing-dgit/

The next step is introducing more granular redundancy, failover, and rebalancing. For this you have to be good in distributed computing. This is not something we are now so we rather outsource it to the experts that make CephFS.

The problem of CephFS is that each file need to be tracked. If we would do it ourselves we could do it on the repository level. But we rather reuse a project that many people have already made better than go through the pain of making all the mistakes ourselves. It could be that using CephFS will not solve our latency problems and we have to do application sharing anyway.

That's a fair comment RE: outsourcing, but at my company I'd bias towards bringing some distributed computing knowledge in-house rather than bringing ops expertise plus maintenance burden in-house; sounds like you're going to have to add new expertise to your team either way.

Worth investigating if you can bolt on a distributed datastore like etcd or ZooKeeper to store the cluster membership and data locations; this might not be as complex as it sounds at first. etcd gives you some very powerful primitives to work with.

(For example, etcd has the concept of expiring keys, so you can keep an up-to-date list of live nodes in your network. And you can use those same primitives to keep a strongly consistent prioritized list of repos and their backed up locations. The reconciliation component might just have to listen for node keys expiring and create and register new data copies in response.)

I agree we need to add new expertise to our team anyway. But I think adding bare metal expertise is easier than distributed system expertise.

Etcd is indeed very interesting. I'm thinking about using it for active active replication in https://gitlab.com/gitlab-org/gitlab-ee/issues/1381

Think about it this way: your EE customers probably have the easier bare metal knowledge, but would be willing to pay for you to solve the distributed system problems for them :)

> Love how they did some number crunching and decided that rent vs own, own won. I think that if more places looked they would find that out also

I wouldn't be so quick to jump to that conclusion. It's not just the cost of owning and renewing the hardware, it's everything else that comes with it. Designing your network, performance tuning and debugging everything. Suddenly you have a capacity issue, now what b/c you're not likely to have a spare 100 servers racked and ready to go, or be able to spin them up in 2m? Autoscaling?

Companies spend enormous amounts of engineering hours to maintain their on-premise solutions. And sometimes that's fine b/c you have requirements that you can't easily do in the cloud (think of high frequency trading for example). However, once you tally all that up, plus all the value added services you can buy in the cloud (just take a look at the AWS portfolio for example) the price might well be worth it. That's not to say you won't need engineers to help you with cloud stuff, but you'll probably need less and they'll be able to focus on solving a different class of problems for you.

> There must be a margin in it since the big players are making money at it.

From what I've seen the players aren't making (lots of) money on providing compute power. They're basically racing against each other to the bottom. What they're making money on is all the value added services, the rest of the portfolio AWS/Google Cloud Platform/Azure offers.

So in gitlabs case, they have a load that they can monitor and predict. They are looking at 60+ processors, so they can plan to add 10% (6 procs at a time) and grow. They know their load, so the likely need to spin 150% of current capacity isn't something on their plan. I'll give you that there are companies that have erratic loads that are hard to predict, they make sense to place in something that can grow 100% on an email.

Big companies, most of their servers have a pretty stable load, it's unlikely things like internal email, Sharepoint, ERP/MAP systems will take a spike. It's only things like front end order processing that takes the hit.

There are lots of businesses that make sense and some that don't

I like the concept of "racing to the bottom" but they are still making money. But lets take your comment of the Value Added Services other than the ability to spin up capacity. What's the cost to Gitlab to pull this together and keep it running? There is an overflow every day on HN articles about operations monitors, containers, network monitoring. The tools are there, its an effort to glue them together, but then they are there.

So I'll still posit there is cases that the dollars to own are less than the dollars to rent. And I'll agree with your cases of rent because of capacity blowouts is key. The issue, is your ops team savvy enough to figure out what to keep/own, what to rent?

I have a Cisco c220 m4 (http://www.cisco.com/c/dam/en/us/products/collateral/servers...). I was surprised at the complexity of the configuration options. For example, it has 24 memory slots, but if you fill them memory speed is 2133mhz, whereas if you only use 16 memory slots, max memory speed is 2400mhz.

I've ordered a pair of Intel 750 Series 1.2TB NVMe SSD's for it, can't wait to try that out. Still waiting for the SSD drives to arrive.

Goddamn, this thread is worth a few tens of thousands of dollars in consultation fees.

Yes, I think it is, thanks everyone!

I'd like to see more about how they plan to increase the engineering and on-call staff to keep watch over the thing. The increased head and stress is a pretty tough hidden cost on my opinion.

H1: I don't see a discussion of getting an ASN and doing BGP with a number of upstreams. You say you want a carrier-neutral facility, but that won't buy you much unless you have your own AS.

Some ISPs will let you announce on a /24, which shouldn't be too hard for GitLab to set up for themselves.

@sytse: If this (BGP + ASN + /24 IPv4 + /n IPv6) is your plan, I'd encourage you to get started on the process now.

Go ahead and apply for your ASN now and an IPv6 allocation. Then start working on the paperwork for an IPv4 allocation. Because there is no more IPv4 to allocate you'll have to go through the auction process and then the subsequent transfer process.

You'll easily be able to find a provider that can give you a /24 if you buy transit from them, but you don't wanna go through the trouble of renumbering into your own IP space if you can avoid it.

Regarding D2: If you're going to go with bcache, make sure you're using a kernel >= 4.5, since that's when a bunch of stability patches landed (https://lkml.org/lkml/2015/12/5/38). Alternatively, if you're building your own kernel, you should be able to apply those patches yourself.

Is there a reason you don't just put out an RFP and purchase a complete supported solution from a VAR or OEM vendor?

I'm not sure you have the in-house expertise to maintain a production service of this kind. That's not an attack; this has not been your focus in the past, so it might be wise to have a third party provide some assistance and support as an intermediate step toward doing everything in-house.

M1: Each CPU has four memory channels, each node has 16 DIMMs (8/CPU, 2 DIMMs/channel) which means that you can use 1 DIMM per channel -> maximum memory bandwith and speed with RDIMMs. Using 64GB or 128GB LRDIMMs with E5v4 CPUs won't affect your bandwith or speed as long as you populate all the channels.[0]

Your memory options are:

1TB - 16x64GB / 8x128GB

2TB - 16x128GB

[0] - SuperMicro X10DRT-PT Motherboard manual page 35(2-13).

I may be way off base, but is this a solvable problem by just sharding/placing customer repositories over various storage systems?

I have never run a deployment this large, but I wonder: Only 1 staging server of 64 servers? I have usually tried to at least have a same order of magnitude when testing architecture changes: Sure, it works on my laptop, but how will this change work on 10 servers? Isn't it common to have a full staging setup, with similar dataset sizes?

> We want to dual bound the network connections to increase performance and reliability. This will allow us to take routers out of service during low traffic times, for example to restart them after a software upgrade.

does not really agree with

> Each of the two physical network connections will connect to a different top of rack router.

Sure, you can do it with something like MLAG, but that's really just moving your SPOF to somewhere else (the router software running MLAG). Router software being super buggy, I wouldn't rely on MLAG being up at all times.

> N1 Which router should we purchase?

Pick your favorite. For what you're looking for here, everything is largely using the same silicon (broadcom chipsets).

> N2 How do we interconnect the routers while keeping the network simple and fast?

Don't fall into the trap of extending vlans everywhere. You should definitely be routing (not switching) between different routers. You can read through http://blog.ipspace.net/ for some info on layer 3 only datacenter networks.

You'd want to use something like OSPF or BGP between routers.

> N3 Should we have a separate network for Ceph traffic?

Yes, if you want your Ceph cluster to remain usable during rebuilds. Ceph will peg the internal network during any sort of rebuild event.

> N4 Do we need an SDN compatible router or can we purchase something more affordable?

You probably don't need SDN unless you actually have a SDN use case in mind. I'd bet you can get away with simpler gear.

> N5 What router should we use for the management network?

Doesn't really matter, gigabit routers are pretty robust/cheap/similar. I'd suggest same vendor as you go for whatever your public network routers.

Also, consider another standalone network for IPMI. I can tell you that the Supermicro IPMI controllers are significantly more reliable if you use the dedicated IPMI ports and isolate them. You can use a shitty 100mbit switches for this, the IPMI controllers don't support anything higher.

> D5 Is it a good idea to have a boot drive or should we use PXE boot every time it starts?

PXE booting at every boot is cool, but can end up sucking up a lot of time. If you have not already designed your systems to do this, and have experience with PXE, then don't.

> The default rack height seems to be 45U nowadays (42U used to be the standard).

You may not have accounted for PDUs here. Some racks will support 'zero-U' PDUs, but you'd need to confirm this before moving on.

> H3 How can we minimize installation costs? Should we ask to configure the servers to PXE boot?

Assume remote hands is dumb. Provide stupidly detailed instructions for them. Server hardware will PXE by default, so that's not really a concern. IPMI controllers come up via DHCP too, so once you've got access to those you shouldn't need remote hands anymore.

> D2 Should we use Bcache to improve latency on on the Ceph OSD servers with SSD?

Did you consider just putting your Ceph journals on the SSD? That's a lot more standard config then somehow using bcache with OSD drives.

>Each of the two physical network connections will connect >to a different top of rack router. > >Sure, you can do it with something like MLAG, but that's >really just moving your SPOF to somewhere else (the router >software running MLAG). Router software being super buggy, >I wouldn't rely on MLAG being up at all times.

I would strongly consider doing this via pure L3 routing. This is a scale at which the benefits of L2 fabric switching vs L3 multihomed routing (yes, routing decisions on every node) begin to be interesting decisions.

Thanks for the suggestions.

We're already planning a separate router for the management network ("Apart from those routers we'll have a separate router for a 1Gbps management network.").

All Ceph journals will be on SSD too. I've added a question about combining this with bcache in https://gitlab.com/gitlab-com/www-gitlab-com/commit/a9cc9aad...

If you're doing this right, the management network will not actually be accessible via the normal operating system. Most IPMI controllers support sharing a normal nic and management (meaning both the IPMI controller and host OS can access it), but I wouldn't recommend doing this.

>> N1 Which router should we purchase? >Pick your favorite. For what you're looking for here, everything is largely using the same silicon (broadcom chipsets).

For switches, yes. Many of the switches share the same merchant silicon Broadcom Trident-II, Tomahawk, et al., however there are switches like the Juniper EX9200 which isn't baed on merchant silicon. Routers (N1) are also not typically based on merchant silicon (Juniper Trio-3D for example).

C1: The SuperMicro offers blade-server densities without the ridiculous prices that proprietary blade-server chassis come with. It's a pretty good system. Please make sure you have assessed your power requirements correctly and communicated that to your datacenter.

If you were larger, Open Compute Platform might be a way to go. Maybe next generation.

We moved 10 racks of servers from New York hosting to custom build small hosting facility in Tallinn, Estonia, and we were able to save about €13K a month on hosting. We use free side cooling to save on electricity. We are here for 3 years now, no major problems so far...

D5: You want a local boot drive, and you want it to fall back to PXE booting if the local drive is unavailable. Your PXE image should default to the last known working image, and have a boot-time menu with options for a rescue image and an installer for your distribution of choice.

It's better to have it the other way around. Attempt to boot off the network and fall back to local drive. That will allow you to reimage a node without having to fiddle with the BIOS. For regular boot there should be no image configured. Network boot will fail and the node will boot off disk. However if an install image is configured for that node, you can reimage it at will. The install image should reset to having no image for that node, once it's done.

You'll need to maintain an association between - dns, ip, mac address, ssh keys etc.

Hardware break-fix workflow is usually ignored by most production engineers. You'll be doing that a lot. You want to get your hw back into use as fast as possible.

Have you thought about how many spares (CPU, RAM, disks) you'll have to keep at your datacenters ?

Thanks, added to the article with https://gitlab.com/gitlab-com/www-gitlab-com/commit/ef3c7e1c... (will take some minutes to roll out).

And memtest86.

What is the amount of money you are expecting to spend on staff vs the performance return you get?

Running your own boxes can be done, but usually at great cost and usually by blowing up your sla. Given the inexperience you have at this some other options might be politically cheaper.

D2: Ceph has built in support for cache tiering.


I would say you should start with hiring 2 full-time senior ops and 2 senior network admins.

Then give them freedom to do all hardware picks AND hiring some more intermediate/junior ops/admin staff.

You can do with 3-4 ops and maybe 3 network admins people in total.

Hard for me to accept that the price differences server vs desktop components are legit and not some kind of scam.

Also, assuming you are getting jacked by AWS like most people, have you looked into Linode, Digital Ocean or anyone else?

Ive done MMO setups that are more simple. You do not need this large cluster. Spread it out with nodes plus backup nodes near customers. Separate customers. Scaling will be as easy as adding more servers one case at a time.

Have you looked into using a CDN? A good CDN configuration can pay for itself in cost savings, regardless of whether you go self hosted or remain in the cloud.

Have you considered Rackspace's OnMetal products or other IaaS providers that run bare metal, such as Joyent Public Cloud? If you are in such a rush, I'll suggest to factor in both the migration risk and the time to deliver said hardware. Fx the Joyent Public Cloud will allow you to have a mixed private/public cloud as their IaaS software, Triton, is open source and you can run it in your own hardware. Note: I'm not familiar with anybody running Ceph on illumos LX containers although.

In my experience bcache ssd + spinning rust performed much worse than ZFS with its caching layer. ZFS did a significantly better job.

I'm sorry to hear that but it makes sense. Unfortunately I don't think it is an option to run ZFS below Ceph.

For server hardware, the Supermicro 2U Twins are a reasonable choice, but I prefer their 4U FatTwin chassis. The engineering quality is a little better IMO, and the cost increase isn't too big. Absolutely do not buy their 1U Twin systems, they are hot garbage.

The FatTwin chassis has similar density, and can support either 1U half width or 2U half width systems in a particular chassis. Typically I use 1U's for app / web servers and 2U's for lower end database / storage. Separate 2U servers for higher end database and 4U servers for bulk storage.

HPE has the Apollo 2600 / XL170r 2U chassis, which I think is somewhat inferior but still a reasonable choice. Dell sells the same thing as the C6300. I really prefer a 4U chassis though from a cooling and power supply perspective, but Dell or HPE may have a better international story for you.

You absolutely should not buy the 2630v4 CPU. I say that because the lower-end Intel CPUs do not support maximum throughput for memory and QPI. The 2630v4 is 8.0GT/s QPI and 2133MHz DDR4. A better solution low-watt part is 2650Lv4 (9.6GT/s QPI and 2400MHz DDR4). I have a guide that I created (and use myself) to determine comparative $/performance of Intel CPUs based on SPEC numbers [1]. If you can go up to 105W the 2660v4 is probably your best bet. Presuming that you're targeting 12-15kW per rack, a 105W part should allow you to deploy between 60-80 hosts per rack.

Also, don't use a W-series CPU that draws 160W. That's crazy power draw per-socket. If you want a super high-end CPU for your database server, I suggest a 2698v4 -- but normally I would go with 2680v4 or 2683v4 depending on the part cost.

In terms of hard drives, absolutely you should specify HGST over Seagate. At some point you may want to dual source this, but if you're only going with one vendor right now HGST is the best option. He8 or He10 8TB are your best bet in terms of cost and availability right now, although start thinking about He10 10T drives. The newly announced He12 drives shouldn't be on your radar until Q2 2017. Stock spares, maybe 2-3% of your total drives deployed, but at least 5-6 drives per site. You don't want to get caught out if there is a supply shortage when you need it most. Your business depends on ready access to considerable quantities of storage.

The P-series Intel SSDs are probably not going to be cost effective for your use case. But they are considerably better in terms of IOPS and remove the need for an HBA or RAID controller. Consider a Supermicro 2U chassis with 2.5" NVMe support, which will allow you to go considerably denser than the PCIe form factor. However, I think it's too early to go with NVMe unless you truly need beyond-SSD performance.

Don't PXE boot every time you boot. This creates a single point of failure (even if your architecture is redundant), and you will regret this at some point. However, DO PXE boot to install the OS.

Don't use 128GB DIMMs. They are not cost effective today.

There is only one solution for database scaling: shard. You'll either shard it today, or you'll shard it tomorrow when the problem is much harder. Scale up each host to what is easily achievable with today's hardware, and if push comes to shove retrofit to get over a hump that arises in the future, but know that you MUST shard in order to keep up with demand. Scaling up simply does not work.

There's a lot more to say, but without doing my job for free in a HN comment, the best advice I can offer is:

1. Simplify what you hope to accomplish in the first round. This is a lot to achieve at once. I think you'll have a hard time achieving the fanciness you want from a software perspective while also forklifting the entire stack over to physical hardware. It's perfectly fine to have something be good enough for now.

2. Find people who have done this before and get their advice. Find a VAR you can trust.

3. Plan, plan, plan, plan. Don't commit until you have a plan, make sure the plan is flexible enough to change course without tossing everything out, and plan to do a good enough job to survive long enough to figure out a better plan next time.

4. Get eval gear, qualify and test things, and make sure that what you think will work does work.

[1] https://docs.google.com/spreadsheets/d/1bbbeMCmqt5pZCb_x2QMW...

And if you want to upgrade later, I think you should estimate about ~3 years before 128GB DDR4 LR-DIMMs are cost effective, right?

It's hard to say with absolute certainty, but I think 2-3 years is a reasonable guess. 64GB DIMMs have only recently become semi-reasonable, and I still use a lot of 16GB or 32GB DIMMs on smaller deployments.

Basically whatever the top-of-the-line DIMM option may be (and this applies for CPUs and HDDs and other stuff too), you want to avoid being in a situation where you HAVE to use it. Vendors price these parts accordingly: you pay a premium for top-of-the-line because you must have it. If you can avoid that, do so.

Thanks! But I don't see the 2630v4 in that sheet, only older versions.

The 2630v4 row was hidden because of the QPI / memory concerns I mentioned (all such models were hidden on the sheet because I don't consider them for my purchases). I re-exposed those rows just now. FYI, the v4 rows are towards the bottom, under Broadwell. I've been tracking this stuff for a while.

The $/perf of 2630v4 is pretty decent ($2.24), but I would personally be leery for the reasons I mentioned. That said, I have used it for bulk storage servers, where CPU performance was not that important. So it's not like it will blow up your machine or something.

To obtain the perf number, I'm averaging single core and multi core fp and int SPEC numbers. If your workload isn't heavily parallelizable, that might not make sense. I'm not too worried about single core performance myself these days and have been tempted to remove it entirely.

One other thing I forgot to mention: v5 Xeon CPUs will be shipping in quantity early 2017, so you may want to consider holding off and looking for better deals on v4 CPUs then.

Likewise, you might be able to obtain a better deal today on v3 CPUs, particularly if you aren't using a large vendor like Dell or HPE. All of my pricing is list (I don't pay these prices), so the math changes significantly if you can get a disproportionate deal on a particular model. I use it as a place to start the conversation with my VAR, and then go with what makes sense in the market right now.

The network piece seems kind of incomplete and outdated when compared to what's being discussed in terms of compute and storage. Most of the new networking fabrics being built at this point are 100G, with emerging server connectivity focusing on nx25G for in-rack. Longer-distance 25G is awaiting implementations of FEC but the currently shipping gear can do 3-5M - which is about ideal for top-of-rack. The economics and development cycles of these link-types tends to dictate that 40GE doesn't make sense for a new installation.

The call for "SDN" is incredibly nebulous - to the point of being almost meaningless, IMO. What the big guys tend to be after is a way to control the fabric via standardized API calls - so capability for YANG/NETCONF or some mechanism for direct access to SDE calls. The other thing that's not addressed is how to efficiently get information about the network out of the network. Traditional polled mechanisms (SNMP/RMON, et al) have been shown not to both lack scale and adequate resolution while legacy approaches to push telemetry (sFlow/Netflow) miss the mark in terms of level of detail and compatibility with large-scale data processing needs.

The next point is the selection of topology and the integration with multi-site planning. There's a lot of cool stuff happening in this regard and there seems to still be a pretty big disconnect between what the systems folks seem to understand and what's happening in the network industry, which is a shame because there's probably more opportunity for neat stuff (read: scalability, performance, fault resistance, manageability) than seems to be discussed (at least on HN).

Finally - there's a certain conventional wisdom among the systems and some sections of the programmer crowd that network control planes are just another mostly simple bit of software to be implemented. It's not. It's a hard problem and is the manifest reason why only a handful of organizations have been able to produce software that runs a meaningful percentage of large-scale L3 infrastructure in the world (hint: Arista is a great company but isn't included in this number quite yet). Truly rugged, useful/usefully-featured and performant network code is hard. Making that code work in the context of 30+ years of protocol implementations, morphing standards and a world of bad/clueful actors is REALLY hard. There's an inverse relationship between the amount of money spent on a solution and the amount of specialized expertise you have on staff. A more traditional commercial solution might be more expensive but it also relieves you of the need to keep some relatively rare, likely expensive and almost certainly non-revenue producing skill-sets off-staff.

Can someone fill me in why renting a couple of root servers wouldn't solve the problem?

Good luck. I think moving from cloud is a mistake. I hope your prove me wrong.

We hope so too ;)

The first consideration should always be the DC, and the very last one is the software, after hardware, network, power and cooling.

Where will my DC be? What kind of DC is it? What services do they provide? How long do I want it to take for an employee to get there, whether or not they have 24/7 remote hands? What kind of power resiliency do they provide? What will power cost? What kind of power do they provide per cage and rack? How will their uplinks affect my traffic needs? Etc etc.

Cooling I didn't deal with directly, but suffice to say you will always need more cooling, and its efficacy will determine if your hardware stays alive or not. I've seen 11 foot racks with only 6 feet of hardware because they simply couldn't cool the racks at full height. Learn how to look for properly designed hot/cold racks and keep your racks well organized to make cooling efficient.

Power is pretty obvious, except that it isn't. Eventually you will draw too much power and you'll need to shunt machines into different racks and monitor your power trends. So one of the things to consider, besides dual drops, is how many extra racks do I have for when I need to spread out my power OR cooling into new racks? Will they make me use a rack on the other side of the building, or am I going to pay for some in reserve next to the current ones? Get PDUs that aren't a pain in the ass to automate (APC sucks balls).

Network: i'm not a neteng, don't listen to me, but obviously it should be managed with nice fat switching fabric bandwidth, good forwarding rate and big uplink module support. 48-port switches don't always have the same bandwidth ratios as 24-port switches, and uplinks are much easier to manage on a 24-port than a 48.

Hardware: you don't seem to need anything special, so you need to determine if a support contract is necessary, and if not, buy the cheapest pieces of shit you can and then rely on remote hands or a local employee to change out broken shit all the time. If space, power, cooling are at a premium, a blade chassis can be handy. But if you can spare the space, power, and cooling, 0.5U and 1U shitboxes are fine for most purposes. Don't get wrapped up in the details unless your application design requires specific hardware performance guarantees.

Looked at iSCSI SANs? Could make WAN sync easier, reduce overhead from NFS, but probably depends on how well your OS supports it and the features of the SAN. Oh, and an OOB terminal server can be a godsend when combined with a good PDU.

Go find all the industry datacenter design papers out there (there are tons) to bone up on the design considerations. Remember that you can always replace machines, but you can't replace rack, cooling or power design.

>Network: i'm not a neteng, don't listen to me, but obviously it should be managed with nice fat switching fabric bandwidth, good forwarding rate and big uplink module support. 48-port switches don't always have the same bandwidth ratios as 24-port switches, and uplinks are much easier to manage on a 24-port than a 48.

With current gen ASICs and switches, this isn't generally true anymore. A $1000 48 port 1Gbps switch is fully non-blocking with almost 1:1 10G uplinks (48x1G in, 4x10G out)

agreed. There is absolutely zero reason to even consider 24 port switches in this environment.


C1: Have a look at the FatTwin Line for more Disks per U. More PCIe Slots too.


D3: Check the measurements, having it not fit is painful

D4: More, smaller drives. Make sure you go PMR not SMR if you do go for 8TB


N1: The "SDN" aspect of the supermicro one is not really any different than any other. Look at https://bm-switch.com/ and get one that supports Cumulus Linux. Buy one with an x86 CPU. If you want to do "SDN" things or run custom monitoring, not dealing with PPC is great.

N3: Probably not needed, but not terribly expensive if it provides benefit.

N4: see N1, no.

N5: Cheap 1G switch that supports cumulus, x86, probably broadcom \ helix4

Networking General:

- I wouldn't advise using the 10GbE Copper --Go for 25GbE with DAC, it's basically the same price, Mellanox NICs are small/cheap

- Transit is cheap, you can get 500Mbps on a 10G port for $325/mo from Cogent

- If your bandwidth needs to scale up, data center locations matter more than you think

25GbE adapter -- minimal additional cost for 2.5x the perf, lower latency as well: http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=2...

a 32 port 100GbE switch is about $7000-12,000 -- you can break that down to 128x 25GbE, and use the 25GbE ports running at 10GbE mode for your carrier uplinks. Could even do 100GbE to your Ceph nodes if you wanted, but be aware of PCIe bottlenecks -- x8 is about 64Gbps, x16 will do 125Gbps. Dual port 40G on x8 or dual port 100G on x16 will not provide more than those numbers.

Consider Supermicro NVMe servers (Ultra series) for DBs, and 2.5in NVMe SSDs instead of PCIe.

Rack: Don't assume 45U 40-48 is common. Consider buying 2 racks.


19kW seems high for a single rack, you will need a good datacenter to support that density. Density costs money more racks is cheaper generally.

208v * 30A = 5000W usable per feed, unless they are talking about

208v 3-phase, which gets you 8600W usable per feed, which again is only 17kW and you need 18-20kW.

helpful reference: http://www.raritan.com/blog/detail/3-phase-208v-power-strips...

You can only use 80% of your power provided.

You also need Rack PDUs, higher density PDUs cost more money consider buying port-switchable PDUs. Raritan makes good ones.

Ask supermicro (or your reseller) for a "Power Sheet" it will tell you almost exactly how much power your server will use. I've had good luck with ThinkMate


H1: Yes, too many to mention

H2: Do it yourselves

H4: yes

Hosting general: Cross connects cost money, a number of facilities offer free xconnects, this can add up.

Other notes: - You want a small toolbox in the data center

- Buy more cables than you think you'll need

- You'll always forget something

- There are a number of companies that will lease you servers for pretty decent rates

Agree that bare metal is effective and doable at your scale, and can if done right give better SLA and much much much better control - especially of Internet-facing network performance than public cloud, or combos thereof).

We are running a 50%+ gross margin mid-stage venture-backed startup in Equinix facilities (but started there vs. cloud), and have no people near our facilities, and have had 0 issues service-wise related to doing management remotely. Yes, people go out to set up cabs, etc, but we hired our ops folks as generalists who had some network experience, and our CEO and CTO do as well, though AFAIK I don't have network logins active right now.

2 high-level thoughts I'd share:

1) Try not to use Ceph unless you're committed to having 2 people with deep experience at the code level.

2) I'd use Juniper QFX or EX, or Aristas. You don't seem to be running at scale or functionality where SDN magic is needed and there is a large community of QFC, EX, and Arista users your folks can reach out to when problems happen.

The other comments are more tuning and FYI on what we do HW-wise:

Specifically re: HW, at Kentik we run tens of worker nodes + flow ingest servers, all SM 1us w a few SSD and 256-384gb RAM. 48 logical cores, 2 x E5-2650v4.

We run approaching 1PB of storage, and while we still have some 4u 36-disk 3.5" boxes, those are phasing out and all we buy now is 2u SuperMicros w/ 24x2TB Samsung Evo 850ss. Procs are 72 logical core, 2 x E5-2697v4.

The Evo SSDs have been great - but our workload is largely appends or create/writes - largely but not all sequential, with high read IOPS. Before Samsung I was a big fan of Intel but we have no data on the modern Intels - slower for sure, but a focus on reliability is great...

We use JBOD and ZFS on the storage nodes; the LSI 9300-8i. Have things tested so we can do TRIM.

They do make SuperServers for roughly those configs, but we go with SM resellers who assemble and burn-in for +10-15%. I had 50+ SuperServers that were great at my Usenet company, but we'd rather have our ops folks work on things other than burn-in.

Happy to explain why we went to SSD vs. spinning at 2x the cost, but basically it made enough of a different at 95th and 99th percentile in our query times, and we had access to venture debt on great terms (which you should too and happy to discuss, since we're both funded by August).

Last note re: gear - when we were doing spinning, we found a screaming deal on new 2TB enterprise SATA (Hitachi, I think) for $50 and took the power/space hit for the +IOPS and extra compute we got for firing up the additional machines. Not sure if those are still out there, or the IOPS of this kind of approach would be needed.

First of all, massive kudos for the Stack Exchange-like technical transparency. Definitely consider a massive upgrade album like http://blog.serverfault.com/2015/03/05/how-we-upgrade-a-live... and http://imgur.com/a/X1HoY!

GitLab is awesome. I'm really sad that in the past two or three months I've only found one GitLab link on HN to click. There really needs to be more. (I'm not sure if this is because I'm browsing in AEDT or if GitLab isn't used a lot on here.)

I wondered about how you guys might do advertising to get more mindshare, and then I realized one possible explanation about why you're doing this: getting technical advice from the community means everyone's had a part to play, and they're likely to remember that. Good move ;)


In my case, I have little (okay, 0) practical experience; a lot of the following is mentioned experimentally, to see how these ideas would handle the described environment. It's pretty much all stuff I've read online.

I welcome replies that shoot down any of these ideas.

> Disk

Disks can be slow so we looked at improving latency. Higher RPM hard drives typically come in GB instead of TB sizes. Going all SSD is too expensive. To improve latency we plan to fit every server with an SSD card. On the fileservers this will be used as a cache. We're thinking about using Bcache for this.

There's already been another brief comment (https://news.ycombinator.com/item?id=13153317) about ZFS.

So, I'll ask. Why not ZFS? You don't have to run FreeBSD anymore to get a stable implementation.

You can put both the L2ARC and ZIL on SSDs. You can even use striping with them. Don't quote me on this but I think there MAY be some recovery capabilities built into these layers for if the power goes out (either it didn't use to be possible and now it is, or it's architecturally impossible, I hilariously cannot remember which).


> In general 1GB of memory per TB of raw ZFS disk space is recommended.

This is ONLY if you have dedupe switched on. If you have dedupe off you can run systems in just 4GB. A lot of home server enthusiasts do this.

There are a lot of unfortunate and widespread misconceptions about ZFS.


(This bit's somewhat anecdotal and is more informational than actionable. It's worth noting if you're interested in disks.)

> Every node can fit 3 larger (3.5") harddrives. We plan to purchase the largest one available, a 8TB Seagate with 6Gb/s SATA and 7.2K RPM.

Technically, the largest one available (on Amazon and presumably elsewhere) right now is 10TB, but its price/capacity ratio is atrocious compared to the rest of the market ($450-$520 per disk).

I've heard that Seagate Enterprise Capacity drives either die within the first 2-4 weeks or last 20 years. They have 5 year warranties in any case. I haven't heard anything else about other disks.

Very interestingly, 8TB seems to be the current market leader. Here are a bunch of prices I took straight off Amazon, as guides:

#3: 4TB: $170 (13 disks for 52TB = $2040)

#2: 5TB: $200 (10 disks for 50TB = $2000)

#4: 6TB: $239 (9 disks for 54TB = $2151)

#1: 8TB: $360 (7 disks for 56TB = $1673)

#5: 10TB: $450 (5 disks for 50TB = $2250)

A little while ago 5TB was the leader, and I was going to argue for more disks.


(Hitting add comment now instead of waiting so I can keep up with the discussion)

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact