1. Talk to your hosting providers and make sure they can support 32kW (or whatever max number you need) in a single rack, in terms of cooling. At many facilities you will have to leave empty space in the rack just to stay below their W per sq ft cooling capacity.
2. If you're running dual power supplies on your servers, with separate power lines coming into the rack, model what will happen if you lose one of the power lines, and all of the load switches to the other. You don't want to blow the circuit breaker on the other line, and lose the entire rack.
3. Thinking about steady state power is fine, but remember you may need to power the entire rack at full power in the worst case. Possibly from only one power feed. Make sure you have excess capacity for this.
The first time I made a significant deployment of physical servers into a colo facility, power and cooling was quite literally the last thing I thought about. I'm guessing this is true for you too, based on the number of words you wrote about power. After several years of experience, power/cooling was almost the only thing I thought about.
Build one node with the hardware configuration you intend to use. Same CPU, ram, storage.
Put it on a watt meter accurate to 1W.
Install Debian amd64 on it in a basic config and run 256 threads of the 'cpuburn' package while simultaneously running iozone disk bench and memory benchmarks.
This will give the figure of the absolute maximum load of the server, in watts, when it is running at 100% load on all cores, memory IO, and disk IO.
Watts = heat, since all electricity consumed in a data center is either being used to do physical work like spinning a fan, or is going to end up in the air as heat. Laws of physics. As whatever data center you're using will be responsible for cooling, this is not exactly your problem, but you should be aware of it if you're going to try to do something like 12 kilowatts density per rack.
Then multiply wattage of your prototype unit by number of servers. This will tell you how many fully loaded systems will fit in one 208v 30a circuit on the AC side.
Also, use the system bios option for power recovery to stagger bootup times by 10 seconds each per server, so that in the event of a total power loss and an entire rack of servers does not attempt to power up simultaneously.
That brings me to my next point: GitLab should also being mindful on what services are stored on which hardware - performing heroics to work around circular dependencies is the last thing you want to be doing when recovering from a power outage.
Depends a lot of server/HA/software architecture.
I've only ever built desktop machines, and this top comment drew a surprising parallel to most help me with my desktop build type posts. Granted, I'm sure as you dig deeper, the reasoning may be much different, but myself being ignorant about a proper server build, it was somehow reassuring to see power and cooling at the top!
 - http://www.chatsworth.com/products/cabinet-and-enclosure-sys...
In somewhere that is very close (OSI layer 1) topologically to a major traffic interexchange point, you will definitely be paying somehow for the monthly cost of the square footage occupied. For example a colo in 60 Hudson or 25 Broadway in NYC, or in one Wilshire in LA.
Large-scale colocation pricing that is based on power only will be found in locations that are not also major traffic inter exchange points. For example quincy WA or the many data centers in suburban new jersey.
But for other sites with high value peering, including EQIX in Ashburn, Coresite, your limiting rate is likely to be the power, not the space or power. I.e. they'll "give" you 1 cabinet for every 15kw you buy.
So my assertion assumes you're doing large number of dense cabinets.
If one is doing a large number of dense cabinets, they almost certainly should not do it at a high value peering point, and should backhaul it. You should be able to get diverse metro dark fiber for <$5k/mo if not substantially less. Put a single cabinet (or pair for redundancy) of $2k/mo cabinets in Equinix or Fisher and off you go.
Hope that helps. Your deployment is larger than ours, so there may be other techniques, but that's what we did.
I.e. you would also want to be sure that your wiring was rated for 100% utilization, and that other circuit-breaker-like functions exist.
Fire is an actual thing, and figuring out the best way to recharge a halon system isn't exactly what you want to be doing.
When you hardwire the circuit the electrical code allows you to use a 100% breaker.
This is less common in DCs historically but more and more as folks do 208v 3phase 100A circuits.
For a deployment of this scale it should be metered power (For example 1 (or more) 3phase a+b drops to each cabinet) where you only pay a Non-Recurring setup Charge (NRC) and then the MRC is based on actual power draw.
3phase also means fewer physical PDU's (uses less space), but more physical breakers. Over-building delivery capability will eliminate any over-draw concerns for startup cycles.
Since my cabinet number is usually evenly divisible by N*PDUs, this impacts overall capital.
Having a little headroom on your power circuits is also incredibility important, and not every facility will sell 100% rated breakers. It may make more sense to be in a facility with 80% rated breakers than 100%, even with the added capex of an extra PDU or two.
Goes back to my previous comment. What is important to you, at the pod / multiple pod level, isn't as important to the 1-2 cab deployment.
Similarly, ensure spare room in the cabinet for adjustments, that thing you forgot, and small growth. Much better to have 70% full and not need the space then to have no free RU and need the space.
I would recommend verifying everything is fault tolerant/HA as expected every step of the way. We ran into issues where the power strips on both sides were plugged into the same circuit(D'oh), wrong SST's, redundant routers getting cabled up to the same power strips, etc and you name it.
After a rack is setup have people at the DC(your own employees or the DC's techs) help simulate(create) failure in power, networking, and systems to verify everything is setup correct. It sounds like you have people coming onboard with experience provisioning/delivering physical systems though, so I would expect them to be on the ball with most of this stuff.
A system at idle vs full CPU vs full cpu + all disk will produce very different measurements.
Also keep in mind 80% derating - many electrical codes will state that an X amp circuit should only be used at 0.8X on a continuous basis (and the circuit breakers will be sized accordingly).
I didn't know half of the stuff in grandparents' post.
My underlying assumption is that this is a production service with customers depending on it.
1. Don't fuck with networking. Do you have experience operating same or similar workloads on your super micro sdn? Will the CEO of your super micro VAR pickup his phone at 2AM when you call?
My advice: Get a quote from Arista.
2. Don't fuck with storage.
32 file servers for 96TB? Same question as with networking re:ceph. What are your failure domains? How much does it cost to maintain the FTEs who can run this thing?
3. What's the service SLA on the servers? Historically, supermicro VARs have been challenged with that.
If I were building this solution, I'd want to understand what the ROI of this science project is as compared to the current cloud solution and a converged offering like Cisco/NetApp, HP/3Par or something like Nutanix. You're probably saving like 20-25% on hardware.
This sounds to me like pinching pennies on procurement and picking up tens of dollars of cost on the labor side. If you're accustomed to SLAs from AWS, this will be a rude awakening.
I think they are coming at this problem from the wrong perspective - instead of growing from virtual servers to their own dedicated hardware to get better CephFS performance, they should take a hard look at their application and see if they can architect it in a way that does not require a complex distributed filesystem to present a single mount that they can expose over NFS. At some point in the future, it will bite them. Not an if, but a when.
In addition, this means that running physical hardware, CephFS and Kubernetes (among other things) are now going to be part of their core competencies - I think they are going to underestimate the cost to GitLab in the long run. When they need to pay for someone to be a 30 minute drive from their DC 24/7/365 after the first outage, when they realize how much spare hardware they are going to want around, etc.
As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.
As someone who administers GitLab for my company, yes please.
Any high availability scenario that involves "just mount the same NFS volume on your standby" is a nonstarter for us. (We've found mounting NFS across different datacenters to be too unreliable and our failover scenarios include loss of a data center.)
It would also be wonderful to be able to run GitLab in any of the PaaSes that only have ephemeral disk, but that's a secondary concern.
What are the alternatives?
I suppose there's MySQL's "stream asynchronously to the standby at the application level."
... which, now that I think about it, should be pretty easy to do with Git, since pushing and pulling discrete units of data are core concepts...
I mean, especially because git's whole model of object storage is content-addressable and immutable, it looks like it's a prime use for generic object storage.
Latency and consistency would be my concerns - S3 does not quite have the right semantics for some of this, so you'd have to build a small shim on top to work around this. Ceph's rados doesn't even have these problems, so it is quite a good contender.
And considering the usual compressed size of commits and many text files, you're going to have more HTTP header traffic than actual data if you want to do something like a rev-list.
I would imagine any implementation that used S3 or similar as a backing store would have to heavily rely on an in-memory cache for it (relying on the content-addressable-ness heavily) to avoid re-looking-up things.
I wonder how optimized an object store's protocol would have to be (http2 to compress headers? Protobufs?) before it starts converging on something that has similar latency/overhead to NFS.
"AWS CodeCommit is built on highly scalable, redundant, and durable AWS services such as Amazon S3 and Amazon DynamoDB."
This is a really good point. That's easily $1M in payroll. You could probably run a decent tiered SAN with 80-95% fewer labor dollars. Plus have the benefit of torturing the vendor when you hit scaling hiccups.
Can you give some examples of the problems you ran into?
Something that always seemed to cause nagging issues was that we wanted our cluster to have data encryption at rest. Ceph does not support this out of the box, which means that you need to use dmcrypt on top of your partitions, and present those encrypted partitions to Ceph. This requires some work to make sure that decrypt keys are setup properly, and that the machine can reboot automatically and remount the proper partitions. In addition, we ran into several issues where device mapper or otherwise would lock an OSD, which would send the entire machine into lockup, messy!
We also had to work pretty hard to build quality monitoring around Ceph - by default, there are very little tools that provide at-scale fine grained monitoring for the various components. We spent a lot of time figuring out what metrics we should be tracking, etc.
We also spent a good amount of time reaching out to other people and companies running ceph at scale to figure out how to tune and tweak it to work for us. The clusters were all-SSD, so there was a ton of work to tune the myriad of settings available, on ceph and the hosts themselves, to make sure we were getting the best possible performance out of the software.
When you run dozens-to-hundreds of servers with many SSDs in them that are doing constant traffic, you tend to hit every edge case in the hardware and software, and there are a lot of lurking demons. We went through controller upgrades, SSD firmware upgrades, tweaking OP settings, upgrading switches, finding that the write workload on certain SSDs caused problems and much more.
That's just a snapshot of some of the issues that we ran into with Ceph. It was a very fun project, but if you are getting into it for a high-throughput setup with hundreds of OSDs, it can be quite a bit of work.
Happy to chat more w/ anyone that's curious - there is some very interesting and fun stuff going on in the Ceph space.
I've always wondered how automatic reboots are handled with filesystem encryption.
What's the process that happens at reboot?
Where is the key stored?
How is it accessed automatically?
My philosophy today is that if the data is important at all, it's worthwhile going spendy and getting a SAN. Get a good one. I like Nimble a lot right now, but there are other good ones, too. (Don't believe crazy compression numbers or de-duplication quackery; I've told more than one SAN vendor to fuck off after they said they'd get 20:1 on our data, without doing any research on what our data was).
Have everything backed up? Great! How long until you can go live again after water drips on your storage? If you spend a week waiting for a restore, that's indistinguishable from failure. If you wait a month for replacement hardware to be shipped, you might have killed your product.
Perhaps I don't understand the problem domain, but I don't understand why CephFS is being considered for this task. You're trying to treat your entire set of files across all repos as a single filesystem, but that's an entirely incorrect assumption. The I/O activity on one repo/user does not affect the I/O activity of an entirely different user. Skip the one filesystem idea, shard based on user/location/whatever.
I'd appreciate any comments explaining why I'm wrong, because this doesn't seem to be a productive design to me.
None of the clouds use any of that super-expensive gear, so if you're going for cost savings, you'll need to use the same sort of commodity gear they use.
Gitlab is obviously Linux-savvy and comfortable writing automation, so things like Cumulus Linux and minimal-handholding hardware vendors shouldn't cause them any indigestion.
<disclaimer: co-founder of Cumulus Networks, so slightly biased>
My point isn't to knock them down. It takes cohones to be public about stuff like this. My instinct as a grumpy engineering director type is that there are holes here that need to be filled in.
Putting a major product at risk to save $30k against an Arista switch isn't a decision to make lightly. That means pricing the labor, upside benefit and business risk. If they are going to 100x this environment, Cumulus will save millions. If it will 3x, it will save a few thousand bucks -- who cares.
Arista and Cisco shouldn't cost top dollar; though anyone buying EMC or Netapp for any new build should have their union card revoked. FreeNAS ftw uber alles.
Source: Did it twice.
I've run it in my home a few times out of curiosity, and that was never my impression.
We're... displeased with the current solution that we're using at work for this use case. :)
C2: The Dell equivalent is C6320.
CPU: Calculate the price/performance of the server, not the processor alone. This may lead you towards fewer nodes with 14-core or 18-core CPUs.
Disk: I would use 2.5" PMR (there is a different chassis that gives 6x2.5" per node) to get more spindles/TB, but it is more expensive.
Memory: A different server (e.g. FC630) would give you 24 slots instead of 16. 24x32GB is 768GB and still affordable.
Network: I would not use 10GBase-T since it's designed for desktop use. I suggest ideally 25G SFP28 (AOC-MH25G-m2S2TM) but 10G SFP+ (AOC-MTG-i4S) is OK. The speed and type of the switch needs to match the NIC (you linked to an SFP+ switch that isn't compatible with your proposed 10GBase-T NICs).
N1: A pair of 128-port switches (e.g. SSE-C3632S or SN2700) is going to be better than three 48-port. Cumulus is a good choice if you are more familiar with Linux than Cisco. Be sure to buy the Cumulus training if your people aren't already trained.
N2: MLAG sucks, but the alternatives are probably worse.
N4: No one agrees on what SDN is, so... mu.
N5: SSE-G3648B if you want to stick with Supermicro. The Arctica 4804IP-RMP is probably cheaper.
Hosting: This rack is a great ball of fire. Verify that the data center can handle the power and heat density you are proposing.
Would you mind if we contact you to discuss?
I'm never buying another SuperMicro, for many reasons. The amount of cabling for properly redundant connections is a killer; it's at least three cables per system (five in our case); a rack will have hundreds of wires to manage. Access to the blade is in the back, where the cables are, and you have to think ahead and route things cleverly if you want be able to remove blades later.
The comment about doing something other than three 48-port switches is bang on. And if you're running Juniper hardware, avoid the temptation to make a Virtual Chassis (because it becomes a single point of failure, honest, and you will hate yourself when you have to do a firmware upgrade).
19KW is still a ton of power, and I'm surprised the datacenter isn't worried (none of the datacenters we use worldwide go much above 16KW usable). Also, you need to make sure you're on redundant circuits, and that things still work one of the power legs totally off. Make sure you know what kind of PDU you need (two-phase or three-phase), and that you load the phases equally.
For short distances of known length, twinax cables which are technically copper can be used. They're thinner than regular cat6a but only about the same as thin 6a and thicker than typical unshielded Duplex fiber patch cables. Twinax can be handy if connecting Arista switches to anything else that restricts 3rd party optics, as Arista only restricts other cable types. Twinax is also the cheapest option.
The PHY has to do a lot of forward error correction and filtering, so it adds latency (for the FEC), power (for all the DSP) and cost (for the silicon area to do all of the above).
Consider the power pull from copper v fiber listed here . The Arista 7050TX 128 port pulls 507W while the 7056SX 128 port pulls 235W. Yes, copper is more but we're talking half a kW for 2 TOR switches. And for this you get much cheaper cabling, as the SFP+ are much more expensive (go AOC if you do go fiber, BTW) and you have to worry about clean fiber, etc.
Where fiber is REALLY nice is with the density, although the 28AWG  almost makes that moot.
There's few DC builds that can't do end-to-end copper any more, at least for the initial 50 racks.
 - http://www.datacenterknowledge.com/archives/2012/11/27/data-...
 - https://www.arista.com/en/products/7050x-series
 - https://www.monoprice.com/product?p_id=13510
Cumulus isn't really SDN so I'm not sure what you're saying there.
Traditional networking is fine but it's totally different than Linux so you need dedicated netops people to manage it. And Cisco is the most expensive traditional vendor.
I've just gotten the notion that SDN is almost ready for primetime, just not yet.
Juniper, Cisco, and Arista all have solutions for this environment.
You're talking about only 64 nodes right now. Your storage and IOPS requirements are not huge. A lot of mid-size hosting companies will give you fantastic service at the 10-1000 servers range. If I were you I'd talk to someone like https://www.m5hosting.com/ (note: happy dedicated server customer for many years -- and I'm sure there's similar scale operations on east coast if that's really what you need) who have experience running rock solid hosting for ~100s of dedicated servers per customer.
I suspect you may just be able to get your 5-10X cost/month improvement (and bare metal performance gains) without having to take on the financing and hardware bits yourself.
There are many advantages to AWS and similar services, but if you can't really take advantage of all the goodies because e.g. you also need to provide a local version of your software (which is the case for Gitlab, as far as I understand), renting dedicated servers is an order of magnitude cheaper.
In fact, i wonder if that's a business model for hosting providers. Kind of a sysadmin incubator.
We'll be glad to look into alternatives that manage the servers and network for us although the argument in https://news.ycombinator.com/item?id=13153455 that this is the time to build a team that can handle this makes sense to me too.
I would encourage you to look at what you can get without trying to do your own colo. You're not at the scale where you should be thinking about that.
That said, SoftLayer does provide 20Gbps access within a private rack and 20Gbps access to the public network.
Performance along with reliability are the most important metrics for someone providing VCS as a service.
Geo-redundancy seems like a luxury, until your entire site comes down due to a datacenter-level outage. (E.g. the power goes down, or someone cuts the internet lines when doing construction work on the street outside).
(This is one of the things that is much easier to achieve with a cloud-native architecture).
As far as moving from cloud to bare metal, another thing to take into consideration (that this was replied to), if you don't architect your AWS (or other cloud solution) to take advantage of multiple geographic regions, the cloud won't benefit you.
I 100% agree that there should be more than one region deployed for this service. As others have said, all it takes is 1 event and the site will be down for days to weeks to months (It may not happen often, but when it does, you go out of business). The size and complexity of this infrastructure will make it nearly impossible to reproduce on short order in a new facility. If I were the lead on this I would have either active / active sites, or an active / replica site.
I would also have both local (fast restores), and off-site backups of all data. A replica site protects against site failure not data loss and point-in-time recovery.
Yep, this is why scaling starts with scalable distributed design. We were moving a fairly large logging stack from NFS to S3 once, for the same reason Gitlab is trying to move to bare metal now. Moving off cloud was not an option, moving to a TCO efficient service was. NFS did not scale and there was the latency problem. I think moving to bare metal cannot help with scale as much as a good architecture can. We will see how deep the datacenter hole goes. :)
To add to my previous comment though, AWS (and cloud in general) tends to make much more sense if you are utilizing their features and services (Such as Amazon RDS, SQS, etc.), and if you aren't using these services I can absolutely guarantee I can deliver a much lower TCO on bare metal than AWS. (Which is why I offered to consult for them) I see this all the time. Company moves from bare metal to AWS as bare metal is getting expensive, then they quickly find out AWS can't deliver the performance they need without massive scale (because they aren't using a proper salable distributed design and can't afford to re-architect their platform)
You will need to monitor:
- cpu utilization
- ram utilization
- disk utilization
- disk health
- context switches
- IP addresses assigned and reaching expected MACs
- appropriate ports open and listening
- appropriate responses
- time to execute queries
- processes running
- process health
- at least something for every bit of infrastructure
once you collect that information, you need to record it,
graph it, display it, recognize non-normal conditions, alert on those, page for appropriate alerts, and figure out who answers the pagers and when.
Our Infrastructure lead Pablo will do a webinar of our Prometheus monitoring soon https://page.gitlab.com/20161207_PrometheusWebcast_LandingPa...
We're bundling Prometheus with GitLab https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1481
Brian Brazil is helping us https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1481#not...
On January 9 our Prometheus lead will join us (who was very valuable already helping with this research behind this blog post) and we're hiring Prometheus engineers https://about.gitlab.com/jobs/prometheus-engineer/
In the short term we might send our monitoring and logs to our existing cloud based servers. In the long term we'll host them on our own Kubernetes cluster.
For our monitoring vision see https://gitlab.com/gitlab-org/gitlab-ce/issues/25387
Disclaimer: Prometheus co-founder and just started as a Prometheus contractor for GitLab.
EDIT: ah, didn't reload the page to see that sytse had already responded with this :)
Just as a suggestion since you seem to be so certain about the location. I basically know every DC in Frankfurt, so if you need any help or info in that regard feel free to contact me :-)
Of course there's the issue of potential legalities for your business, but don't kid yourself that you're safe from prying eyes by deploying in a particular country.
The next step (and particularly important for a business like GitLab) is to land a second site in either western Europe or west coast US. Honestly you should be thinking about this right away, and look to sign leases on both spaces simultaneously with a 3-mo delay built in for the second site. This should help you negotiate price as well, if you're able to go with the same dc provider, but be aware that that itself is a single point of failure as well. Make absolutely sure you negotiate your MSA to the n-th degree, get a good SLA, etc. You can still get burned, but do your legal due diligence now because you won't have a chance to change terms later.
Then continue to optimize by having multiple sites per-region (so that failover doesn't involve a big performance hit), adding APAC / AUNZ regions, and so forth. For a service like GitLab, I wouldn't think that time to sync the repo is hyper important, but responsiveness of the web interface is fairly key. So that may lead to a hub/spoke design where there are a few larger sites storing the bulk of the data and more small sites to handle metadata and such to present the web views.
That's all years down the road though. For a first pass, I can't see a problem with northern Virginia as the first site.
I would assume that the .de lines are saturated with 5 eyes...
EDIT: Funny, every single documented, proven link for the last several years has proven me and GP correct and all you nay-sayers wrong.
I wouldn't make a business decision around a general feeling of paranoia. German spies aren't an improvement over American or British ones.
This is Gitlab. There should never be any data in or out except HTTPS and SSH.
I (and others I bet) would be interested in a quick summary of the options (and your opinions of them) for facilities in Frankfurt.
You basically have the big brands like Equinix (7 facilities in Frankfurt), Interxion (They are currently building their 11th facility in Frankfurt), Telecity (now part of Equinix) but also some local ones like e-Shelter, First Colo or Accelerated. There is also a DC only meters away from the Interxion campus that is called Interwerk and is basically the cheapest DC in Frankfurt, but it has an awesome price/performance ratio and you have access to all peering/transit options of Interxion (via CWDM) and can get a 1/1 rack for under 200 EUR/mo. I have been in that DC for several years and never had a single issue, so if you don't rely on any certificates this is a cheap option.
I was also colocating in the First Colo DC for some time, they are also in the locality around the Interxion campus. It's a pretty small DC but it is more premium than the Interwerk and also is kinda cheap. I personally wouldn't go with Accelerated, since they had multiple issues in the past.
For the big brands I would definitely go with Interxion, they are great, have all the certificates and are premium while not going crazy with their pricing like Equinix does. DigitalOcean, Twitch, etc. they are all in one of the many Interxion facilities in Frankfurt. If Price isn't a concern I would probably go with Equinix FRA5.
DE-CIX is present in around 7 facilities, 3 of them are Interxion and I think 2 Equinix.
Some financial institutions require all sensitive* data to be stored/hosted in the same country/state(archaic yes).
It's real hard to actually define sensitive data but the IP* in some code a quant wrote can totally be considered a trade secret by a non technical person
Please don't get me started on how stupid I think it is that people consider code to be IP
Definitely not archaic. If the FISA courts taught you anything, it should be that the country your things are living in determines which entities can tap your traffic/hardware.
Frankfurt may be a better option solely on the basis that it is more "environmentally" stable (in terms of events that may disrupt the operation of the facility).
(I'm from Germany, but I couldn't care less about the country per se or GitLab being co-located in Frankfurt)
You could frame infrastructure cost savings in many different ways though. The most obvious solution may seem to be the spend to move from the cloud to in house bare metal but I feel like you'll have a lot of costs that you haven't accounted for in maintenance, operational staff spend, cost in lost productivity as you make a bunch of mistakes.
Your core competency is code, not infrastructure, so striking out to build all of these new capabilities in your team and organization will come at a cost that you can not predict. Looking at total cost of ownership of cloud vs steel isn't as simple as comparing the hosting costs, hardware and facilities.
You could reduce your operating costs by looking at your architecture and technology first. Where are the bottlenecks in your application? Can you rewrite them or change them to reduce your TCOS? I think you should start with the small easy wins first. If you can tolerate servers disappearing you can cut your operating costs by 1/8th by using preemptible servers in google cloud platform for instance. If you optimize your most expensive services I'm sure you can cut hunks of your infrastructure in half. You'll have made some mistakes a long the way that contribute to your TCOS - why not look at those before moving to steel and see what cost savings vs investment you can get there?
Ruby is pretty slow but that's not my area of expertise - I wouldn't build server software on dynamic languages for a few reasons and performance is one of them but that's neither here nor there as you can address some of these issues in your deployment with modern tech I'm sure. Aren't there ways to compile the ruby and run it on the JVM for optimizations sake?
Otherwise do like facebook and just rewrite the slowest portions in scala or go or something static.
Try these angles first - you're a software company so you should be able to solve these problems by getting your smartest people to do a spike and figure out spend vs savings for optimizations.
GitLab is fine software but fuck me, they need to hire someone with actual ops experience (based on this post and their previous "we tried running a clustered file system in the cloud and for some reason it ran like shit" post).
Another thing I might recommend is a third network (just a simple 1GB and a quality switch) for consensus. Re-replication can max the network out and further cause consensus fails, causes more re-replication winding down everything ... If that's not possible, add firewall rules to prioritize all consensus related ports high.
From my own personal experience, I would go with a PCIe SSD cache/write buffer, and then a primary SSD tier and a HDD tier. Storage, as it seems you guys have experienced, is something you want to get right on the first try.
N1: Yes, the database servers should be on physically separate enclosures.
D5: Don't add the complexity of PXE booting.
For Network, you need a network engineer.
This is the kind of thing you need to sitdown with a VAR for. Find a datacenter and see who they recommend.
If the 8TB drives don't offer the IO we need we'll have to leave part of them empty. I assume that if you fill them with 2TB they should perform equally well. Word on the street is that Google offers Nearline to fill all the capacity they can't use because of latency restrictions for their own applications.
It is an interesting idea to do SSD cache + SSD storage + HDD instead of just SSD cache + HDD. I'm not sure if Ceph allows for this.
I'm sure it has its uses! But I'd emphatically advise anyone to obtakn hands-on experience with Bcache before planning around it! As I'm sure you are already doing or considering of course :)
Ceph seems to offer Tiering, which would move frequently accessed data into a faster tier while the infrequent data to a slower tier.
By "Tiering", is this moving data between different drive types? Or by moving the data to different parts of the platter to optimize for rotational physics?
Ceph can probably explain it better than I can. http://docs.ceph.com/docs/jewel/rados/operations/cache-tieri...
I read the entire article, and while you talked about having a backup system (in a single server, no less!) that can restore your dataset in a reasonable amount of time, you have no capability for disaster recovery. What happens when a plumbing leak in the datacenter causes water to dump into your server racks? How long did it take you to acquire servers, build the racks and get them ready for you to host customer data? Can your business withstand that amount of downtime (days, weeks, months?) and still operate?
These questions are the ones you need to be asking.
In other words, double your budget because you'll need a DR site in another datacenter.
Edit. If Ceph is smart enough, it would be aware of the tiers present on the node, and would tier blocks on that node. So a Block on node A will stay on node A.
If you are talking physically, 2TB/8TB would theoretically be faster than a 2TB/2TB disk if the performance attributes were the same. But a 2TB HDD will have a faster average seek time than an 8TB HDD due to physical design. Any performance gains of only partially filling a drive would probably be offset by the slower overall drive.
I'm skeptical here. So your minimum seek time goes up on the 8TB because it's harder to align. But your maximum seek time should drop tremendously because the drive arm only has to move along the outer fifth of the drive. And your throughput should be great because of the doubled linear density combined with using only the outer edge.
A quick google search shows that there are marginal gains on the outer track vs the inner, but that is only on sequential workloads. For something like GitLab, the workloads would be anything but.
1. If I look at 2TB vs. 8TB HGST drives, their seek times are 8ms and 9ms respectively. But if you're only using a quarter of the 8TB drive, the drive arm needs to move less than a quarter as much. Won't that save at least 1ms?
2. The 8TB drive has a lot more data per inch, and it's spinning at the same speed. Once a read or write begins, it's going to finish a lot faster.
3. Here's a benchmark putting 8TB ahead in every single way http://hdd.userbenchmark.com/Compare/HGST-Ultrastar-He8-Heli...
In my experience, the PCIE drives they mention (the DC P3700) are incredible. They're blazing fast (though the 3D X-Point stuff is obviously faster), have big supercaps (in case of power failure), and really low latency for an SSD (150 microseconds or so). They're a pretty suitable alternative to in-memory logbuffers where latency is less crucial. Much cheaper than the equivalent memory too. Having a few of these in front of a RAID 10 made up of 8TB drives will work just fine.
FWIW your experience with a RAID 6 of big disks is unsurprising. Raid 6 is almost always a bad choice nowadays since disks are so big. Terrible recovery times and middling performance.
I would think that GitLab's workload is mostly random, which would pose a problem for larger drives. The SSDs are a great idea, but I've only seen 8TB drives used when there are 2 to 3 tiers; with 8TB drives being all the way on the bottom. I'm not sure how effective having a single SSD as a cache drive for 24TBs of 8TB disks will be.
Oops that's reliability, I can jump the gun sometimes, sorry.
A few things to watch out for, if your team doesn't have a lot of experience in direct management of dedicated servers/datacenters:
- There is an increased risk of larger/longer very rare outages due to crazy circumstances. Make sure you have a plan to deal with that. (I've had servers hosted at a datacenter that flooded in Hurricane Sandy, and another where an electrical explosion caused a >3 day outage..)
- It's easy to think you'll rely on managed services, but that rarely works out well. It also can become very, very expensive -- possibly more so than cloud-based hosting.
- Specifically, regarding H1, H2: Dedicated hardware is substantially cheaper than cloud-based hosting, but if you rely too much on managed services you negate a large portion of the savings. Consider that most service providers will be both more expensive and less competent than doing it yourselves. Also, having your own team have direct knowledge + their own documentation of the setup will be beneficial.
- I'd recommend budgeting for and ordering some extra parts to keep on hand for replacement, if you are having datacenter ops handle hardware or can have an engineer located relatively close to your datacenters. (A few power supplies, some memory, a couple drives - nothing too crazy.)
- Supermicro's twins systems are great. In the past I've gone with their 1U models vs. 2U to slightly reduce the impact of unit downtime. (Having to take one 2U Twin down affects four nodes. It sounds like you'll have to decide on balancing that against the increased drive capacity, in your case.)
In a long run, probably. Immediately after deploying, unless they hire very experienced, I expect quite a few "never seen before" issues (may not result in publicly visible downtime though). Monitoring for "thermal events", weird and hard to debug issues requiring firmware updates, "bad cable" issues, etc. are not what you have to deal with in the cloud.
I think you could probably get the IO performance you're talking about in your blog post from AWS instances or Google Cloud's local NVMe drives, but if you truly need baremetal, I'd recommend Packet or Softlayer. Don't try to run your own infrastructure or in a year you'll be: https://imgflip.com/i/1fs7it
Lower your SLA requirements and go with multiple providers like OVH. Make your site work at multi datacenters. At the end of the day your users will be much happier.
My 2 cents.
Stay away from the 8TB drives. Performance and recovery will both suck. 4TB drives still give the best cost per GB.
Why are you using fat twins? Honestly, what does that buy you? You need more spindles, and fewer cores and memory. With your current configuration, what are you getting per rack unit?
Consider a 2028u based system. 30 of those with 4TB drives gets you the 1.4PB raw storage you're looking for. 2683v4 processors will give you back your core count, yielding 960 cores (1920 vCPUs) across that entire set. You can add a half terabyte of memory or more per system in that case.
Sebastien Han has written about "hyperconverged" ceph with containers. Ping him for help.
The P3700 is the right choice for Ceph journals. If you wanted to go cheap, maybe run a couple M2.NVME drives on adapters in PCI slots.
I didn't really need the best price per GB in my setup, so I went with 6TB HGST Deskstar NAS drives. I'm suggesting you use 4TB as you need the IOPs and aren't willing to deploy SSD. Those particular drives have 5 platters and a higher relatively high areal density giving them them some of the best throughput numbers in among spinning disks.
If you can figure out a way to make some 2.5" storage holes in your infrastructure, the Samsung SM863 gives amazing write performance and is way, way cheaper than the P3700. I recently picked up about $500k worth, I liked them so much. They run around $.45/GB. Increase over-provisioning to 28% and they outperform every other SATA SSD on the market (Intel S3710 included).
You'll probably want to use 40GE networking. I've not heard good things about Supermicro's switches. If I were doing this, I'd buy switches from Dell and run Cumulus linux on them.
Treat your metal deployment like IaaC just like any cloud deployment. Everything in git, including the network configs. Ansible seems to be the tool of choice for NetDevOps.
We're considering the fat twins so we get both a lot of CPU and some disk. GitLab.com is pretty CPU heavy because of ruby and the CI runners that we might transfer in the future. So we wanted the maximum CPU per U.
The 2028u has 2.5" drives. For that I only see 2TB drives on http://www.supermicro.com/support/resources/HDD.cfm for the SS2028U-TR4+. How do you suggest getting to 4TB?
If you do somehow manage to pick the perfect disk sure having everything from a single batch would be the best since that'll ensure you have the longest MTBF. But how sure are you that you'll be picking the perfect batch simply by blind luck?
That said, I bought the same 6TB HGST disk for two years.
But when you're buying 100% of your disk inventory at once there's a serious "all eggs in one basket" risk.
As for CPU density, I still feel like you're going to need more spindles to get the IO you're looking for.
4 x 208v 30A circuits gives a total of 120A -- of which they can only be used at 80% capacity, so that gives you 96A usable -- without redundancy.
My initial feelings (eyeballing it) are you should be looking for 3 full racks, 2x208V@30A in each.
As a juniper shop, we implemented the 2xQFX 5100 48T (in virtual chassis) + ex4300 for remote access per rack. This will be a decision based on your local expertise though.
I also looked hard into the twin boxes -- but power to the rack in the end ruled the day, and it made not much of a difference to use the 1U boxes.
Don't forget about out of band (serial for the switching gear). We've been using OpenGear.com for this stuff with 4G-LTE builtin.
VPN access to access console devices?
Any site-to-site VPN needs?
Also, I would consider not pre-purchasing N times your required horsepower/disk if you can avoid it, but rather add in pre-planned yearly, or biannually stages.
As the CEO of a cloud hosting & server management company -- I have much more to say about this if you would like to chat via phone or email anytime.
1) Contributors have not been vetted - Some responses are based on real world experience, and some is conjecture from arm-chair quarterbacks. (A simple example would be that nobody has mentioned with any of the SuperMicro 2U Twins that you have to be cautious about the PDU models and outlet locations of 0U PDU's to not block node service/replacement in the rack)
2) There are multiple ways to skin a cat - There are many viable solutions in this thread, but you can't simply take a little bit of this, a little bit of that, and piece together a new platform that "just works." -- Better to go with a know working solution than a little of this and a little of that. Multiple drivers tend to be less effective.
I am the owner of a bespoke Infrastructure as a Service provider that delivers solutions to sites of similar metrics, and speak with plenty of real-world experience.
Larger Providers - We find a lot of clients move away from AWS, Softlayer, Rackspace, et al. as the larger providers aren't nearly as interested in working with the less-than-standard configurations. They want highly-repeatable business, not one-off solutions.
I'd love to talk in more depth with you about how we can deliver exactly what you need based on years of experience in delivering highly customized solutions. We'll save you money and headache.
Average is not the appropriate word, however I'll use your word. I'd rather have, and so would every enterprise out there, rather have average advice based on tried and true solutions that costs 10% more with 99.99 to 99.999% availability than cutting edge, saved 10-20% with 99.8% availability. The downtime alone can (and does) kill reputations of sites like GitLab.com (and others).
This definitely increases software complexity but going the other way increases other complexities (ops, capex, capacity).
You need at least two nodes that do DNS, DHCP, NTP, and other miscellaneous services that you absolutely want to have but do not seem to have mentioned. You want them to be permanently assigned, so that you never have to search for them, and you want them to fail over each service as needed, preferably both operating at once. Three nodes would be better. Consider doing some basic monitoring on these nodes, too, as a backup to your main monitoring system.
1&2. Authoritative DNS (internal)
1&2. Caching DNS resolver
3&4. Outbound HTTP proxy (if necessary)
3&4. PXE / installer / dhcp
3&4. Local software mirror (apt / yum / etc.)
5&6. SSH jump hosts.
Or something like that.
Make sure the second host is not just in a separate chassis from the first host, but also in a separate rack.
For external authoritative DNS, don't do it yourself -- pay for someone to run that for you (Route53, ns1, etc.). For e-mail, if you can possibly not deal with it then don't -- use Mailchimp or something.
The DNS records for the internal records are done using the kubernetes middleware (basically serving the service records).
The external records are pulled in from a git repository hosting our zones as bind files. If need be zones are split into subzones per team/project. Same permission system as our code via MRs using Gitlab.
Our recommendation is build on open standards (BIND, AXFR) and use services on top of these.
I agree that using an external mail provider is usually a good idea. It mostly is your fallback communication channel and is usually easy to switch (doing replication to an offsite mail storage needs to be done to make switching easy/possible/fast). MX records \o/
SuperMicro has a 4 node system available in which each node gets 6 disks, 2028-TP-xx-yyy. Get two of these, populate 2 nodes on each as your databases, and you can grow into the other spaces later. Run your databases on all-SSD; store backups on spinners elsewhere.
Having two node types is not a calamity compared to having just one.
So the proposal now is to have one chassis (excluding the backup system) and two types of nodes (normal and database).
Regarding the 2028 please see the conversation in https://news.ycombinator.com/item?id=13153853
I don't know much about the Ruby web server landscape, but might Puma (http://puma.io/) be better?
The 2u fat twins are very dense, You'll need make sure that thy hosting company can actually cool it effectively.
I'd look again at your storage strategy. You'll save a lot of money, and it'll be much easier to debug if you dump ceph.
You have a clustered FS, when you appear to not really need it. By the looks of it, its all going to be hosted in the same place. So what you have is lots of CPU to manage not very much disk space. The overhead of ceph for such a tiny amount of data all in the same place seems pointless. You can easily fit your data set on one 90 disk shelf 4 times over with 100% redundancy.
First things first, File servers serve files and nothing else. This means that most of your ram is going to be a file server cache. Putting applications on there is just going to mean that you're hitting the disk more often, and stealing resources from other apps in a non transparent way.
get four 90 drive superchassis, connect them to four servers(via SAS, direct connect) with as many CPUs and as much ram as possible. JBOD them, either use ZFS, or hit up IBM for GPFS. (well IBM Spectrum Scale). You can invest in a decent SAS controller and raid 6 8 10 disk stripes. But rebuild time will be very high. Whats your strategy for disk replacement? 400 disks means about 1 every four weeks will be failing.
What is your workload like on the disks? is it write heavy? read heavy? lots of streaming IO? or random? can it be cached?
Your network should be simple. don't use cat6 10gig. It's expensive, not very good in dense racks, just use copper cables with inbuilt SPFs, they are cheap, reliable.
Don't use Super micro switches. They are almost certainly re-badged.
Your network should be fairly simple. A storage VLAN, a application VLAN, and a DMZ. (out of band will need a strongly protected vlan as well)
on the buying side, you need a reseller, who is there to get the best deal. You'll need to audition them. They will also help with design as well. But you need to be harsh with them, they are not your friends, they are out to make money.
Redhat, if you're out there: Now would be a good time to chime in about the limits of Ceph and the reasonable size of a filesystem.
Here's Part 1 of the series: http://nickcraver.com/blog/2016/02/17/stack-overflow-the-arc... (the other parts are linked from there)
Noooooo. Do not go for Supermicro for anything where the operating system matters. Their software quality leaves a lot to be desired.
You might want something Cumulus Linux supports, since it seems like you have a lot more Linux experience then networking...
Bcache is mentioned in the article under the disk header.
Perhaps look into how to have the nodes split between 2 or 3 different datacenters?
It is pretty easy to build such middleware both for the git over ssh (a simple script performing a lookup of where the shard is and then you connect to shard to operate there) and just a little bit more for the http part. At the webapp level, you will have a kind of RPC to run the git related operations which will connect to the right shard to run the operations.
When you use Ceph you are basically running a huge FS at the full scale of your GitLab installation, but practically, you have many independent datasets within your GitLab installation and you do not need to pay the cost for the global consistency of Ceph. You have many islands of data.
Edit: Typos/missing words.
I do; that's my default scenario. If you can survive that, you can survive all sorts of smaller issues like network congestion, data center power problems, grid power problems, and zombie plagues (or flu, which is more likely.)
But that's a true emergency situation. Don't go offline for multiple days for something that's reasonably likely to happen.
- Shouldn't there be a puppet / chef / whatever deployment coordinator in there somewhere?
- There's no mention of a virtualisation environment. While it's not a hardware issue really, all the extra services mentioned before will not take the whole server and you'll want to collocate some of them. (maybe even some of the main services too, if the resource usage on real hardware turns out wildly different than the estimates) If the choice is KVM, great. But if it's VMWare, you want to include that in the cost. (and the network model)
- The "staging x 1" is interesting... so what happens when you need to test a new version of postgres before deployment? You can't pretend that a test on one server (or 4 virtual ones) will be comparable to actual deployment, especially if you need to verify the real data performance.
- "Backing up 480TB of data [...] with a price of $0.01 per GB per month means that for $4800 we don't have to worry about much." - This makes a really bad assumption of X B of data produces X B of backup. That's a mirror, not a backup - it won't protect you from human errors if you overwrite your only mirror. For a backup you need to have actual system of restoring the data and a plan for how many past versions you want to store. On the other hand, gitlab data seems to be mostly text files - that should compress amazingly well, so that should also help with the restore speed.
- Network doesn't seem to mention any out of band access (iLO, or similar). That's another port per node required.
- Because tech is tech, I expect both the staging and spare servers to be repurposed soon to be used for "that one role we forgot about". One each is really not enough. (seen it happen so many times...)
- We want to use Kubernetes instead of virtualization.
- We provisioned 2 spare database servers for that reasons.
- Git already compresses files with zlib so I'm not sure we can compress it much further https://git-scm.com/book/uz/v2/Git-Internals-Packfiles
- The servers have a separate management port and "Apart from those routers we'll have a separate router for a 1Gbps management network."
- I agree we'll probably need more spare servers.
It was followed with "For example to make STONITH reliable when there is a lot of traffic on the normal network". That sounded like you want the heartbeats on it, not out of band management.
Google might do some really cool shit with Kubernetes, but they are Google. I don't think using them as a good example for anything but infrastructure on a massive scale is correct. They are years ahead of everyone else, and if they are doing something in production, they have been testing it for years. If shit hits the fan, they have thousands of employees to throw at the problem. GitLab is building a team, and therefore does not have the experience to know these systems inside and out. In my view, using Docker/Kubernetes is adding unnecessary complexity to the database fabric for minimal tradeoffs.
And considering that they are speaking out dedicated database servers, it makes no sense to add a unneeded layer of abstraction when in all likelihood the container will be bound to a node.
It's refereshing too see that a few quite important players including i.e. dropbox are moving away from having everything in public clouds in contrast to others such as Netflix that go all in. Looks like on premise is not dead yet.
In the longer term the storage appliance will lock us in and will get very expensive. I've heard pretty bad stories of companies betting on it. Especially with many small files like us (IOPS heavy).
And one goal of GitLab.com is to gain experience that we can reuse at our customers. Most of our customers use a storage appliance now but are interested in switching to something open source.
You also now need to gain internal expertise in networking, security, datacenter operations, and people who can rack and stack well.
> ...Even with RAID overhead it should be possible to have 480TB of usable storage (66%).
> Fortunately, with a technique called erasure coding, we can. Reed-Solomon error correction codes are a popular and highly effective method of breaking up data into small pieces and being able to easily detect and correct errors. As an example, if we take a 1 GB file and break it up into 10 chunks of 100 MB each, through Reed-Solomon coding, we can generate an additional set of blocks, say four, that function similar to parity bits. As a result, you can reconstruct the original file using any 10 of those final 14 blocks. So, as long as you store those 14 chunks on different failure domains, you have a statistically high chance of recovering your original data if one of those domains fails.
Facebook didn't release the system they used to do this. I can see two reasons why not to: desire for competitive edge; or the implementation not being a general-purpose solution.
Considering Facebook's general openness, I say get in touch, just in case! It's quite possible that you might be able to figure out something interesting.
I suspect the reason the system wasn't released was due to the latter case - it seems to be technically quite simple and easily achievable for a[ny] company full of algorithms Ph.Ds.
We're setup for 38kw [a] possible gross wattage using the dual-node blades using the Xeon D-1541 . The amazing thing about the D-1541 blades is we get around 100W per server, with 8 hyperthreaded cores, and a 3.84T SSD. With the 6U chassis, you have 28 blades with 2 nodes each - 56 medium sized servers for 5.6kw in 6U - under a kw per RU.
For your 70 some server workloads, I'd recommend using the microblades.
For your higher lever workloads, I think the SuperMicro TwinNode makes sense.
Be very very careful about Ceph hype. Ceph is good at redundancy and throughput, but not at iops, and rados iops are poor. We couldn't get over 60k randrw iops across a 120 OSD cluster with 120 SSDs.
For your NFS server, I'd recommend a large FreeNAS system, put a big brain in it and throw spinning platters.
Datacenters can/will do your 30kw
 - https://www.supermicro.com/products/nfo/2UTwin2.cfm
 - https://www.supermicro.com/products/MicroBlade/
 - https://www.supermicro.com/products/MicroBlade/module/MBI-62...
[a] - Although we have 38kw of possible power there, it's practically well under the 27kw we'll get with 4x208v@60A PDUs at 80%.
Separately I'm using a FreeNAS controller with 4 SAS HBAs supporting 3 JBODs with 45 8TB HGST He8 near line SATA disks each (135 disks or ~1PB) for backups and slow data.
I can understand the need for performance but if it were my business I would have taken a significantly different approach.
Hopefully IPv6 shows up somewhere in the stack. It's sad to see big players not using it yet.