1. Talk to your hosting providers and make sure they can support 32kW (or whatever max number you need) in a single rack, in terms of cooling. At many facilities you will have to leave empty space in the rack just to stay below their W per sq ft cooling capacity.
2. If you're running dual power supplies on your servers, with separate power lines coming into the rack, model what will happen if you lose one of the power lines, and all of the load switches to the other. You don't want to blow the circuit breaker on the other line, and lose the entire rack.
3. Thinking about steady state power is fine, but remember you may need to power the entire rack at full power in the worst case. Possibly from only one power feed. Make sure you have excess capacity for this.
The first time I made a significant deployment of physical servers into a colo facility, power and cooling was quite literally the last thing I thought about. I'm guessing this is true for you too, based on the number of words you wrote about power. After several years of experience, power/cooling was almost the only thing I thought about.
Build one node with the hardware configuration you intend to use. Same CPU, ram, storage.
Put it on a watt meter accurate to 1W.
Install Debian amd64 on it in a basic config and run 256 threads of the 'cpuburn' package while simultaneously running iozone disk bench and memory benchmarks.
This will give the figure of the absolute maximum load of the server, in watts, when it is running at 100% load on all cores, memory IO, and disk IO.
Watts = heat, since all electricity consumed in a data center is either being used to do physical work like spinning a fan, or is going to end up in the air as heat. Laws of physics. As whatever data center you're using will be responsible for cooling, this is not exactly your problem, but you should be aware of it if you're going to try to do something like 12 kilowatts density per rack.
Then multiply wattage of your prototype unit by number of servers. This will tell you how many fully loaded systems will fit in one 208v 30a circuit on the AC side.
Also, use the system bios option for power recovery to stagger bootup times by 10 seconds each per server, so that in the event of a total power loss and an entire rack of servers does not attempt to power up simultaneously.
That brings me to my next point: GitLab should also being mindful on what services are stored on which hardware - performing heroics to work around circular dependencies is the last thing you want to be doing when recovering from a power outage.
Depends a lot of server/HA/software architecture.
I've only ever built desktop machines, and this top comment drew a surprising parallel to most help me with my desktop build type posts. Granted, I'm sure as you dig deeper, the reasoning may be much different, but myself being ignorant about a proper server build, it was somehow reassuring to see power and cooling at the top!
 - http://www.chatsworth.com/products/cabinet-and-enclosure-sys...
In somewhere that is very close (OSI layer 1) topologically to a major traffic interexchange point, you will definitely be paying somehow for the monthly cost of the square footage occupied. For example a colo in 60 Hudson or 25 Broadway in NYC, or in one Wilshire in LA.
Large-scale colocation pricing that is based on power only will be found in locations that are not also major traffic inter exchange points. For example quincy WA or the many data centers in suburban new jersey.
But for other sites with high value peering, including EQIX in Ashburn, Coresite, your limiting rate is likely to be the power, not the space or power. I.e. they'll "give" you 1 cabinet for every 15kw you buy.
So my assertion assumes you're doing large number of dense cabinets.
If one is doing a large number of dense cabinets, they almost certainly should not do it at a high value peering point, and should backhaul it. You should be able to get diverse metro dark fiber for <$5k/mo if not substantially less. Put a single cabinet (or pair for redundancy) of $2k/mo cabinets in Equinix or Fisher and off you go.
Hope that helps. Your deployment is larger than ours, so there may be other techniques, but that's what we did.
I.e. you would also want to be sure that your wiring was rated for 100% utilization, and that other circuit-breaker-like functions exist.
Fire is an actual thing, and figuring out the best way to recharge a halon system isn't exactly what you want to be doing.
When you hardwire the circuit the electrical code allows you to use a 100% breaker.
This is less common in DCs historically but more and more as folks do 208v 3phase 100A circuits.
For a deployment of this scale it should be metered power (For example 1 (or more) 3phase a+b drops to each cabinet) where you only pay a Non-Recurring setup Charge (NRC) and then the MRC is based on actual power draw.
3phase also means fewer physical PDU's (uses less space), but more physical breakers. Over-building delivery capability will eliminate any over-draw concerns for startup cycles.
Since my cabinet number is usually evenly divisible by N*PDUs, this impacts overall capital.
Having a little headroom on your power circuits is also incredibility important, and not every facility will sell 100% rated breakers. It may make more sense to be in a facility with 80% rated breakers than 100%, even with the added capex of an extra PDU or two.
Goes back to my previous comment. What is important to you, at the pod / multiple pod level, isn't as important to the 1-2 cab deployment.
Similarly, ensure spare room in the cabinet for adjustments, that thing you forgot, and small growth. Much better to have 70% full and not need the space then to have no free RU and need the space.
I would recommend verifying everything is fault tolerant/HA as expected every step of the way. We ran into issues where the power strips on both sides were plugged into the same circuit(D'oh), wrong SST's, redundant routers getting cabled up to the same power strips, etc and you name it.
After a rack is setup have people at the DC(your own employees or the DC's techs) help simulate(create) failure in power, networking, and systems to verify everything is setup correct. It sounds like you have people coming onboard with experience provisioning/delivering physical systems though, so I would expect them to be on the ball with most of this stuff.
A system at idle vs full CPU vs full cpu + all disk will produce very different measurements.
Also keep in mind 80% derating - many electrical codes will state that an X amp circuit should only be used at 0.8X on a continuous basis (and the circuit breakers will be sized accordingly).
I didn't know half of the stuff in grandparents' post.
My underlying assumption is that this is a production service with customers depending on it.
1. Don't fuck with networking. Do you have experience operating same or similar workloads on your super micro sdn? Will the CEO of your super micro VAR pickup his phone at 2AM when you call?
My advice: Get a quote from Arista.
2. Don't fuck with storage.
32 file servers for 96TB? Same question as with networking re:ceph. What are your failure domains? How much does it cost to maintain the FTEs who can run this thing?
3. What's the service SLA on the servers? Historically, supermicro VARs have been challenged with that.
If I were building this solution, I'd want to understand what the ROI of this science project is as compared to the current cloud solution and a converged offering like Cisco/NetApp, HP/3Par or something like Nutanix. You're probably saving like 20-25% on hardware.
This sounds to me like pinching pennies on procurement and picking up tens of dollars of cost on the labor side. If you're accustomed to SLAs from AWS, this will be a rude awakening.
I think they are coming at this problem from the wrong perspective - instead of growing from virtual servers to their own dedicated hardware to get better CephFS performance, they should take a hard look at their application and see if they can architect it in a way that does not require a complex distributed filesystem to present a single mount that they can expose over NFS. At some point in the future, it will bite them. Not an if, but a when.
In addition, this means that running physical hardware, CephFS and Kubernetes (among other things) are now going to be part of their core competencies - I think they are going to underestimate the cost to GitLab in the long run. When they need to pay for someone to be a 30 minute drive from their DC 24/7/365 after the first outage, when they realize how much spare hardware they are going to want around, etc.
As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.
As someone who administers GitLab for my company, yes please.
Any high availability scenario that involves "just mount the same NFS volume on your standby" is a nonstarter for us. (We've found mounting NFS across different datacenters to be too unreliable and our failover scenarios include loss of a data center.)
It would also be wonderful to be able to run GitLab in any of the PaaSes that only have ephemeral disk, but that's a secondary concern.
What are the alternatives?
I suppose there's MySQL's "stream asynchronously to the standby at the application level."
... which, now that I think about it, should be pretty easy to do with Git, since pushing and pulling discrete units of data are core concepts...
I mean, especially because git's whole model of object storage is content-addressable and immutable, it looks like it's a prime use for generic object storage.
Latency and consistency would be my concerns - S3 does not quite have the right semantics for some of this, so you'd have to build a small shim on top to work around this. Ceph's rados doesn't even have these problems, so it is quite a good contender.
And considering the usual compressed size of commits and many text files, you're going to have more HTTP header traffic than actual data if you want to do something like a rev-list.
I would imagine any implementation that used S3 or similar as a backing store would have to heavily rely on an in-memory cache for it (relying on the content-addressable-ness heavily) to avoid re-looking-up things.
I wonder how optimized an object store's protocol would have to be (http2 to compress headers? Protobufs?) before it starts converging on something that has similar latency/overhead to NFS.
"AWS CodeCommit is built on highly scalable, redundant, and durable AWS services such as Amazon S3 and Amazon DynamoDB."
This is a really good point. That's easily $1M in payroll. You could probably run a decent tiered SAN with 80-95% fewer labor dollars. Plus have the benefit of torturing the vendor when you hit scaling hiccups.
Can you give some examples of the problems you ran into?
Something that always seemed to cause nagging issues was that we wanted our cluster to have data encryption at rest. Ceph does not support this out of the box, which means that you need to use dmcrypt on top of your partitions, and present those encrypted partitions to Ceph. This requires some work to make sure that decrypt keys are setup properly, and that the machine can reboot automatically and remount the proper partitions. In addition, we ran into several issues where device mapper or otherwise would lock an OSD, which would send the entire machine into lockup, messy!
We also had to work pretty hard to build quality monitoring around Ceph - by default, there are very little tools that provide at-scale fine grained monitoring for the various components. We spent a lot of time figuring out what metrics we should be tracking, etc.
We also spent a good amount of time reaching out to other people and companies running ceph at scale to figure out how to tune and tweak it to work for us. The clusters were all-SSD, so there was a ton of work to tune the myriad of settings available, on ceph and the hosts themselves, to make sure we were getting the best possible performance out of the software.
When you run dozens-to-hundreds of servers with many SSDs in them that are doing constant traffic, you tend to hit every edge case in the hardware and software, and there are a lot of lurking demons. We went through controller upgrades, SSD firmware upgrades, tweaking OP settings, upgrading switches, finding that the write workload on certain SSDs caused problems and much more.
That's just a snapshot of some of the issues that we ran into with Ceph. It was a very fun project, but if you are getting into it for a high-throughput setup with hundreds of OSDs, it can be quite a bit of work.
Happy to chat more w/ anyone that's curious - there is some very interesting and fun stuff going on in the Ceph space.
I've always wondered how automatic reboots are handled with filesystem encryption.
What's the process that happens at reboot?
Where is the key stored?
How is it accessed automatically?
My philosophy today is that if the data is important at all, it's worthwhile going spendy and getting a SAN. Get a good one. I like Nimble a lot right now, but there are other good ones, too. (Don't believe crazy compression numbers or de-duplication quackery; I've told more than one SAN vendor to fuck off after they said they'd get 20:1 on our data, without doing any research on what our data was).
Have everything backed up? Great! How long until you can go live again after water drips on your storage? If you spend a week waiting for a restore, that's indistinguishable from failure. If you wait a month for replacement hardware to be shipped, you might have killed your product.
Perhaps I don't understand the problem domain, but I don't understand why CephFS is being considered for this task. You're trying to treat your entire set of files across all repos as a single filesystem, but that's an entirely incorrect assumption. The I/O activity on one repo/user does not affect the I/O activity of an entirely different user. Skip the one filesystem idea, shard based on user/location/whatever.
I'd appreciate any comments explaining why I'm wrong, because this doesn't seem to be a productive design to me.
None of the clouds use any of that super-expensive gear, so if you're going for cost savings, you'll need to use the same sort of commodity gear they use.
Gitlab is obviously Linux-savvy and comfortable writing automation, so things like Cumulus Linux and minimal-handholding hardware vendors shouldn't cause them any indigestion.
<disclaimer: co-founder of Cumulus Networks, so slightly biased>
My point isn't to knock them down. It takes cohones to be public about stuff like this. My instinct as a grumpy engineering director type is that there are holes here that need to be filled in.
Putting a major product at risk to save $30k against an Arista switch isn't a decision to make lightly. That means pricing the labor, upside benefit and business risk. If they are going to 100x this environment, Cumulus will save millions. If it will 3x, it will save a few thousand bucks -- who cares.
Arista and Cisco shouldn't cost top dollar; though anyone buying EMC or Netapp for any new build should have their union card revoked. FreeNAS ftw uber alles.
Source: Did it twice.
I've run it in my home a few times out of curiosity, and that was never my impression.
We're... displeased with the current solution that we're using at work for this use case. :)
C2: The Dell equivalent is C6320.
CPU: Calculate the price/performance of the server, not the processor alone. This may lead you towards fewer nodes with 14-core or 18-core CPUs.
Disk: I would use 2.5" PMR (there is a different chassis that gives 6x2.5" per node) to get more spindles/TB, but it is more expensive.
Memory: A different server (e.g. FC630) would give you 24 slots instead of 16. 24x32GB is 768GB and still affordable.
Network: I would not use 10GBase-T since it's designed for desktop use. I suggest ideally 25G SFP28 (AOC-MH25G-m2S2TM) but 10G SFP+ (AOC-MTG-i4S) is OK. The speed and type of the switch needs to match the NIC (you linked to an SFP+ switch that isn't compatible with your proposed 10GBase-T NICs).
N1: A pair of 128-port switches (e.g. SSE-C3632S or SN2700) is going to be better than three 48-port. Cumulus is a good choice if you are more familiar with Linux than Cisco. Be sure to buy the Cumulus training if your people aren't already trained.
N2: MLAG sucks, but the alternatives are probably worse.
N4: No one agrees on what SDN is, so... mu.
N5: SSE-G3648B if you want to stick with Supermicro. The Arctica 4804IP-RMP is probably cheaper.
Hosting: This rack is a great ball of fire. Verify that the data center can handle the power and heat density you are proposing.
Would you mind if we contact you to discuss?
I'm never buying another SuperMicro, for many reasons. The amount of cabling for properly redundant connections is a killer; it's at least three cables per system (five in our case); a rack will have hundreds of wires to manage. Access to the blade is in the back, where the cables are, and you have to think ahead and route things cleverly if you want be able to remove blades later.
The comment about doing something other than three 48-port switches is bang on. And if you're running Juniper hardware, avoid the temptation to make a Virtual Chassis (because it becomes a single point of failure, honest, and you will hate yourself when you have to do a firmware upgrade).
19KW is still a ton of power, and I'm surprised the datacenter isn't worried (none of the datacenters we use worldwide go much above 16KW usable). Also, you need to make sure you're on redundant circuits, and that things still work one of the power legs totally off. Make sure you know what kind of PDU you need (two-phase or three-phase), and that you load the phases equally.
For short distances of known length, twinax cables which are technically copper can be used. They're thinner than regular cat6a but only about the same as thin 6a and thicker than typical unshielded Duplex fiber patch cables. Twinax can be handy if connecting Arista switches to anything else that restricts 3rd party optics, as Arista only restricts other cable types. Twinax is also the cheapest option.
The PHY has to do a lot of forward error correction and filtering, so it adds latency (for the FEC), power (for all the DSP) and cost (for the silicon area to do all of the above).
Consider the power pull from copper v fiber listed here . The Arista 7050TX 128 port pulls 507W while the 7056SX 128 port pulls 235W. Yes, copper is more but we're talking half a kW for 2 TOR switches. And for this you get much cheaper cabling, as the SFP+ are much more expensive (go AOC if you do go fiber, BTW) and you have to worry about clean fiber, etc.
Where fiber is REALLY nice is with the density, although the 28AWG  almost makes that moot.
There's few DC builds that can't do end-to-end copper any more, at least for the initial 50 racks.
 - http://www.datacenterknowledge.com/archives/2012/11/27/data-...
 - https://www.arista.com/en/products/7050x-series
 - https://www.monoprice.com/product?p_id=13510
Cumulus isn't really SDN so I'm not sure what you're saying there.
Traditional networking is fine but it's totally different than Linux so you need dedicated netops people to manage it. And Cisco is the most expensive traditional vendor.
I've just gotten the notion that SDN is almost ready for primetime, just not yet.
Juniper, Cisco, and Arista all have solutions for this environment.
You're talking about only 64 nodes right now. Your storage and IOPS requirements are not huge. A lot of mid-size hosting companies will give you fantastic service at the 10-1000 servers range. If I were you I'd talk to someone like https://www.m5hosting.com/ (note: happy dedicated server customer for many years -- and I'm sure there's similar scale operations on east coast if that's really what you need) who have experience running rock solid hosting for ~100s of dedicated servers per customer.
I suspect you may just be able to get your 5-10X cost/month improvement (and bare metal performance gains) without having to take on the financing and hardware bits yourself.
There are many advantages to AWS and similar services, but if you can't really take advantage of all the goodies because e.g. you also need to provide a local version of your software (which is the case for Gitlab, as far as I understand), renting dedicated servers is an order of magnitude cheaper.
In fact, i wonder if that's a business model for hosting providers. Kind of a sysadmin incubator.
We'll be glad to look into alternatives that manage the servers and network for us although the argument in https://news.ycombinator.com/item?id=13153455 that this is the time to build a team that can handle this makes sense to me too.
I would encourage you to look at what you can get without trying to do your own colo. You're not at the scale where you should be thinking about that.
That said, SoftLayer does provide 20Gbps access within a private rack and 20Gbps access to the public network.
Performance along with reliability are the most important metrics for someone providing VCS as a service.
Geo-redundancy seems like a luxury, until your entire site comes down due to a datacenter-level outage. (E.g. the power goes down, or someone cuts the internet lines when doing construction work on the street outside).
(This is one of the things that is much easier to achieve with a cloud-native architecture).
As far as moving from cloud to bare metal, another thing to take into consideration (that this was replied to), if you don't architect your AWS (or other cloud solution) to take advantage of multiple geographic regions, the cloud won't benefit you.
I 100% agree that there should be more than one region deployed for this service. As others have said, all it takes is 1 event and the site will be down for days to weeks to months (It may not happen often, but when it does, you go out of business). The size and complexity of this infrastructure will make it nearly impossible to reproduce on short order in a new facility. If I were the lead on this I would have either active / active sites, or an active / replica site.
I would also have both local (fast restores), and off-site backups of all data. A replica site protects against site failure not data loss and point-in-time recovery.
Yep, this is why scaling starts with scalable distributed design. We were moving a fairly large logging stack from NFS to S3 once, for the same reason Gitlab is trying to move to bare metal now. Moving off cloud was not an option, moving to a TCO efficient service was. NFS did not scale and there was the latency problem. I think moving to bare metal cannot help with scale as much as a good architecture can. We will see how deep the datacenter hole goes. :)
To add to my previous comment though, AWS (and cloud in general) tends to make much more sense if you are utilizing their features and services (Such as Amazon RDS, SQS, etc.), and if you aren't using these services I can absolutely guarantee I can deliver a much lower TCO on bare metal than AWS. (Which is why I offered to consult for them) I see this all the time. Company moves from bare metal to AWS as bare metal is getting expensive, then they quickly find out AWS can't deliver the performance they need without massive scale (because they aren't using a proper salable distributed design and can't afford to re-architect their platform)
You will need to monitor:
- cpu utilization
- ram utilization
- disk utilization
- disk health
- context switches
- IP addresses assigned and reaching expected MACs
- appropriate ports open and listening
- appropriate responses
- time to execute queries
- processes running
- process health
- at least something for every bit of infrastructure
once you collect that information, you need to record it,
graph it, display it, recognize non-normal conditions, alert on those, page for appropriate alerts, and figure out who answers the pagers and when.
Our Infrastructure lead Pablo will do a webinar of our Prometheus monitoring soon https://page.gitlab.com/20161207_PrometheusWebcast_LandingPa...
We're bundling Prometheus with GitLab https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1481
Brian Brazil is helping us https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1481#not...
On January 9 our Prometheus lead will join us (who was very valuable already helping with this research behind this blog post) and we're hiring Prometheus engineers https://about.gitlab.com/jobs/prometheus-engineer/
In the short term we might send our monitoring and logs to our existing cloud based servers. In the long term we'll host them on our own Kubernetes cluster.
For our monitoring vision see https://gitlab.com/gitlab-org/gitlab-ce/issues/25387
Disclaimer: Prometheus co-founder and just started as a Prometheus contractor for GitLab.
EDIT: ah, didn't reload the page to see that sytse had already responded with this :)
Just as a suggestion since you seem to be so certain about the location. I basically know every DC in Frankfurt, so if you need any help or info in that regard feel free to contact me :-)
Of course there's the issue of potential legalities for your business, but don't kid yourself that you're safe from prying eyes by deploying in a particular country.
The next step (and particularly important for a business like GitLab) is to land a second site in either western Europe or west coast US. Honestly you should be thinking about this right away, and look to sign leases on both spaces simultaneously with a 3-mo delay built in for the second site. This should help you negotiate price as well, if you're able to go with the same dc provider, but be aware that that itself is a single point of failure as well. Make absolutely sure you negotiate your MSA to the n-th degree, get a good SLA, etc. You can still get burned, but do your legal due diligence now because you won't have a chance to change terms later.
Then continue to optimize by having multiple sites per-region (so that failover doesn't involve a big performance hit), adding APAC / AUNZ regions, and so forth. For a service like GitLab, I wouldn't think that time to sync the repo is hyper important, but responsiveness of the web interface is fairly key. So that may lead to a hub/spoke design where there are a few larger sites storing the bulk of the data and more small sites to handle metadata and such to present the web views.
That's all years down the road though. For a first pass, I can't see a problem with northern Virginia as the first site.
I would assume that the .de lines are saturated with 5 eyes...
EDIT: Funny, every single documented, proven link for the last several years has proven me and GP correct and all you nay-sayers wrong.
I wouldn't make a business decision around a general feeling of paranoia. German spies aren't an improvement over American or British ones.
This is Gitlab. There should never be any data in or out except HTTPS and SSH.
I (and others I bet) would be interested in a quick summary of the options (and your opinions of them) for facilities in Frankfurt.
You basically have the big brands like Equinix (7 facilities in Frankfurt), Interxion (They are currently building their 11th facility in Frankfurt), Telecity (now part of Equinix) but also some local ones like e-Shelter, First Colo or Accelerated. There is also a DC only meters away from the Interxion campus that is called Interwerk and is basically the cheapest DC in Frankfurt, but it has an awesome price/performance ratio and you have access to all peering/transit options of Interxion (via CWDM) and can get a 1/1 rack for under 200 EUR/mo. I have been in that DC for several years and never had a single issue, so if you don't rely on any certificates this is a cheap option.
I was also colocating in the First Colo DC for some time, they are also in the locality around the Interxion campus. It's a pretty small DC but it is more premium than the Interwerk and also is kinda cheap. I personally wouldn't go with Accelerated, since they had multiple issues in the past.
For the big brands I would definitely go with Interxion, they are great, have all the certificates and are premium while not going crazy with their pricing like Equinix does. DigitalOcean, Twitch, etc. they are all in one of the many Interxion facilities in Frankfurt. If Price isn't a concern I would probably go with Equinix FRA5.
DE-CIX is present in around 7 facilities, 3 of them are Interxion and I think 2 Equinix.
Some financial institutions require all sensitive* data to be stored/hosted in the same country/state(archaic yes).
It's real hard to actually define sensitive data but the IP* in some code a quant wrote can totally be considered a trade secret by a non technical person
Please don't get me started on how stupid I think it is that people consider code to be IP
Definitely not archaic. If the FISA courts taught you anything, it should be that the country your things are living in determines which entities can tap your traffic/hardware.
Frankfurt may be a better option solely on the basis that it is more "environmentally" stable (in terms of events that may disrupt the operation of the facility).
(I'm from Germany, but I couldn't care less about the country per se or GitLab being co-located in Frankfurt)
You could frame infrastructure cost savings in many different ways though. The most obvious solution may seem to be the spend to move from the cloud to in house bare metal but I feel like you'll have a lot of costs that you haven't accounted for in maintenance, operational staff spend, cost in lost productivity as you make a bunch of mistakes.
Your core competency is code, not infrastructure, so striking out to build all of these new capabilities in your team and organization will come at a cost that you can not predict. Looking at total cost of ownership of cloud vs steel isn't as simple as comparing the hosting costs, hardware and facilities.
You could reduce your operating costs by looking at your architecture and technology first. Where are the bottlenecks in your application? Can you rewrite them or change them to reduce your TCOS? I think you should start with the small easy wins first. If you can tolerate servers disappearing you can cut your operating costs by 1/8th by using preemptible servers in google cloud platform for instance. If you optimize your most expensive services I'm sure you can cut hunks of your infrastructure in half. You'll have made some mistakes a long the way that contribute to your TCOS - why not look at those before moving to steel and see what cost savings vs investment you can get there?
Ruby is pretty slow but that's not my area of expertise - I wouldn't build server software on dynamic languages for a few reasons and performance is one of them but that's neither here nor there as you can address some of these issues in your deployment with modern tech I'm sure. Aren't there ways to compile the ruby and run it on the JVM for optimizations sake?
Otherwise do like facebook and just rewrite the slowest portions in scala or go or something static.
Try these angles first - you're a software company so you should be able to solve these problems by getting your smartest people to do a spike and figure out spend vs savings for optimizations.
GitLab is fine software but fuck me, they need to hire someone with actual ops experience (based on this post and their previous "we tried running a clustered file system in the cloud and for some reason it ran like shit" post).
Another thing I might recommend is a third network (just a simple 1GB and a quality switch) for consensus. Re-replication can max the network out and further cause consensus fails, causes more re-replication winding down everything ... If that's not possible, add firewall rules to prioritize all consensus related ports high.
From my own personal experience, I would go with a PCIe SSD cache/write buffer, and then a primary SSD tier and a HDD tier. Storage, as it seems you guys have experienced, is something you want to get right on the first try.
N1: Yes, the database servers should be on physically separate enclosures.
D5: Don't add the complexity of PXE booting.
For Network, you need a network engineer.
This is the kind of thing you need to sitdown with a VAR for. Find a datacenter and see who they recommend.
If the 8TB drives don't offer the IO we need we'll have to leave part of them empty. I assume that if you fill them with 2TB they should perform equally well. Word on the street is that Google offers Nearline to fill all the capacity they can't use because of latency restrictions for their own applications.
It is an interesting idea to do SSD cache + SSD storage + HDD instead of just SSD cache + HDD. I'm not sure if Ceph allows for this.
I'm sure it has its uses! But I'd emphatically advise anyone to obtakn hands-on experience with Bcache before planning around it! As I'm sure you are already doing or considering of course :)
Ceph seems to offer Tiering, which would move frequently accessed data into a faster tier while the infrequent data to a slower tier.
By "Tiering", is this moving data between different drive types? Or by moving the data to different parts of the platter to optimize for rotational physics?
Ceph can probably explain it better than I can. http://docs.ceph.com/docs/jewel/rados/operations/cache-tieri...
I read the entire article, and while you talked about having a backup system (in a single server, no less!) that can restore your dataset in a reasonable amount of time, you have no capability for disaster recovery. What happens when a plumbing leak in the datacenter causes water to dump into your server racks? How long did it take you to acquire servers, build the racks and get them ready for you to host customer data? Can your business withstand that amount of downtime (days, weeks, months?) and still operate?
These questions are the ones you need to be asking.
In other words, double your budget because you'll need a DR site in another datacenter.
Edit. If Ceph is smart enough, it would be aware of the tiers present on the node, and would tier blocks on that node. So a Block on node A will stay on node A.
If you are talking physically, 2TB/8TB would theoretically be faster than a 2TB/2TB disk if the performance attributes were the same. But a 2TB HDD will have a faster average seek time than an 8TB HDD due to physical design. Any performance gains of only partially filling a drive would probably be offset by the slower overall drive.
I'm skeptical here. So your minimum seek time goes up on the 8TB because it's harder to align. But your maximum seek time should drop tremendously because the drive arm only has to move along the outer fifth of the drive. And your throughput should be great because of the doubled linear density combined with using only the outer edge.
A quick google search shows that there are marginal gains on the outer track vs the inner, but that is only on sequential workloads. For something like GitLab, the workloads would be anything but.
1. If I look at 2TB vs. 8TB HGST drives, their seek times are 8ms and 9ms respectively. But if you're only using a quarter of the 8TB drive, the drive arm needs to move less than a quarter as much. Won't that save at least 1ms?
2. The 8TB drive has a lot more data per inch, and it's spinning at the same speed. Once a read or write begins, it's going to finish a lot faster.
3. Here's a benchmark putting 8TB ahead in every single way http://hdd.userbenchmark.com/Compare/HGST-Ultrastar-He8-Heli...
In my experience, the PCIE drives they mention (the DC P3700) are incredible. They're blazing fast (though the 3D X-Point stuff is obviously faster), have big supercaps (in case of power failure), and really low latency for an SSD (150 microseconds or so). They're a pretty suitable alternative to in-memory logbuffers where latency is less crucial. Much cheaper than the equivalent memory too. Having a few of these in front of a RAID 10 made up of 8TB drives will work just fine.
FWIW your experience with a RAID 6 of big disks is unsurprising. Raid 6 is almost always a bad choice nowadays since disks are so big. Terrible recovery times and middling performance.
I would think that GitLab's workload is mostly random, which would pose a problem for larger drives. The SSDs are a great idea, but I've only seen 8TB drives used when there are 2 to 3 tiers; with 8TB drives being all the way on the bottom. I'm not sure how effective having a single SSD as a cache drive for 24TBs of 8TB disks will be.
Oops that's reliability, I can jump the gun sometimes, sorry.
A few things to watch out for, if your team doesn't have a lot of experience in direct management of dedicated servers/datacenters:
- There is an increased risk of larger/longer very rare outages due to crazy circumstances. Make sure you have a plan to deal with that. (I've had servers hosted at a datacenter that flooded in Hurricane Sandy, and another where an electrical explosion caused a >3 day outage..)
- It's easy to think you'll rely on managed services, but that rarely works out well. It also can become very, very expensive -- possibly more so than cloud-based hosting.
- Specifically, regarding H1, H2: Dedicated hardware is substantially cheaper than cloud-based hosting, but if you rely too much on managed services you negate a large portion of the savings. Consider that most service providers will be both more expensive and less competent than doing it yourselves. Also, having your own team have direct knowledge + their own documentation of the setup will be beneficial.
- I'd recommend budgeting for and ordering some extra parts to keep on hand for replacement, if you are having datacenter ops handle hardware or can have an engineer located relatively close to your datacenters. (A few power supplies, some memory, a couple drives - nothing too crazy.)
- Supermicro's twins systems are great. In the past I've gone with their 1U models vs. 2U to slightly reduce the impact of unit downtime. (Having to take one 2U Twin down affects four nodes. It sounds like you'll have to decide on balancing that against the increased drive capacity, in your case.)
In a long run, probably. Immediately after deploying, unless they hire very experienced, I expect quite a few "never seen before" issues (may not result in publicly visible downtime though). Monitoring for "thermal events", weird and hard to debug issues requiring firmware updates, "bad cable" issues, etc. are not what you have to deal with in the cloud.
I think you could probably get the IO performance you're talking about in your blog post from AWS instances or Google Cloud's local NVMe drives, but if you truly need baremetal, I'd recommend Packet or Softlayer. Don't try to run your own infrastructure or in a year you'll be: https://imgflip.com/i/1fs7it
Lower your SLA requirements and go with multiple providers like OVH. Make your site work at multi datacenters. At the end of the day your users will be much happier.
My 2 cents.
Stay away from the 8TB drives. Performance and recovery will both suck. 4TB drives still give the best cost per GB.
Why are you using fat twins? Honestly, what does that buy you? You need more spindles, and fewer cores and memory. With your current configuration, what are you getting per rack unit?
Consider a 2028u based system. 30 of those with 4TB drives gets you the 1.4PB raw storage you're looking for. 2683v4 processors will give you back your core count, yielding 960 cores (1920 vCPUs) across that entire set. You can add a half terabyte of memory or more per system in that case.
Sebastien Han has written about "hyperconverged" ceph with containers. Ping him for help.
The P3700 is the right choice for Ceph journals. If you wanted to go cheap, maybe run a couple M2.NVME drives on adapters in PCI slots.
I didn't really need the best price per GB in my setup, so I went with 6TB HGST Deskstar NAS drives. I'm suggesting you use 4TB as you need the IOPs and aren't willing to deploy SSD. Those particular drives have 5 platters and a higher relatively high areal density giving them them some of the best throughput numbers in among spinning disks.
If you can figure out a way to make some 2.5" storage holes in your infrastructure, the Samsung SM863 gives amazing write performance and is way, way cheaper than the P3700. I recently picked up about $500k worth, I liked them so much. They run around $.45/GB. Increase over-provisioning to 28% and they outperform every other SATA SSD on the market (Intel S3710 included).
You'll probably want to use 40GE networking. I've not heard good things about Supermicro's switches. If I were doing this, I'd buy switches from Dell and run Cumulus linux on them.
Treat your metal deployment like IaaC just like any cloud deployment. Everything in git, including the network configs. Ansible seems to be the tool of choice for NetDevOps.
We're considering the fat twins so we get both a lot of CPU and some disk. GitLab.com is pretty CPU heavy because of ruby and the CI runners that we might transfer in the future. So we wanted the maximum CPU per U.
The 2028u has 2.5" drives. For that I only see 2TB drives on http://www.supermicro.com/support/resources/HDD.cfm for the SS2028U-TR4+. How do you suggest getting to 4TB?
If you do somehow manage to pick the perfect disk sure having everything from a single batch would be the best since that'll ensure you have the longest MTBF. But how sure are you that you'll be picking the perfect batch simply by blind luck?
That said, I bought the same 6TB HGST disk for two years.
But when you're buying 100% of your disk inventory at once there's a serious "all eggs in one basket" risk.
As for CPU density, I still feel like you're going to need more spindles to get the IO you're looking for.
4 x 208v 30A circuits gives a total of 120A -- of which they can only be used at 80% capacity, so that gives you 96A usable -- without redundancy.
My initial feelings (eyeballing it) are you should be looking for 3 full racks, 2x208V@30A in each.
As a juniper shop, we implemented the 2xQFX 5100 48T (in virtual chassis) + ex4300 for remote access per rack. This will be a decision based on your local expertise though.
I also looked hard into the twin boxes -- but power to the rack in the end ruled the day, and it made not much of a difference to use the 1U boxes.
Don't forget about out of band (serial for the switching gear). We've been using OpenGear.com for this stuff with 4G-LTE builtin.
VPN access to access console devices?
Any site-to-site VPN needs?
Also, I would consider not pre-purchasing N times your required horsepower/disk if you can avoid it, but rather add in pre-planned yearly, or biannually stages.
As the CEO of a cloud hosting & server management company -- I have much more to say about this if you would like to chat via phone or email anytime.
1) Contributors have not been vetted - Some responses are based on real world experience, and some is conjecture from arm-chair quarterbacks. (A simple example would be that nobody has mentioned with any of the SuperMicro 2U Twins that you have to be cautious about the PDU models and outlet locations of 0U PDU's to not block node service/replacement in the rack)
2) There are multiple ways to skin a cat - There are many viable solutions in this thread, but you can't simply take a little bit of this, a little bit of that, and piece together a new platform that "just works." -- Better to go with a know working solution than a little of this and a little of that. Multiple drivers tend to be less effective.
I am the owner of a bespoke Infrastructure as a Service provider that delivers solutions to sites of similar metrics, and speak with plenty of real-world experience.
Larger Providers - We find a lot of clients move away from AWS, Softlayer, Rackspace, et al. as the larger providers aren't nearly as interested in working with the less-than-standard configurations. They want highly-repeatable business, not one-off solutions.
I'd love to talk in more depth with you about how we can deliver exactly what you need based on years of experience in delivering highly customized solutions. We'll save you money and headache.
Average is not the appropriate word, however I'll use your word. I'd rather have, and so would every enterprise out there, rather have average advice based on tried and true solutions that costs 10% more with 99.99 to 99.999% availability than cutting edge, saved 10-20% with 99.8% availability. The downtime alone can (and does) kill reputations of sites like GitLab.com (and others).
This definitely increases software complexity but going the other way increases other complexities (ops, capex, capacity).
You need at least two nodes that do DNS, DHCP, NTP, and other miscellaneous services that you absolutely want to have but do not seem to have mentioned. You want them to be permanently assigned, so that you never have to search for them, and you want them to fail over each service as needed, preferably both operating at once. Three nodes would be better. Consider doing some basic monitoring on these nodes, too, as a backup to your main monitoring system.
1&2. Authoritative DNS (internal)
1&2. Caching DNS resolver
3&4. Outbound HTTP proxy (if necessary)
3&4. PXE / installer / dhcp
3&4. Local software mirror (apt / yum / etc.)
5&6. SSH jump hosts.
Or something like that.
Make sure the second host is not just in a separate chassis from the first host, but also in a separate rack.
For external authoritative DNS, don't do it yourself -- pay for someone to run that for you (Route53, ns1, etc.). For e-mail, if you can possibly not deal with it then don't -- use Mailchimp or something.
The DNS records for the internal records are done using the kubernetes middleware (basically serving the service records).
The external records are pulled in from a git repository hosting our zones as bind files. If need be zones are split into subzones per team/project. Same permission system as our code via MRs using Gitlab.
Our recommendation is build on open standards (BIND, AXFR) and use services on top of these.
I agree that using an external mail provider is usually a good idea. It mostly is your fallback communication channel and is usually easy to switch (doing replication to an offsite mail storage needs to be done to make switching easy/possible/fast). MX records \o/
SuperMicro has a 4 node system available in which each node gets 6 disks, 2028-TP-xx-yyy. Get two of these, populate 2 nodes on each as your databases, and you can grow into the other spaces later. Run your databases on all-SSD; store backups on spinners elsewhere.
Having two node types is not a calamity compared to having just one.
So the proposal now is to have one chassis (excluding the backup system) and two types of nodes (normal and database).
Regarding the 2028 please see the conversation in https://news.ycombinator.com/item?id=13153853
I don't know much about the Ruby web server landscape, but might Puma (http://puma.io/) be better?
The 2u fat twins are very dense, You'll need make sure that thy hosting company can actually cool it effectively.
I'd look again at your storage strategy. You'll save a lot of money, and it'll be much easier to debug if you dump ceph.
You have a clustered FS, when you appear to not really need it. By the looks of it, its all going to be hosted in the same place. So what you have is lots of CPU to manage not very much disk space. The overhead of ceph for such a tiny amount of data all in the same place seems pointless. You can easily fit your data set on one 90 disk shelf 4 times over with 100% redundancy.
First things first, File servers serve files and nothing else. This means that most of your ram is going to be a file server cache. Putting applications on there is just going to mean that you're hitting the disk more often, and stealing resources from other apps in a non transparent way.
get four 90 drive superchassis, connect them to four servers(via SAS, direct connect) with as many CPUs and as much ram as possible. JBOD them, either use ZFS, or hit up IBM for GPFS. (well IBM Spectrum Scale). You can invest in a decent SAS controller and raid 6 8 10 disk stripes. But rebuild time will be very high. Whats your strategy for disk replacement? 400 disks means about 1 every four weeks will be failing.
What is your workload like on the disks? is it write heavy? read heavy? lots of streaming IO? or random? can it be cached?
Your network should be simple. don't use cat6 10gig. It's expensive, not very good in dense racks, just use copper cables with inbuilt SPFs, they are cheap, reliable.
Don't use Super micro switches. They are almost certainly re-badged.
Your network should be fairly simple. A storage VLAN, a application VLAN, and a DMZ. (out of band will need a strongly protected vlan as well)
on the buying side, you need a reseller, who is there to get the best deal. You'll need to audition them. They will also help with design as well. But you need to be harsh with them, they are not your friends, they are out to make money.
Redhat, if you're out there: Now would be a good time to chime in about the limits of Ceph and the reasonable size of a filesystem.
Here's Part 1 of the series: http://nickcraver.com/blog/2016/02/17/stack-overflow-the-arc... (the other parts are linked from there)
Noooooo. Do not go for Supermicro for anything where the operating system matters. Their software quality leaves a lot to be desired.
You might want something Cumulus Linux supports, since it seems like you have a lot more Linux experience then networking...
Ohhh boy. I hope this works, and look forward to hearing updates.
"Important: CephFS currently lacks a robust ‘fsck’ check and repair function. Please use caution when storing important data as the disaster recovery tools are still under development. For more information about using CephFS today, see CephFS for early adopters."
I'm getting the same feeling I get when I'm watching those "Hold my beer and watch this..." videos.
From my vantage point they look to have 0 experience in building and running infrastructure... and asking advice on HN. They might ask well post a Ask Slashdot thread if they want armchair advice. Genuinely, I think they've crunched some numbers and think they can run their stuff cheaper and faster in-house... but probably underestimated the human-experience angle.
For just 10-20 physical servers, this is going to be either extremely expensive (if they hire right) or extremely painful (if they don't).
We're not doing this to save money, we're doing it to increase performance https://about.gitlab.com/2016/11/10/why-choose-bare-metal/
A service providers' biggest responsibilities to its customers are security, durability, availability and performance -- in that order. You guys are vastly underestimating the complexity involved in getting first 3 right.
Bcache is mentioned in the article under the disk header.
Perhaps look into how to have the nodes split between 2 or 3 different datacenters?
It is pretty easy to build such middleware both for the git over ssh (a simple script performing a lookup of where the shard is and then you connect to shard to operate there) and just a little bit more for the http part. At the webapp level, you will have a kind of RPC to run the git related operations which will connect to the right shard to run the operations.
When you use Ceph you are basically running a huge FS at the full scale of your GitLab installation, but practically, you have many independent datasets within your GitLab installation and you do not need to pay the cost for the global consistency of Ceph. You have many islands of data.
Edit: Typos/missing words.
I do; that's my default scenario. If you can survive that, you can survive all sorts of smaller issues like network congestion, data center power problems, grid power problems, and zombie plagues (or flu, which is more likely.)
But that's a true emergency situation. Don't go offline for multiple days for something that's reasonably likely to happen.
- Shouldn't there be a puppet / chef / whatever deployment coordinator in there somewhere?
- There's no mention of a virtualisation environment. While it's not a hardware issue really, all the extra services mentioned before will not take the whole server and you'll want to collocate some of them. (maybe even some of the main services too, if the resource usage on real hardware turns out wildly different than the estimates) If the choice is KVM, great. But if it's VMWare, you want to include that in the cost. (and the network model)
- The "staging x 1" is interesting... so what happens when you need to test a new version of postgres before deployment? You can't pretend that a test on one server (or 4 virtual ones) will be comparable to actual deployment, especially if you need to verify the real data performance.
- "Backing up 480TB of data [...] with a price of $0.01 per GB per month means that for $4800 we don't have to worry about much." - This makes a really bad assumption of X B of data produces X B of backup. That's a mirror, not a backup - it won't protect you from human errors if you overwrite your only mirror. For a backup you need to have actual system of restoring the data and a plan for how many past versions you want to store. On the other hand, gitlab data seems to be mostly text files - that should compress amazingly well, so that should also help with the restore speed.
- Network doesn't seem to mention any out of band access (iLO, or similar). That's another port per node required.
- Because tech is tech, I expect both the staging and spare servers to be repurposed soon to be used for "that one role we forgot about". One each is really not enough. (seen it happen so many times...)
- We want to use Kubernetes instead of virtualization.
- We provisioned 2 spare database servers for that reasons.
- Git already compresses files with zlib so I'm not sure we can compress it much further https://git-scm.com/book/uz/v2/Git-Internals-Packfiles
- The servers have a separate management port and "Apart from those routers we'll have a separate router for a 1Gbps management network."
- I agree we'll probably need more spare servers.
It was followed with "For example to make STONITH reliable when there is a lot of traffic on the normal network". That sounded like you want the heartbeats on it, not out of band management.
Google might do some really cool shit with Kubernetes, but they are Google. I don't think using them as a good example for anything but infrastructure on a massive scale is correct. They are years ahead of everyone else, and if they are doing something in production, they have been testing it for years. If shit hits the fan, they have thousands of employees to throw at the problem. GitLab is building a team, and therefore does not have the experience to know these systems inside and out. In my view, using Docker/Kubernetes is adding unnecessary complexity to the database fabric for minimal tradeoffs.
And considering that they are speaking out dedicated database servers, it makes no sense to add a unneeded layer of abstraction when in all likelihood the container will be bound to a node.
It's refereshing too see that a few quite important players including i.e. dropbox are moving away from having everything in public clouds in contrast to others such as Netflix that go all in. Looks like on premise is not dead yet.
In the longer term the storage appliance will lock us in and will get very expensive. I've heard pretty bad stories of companies betting on it. Especially with many small files like us (IOPS heavy).
And one goal of GitLab.com is to gain experience that we can reuse at our customers. Most of our customers use a storage appliance now but are interested in switching to something open source.
You also now need to gain internal expertise in networking, security, datacenter operations, and people who can rack and stack well.
> ...Even with RAID overhead it should be possible to have 480TB of usable storage (66%).
> Fortunately, with a technique called erasure coding, we can. Reed-Solomon error correction codes are a popular and highly effective method of breaking up data into small pieces and being able to easily detect and correct errors. As an example, if we take a 1 GB file and break it up into 10 chunks of 100 MB each, through Reed-Solomon coding, we can generate an additional set of blocks, say four, that function similar to parity bits. As a result, you can reconstruct the original file using any 10 of those final 14 blocks. So, as long as you store those 14 chunks on different failure domains, you have a statistically high chance of recovering your original data if one of those domains fails.
Facebook didn't release the system they used to do this. I can see two reasons why not to: desire for competitive edge; or the implementation not being a general-purpose solution.
Considering Facebook's general openness, I say get in touch, just in case! It's quite possible that you might be able to figure out something interesting.
I suspect the reason the system wasn't released was due to the latter case - it seems to be technically quite simple and easily achievable for a[ny] company full of algorithms Ph.Ds.
We're setup for 38kw [a] possible gross wattage using the dual-node blades using the Xeon D-1541 . The amazing thing about the D-1541 blades is we get around 100W per server, with 8 hyperthreaded cores, and a 3.84T SSD. With the 6U chassis, you have 28 blades with 2 nodes each - 56 medium sized servers for 5.6kw in 6U - under a kw per RU.
For your 70 some server workloads, I'd recommend using the microblades.
For your higher lever workloads, I think the SuperMicro TwinNode makes sense.
Be very very careful about Ceph hype. Ceph is good at redundancy and throughput, but not at iops, and rados iops are poor. We couldn't get over 60k randrw iops across a 120 OSD cluster with 120 SSDs.
For your NFS server, I'd recommend a large FreeNAS system, put a big brain in it and throw spinning platters.
Datacenters can/will do your 30kw
 - https://www.supermicro.com/products/nfo/2UTwin2.cfm
 - https://www.supermicro.com/products/MicroBlade/
 - https://www.supermicro.com/products/MicroBlade/module/MBI-62...
[a] - Although we have 38kw of possible power there, it's practically well under the 27kw we'll get with 4x208v@60A PDUs at 80%.
Separately I'm using a FreeNAS controller with 4 SAS HBAs supporting 3 JBODs with 45 8TB HGST He8 near line SATA disks each (135 disks or ~1PB) for backups and slow data.
There is a great deal of pain in moving from one piece of metal to the next, and there is nothing wrong with underpinning your metal with a tech where you can at any time move your architecture to be any combination of a private, public or hybrid cloud, storage aside.
This looks to be a really interesting project, I hope you can continue to blog about it in detail.
I can understand the need for performance but if it were my business I would have taken a significantly different approach.
Hopefully IPv6 shows up somewhere in the stack. It's sad to see big players not using it yet.
I'm interested to see what they end up with in the end.
It is nice that we'll save on costs but we anticipate a lot of extra complexity that will slow us down. So if it wasn't needed we would have stayed in the cloud. But it is interesting that both our competitors (GitHub.com and BitBucket.org) also moved to metal.
email me at firstname.lastname@example.org and I'll be sure to get you in contact with the right people at google.
Naively it seems like you should be able to reduce your peak filesystem iops by sharding the data at the application layer. That does introduce application complexity, but it might shake out as being less work than the operational complexity of running my own metal.
Of course, easier said than done -- I just didn't spot any discussion of this option, and it seemed like the design choice of having one filesystem served by Ceph was taken for granted.
Then we have to think about redundancy. The simple solution is to have an secondary NFS server and use DRBD. For the shortcomings of that read http://githubengineering.com/introducing-dgit/
The next step is introducing more granular redundancy, failover, and rebalancing. For this you have to be good in distributed computing. This is not something we are now so we rather outsource it to the experts that make CephFS.
The problem of CephFS is that each file need to be tracked. If we would do it ourselves we could do it on the repository level. But we rather reuse a project that many people have already made better than go through the pain of making all the mistakes ourselves. It could be that using CephFS will not solve our latency problems and we have to do application sharing anyway.
Worth investigating if you can bolt on a distributed datastore like etcd or ZooKeeper to store the cluster membership and data locations; this might not be as complex as it sounds at first. etcd gives you some very powerful primitives to work with.
(For example, etcd has the concept of expiring keys, so you can keep an up-to-date list of live nodes in your network. And you can use those same primitives to keep a strongly consistent prioritized list of repos and their backed up locations. The reconciliation component might just have to listen for node keys expiring and create and register new data copies in response.)
Etcd is indeed very interesting. I'm thinking about using it for active active replication in https://gitlab.com/gitlab-org/gitlab-ee/issues/1381
I wouldn't be so quick to jump to that conclusion. It's not just the cost of owning and renewing the hardware, it's everything else that comes with it. Designing your network, performance tuning and debugging everything. Suddenly you have a capacity issue, now what b/c you're not likely to have a spare 100 servers racked and ready to go, or be able to spin them up in 2m? Autoscaling?
Companies spend enormous amounts of engineering hours to maintain their on-premise solutions. And sometimes that's fine b/c you have requirements that you can't easily do in the cloud (think of high frequency trading for example). However, once you tally all that up, plus all the value added services you can buy in the cloud (just take a look at the AWS portfolio for example) the price might well be worth it. That's not to say you won't need engineers to help you with cloud stuff, but you'll probably need less and they'll be able to focus on solving a different class of problems for you.
> There must be a margin in it since the big players are making money at it.
From what I've seen the players aren't making (lots of) money on providing compute power. They're basically racing against each other to the bottom. What they're making money on is all the value added services, the rest of the portfolio AWS/Google Cloud Platform/Azure offers.
Big companies, most of their servers have a pretty stable load, it's unlikely things like internal email, Sharepoint, ERP/MAP systems will take a spike. It's only things like front end order processing that takes the hit.
There are lots of businesses that make sense and some that don't
I like the concept of "racing to the bottom" but they are still making money. But lets take your comment of the Value Added Services other than the ability to spin up capacity. What's the cost to Gitlab to pull this together and keep it running? There is an overflow every day on HN articles about operations monitors, containers, network monitoring. The tools are there, its an effort to glue them together, but then they are there.
So I'll still posit there is cases that the dollars to own are less than the dollars to rent. And I'll agree with your cases of rent because of capacity blowouts is key. The issue, is your ops team savvy enough to figure out what to keep/own, what to rent?
I've ordered a pair of Intel 750 Series 1.2TB NVMe SSD's for it, can't wait to try that out. Still waiting for the SSD drives to arrive.
Go ahead and apply for your ASN now and an IPv6 allocation. Then start working on the paperwork for an IPv4 allocation. Because there is no more IPv4 to allocate you'll have to go through the auction process and then the subsequent transfer process.
You'll easily be able to find a provider that can give you a /24 if you buy transit from them, but you don't wanna go through the trouble of renumbering into your own IP space if you can avoid it.
I'm not sure you have the in-house expertise to maintain a production service of this kind. That's not an attack; this has not been your focus in the past, so it might be wise to have a third party provide some assistance and support as an intermediate step toward doing everything in-house.
Your memory options are:
1TB - 16x64GB / 8x128GB
2TB - 16x128GB
 - SuperMicro X10DRT-PT Motherboard manual page 35(2-13).
does not really agree with
> Each of the two physical network connections will connect to a different top of rack router.
Sure, you can do it with something like MLAG, but that's really just moving your SPOF to somewhere else (the router software running MLAG). Router software being super buggy, I wouldn't rely on MLAG being up at all times.
> N1 Which router should we purchase?
Pick your favorite. For what you're looking for here, everything is largely using the same silicon (broadcom chipsets).
> N2 How do we interconnect the routers while keeping the network simple and fast?
Don't fall into the trap of extending vlans everywhere. You should definitely be routing (not switching) between different routers. You can read through http://blog.ipspace.net/ for some info on layer 3 only datacenter networks.
You'd want to use something like OSPF or BGP between routers.
> N3 Should we have a separate network for Ceph traffic?
Yes, if you want your Ceph cluster to remain usable during rebuilds. Ceph will peg the internal network during any sort of rebuild event.
> N4 Do we need an SDN compatible router or can we purchase something more affordable?
You probably don't need SDN unless you actually have a SDN use case in mind. I'd bet you can get away with simpler gear.
> N5 What router should we use for the management network?
Doesn't really matter, gigabit routers are pretty robust/cheap/similar. I'd suggest same vendor as you go for whatever your public network routers.
Also, consider another standalone network for IPMI. I can tell you that the Supermicro IPMI controllers are significantly more reliable if you use the dedicated IPMI ports and isolate them. You can use a shitty 100mbit switches for this, the IPMI controllers don't support anything higher.
> D5 Is it a good idea to have a boot drive or should we use PXE boot every time it starts?
PXE booting at every boot is cool, but can end up sucking up a lot of time. If you have not already designed your systems to do this, and have experience with PXE, then don't.
> The default rack height seems to be 45U nowadays (42U used to be the standard).
You may not have accounted for PDUs here. Some racks will support 'zero-U' PDUs, but you'd need to confirm this before moving on.
> H3 How can we minimize installation costs? Should we ask to configure the servers to PXE boot?
Assume remote hands is dumb. Provide stupidly detailed instructions for them. Server hardware will PXE by default, so that's not really a concern. IPMI controllers come up via DHCP too, so once you've got access to those you shouldn't need remote hands anymore.
> D2 Should we use Bcache to improve latency on on the Ceph OSD servers with SSD?
Did you consider just putting your Ceph journals on the SSD? That's a lot more standard config then somehow using bcache with OSD drives.
I would strongly consider doing this via pure L3 routing. This is a scale at which the benefits of L2 fabric switching vs L3 multihomed routing (yes, routing decisions on every node) begin to be interesting decisions.
We're already planning a separate router for the management network ("Apart from those routers we'll have a separate router for a 1Gbps management network.").
All Ceph journals will be on SSD too. I've added a question about combining this with bcache in https://gitlab.com/gitlab-com/www-gitlab-com/commit/a9cc9aad...
For switches, yes. Many of the switches share the same merchant silicon Broadcom Trident-II, Tomahawk, et al., however there are switches like the Juniper EX9200 which isn't baed on merchant silicon. Routers (N1) are also not typically based on merchant silicon (Juniper Trio-3D for example).
If you were larger, Open Compute Platform might be a way to go. Maybe next generation.
You'll need to maintain an association between - dns, ip, mac address, ssh keys etc.
Hardware break-fix workflow is usually ignored by most production engineers. You'll be doing that a lot. You want to get your hw back into use as fast as possible.
Have you thought about how many spares (CPU, RAM, disks) you'll have to keep at your datacenters ?
Running your own boxes can be done, but usually at great cost and usually by blowing up your sla. Given the inexperience you have at this some other options might be politically cheaper.
Then give them freedom to do all hardware picks AND hiring some more intermediate/junior ops/admin staff.
You can do with 3-4 ops and maybe 3 network admins people in total.
Also, assuming you are getting jacked by AWS like most people, have you looked into Linode, Digital Ocean or anyone else?
The FatTwin chassis has similar density, and can support either 1U half width or 2U half width systems in a particular chassis. Typically I use 1U's for app / web servers and 2U's for lower end database / storage. Separate 2U servers for higher end database and 4U servers for bulk storage.
HPE has the Apollo 2600 / XL170r 2U chassis, which I think is somewhat inferior but still a reasonable choice. Dell sells the same thing as the C6300. I really prefer a 4U chassis though from a cooling and power supply perspective, but Dell or HPE may have a better international story for you.
You absolutely should not buy the 2630v4 CPU. I say that because the lower-end Intel CPUs do not support maximum throughput for memory and QPI. The 2630v4 is 8.0GT/s QPI and 2133MHz DDR4. A better solution low-watt part is 2650Lv4 (9.6GT/s QPI and 2400MHz DDR4). I have a guide that I created (and use myself) to determine comparative $/performance of Intel CPUs based on SPEC numbers . If you can go up to 105W the 2660v4 is probably your best bet. Presuming that you're targeting 12-15kW per rack, a 105W part should allow you to deploy between 60-80 hosts per rack.
Also, don't use a W-series CPU that draws 160W. That's crazy power draw per-socket. If you want a super high-end CPU for your database server, I suggest a 2698v4 -- but normally I would go with 2680v4 or 2683v4 depending on the part cost.
In terms of hard drives, absolutely you should specify HGST over Seagate. At some point you may want to dual source this, but if you're only going with one vendor right now HGST is the best option. He8 or He10 8TB are your best bet in terms of cost and availability right now, although start thinking about He10 10T drives. The newly announced He12 drives shouldn't be on your radar until Q2 2017. Stock spares, maybe 2-3% of your total drives deployed, but at least 5-6 drives per site. You don't want to get caught out if there is a supply shortage when you need it most. Your business depends on ready access to considerable quantities of storage.
The P-series Intel SSDs are probably not going to be cost effective for your use case. But they are considerably better in terms of IOPS and remove the need for an HBA or RAID controller. Consider a Supermicro 2U chassis with 2.5" NVMe support, which will allow you to go considerably denser than the PCIe form factor. However, I think it's too early to go with NVMe unless you truly need beyond-SSD performance.
Don't PXE boot every time you boot. This creates a single point of failure (even if your architecture is redundant), and you will regret this at some point. However, DO PXE boot to install the OS.
Don't use 128GB DIMMs. They are not cost effective today.
There is only one solution for database scaling: shard. You'll either shard it today, or you'll shard it tomorrow when the problem is much harder. Scale up each host to what is easily achievable with today's hardware, and if push comes to shove retrofit to get over a hump that arises in the future, but know that you MUST shard in order to keep up with demand. Scaling up simply does not work.
There's a lot more to say, but without doing my job for free in a HN comment, the best advice I can offer is:
1. Simplify what you hope to accomplish in the first round. This is a lot to achieve at once. I think you'll have a hard time achieving the fanciness you want from a software perspective while also forklifting the entire stack over to physical hardware. It's perfectly fine to have something be good enough for now.
2. Find people who have done this before and get their advice. Find a VAR you can trust.
3. Plan, plan, plan, plan. Don't commit until you have a plan, make sure the plan is flexible enough to change course without tossing everything out, and plan to do a good enough job to survive long enough to figure out a better plan next time.
4. Get eval gear, qualify and test things, and make sure that what you think will work does work.
Basically whatever the top-of-the-line DIMM option may be (and this applies for CPUs and HDDs and other stuff too), you want to avoid being in a situation where you HAVE to use it. Vendors price these parts accordingly: you pay a premium for top-of-the-line because you must have it. If you can avoid that, do so.
The $/perf of 2630v4 is pretty decent ($2.24), but I would personally be leery for the reasons I mentioned. That said, I have used it for bulk storage servers, where CPU performance was not that important. So it's not like it will blow up your machine or something.
To obtain the perf number, I'm averaging single core and multi core fp and int SPEC numbers. If your workload isn't heavily parallelizable, that might not make sense. I'm not too worried about single core performance myself these days and have been tempted to remove it entirely.
One other thing I forgot to mention: v5 Xeon CPUs will be shipping in quantity early 2017, so you may want to consider holding off and looking for better deals on v4 CPUs then.
Likewise, you might be able to obtain a better deal today on v3 CPUs, particularly if you aren't using a large vendor like Dell or HPE. All of my pricing is list (I don't pay these prices), so the math changes significantly if you can get a disproportionate deal on a particular model. I use it as a place to start the conversation with my VAR, and then go with what makes sense in the market right now.
The call for "SDN" is incredibly nebulous - to the point of being almost meaningless, IMO. What the big guys tend to be after is a way to control the fabric via standardized API calls - so capability for YANG/NETCONF or some mechanism for direct access to SDE calls. The other thing that's not addressed is how to efficiently get information about the network out of the network. Traditional polled mechanisms (SNMP/RMON, et al) have been shown not to both lack scale and adequate resolution while legacy approaches to push telemetry (sFlow/Netflow) miss the mark in terms of level of detail and compatibility with large-scale data processing needs.
The next point is the selection of topology and the integration with multi-site planning. There's a lot of cool stuff happening in this regard and there seems to still be a pretty big disconnect between what the systems folks seem to understand and what's happening in the network industry, which is a shame because there's probably more opportunity for neat stuff (read: scalability, performance, fault resistance, manageability) than seems to be discussed (at least on HN).
Finally - there's a certain conventional wisdom among the systems and some sections of the programmer crowd that network control planes are just another mostly simple bit of software to be implemented. It's not. It's a hard problem and is the manifest reason why only a handful of organizations have been able to produce software that runs a meaningful percentage of large-scale L3 infrastructure in the world (hint: Arista is a great company but isn't included in this number quite yet). Truly rugged, useful/usefully-featured and performant network code is hard. Making that code work in the context of 30+ years of protocol implementations, morphing standards and a world of bad/clueful actors is REALLY hard. There's an inverse relationship between the amount of money spent on a solution and the amount of specialized expertise you have on staff. A more traditional commercial solution might be more expensive but it also relieves you of the need to keep some relatively rare, likely expensive and almost certainly non-revenue producing skill-sets off-staff.
Where will my DC be? What kind of DC is it? What services do they provide? How long do I want it to take for an employee to get there, whether or not they have 24/7 remote hands? What kind of power resiliency do they provide? What will power cost? What kind of power do they provide per cage and rack? How will their uplinks affect my traffic needs? Etc etc.
Cooling I didn't deal with directly, but suffice to say you will always need more cooling, and its efficacy will determine if your hardware stays alive or not. I've seen 11 foot racks with only 6 feet of hardware because they simply couldn't cool the racks at full height. Learn how to look for properly designed hot/cold racks and keep your racks well organized to make cooling efficient.
Power is pretty obvious, except that it isn't. Eventually you will draw too much power and you'll need to shunt machines into different racks and monitor your power trends. So one of the things to consider, besides dual drops, is how many extra racks do I have for when I need to spread out my power OR cooling into new racks? Will they make me use a rack on the other side of the building, or am I going to pay for some in reserve next to the current ones? Get PDUs that aren't a pain in the ass to automate (APC sucks balls).
Network: i'm not a neteng, don't listen to me, but obviously it should be managed with nice fat switching fabric bandwidth, good forwarding rate and big uplink module support. 48-port switches don't always have the same bandwidth ratios as 24-port switches, and uplinks are much easier to manage on a 24-port than a 48.
Hardware: you don't seem to need anything special, so you need to determine if a support contract is necessary, and if not, buy the cheapest pieces of shit you can and then rely on remote hands or a local employee to change out broken shit all the time. If space, power, cooling are at a premium, a blade chassis can be handy. But if you can spare the space, power, and cooling, 0.5U and 1U shitboxes are fine for most purposes. Don't get wrapped up in the details unless your application design requires specific hardware performance guarantees.
Looked at iSCSI SANs? Could make WAN sync easier, reduce overhead from NFS, but probably depends on how well your OS supports it and the features of the SAN. Oh, and an OOB terminal server can be a godsend when combined with a good PDU.
Go find all the industry datacenter design papers out there (there are tons) to bone up on the design considerations. Remember that you can always replace machines, but you can't replace rack, cooling or power design.
With current gen ASICs and switches, this isn't generally true anymore. A $1000 48 port 1Gbps switch is fully non-blocking with almost 1:1 10G uplinks (48x1G in, 4x10G out)
C1: Have a look at the FatTwin Line for more Disks per U. More PCIe Slots too.
D3: Check the measurements, having it not fit is painful
D4: More, smaller drives. Make sure you go PMR not SMR if you do
go for 8TB
N1: The "SDN" aspect of the supermicro one is not really
any different than any other. Look at https://bm-switch.com/
and get one that supports Cumulus Linux. Buy one with an
x86 CPU. If you want to do "SDN" things or run custom monitoring, not dealing with PPC is great.
N3: Probably not needed, but not terribly expensive if it provides benefit.
N4: see N1, no.
N5: Cheap 1G switch that supports cumulus, x86, probably broadcom \
- I wouldn't advise using the 10GbE Copper --Go for 25GbE with DAC, it's basically the same price, Mellanox NICs are small/cheap
- Transit is cheap, you can get 500Mbps on a 10G port for $325/mo from Cogent
- If your bandwidth needs to scale up, data center locations matter more than you think
25GbE adapter -- minimal additional cost for 2.5x the perf, lower latency as well:
a 32 port 100GbE switch is about $7000-12,000 -- you can break that down to 128x 25GbE, and use the 25GbE ports running at 10GbE mode for your carrier uplinks. Could even do
100GbE to your Ceph nodes if you wanted, but be aware of PCIe bottlenecks -- x8 is about 64Gbps, x16 will do 125Gbps. Dual port 40G on x8 or dual port 100G on x16 will not provide more than those numbers.
Consider Supermicro NVMe servers (Ultra series) for DBs, and 2.5in NVMe SSDs instead of PCIe.
Don't assume 45U
40-48 is common. Consider buying 2 racks.
19kW seems high for a single rack, you will need a good
datacenter to support that density. Density costs money
more racks is cheaper generally.
208v * 30A = 5000W usable per feed, unless they are talking about
208v 3-phase, which gets you 8600W usable per feed, which again
is only 17kW and you need 18-20kW.
helpful reference: http://www.raritan.com/blog/detail/3-phase-208v-power-strips...
You can only use 80% of your power provided.
You also need Rack PDUs, higher density PDUs cost more money
consider buying port-switchable PDUs. Raritan makes good ones.
Ask supermicro (or your reseller) for a "Power Sheet" it will tell you almost
exactly how much power your server will use. I've had good luck with ThinkMate
H1: Yes, too many to mention
H2: Do it yourselves
Hosting general: Cross connects cost money, a number of facilities offer free xconnects, this can add up.
- You want a small toolbox in the data center
- Buy more cables than you think you'll need
- You'll always forget something
- There are a number of companies that will lease you servers for pretty decent rates
We are running a 50%+ gross margin mid-stage venture-backed startup in Equinix facilities (but started there vs. cloud), and have no people near our facilities, and have had 0 issues service-wise related to doing management remotely. Yes, people go out to set up cabs, etc, but we hired our ops folks as generalists who had some network experience, and our CEO and CTO do as well, though AFAIK I don't have network logins active right now.
2 high-level thoughts I'd share:
1) Try not to use Ceph unless you're committed to having 2 people with deep experience at the code level.
2) I'd use Juniper QFX or EX, or Aristas. You don't seem to be running at scale or functionality where SDN magic is needed and there is a large community of QFC, EX, and Arista users your folks can reach out to when problems happen.
The other comments are more tuning and FYI on what we do HW-wise:
Specifically re: HW, at Kentik we run tens of worker nodes + flow ingest servers, all SM 1us w a few SSD and 256-384gb RAM. 48 logical cores, 2 x E5-2650v4.
We run approaching 1PB of storage, and while we still have some 4u 36-disk 3.5" boxes, those are phasing out and all we buy now is 2u SuperMicros w/ 24x2TB Samsung Evo 850ss. Procs are 72 logical core, 2 x E5-2697v4.
The Evo SSDs have been great - but our workload is largely appends or create/writes - largely but not all sequential, with high read IOPS. Before Samsung I was a big fan of Intel but we have no data on the modern Intels - slower for sure, but a focus on reliability is great...
We use JBOD and ZFS on the storage nodes; the LSI 9300-8i. Have things tested so we can do TRIM.
They do make SuperServers for roughly those configs, but we go with SM resellers who assemble and burn-in for +10-15%. I had 50+ SuperServers that were great at my Usenet company, but we'd rather have our ops folks work on things other than burn-in.
Happy to explain why we went to SSD vs. spinning at 2x the cost, but basically it made enough of a different at 95th and 99th percentile in our query times, and we had access to venture debt on great terms (which you should too and happy to discuss, since we're both funded by August).
Last note re: gear - when we were doing spinning, we found a screaming deal on new 2TB enterprise SATA (Hitachi, I think) for $50 and took the power/space hit for the +IOPS and extra compute we got for firing up the additional machines. Not sure if those are still out there, or the IOPS of this kind of approach would be needed.
GitLab is awesome. I'm really sad that in the past two or three months I've only found one GitLab link on HN to click. There really needs to be more. (I'm not sure if this is because I'm browsing in AEDT or if GitLab isn't used a lot on here.)
I wondered about how you guys might do advertising to get more mindshare, and then I realized one possible explanation about why you're doing this: getting technical advice from the community means everyone's had a part to play, and they're likely to remember that. Good move ;)
In my case, I have little (okay, 0) practical experience; a lot of the following is mentioned experimentally, to see how these ideas would handle the described environment. It's pretty much all stuff I've read online.
I welcome replies that shoot down any of these ideas.
Disks can be slow so we looked at improving latency. Higher RPM hard drives typically come in GB instead of TB sizes. Going all SSD is too expensive. To improve latency we plan to fit every server with an SSD card. On the fileservers this will be used as a cache. We're thinking about using Bcache for this.
There's already been another brief comment (https://news.ycombinator.com/item?id=13153317) about ZFS.
So, I'll ask. Why not ZFS? You don't have to run FreeBSD anymore to get a stable implementation.
You can put both the L2ARC and ZIL on SSDs. You can even use striping with them. Don't quote me on this but I think there MAY be some recovery capabilities built into these layers for if the power goes out (either it didn't use to be possible and now it is, or it's architecturally impossible, I hilariously cannot remember which).
> In general 1GB of memory per TB of raw ZFS disk space is recommended.
This is ONLY if you have dedupe switched on. If you have dedupe off you can run systems in just 4GB. A lot of home server enthusiasts do this.
There are a lot of unfortunate and widespread misconceptions about ZFS.
(This bit's somewhat anecdotal and is more informational than actionable. It's worth noting if you're interested in disks.)
> Every node can fit 3 larger (3.5") harddrives. We plan to purchase the largest one available, a 8TB Seagate with 6Gb/s SATA and 7.2K RPM.
Technically, the largest one available (on Amazon and presumably elsewhere) right now is 10TB, but its price/capacity ratio is atrocious compared to the rest of the market ($450-$520 per disk).
I've heard that Seagate Enterprise Capacity drives either die within the first 2-4 weeks or last 20 years. They have 5 year warranties in any case. I haven't heard anything else about other disks.
Very interestingly, 8TB seems to be the current market leader. Here are a bunch of prices I took straight off Amazon, as guides:
#3: 4TB: $170 (13 disks for 52TB = $2040)
#2: 5TB: $200 (10 disks for 50TB = $2000)
#4: 6TB: $239 (9 disks for 54TB = $2151)
#1: 8TB: $360 (7 disks for 56TB = $1673)
#5: 10TB: $450 (5 disks for 50TB = $2250)
A little while ago 5TB was the leader, and I was going to argue for more disks.
(Hitting add comment now instead of waiting so I can keep up with the discussion)