
Proposed server purchase for GitLab.com - jfreax
https://about.gitlab.com/2016/12/11/proposed-server-purchase-for-gitlab-com/
======
gr2020
Think much harder about power and cooling. A few points:

1\. Talk to your hosting providers and make sure they can support 32kW (or
whatever max number you need) in a single rack, in terms of cooling. At many
facilities you will have to leave empty space in the rack just to stay below
their W per sq ft cooling capacity.

2\. If you're running dual power supplies on your servers, with separate power
lines coming into the rack, model what will happen if you lose one of the
power lines, and all of the load switches to the other. You don't want to blow
the circuit breaker on the other line, and lose the entire rack.

3\. Thinking about steady state power is fine, but remember you may need to
power the entire rack at full power in the worst case. Possibly from only one
power feed. Make sure you have excess capacity for this.

The first time I made a significant deployment of physical servers into a colo
facility, power and cooling was quite literally the last thing I thought
about. I'm guessing this is true for you too, based on the number of words you
wrote about power. After several years of experience, power/cooling was almost
the only thing I thought about.

~~~
walrus01
Something you should do to understand the actual power consumption of a server
(ac power, in watts , at its input):

Build one node with the hardware configuration you intend to use. Same CPU,
ram, storage.

Put it on a watt meter accurate to 1W.

Install Debian amd64 on it in a basic config and run 256 threads of the
'cpuburn' package while simultaneously running iozone disk bench and memory
benchmarks.

This will give the figure of the absolute maximum load of the server, in
watts, when it is running at 100% load on all cores, memory IO, and disk IO.

Watts = heat, since all electricity consumed in a data center is either being
used to do physical work like spinning a fan, or is going to end up in the air
as heat. Laws of physics. As whatever data center you're using will be
responsible for cooling, this is not exactly your problem, but you should be
aware of it if you're going to try to do something like 12 kilowatts density
per rack.

Then multiply wattage of your prototype unit by number of servers. This will
tell you how many fully loaded systems will fit in one 208v 30a circuit on the
AC side.

Also, use the system bios option for power recovery to stagger bootup times by
10 seconds each per server, so that in the event of a total power loss and an
entire rack of servers does not attempt to power up simultaneously.

~~~
btgeekboy
This is good advice. I'd consider _not_ having them auto power on, however.
This allows them to bring things up in a controlled manner.

That brings me to my next point: GitLab should also being mindful on what
services are stored on which hardware - performing heroics to work around
circular dependencies is the last thing you want to be doing when recovering
from a power outage.

~~~
walrus01
Yes, this is a choice that depends on the facility, and what sort of recovery
plan you have for total power failure. In many cases you would want to have
everything remain off, and have a remote hands person power up certain Network
equipment, and key service, before everything else. On the other hand, you may
want to design everything to recover itself, without any button pushing from
humans.

Depends a lot of server/HA/software architecture.

------
Spooky23
I'm a cranky old person now, I think this is a crazy approach to take and I
would be having a very challenging conversation with the engineer pitching
this to me.

My underlying assumption is that this is a production service with customers
depending on it.

1\. Don't fuck with networking. Do you have experience operating same or
similar workloads on your super micro sdn? Will the CEO of your super micro
VAR pickup his phone at 2AM when you call?

My advice: Get a quote from Arista.

2\. Don't fuck with storage.

32 file servers for 96TB? Same question as with networking re:ceph. What are
your failure domains? How much does it cost to maintain the FTEs who can run
this thing?

3\. What's the service SLA on the servers? Historically, supermicro VARs have
been challenged with that.

If I were building this solution, I'd want to understand what the ROI of this
science project is as compared to the current cloud solution and a converged
offering like Cisco/NetApp, HP/3Par or something like Nutanix. You're probably
saving like 20-25% on hardware.

This sounds to me like pinching pennies on procurement and picking up tens of
dollars of cost on the labor side. If you're accustomed to SLAs from AWS, this
will be a rude awakening.

~~~
manacit
I'm happy to see this - I could not agree more with these points.

I think they are coming at this problem from the wrong perspective - instead
of growing from virtual servers to their own dedicated hardware to get better
CephFS performance, they should take a hard look at their application and see
if they can architect it in a way that does not require a complex distributed
filesystem to present a single mount that they can expose over NFS. At some
point in the future, it _will_ bite them. Not an if, but a when.

In addition, this means that running physical hardware, CephFS and Kubernetes
(among other things) are now going to be part of their core competencies - I
think they are going to underestimate the cost to GitLab in the long run. When
they need to pay for someone to be a 30 minute drive from their DC 24/7/365
after the first outage, when they realize how much spare hardware they are
going to want around, etc.

As someone who has run large scale Ceph before (though not CephFS,
thankfully), it's not easy to run at scale. We had a team of 5 software
engineers as well as an Ops and Hardware team, and we had to do a lot to get
it stable. It's not as easy as installing it and walking away.

~~~
ninkendo
I don't see why it wouldn't be difficult to do a git implementation that
replaces all the FS syscalls with calls to S3 or native ceph or some other
object store. If all they're using NFS for is to store git, it seems like a
big win to put in the up front engineering cost.

I mean, especially because git's whole model of object storage is content-
addressable and immutable, it looks like it's a prime use for generic object
storage.

~~~
charliesome
Latency is an issue. Especially when traversing history in operations like log
or blame, it is important to have an extremely low latency object store.
Usually this means a local disk.

~~~
cmn
Yuup. Latency is a huge issue for Git. Even hosting Git at non-trivial scales
on EBS is a challenge. S3 for individual objects is going to take forever to
do even the simplest operations.

And considering the usual compressed size of commits and many text files,
you're going to have more HTTP header traffic than actual data if you want to
do something like a rev-list.

~~~
ninkendo
I'm trying to think of the reason why NFS's latency is tolerable but S3's
wouldn't be. (Not that I disagree, you're totally right, but why is this true
in principle? Just HTTP being inefficient?)

I would imagine any implementation that used S3 or similar as a backing store
would have to _heavily_ rely on an in-memory cache for it (relying on the
content-addressable-ness heavily) to avoid re-looking-up things.

I wonder how optimized an object store's protocol would have to be (http2 to
compress headers? Protobufs?) before it starts converging on something that
has similar latency/overhead to NFS.

------
wmf
I have built out a few racks of Supermicro twins. In general I would suggest
hiring your ops people first and then letting them buy what they are
comfortable with.

C2: The Dell equivalent is C6320.

CPU: Calculate the price/performance of the _server_ , not the processor
alone. This may lead you towards fewer nodes with 14-core or 18-core CPUs.

Disk: I would use 2.5" PMR (there is a different chassis that gives 6x2.5" per
node) to get more spindles/TB, but it is more expensive.

Memory: A different server (e.g. FC630) would give you 24 slots instead of 16.
24x32GB is 768GB and still affordable.

Network: I would not use 10GBase-T since it's designed for desktop use. I
suggest ideally 25G SFP28 (AOC-MH25G-m2S2TM) but 10G SFP+ (AOC-MTG-i4S) is OK.
The speed and type of the switch needs to match the NIC (you linked to an SFP+
switch that isn't compatible with your proposed 10GBase-T NICs).

N1: A pair of 128-port switches (e.g. SSE-C3632S or SN2700) is going to be
better than three 48-port. Cumulus is a good choice if you are more familiar
with Linux than Cisco. Be sure to buy the Cumulus training if your people
aren't already trained.

N2: MLAG sucks, but the alternatives are probably worse.

N4: No one agrees on what SDN is, so... mu.

N5: SSE-G3648B if you want to stick with Supermicro. The Arctica 4804IP-RMP is
probably cheaper.

Hosting: This rack is a great ball of fire. Verify that the data center can
handle the power and heat density you are proposing.

~~~
MichaelGG
Can you elaborate on what deficiencies 10GBase-T has in server applications?

~~~
walrus01
One, category 6A cables actually cost two and a half times as much as a basic
single mode fiber patch cables. Two, cable diameter. Ordinary LC to LC fiber
cables with 2 millimeter diameter duplex fiber are much easier to manage then
category 6A. Three, choice of network equipment. There is a great deal more
equipment that will take ordinary sfp+ 10 gig transceivers, then equipment
that has 10 gigabit copper ports. As a medium-sized ISP, I rarely if ever see
Copper 10 gigabit.

~~~
hhw
SFP+ 10Gbase-T 3rd party optics started hitting the market this year, but none
of the major switch vendors offer them yet, so the coding effectively lies
about what cable type it is. Just an option to keep in your back pocket, as
onboard 10Gb is typically 10Gbase-T. Thankfully, onboard SFP+ is becoming more
common however.

For short distances of known length, twinax cables which are technically
copper can be used. They're thinner than regular cat6a but only about the same
as thin 6a and thicker than typical unshielded Duplex fiber patch cables.
Twinax can be handy if connecting Arista switches to anything else that
restricts 3rd party optics, as Arista only restricts other cable types. Twinax
is also the cheapest option.

------
compumike
The raw performance benefits of bare metal vs cloud are incredible, but why
does that necessarily mean building & maintaining your own hardware when you
can lease (or work out whatever financing you want, but still let the hosting
company maintain a lot of the responsibility for HW)? And besides financing,
taking on all the HW maint? I'm not sure your needs are so unique as to
require custom hardware.

You're talking about only 64 nodes right now. Your storage and IOPS
requirements are not huge. A lot of mid-size hosting companies will give you
fantastic service at the 10-1000 servers range. If I were you I'd talk to
someone like [https://www.m5hosting.com/](https://www.m5hosting.com/) (note:
happy dedicated server customer for many years -- and I'm sure there's similar
scale operations on east coast if that's really what you need) who have
experience running rock solid hosting for ~100s of dedicated servers per
customer.

I suspect you may just be able to get your 5-10X cost/month improvement (and
bare metal performance gains) without having to take on the financing and
hardware bits yourself.

~~~
sytse
We looked at providers such as Softlayer but while they guarantee the
performance of the servers they typically can't guaranty network latency.
Since we're doing this to reduce latency
[https://about.gitlab.com/2016/11/10/why-choose-bare-
metal/](https://about.gitlab.com/2016/11/10/why-choose-bare-metal/) this is
essential to us.

We'll be glad to look into alternatives that manage the servers and network
for us although the argument in
[https://news.ycombinator.com/item?id=13153455](https://news.ycombinator.com/item?id=13153455)
that this is the time to build a team that can handle this makes sense to me
too.

~~~
pyrox420
Been a softlayer customer for 4 years now. Their network is pretty awesome.
When hosted in the same datacenter it's sub-ms response time always. If there
is an issue they get right on it. You can even ask them to host the stuff in
the same rack to get even better response time.

~~~
sulam
We are also a SL customer -- 4-figures of hosts with them. We have had
networking problems in the past (latency and loss far higher than I would
expect to see in a well-provisioned DC) and talked to them about it. It ended
up being contention with another customer, it got fixed, and our network
performance has been great since.

I would encourage you to look at what you can get without trying to do your
own colo. You're not at the scale where you should be thinking about that.

~~~
sytse
Thanks for the suggestion. Any idea if they can offer 40 Gbps networking?

~~~
ihsw
Few will offer 40Gbps without charging a pretty penny, generally the jump will
go from 10Gbps to 100Gbps but not until it becomes cost-effective (and not
anytime soon).

That said, SoftLayer does provide 20Gbps access within a private rack and
20Gbps access to the public network.

~~~
rphlx
Most large scale operations are (or soon will be) deploying 25GE and/or 50GE
in place of 10Gbps Ethernet. 100GE to each node is unnecessary for most
workloads & more importantly it's obscenely expensive and likely to remain so
for at least 3 more years.

------
theptip
If you're committed to having a robust architecture (this may not be
financially viable immediately) you should study the mistakes that Github have
made, e.g.
[https://news.ycombinator.com/item?id=11029898](https://news.ycombinator.com/item?id=11029898)

Geo-redundancy seems like a luxury, until your entire site comes down due to a
datacenter-level outage. (E.g. the power goes down, or someone cuts the
internet lines when doing construction work on the street outside).

(This is one of the things that is much easier to achieve with a cloud-native
architecture).

~~~
StreamBright
Exactly this. I don't think that if you take into account __all__ parameters
bare metal is cheaper. My other problem is that they are moving from cloud to
bare metal because of performance while using a bunch of software that are
notoriously slow and wasteful. I would optimise the hell out of my stack
before commit to a change like this. Building your own racks does not deliver
business value and it is extremely error prone process (been there, done
that). There are a lot of vendors where the RMA process is a nightmare. We
will see how it turns out for Gitlab.

~~~
mohctp
The RMA process is pretty much a moot point at this scale. It simply doesn't
make sense to buy servers with large warranties. Save the money, stock spares,
and when a component dies replace it. In the long run it will come out to cost
a lot less. I'd much rather pay much less knowing a component may fail here
and there and you buy a new one.

As far as moving from cloud to bare metal, another thing to take into
consideration (that this was replied to), if you don't architect your AWS (or
other cloud solution) to take advantage of multiple geographic regions, the
cloud won't benefit you.

I 100% agree that there should be more than one region deployed for this
service. As others have said, all it takes is 1 event and the site will be
down for days to weeks to months (It may not happen often, but when it does,
you go out of business). The size and complexity of this infrastructure will
make it nearly impossible to reproduce on short order in a new facility. If I
were the lead on this I would have either active / active sites, or an active
/ replica site.

I would also have both local (fast restores), and off-site backups of all
data. A replica site protects against site failure not data loss and point-in-
time recovery.

~~~
StreamBright
"As far as moving from cloud to bare metal, another thing to take into
consideration (that this was replied to), if you don't architect your AWS (or
other cloud solution) to take advantage of multiple geographic regions, the
cloud won't benefit you."

Yep, this is why scaling starts with scalable distributed design. We were
moving a fairly large logging stack from NFS to S3 once, for the same reason
Gitlab is trying to move to bare metal now. Moving off cloud was not an
option, moving to a TCO efficient service was. NFS did not scale and there was
the latency problem. I think moving to bare metal cannot help with scale as
much as a good architecture can. We will see how deep the datacenter hole
goes. :)

~~~
mohctp
Agreed. Application Architecture is far more important than Cloud vs. Bare
Metal. It is just easier and more cost effective to through more bare metal
hardware at the problem than it is cloud instances. For some this does make
bare metal the better option.

To add to my previous comment though, AWS (and cloud in general) tends to make
much more sense if you are utilizing their features and services (Such as
Amazon RDS, SQS, etc.), and if you aren't using these services I can
absolutely guarantee I can deliver a much lower TCO on bare metal than AWS.
(Which is why I offered to consult for them) I see this all the time. Company
moves from bare metal to AWS as bare metal is getting expensive, then they
quickly find out AWS can't deliver the performance they need without massive
scale (because they aren't using a proper salable distributed design and can't
afford to re-architect their platform)

------
dsr_
Z1: the word "monitoring" does not appear in this document.

You will need to monitor: \- ping \- latency \- temperatures \- cpu
utilization \- ram utilization \- disk utilization \- disk health \- context
switches \- IP addresses assigned and reaching expected MACs \- appropriate
ports open and listening \- appropriate responses \- time to execute queries
\- processes running \- process health \- at least something for every bit of
infrastructure

once you collect that information, you need to record it, graph it, display
it, recognize non-normal conditions, alert on those, page for appropriate
alerts, and figure out who answers the pagers and when.

~~~
sytse
Good point, we know monitoring is very important.

Our Infrastructure lead Pablo will do a webinar of our Prometheus monitoring
soon
[https://page.gitlab.com/20161207_PrometheusWebcast_LandingPa...](https://page.gitlab.com/20161207_PrometheusWebcast_LandingPage.html)

We're bundling Prometheus with GitLab [https://gitlab.com/gitlab-org/omnibus-
gitlab/issues/1481](https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1481)

Brian Brazil is helping us [https://gitlab.com/gitlab-org/omnibus-
gitlab/issues/1481#not...](https://gitlab.com/gitlab-org/omnibus-
gitlab/issues/1481#note_19428808)

On January 9 our Prometheus lead will join us (who was very valuable already
helping with this research behind this blog post) and we're hiring Prometheus
engineers [https://about.gitlab.com/jobs/prometheus-
engineer/](https://about.gitlab.com/jobs/prometheus-engineer/)

In the short term we might send our monitoring and logs to our existing cloud
based servers. In the long term we'll host them on our own Kubernetes cluster.

For our monitoring vision see [https://gitlab.com/gitlab-org/gitlab-
ce/issues/25387](https://gitlab.com/gitlab-org/gitlab-ce/issues/25387)

~~~
favadi
Gitlab package includes too many things! I remember Gitlab sponsors a guy to
do proper packaging to include Gitlab in debian last year, not sure what is
the status.

------
nik736
Are you sure about the location? I would go with Frankfurt, Germany. Biggest
IX in the world and if you want a "low-latency" solution for all users this is
basically the middle of everything. NYC will have a worse connection to Asia
and I don't want to begin with India or something. While Frankfurt is
basically only 70ms away from NY and around 120ms to the west coast, while
even south america should be < 200ms. Russia is 40-50ms and Asia should be
between 100ms and 200ms. Australia will probably have the worst ping with
around 200-300ms.

Just as a suggestion since you seem to be so certain about the location. I
basically know every DC in Frankfurt, so if you need any help or info in that
regard feel free to contact me :-)

~~~
sytse
Frankfurt surely has a great IX. Even if Frankfurt would make everyone better
off (which I'm not sure about) there is another problem. People in the US are
used to lower latencies because most SaaS services are hosted there.

~~~
samstave
I have less of a concern on latency than I do with who has taps in the
lines...

I would assume that the .de lines are saturated with 5 eyes...

~~~
nikanj
I remain skeptical about the existence of non-tapped lines.

~~~
samstave
+1

EDIT: Funny, every single documented, proven link for the last several
___years_ __has proven me and GP correct and all you nay-sayers wrong.

~~~
predakanga
I believe the downvotes are due to "+1" not adding anything to the
conversation, as opposed to people disagreeing with the sentiment.

~~~
samstave
Isnt it ironic that a (+1) on HN will get downvotes yet thats basically what
every single HNer seeks to see in their requests on git. :-)

------
ninjakeyboard
I'm sure that you guys have done a lot of analysis but flipping to steel is
the very last thing I would consider before reviewing the tech in the world
today in relationship to TCOS as things move very fast.

You could frame infrastructure cost savings in many different ways though. The
most obvious solution may seem to be the spend to move from the cloud to in
house bare metal but I feel like you'll have a lot of costs that you haven't
accounted for in maintenance, operational staff spend, cost in lost
productivity as you make a bunch of mistakes.

Your core competency is code, not infrastructure, so striking out to build all
of these new capabilities in your team and organization will come at a cost
that you can not predict. Looking at total cost of ownership of cloud vs steel
isn't as simple as comparing the hosting costs, hardware and facilities.

You could reduce your operating costs by looking at your architecture and
technology first. Where are the bottlenecks in your application? Can you
rewrite them or change them to reduce your TCOS? I think you should start with
the small easy wins first. If you can tolerate servers disappearing you can
cut your operating costs by 1/8th by using preemptible servers in google cloud
platform for instance. If you optimize your most expensive services I'm sure
you can cut hunks of your infrastructure in half. You'll have made some
mistakes a long the way that contribute to your TCOS - why not look at those
before moving to steel and see what cost savings vs investment you can get
there?

Ruby is pretty slow but that's not my area of expertise - I wouldn't build
server software on dynamic languages for a few reasons and performance is one
of them but that's neither here nor there as you can address some of these
issues in your deployment with modern tech I'm sure. Aren't there ways to
compile the ruby and run it on the JVM for optimizations sake?

Otherwise do like facebook and just rewrite the slowest portions in scala or
go or something static.

Try these angles first - you're a software company so you should be able to
solve these problems by getting your smartest people to do a spike and figure
out spend vs savings for optimizations.

------
deadbunny
Yeah, link aggregation doesn't work how they think it does. And not having a
separate network for Ceph is going to bite them in the arse.

GitLab is fine software but fuck me, they need to hire someone with actual ops
experience (based on this post and their previous "we tried running a
clustered file system in the cloud and for some reason it ran like shit"
post).

~~~
btgeekboy
They'd save themselves a whole lot of time, effort, and money if they looked
at partitioning their data storage instead of plowing ahead with Ceph. They
have customers with individual repositories. There is no need to have one
massive filesystem / blast radius.

~~~
comex
I don't know much about this stuff, but won't that stop working if they ever
decide to expand to multiple geographical sites, to reduce latency to
customers in different locations? In that case, different sites can receive
requests for the same repositories, and ideally each site would be able to
provide read only access without synchronization, with some smarts for
maintaining caches, deciding which site should 'own' each file, etc. They
could roll their own logic for that, but doesn't that pretty much exactly
describe the job of a distributed filesystem? So they'd end up wanting Ceph
anyway, so they may as well get experience with it now.

------
lykron
If you are looking for performance, do not get the 8TB drives. In my
experience, drives above 5TB do not have good response times. I don't have
hard numbers, but I built a 10 disk RAID6 array with 5TB disks and 2TB disks
and the 2TB disks were a lot more responsive.

From my own personal experience, I would go with a PCIe SSD cache/write
buffer, and then a primary SSD tier and a HDD tier. Storage, as it seems you
guys have experienced, is something you want to get right on the first try.

Edit:

N1: Yes, the database servers should be on physically separate enclosures.

D5: Don't add the complexity of PXE booting.

For Network, you need a network engineer.

This is the kind of thing you need to sitdown with a VAR for. Find a
datacenter and see who they recommend.

~~~
sytse
Thanks for commenting! You're right that 2TB disks will have more heads per TB
and therefore more IO. We want to increase performance of the 8TB drives with
Bcache. If we go to smaller drives we won't be able to fit enough capacity in
our common server that has only 3 drives per node. In this case we'll have to
go to dedicated file servers, reducing the commonality of our setup. We're
using JBOD with Ceph instead of RAID.

If the 8TB drives don't offer the IO we need we'll have to leave part of them
empty. I assume that if you fill them with 2TB they should perform equally
well. Word on the street is that Google offers Nearline to fill all the
capacity they can't use because of latency restrictions for their own
applications.

It is an interesting idea to do SSD cache + SSD storage + HDD instead of just
SSD cache + HDD. I'm not sure if Ceph allows for this.

~~~
lykron
It doesn't work like that. 2TB drive will always be faster than an 8TB drive.
The amount of data has no effect when compared to the physical attributes of
the drive. More platters will increase the response time.

Ceph seems to offer Tiering, which would move frequently accessed data into a
faster tier while the infrequent data to a slower tier.

~~~
toomuchtodo
> Ceph seems to offer Tiering, which would move frequently accessed data into
> a faster tier while the infrequent data to a slower tier.

By "Tiering", is this moving data between different drive types? Or by moving
the data to different parts of the platter to optimize for rotational physics?

~~~
lykron
By Tiering, I'm talking about moving blocks of data from slower to faster or
visa verse. If I have 10TB of 10k IOPS storage and 100TB of 1k IOPS storage in
a tiered setup, data that is frequently accessed would reside in the 10k IOPS
tier while less frequently accessed data would be in the 1k tier. In this
case, the blocks of popular repositories would be stored in SSD, while the
blocks of your side project that you haven't touched in 4 years would be on
the HDD. You still have access to it, it might just take a bit longer to
clone.

Ceph can probably explain it better than I can.
[http://docs.ceph.com/docs/jewel/rados/operations/cache-
tieri...](http://docs.ceph.com/docs/jewel/rados/operations/cache-tiering/)

~~~
sytse
That is pretty awesome. Should the SSD's for the fast storage be on the OSD
nodes that also have the HDD's or should it be separate OSD nodes?

~~~
illumin8
The fact that you're asking this question on Hacker News leaves little doubt
in my mind that you and your team are not prepared for this (running bare
metal).

I read the entire article, and while you talked about having a backup system
(in a single server, no less!) that can restore your dataset in a reasonable
amount of time, you have no capability for disaster recovery. What happens
when a plumbing leak in the datacenter causes water to dump into your server
racks? How long did it take you to acquire servers, build the racks and get
them ready for you to host customer data? Can your business withstand that
amount of downtime (days, weeks, months?) and still operate?

These questions are the ones you need to be asking.

In other words, double your budget because you'll need a DR site in another
datacenter.

~~~
lykron
They don't need another datacenter. AWS has had more issues they NYI's
datacenters have had in the past year. There are many companies that are based
out of a single datacenter that haven't had any major issues.

~~~
toomuchtodo
Backblaze comes to mind.

------
sytse
I'll be here all day to learn from suggestions. I'm hoping for much feedback
so please reference questions with the letter and number: 'Regarding R1'.

~~~
neom
I'm curious to know if you think this work is within the core competency of
GitLab. if so, how did you decide it was. If not, how do you realize the
investment over time in something that isn't? Is the GitLab CEO here?

~~~
sytse
GitLab CEO here. Hardware and hosting are certainly not our core competencies.
Hence all the questions in the blog post. And I'm sure we made some wrong
assumptions on top of that. But it needs to become a core competency, so we're
hiring [https://about.gitlab.com/jobs/production-
engineer/](https://about.gitlab.com/jobs/production-engineer/)

~~~
thebyrd
I think you're under estimating the number of people required to run your own
infrastructure. You need people who can configure networking gear, people
swapping out failed NICs/Drives at the datacenter, someone managing vendor
relationships, and people doing capacity planning.

I think you could probably get the IO performance you're talking about in your
blog post from AWS instances or Google Cloud's local NVMe drives, but if you
truly need baremetal, I'd recommend Packet or Softlayer. Don't try to run your
own infrastructure or in a year you'll be:
[https://imgflip.com/i/1fs7it](https://imgflip.com/i/1fs7it)

------
halbritt
Just a few quick notes. I've experience running ~300TB of usable Ceph storage.

Stay away from the 8TB drives. Performance and recovery will both suck. 4TB
drives still give the best cost per GB.

Why are you using fat twins? Honestly, what does that buy you? You need more
spindles, and fewer cores and memory. With your current configuration, what
are you getting per rack unit?

Consider a 2028u based system. 30 of those with 4TB drives gets you the 1.4PB
raw storage you're looking for. 2683v4 processors will give you back your core
count, yielding 960 cores (1920 vCPUs) across that entire set. You can add a
half terabyte of memory or more per system in that case.

Sebastien Han has written about "hyperconverged" ceph with containers. Ping
him for help.

The P3700 is the right choice for Ceph journals. If you wanted to go cheap,
maybe run a couple M2.NVME drives on adapters in PCI slots.

I didn't really need the best price per GB in my setup, so I went with 6TB
HGST Deskstar NAS drives. I'm suggesting you use 4TB as you need the IOPs and
aren't willing to deploy SSD. Those particular drives have 5 platters and a
higher relatively high areal density giving them them some of the best
throughput numbers in among spinning disks.

If you can figure out a way to make some 2.5" storage holes in your
infrastructure, the Samsung SM863 gives amazing write performance and is way,
way cheaper than the P3700. I recently picked up about $500k worth, I liked
them so much. They run around $.45/GB. Increase over-provisioning to 28% and
they outperform every other SATA SSD on the market (Intel S3710 included).

You'll probably want to use 40GE networking. I've not heard good things about
Supermicro's switches. If I were doing this, I'd buy switches from Dell and
run Cumulus linux on them.

Treat your metal deployment like IaaC just like any cloud deployment.
Everything in git, including the network configs. Ansible seems to be the tool
of choice for NetDevOps.

~~~
sytse
Thanks for the great suggestions.

We're considering the fat twins so we get both a lot of CPU and some disk.
GitLab.com is pretty CPU heavy because of ruby and the CI runners that we
might transfer in the future. So we wanted the maximum CPU per U.

The 2028u has 2.5" drives. For that I only see 2TB drives on
[http://www.supermicro.com/support/resources/HDD.cfm](http://www.supermicro.com/support/resources/HDD.cfm)
for the SS2028U-TR4+. How do you suggest getting to 4TB?

~~~
msandford
Also whatever you do, don't buy all one kind of disk. That'll be the thing
that dies first and most frequently. Buy from different manufacturers and
through different vendors to try and get disks from at least a few different
batches. That way you don't get hit by some batch of parts being out of spec
by 5% instead of 2% and them all failing within a year, all at the same time.

If you do somehow manage to pick the perfect disk sure having everything from
a single batch would be the best since that'll ensure you have the longest
MTBF. But how sure are you that you'll be picking the perfect batch simply by
blind luck?

~~~
halbritt
I had this problem with the Supermicro SATA DOMs. Had problems with the whole
batch.

That said, I bought the same 6TB HGST disk for two years.

~~~
msandford
As long as you're not buying all the disks at once sticking with one
manufacturer and brand should be fine. If you're buying 25% of your total
inventory every year it'll all be spread out to just a few percent per month.

But when you're buying 100% of your disk inventory at once there's a serious
"all eggs in one basket" risk.

------
nwilkens
The estimate of 19KW gives a rough estimated requirement of ~90A @ 208v.

4 x 208v 30A circuits gives a total of 120A -- of which they can only be used
at 80% capacity, so that gives you 96A usable -- without redundancy.

My initial feelings (eyeballing it) are you should be looking for 3 full
racks, 2x208V@30A in each.

As a juniper shop, we implemented the 2xQFX 5100 48T (in virtual chassis) +
ex4300 for remote access per rack. This will be a decision based on your local
expertise though.

I also looked hard into the twin boxes -- but power to the rack in the end
ruled the day, and it made not much of a difference to use the 1U boxes.

Don't forget about out of band (serial for the switching gear). We've been
using OpenGear.com for this stuff with 4G-LTE builtin.

VPN access to access console devices?

Any site-to-site VPN needs?

Also, I would consider not pre-purchasing N times your required
horsepower/disk if you can avoid it, but rather add in pre-planned yearly, or
biannually stages.

As the CEO of a cloud hosting & server management company -- I have much more
to say about this if you would like to chat via phone or email anytime.

------
mohctp
There are two major pitfalls to crowd-sourced consulting such as this.

1) Contributors have not been vetted - Some responses are based on real world
experience, and some is conjecture from arm-chair quarterbacks. (A simple
example would be that nobody has mentioned with any of the SuperMicro 2U Twins
that you have to be cautious about the PDU models and outlet locations of 0U
PDU's to not block node service/replacement in the rack)

2) There are multiple ways to skin a cat - There are many viable solutions in
this thread, but you can't simply take a little bit of this, a little bit of
that, and piece together a new platform that "just works." \-- Better to go
with a know working solution than a little of this and a little of that.
Multiple drivers tend to be less effective.

I am the owner of a bespoke Infrastructure as a Service provider that delivers
solutions to sites of similar metrics, and speak with plenty of real-world
experience.

Larger Providers - We find a lot of clients move away from AWS, Softlayer,
Rackspace, et al. as the larger providers aren't nearly as interested in
working with the less-than-standard configurations. They want highly-
repeatable business, not one-off solutions.

I'd love to talk in more depth with you about how we can deliver exactly what
you need based on years of experience in delivering highly customized
solutions. We'll save you money and headache.

~~~
rphlx
HN crowdsourcing is a pretty reasonable strategy for entities that cannot
reliably identify & hire 1+ rockstar employee(s) and/or VARs to cover the
compute, networking, storage, electrical, environmental, etc. If you can
rationally evaluate the HN comments you should get pretty close to the best,
cutting-edge advice. Whereas when you are small and you listen to 1 or 2 VARs
and/or 1-2 internal employees you can expect, on average, to get average
advice. Or advice that was excellent 2-3 years ago but is now out-of-date due
to HW/SW progress that the employee/VAR is unaware of.

~~~
mohctp
Cutting-edge and availability aren't really always best friends. For example
cutting edge routing and switching (the latest products, fabric, SDN's, MC-
LAG, etc) are notorious for failures and outages.

Average is not the appropriate word, however I'll use your word. I'd rather
have, and so would every enterprise out there, rather have average advice
based on tried and true solutions that costs 10% more with 99.99 to 99.999%
availability than cutting edge, saved 10-20% with 99.8% availability. The
downtime alone can (and does) kill reputations of sites like GitLab.com (and
others).

------
hxegon
I wish more companies were open like this, pretty cool to see the decision
making and reasoning behind operational stuff like this.

------
siliconc0w
It'd be interesting to see more application-level solutions to scale rather
than just adding hardware. Like extending git so objects can be fetched via a
remote object-store rather than dealing with a locally mounted POSIX file
system. This would allow you to use native cloud object stores and might
simplify your latency requirements to the point where you could consider
staying in the cloud.

This definitely increases software complexity but going the other way
increases other complexities (ops, capex, capacity).

~~~
sytse
I thought the same and looked into this years ago but the short story is that
for many git operations it needs block access to the repository. With an
object store it becomes very slow.

------
dsr_
Z2 (no, Z is not on the list you have)

You need at least two nodes that do DNS, DHCP, NTP, and other miscellaneous
services that you absolutely want to have but do not seem to have mentioned.
You want them to be permanently assigned, so that you never have to search for
them, and you want them to fail over each service as needed, preferably both
operating at once. Three nodes would be better. Consider doing some basic
monitoring on these nodes, too, as a backup to your main monitoring system.

~~~
skuhn
Absolutely agree. Typically I deploy about 4-6 "tools" hosts in a site. From a
computational standpoint, you could make do with fewer hosts, but there are
some things that I prefer to separate out:

1&2\. Authoritative DNS (internal)

1&2\. NTP

1&2\. Caching DNS resolver

3&4\. Outbound HTTP proxy (if necessary)

3&4\. PXE / installer / dhcp

3&4\. Local software mirror (apt / yum / etc.)

5&6\. SSH jump hosts.

Or something like that.

Make sure the second host is not just in a separate chassis from the first
host, but also in a separate rack.

For external authoritative DNS, don't do it yourself -- pay for someone to run
that for you (Route53, ns1, etc.). For e-mail, if you can possibly not deal
with it then don't -- use Mailchimp or something.

~~~
stp-ip
Sidenote: We are running coredns.io in production as authoritative internal
DNS and as hidden master with NOTIFY to a secondary DNS provider (currently
DNSmadeEasy).

The DNS records for the internal records are done using the kubernetes
middleware (basically serving the service records). The external records are
pulled in from a git repository hosting our zones as bind files. If need be
zones are split into subzones per team/project. Same permission system as our
code via MRs using Gitlab.

Our recommendation is build on open standards (BIND, AXFR) and use services on
top of these.

I agree that using an external mail provider is usually a good idea. It mostly
is your fallback communication channel and is usually easy to switch (doing
replication to an offsite mail storage needs to be done to make switching
easy/possible/fast). MX records \o/

------
lern_too_spel
Since hosting git repositories is core to your business, you should take the
time to do it right
([https://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/fi...](https://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/files/Scaling%20Up%20JGit%20-%20EclipseCon%202013.pdf))
instead of using vanilla git and relying on a magic filesystems and vertical
scaling to solve your issues.

~~~
sytse
The presentation you linked is something we've considered. But it is build on
top of Google's distributed filesystem. So we consider our move to Ceph a
first step in that journey.

~~~
lern_too_spel
You can swap in any other distributed storage system in jgit or libgit2.

------
dsr_
N1/N2: I think you are making a mistake by concentrating on the advantage of
having a single node type. Databases really aren't like other systems.

SuperMicro has a 4 node system available in which each node gets 6 disks,
2028-TP-xx-yyy. Get two of these, populate 2 nodes on each as your databases,
and you can grow into the other spaces later. Run your databases on all-SSD;
store backups on spinners elsewhere.

Having two node types is not a calamity compared to having just one.

~~~
sytse
Thanks, I agree that multiple nodes could be acceptable if needed.

So the proposal now is to have one chassis (excluding the backup system) and
two types of nodes (normal and database).

Regarding the 2028 please see the conversation in
[https://news.ycombinator.com/item?id=13153853](https://news.ycombinator.com/item?id=13153853)

------
mwcampbell
I'm surprised to see that GitLab is using Unicorn. Isn't Unicorn grossly
inefficient, because each of the worker processes can only handle one request
at a time. Are web application processes actually CPU-bound these days?

I don't know much about the Ruby web server landscape, but might Puma
([http://puma.io/](http://puma.io/)) be better?

~~~
bpicolo
For "grossly inefficient" I imagine it depends. If you're loading most of the
app prefork then the memory overhead for more processes is pretty low.
(Disclaimer: Dunno much about ruby deployments)

~~~
mwcampbell
But the number of Unicorn processes that GitLab runs is "virtual cores + 1".
It seems to me that this only makes sense if web application request handling
is actually CPU bound. Maybe it is if you have a really good caching
implementation and hardly ever have to query the DB.

------
KaiserPro
A few things that Immediately jump out at me.

The 2u fat twins are very dense, You'll need make sure that thy hosting
company can actually cool it effectively.

I'd look again at your storage strategy. You'll save a lot of money, and it'll
be much easier to debug if you dump ceph.

You have a clustered FS, when you appear to not really need it. By the looks
of it, its all going to be hosted in the same place. So what you have is lots
of CPU to manage not very much disk space. The overhead of ceph for such a
tiny amount of data all in the same place seems pointless. You can easily fit
your data set on one 90 disk shelf 4 times over with 100% redundancy.

First things first, File servers serve files and nothing else. This means that
most of your ram is going to be a file server cache. Putting applications on
there is just going to mean that you're hitting the disk more often, and
stealing resources from other apps in a non transparent way.

get four 90 drive superchassis, connect them to four servers(via SAS, direct
connect) with as many CPUs and as much ram as possible. JBOD them, either use
ZFS, or hit up IBM for GPFS. (well IBM Spectrum Scale). You can invest in a
decent SAS controller and raid 6 8 10 disk stripes. But rebuild time will be
very high. Whats your strategy for disk replacement? 400 disks means about 1
every four weeks will be failing.

What is your workload like on the disks? is it write heavy? read heavy? lots
of streaming IO? or random? can it be cached?

Your network should be simple. don't use cat6 10gig. It's expensive, not very
good in dense racks, just use copper cables with inbuilt SPFs, they are cheap,
reliable.

Don't use Super micro switches. They are almost certainly re-badged.

Your network should be fairly simple. A storage VLAN, a application VLAN, and
a DMZ. (out of band will need a strongly protected vlan as well)

on the buying side, you need a reseller, who is there to get the best deal.
You'll need to audition them. They will also help with design as well. But you
need to be harsh with them, they are not your friends, they are out to make
money.

------
alrs
If the plan is still to build a huge Ceph-backed filesystem to store your git
repos on, you are doomed.

Redhat, if you're out there: Now would be a good time to chime in about the
limits of Ceph and the reasonable size of a filesystem.

~~~
sytse
We'll have a Ceph expert review our configuration.

~~~
discodave
How about an expert in enterprise and/or cloud storage in general?

------
Svenskunganka
I know Stackoverflow and all their related services are running bare-metal and
has done so for a long time. They have written a very detailed series of blog
posts about their whole infrastructure and hardware which I really recommend
that you read if you haven't already.

Here's Part 1 of the series: [http://nickcraver.com/blog/2016/02/17/stack-
overflow-the-arc...](http://nickcraver.com/blog/2016/02/17/stack-overflow-the-
architecture-2016-edition/) (the other parts are linked from there)

------
devicenull
> Each of the two physical network connections will connect to a different top
> of rack router. We want to get a Software Defined Networking (SDN)
> compatible router so we have flexibility there. We're considering the
> 10/40GbE SDN SuperSwitch (SSE-X3648S/SSE-X3648SR) that can switch 1440 Gbps.

Noooooo. Do not go for Supermicro for anything where the operating system
matters. Their software quality leaves a _lot_ to be desired.

You might want something Cumulus Linux supports, since it seems like you have
a lot more Linux experience then networking...

~~~
sytse
Thanks, Cumulus Linux seems to be the way to go.

------
spullara
All of your data and IOPs could be handled by a single Pure Storage
FlashBlade. This would significantly simplify almost everything about your
system. (I'm an investor, but generally wouldn't point this out unless someone
is seriously considering on premise deployment.)

------
peller
For such a big expenditure, some prototyping first? Maybe buy a mainboard or
two and some CPUs/HDDs/SSDs and benchmark them on your specific workloads.
Also look into using something like bcache if going all-SSDs is too expensive.

~~~
sytse
We're in a rush because we need GitLab.com to get fast now. We might shoot
ourselves in the foot by not prototyping first but we're taking the risk.
We're hiring more consultants to help us with the move.

Bcache is mentioned in the article under the disk header.

------
octo_t
It seems based on this post that all the hardware is going to end up in one
datacenter. that seems like a big risk if that datacenter ends up having
issues.

Perhaps look into how to have the nodes split between 2 or 3 different
datacenters?

~~~
sytse
We've considered having part of the servers in a different location. But it is
a big risk to see if Ceph can handle the latency. We're also considering using
GitLab Geo to have a secondary installation. It seems that data centers can
have intermittent network or power issues but they are less likely to go down
for days (for example Backblaze is in one DC). At some point we'll likely have
multiple datacenters but our first focus is to make GitLab.com fast as soon as
possible.

~~~
dsr_
Do you have a disaster recovery plan that starts with "A meteor has destroyed
our primary data center."?

I do; that's my default scenario. If you can survive that, you can survive all
sorts of smaller issues like network congestion, data center power problems,
grid power problems, and zombie plagues (or flu, which is more likely.)

~~~
Dylan16807
Depends on what you mean by 'survive'. I'd call a backup in google nearline
sufficient for the meteor scenario, but that's going to be very slow and
unpleasant to depend on for milder problems.

~~~
illumin8
It's not sufficient. How quickly could you procure new hardware, install that
in a datacenter, make it fully functional, and restore your backups? The
answer is likely weeks/months. Could your business survive being offline that
long? It sounds unlikely.

~~~
Dylan16807
In an emergency you don't need new hardware. You can get cloud servers in
minutes. If people have been practicing restores then it should not take
particularly long to get the containers working again. A couple days to get
things working-ish. That should be survivable while everyone focuses on the
news coverage of the meteor.

But that's a true emergency situation. Don't go offline for multiple days for
something that's reasonably likely to happen.

------
viraptor
As someone already mentioned "all the other services" are missing (dns, ntp,
monitoring, etc.), but also:

\- Shouldn't there be a puppet / chef / whatever deployment coordinator in
there somewhere?

\- There's no mention of a virtualisation environment. While it's not a
hardware issue really, all the extra services mentioned before will not take
the whole server and you'll want to collocate some of them. (maybe even some
of the main services too, if the resource usage on real hardware turns out
wildly different than the estimates) If the choice is KVM, great. But if it's
VMWare, you want to include that in the cost. (and the network model)

\- The "staging x 1" is interesting... so what happens when you need to test a
new version of postgres before deployment? You can't pretend that a test on
one server (or 4 virtual ones) will be comparable to actual deployment,
especially if you need to verify the real data performance.

\- "Backing up 480TB of data [...] with a price of $0.01 per GB per month
means that for $4800 we don't have to worry about much." \- This makes a
really bad assumption of X B of data produces X B of backup. That's a mirror,
not a backup - it won't protect you from human errors if you overwrite your
only mirror. For a backup you need to have actual system of restoring the data
and a plan for how many past versions you want to store. On the other hand,
gitlab data seems to be mostly text files - that should compress amazingly
well, so that should also help with the restore speed.

\- Network doesn't seem to mention any out of band access (iLO, or similar).
That's another port per node required.

\- Because tech is tech, I expect both the staging and spare servers to be
repurposed soon to be used for "that one role we forgot about". One each is
really not enough. (seen it happen so many times...)

~~~
sytse
\- We would continue to use our existing cloud hosted Chef server for now.

\- We want to use Kubernetes instead of virtualization.

\- We provisioned 2 spare database servers for that reasons.

\- Git already compresses files with zlib so I'm not sure we can compress it
much further [https://git-scm.com/book/uz/v2/Git-Internals-
Packfiles](https://git-scm.com/book/uz/v2/Git-Internals-Packfiles)

\- The servers have a separate management port and "Apart from those routers
we'll have a separate router for a 1Gbps management network."

\- I agree we'll probably need more spare servers.

~~~
viraptor
> The servers have a separate management port and "Apart from those routers
> we'll have a separate router for a 1Gbps management network."

It was followed with "For example to make STONITH reliable when there is a lot
of traffic on the normal network". That sounded like you want the heartbeats
on it, not out of band management.

~~~
sytse
I probably made a mistake here. I assumed that since the management network
was never congested we could run the heartbeats there. But it might be that
the server can't address the management port from the operating system.

------
binaryanomaly
Kudos to gitlab for being so open about this. As an outsider find it very
interesting to observe and learn from this transition.

It's refereshing too see that a few quite important players including i.e.
dropbox are moving away from having everything in public clouds in contrast to
others such as Netflix that go all in. Looks like on premise is not dead yet.

Good luck!

------
jscott0918
A bit confused why you are trying to roll your own storage solution. Ceph is
great for a lot of applications, but I am not sure it really fits the bill for
what you describe. Especially when you indicate you are going to use spinning
disk behind it. Have you looked at any of the storage arrays on the market?
Your TCO is likely to be much lower and your performance/resiliency much
higher if you buy something with $100's of millions of R&D behind it rather
than all the hours and costs of rolling your own. Not saying its impossible to
make it work, but it just sounds like something that will be a PITA going
forward.

~~~
sytse
I think a storage appliance (NetApp, etc.) makes a lot of sense in the short
term. The TCO is lower since we'll spend a lot of time making Ceph work.

In the longer term the storage appliance will lock us in and will get very
expensive. I've heard pretty bad stories of companies betting on it.
Especially with many small files like us (IOPS heavy).

And one goal of GitLab.com is to gain experience that we can reuse at our
customers. Most of our customers use a storage appliance now but are
interested in switching to something open source.

------
joneholland
Disregarding the obvious signs that there is severe lack of experience from
gitlab staff on the physical DC build, they completely underestimate how
difficult it is to run ceph at scale with a reliable SLA.

------
jedberg
My initial gut feeling is that you are moving out of the cloud for the wrong
reasons. Any performance gain you get with bare metal will be erased with the
complexity of running a hybrid environment, namely moving data back and forth
between your datacenter and the cloud, and also the mental overhead of
programming to that model.

You also now need to gain internal expertise in networking, security,
datacenter operations, and people who can rack and stack well.

------
i336_
Here's a crazy but interesting suggestion:

> _Backup_

> _...Even with RAID overhead it should be possible to have 480TB of usable
> storage (66%)._

Quoting [https://code.facebook.com/posts/1433093613662262/-under-
the-...](https://code.facebook.com/posts/1433093613662262/-under-the-hood-
facebook-s-cold-storage-system-/):

> _Fortunately, with a technique called erasure coding, we can. Reed-Solomon
> error correction codes are a popular and highly effective method of breaking
> up data into small pieces and being able to easily detect and correct
> errors. As an example, if we take a 1 GB file and break it up into 10 chunks
> of 100 MB each, through Reed-Solomon coding, we can generate an additional
> set of blocks, say four, that function similar to parity bits. As a result,
> you can reconstruct the original file using any 10 of those final 14 blocks.
> So, as long as you store those 14 chunks on different failure domains, you
> have a statistically high chance of recovering your original data if one of
> those domains fails._

Facebook didn't release the system they used to do this. I can see two reasons
why not to: desire for competitive edge; or the implementation not being a
general-purpose solution.

Considering Facebook's general openness, I say get in touch, just in case!
It's quite possible that you might be able to figure out something
interesting.

I suspect the reason the system wasn't released was due to the latter case -
it seems to be technically quite simple and easily achievable for a[ny]
company full of algorithms Ph.Ds.

~~~
sulam
Any halfway experienced storage engineer is fluent in ECC. You don't need
secret sauce from Facebook. That said, because of the first statement, a lot
of today's storage solutions will use ECC on the backend if they present you
with a logical FS. So you may not (should not?) need to reinvent this wheel.
At Facebook's scale this absolutely makes sense, but they're not particularly
breaking new ground here. Look at RAID 6 if you need more evidence.

~~~
i336_
I'm still figuring all of this out, thanks for the headsup. I realize Facebook
don't have anything particularly interesting, but now I understand just how
standard this is, huh.

Thanks again.

------
late2part
I recently built some data center processing using SuperMicro Twin-Nodes^2 [0]
and MicroBlades [1].

We're setup for 38kw [a] possible gross wattage using the dual-node blades
using the Xeon D-1541 [2]. The amazing thing about the D-1541 blades is we get
around 100W per server, with 8 hyperthreaded cores, and a 3.84T SSD. With the
6U chassis, you have 28 blades with 2 nodes each - 56 medium sized servers for
5.6kw in 6U - under a kw per RU.

For your 70 some server workloads, I'd recommend using the microblades.

For your higher lever workloads, I think the SuperMicro TwinNode makes sense.

Be very very careful about Ceph hype. Ceph is good at redundancy and
throughput, but not at iops, and rados iops are poor. We couldn't get over 60k
randrw iops across a 120 OSD cluster with 120 SSDs.

For your NFS server, I'd recommend a large FreeNAS system, put a big brain in
it and throw spinning platters.

Datacenters can/will do your 30kw

[0] -
[https://www.supermicro.com/products/nfo/2UTwin2.cfm](https://www.supermicro.com/products/nfo/2UTwin2.cfm)
[1] -
[https://www.supermicro.com/products/MicroBlade/](https://www.supermicro.com/products/MicroBlade/)
[2] -
[https://www.supermicro.com/products/MicroBlade/module/MBI-62...](https://www.supermicro.com/products/MicroBlade/module/MBI-6218G-T41X.cfm)
[a] - Although we have 38kw of possible power there, it's practically well
under the 27kw we'll get with 4x208v@60A PDUs at 80%.

~~~
etcet
If not Ceph, what are you using for storage?

~~~
late2part
2RU SuperMicro server with 24xSSD with FreeNAS raid10 or raidz2 exporting
volumes via iScsi.

Separately I'm using a FreeNAS controller with 4 SAS HBAs supporting 3 JBODs
with 45 8TB HGST He8 near line SATA disks each (135 disks or ~1PB) for backups
and slow data.

------
dgemm
This undertaking seems like a huge investment in time & overhead for only 64
servers. My guess is that with this move, performance will go up and
availability will go down.

I can understand the need for performance but if it were my business I would
have taken a significantly different approach.

~~~
sytse
It likely not be 64 servers for long.

------
Jabdoa
Why go with 3.5' disks? You usually want 2.5' and you can fit 24 ones in 2HE.
I recommend those using a supermicro cabinet. 2 cache SSDs and 20 SAS disks (+
2 spares) in combination with a raid controller will give you very much quite
fast storage at a sweet price spot.

~~~
sytse
What disk would you recommend for that setup?

------
tomschlick
I didn't see it mentioned but what are your plans for the network strategy.
Are you planning to run dual-stack IPv4/IPv6 ? IPv4 only? Internal IPv6 only
with NAT64 to the public stuff?

Hopefully IPv6 shows up somewhere in the stack. It's sad to see big players
not using it yet.

~~~
sytse
Probably IPv4 on a /24 block but we'll open up a vacancy for a network
engineer.

------
AstroJetson
Love how they did some number crunching and decided that rent vs own, own won.
I think that if more places looked they would find that out also. There must
be a margin in it since the big players are making money at it.

I'm interested to see what they end up with in the end.

~~~
sytse
Thanks! The decision to move to metal was because of performance problems
[https://about.gitlab.com/2016/11/10/why-choose-bare-
metal/](https://about.gitlab.com/2016/11/10/why-choose-bare-metal/)

It is nice that we'll save on costs but we anticipate a lot of extra
complexity that will slow us down. So if it wasn't needed we would have stayed
in the cloud. But it is interesting that both our competitors (GitHub.com and
BitBucket.org) also moved to metal.

~~~
mwcampbell
Have you considered hosting with Packet.net? You'd be on bare metal, thus
solving your performance problems, but you'd still be renting by the hour as
you are now, and you wouldn't have to deal with buying your own hardware and
all the complexity that comes with that.

~~~
sytse
I looked at their site and they talk about bring your own block, anycast, and
IPv6. But I can't find any information about networking speeds. What if we end
up needing 40 Gbps between the CephFS servers?

~~~
justincormack
They provide dual 10Gb as standard. But talk to them about options.

------
gsylvie
I have a Cisco c220 m4
([http://www.cisco.com/c/dam/en/us/products/collateral/servers...](http://www.cisco.com/c/dam/en/us/products/collateral/servers-
unified-computing/ucs-c-series-rack-servers/c220m4-sff-spec-sheet.pdf)). I was
surprised at the complexity of the configuration options. For example, it has
24 memory slots, but if you fill them memory speed is 2133mhz, whereas if you
only use 16 memory slots, max memory speed is 2400mhz.

I've ordered a pair of Intel 750 Series 1.2TB NVMe SSD's for it, can't wait to
try that out. Still waiting for the SSD drives to arrive.

------
StavrosK
Goddamn, this thread is worth a few tens of thousands of dollars in
consultation fees.

~~~
sytse
Yes, I think it is, thanks everyone!

------
j45
Coming from a complex hosting background - the move to metal is always good
one especially when growing. Still, I recommend looking into, where possible
using a hypervisor management system like ProxMox where possible.

There is a great deal of pain in moving from one piece of metal to the next,
and there is nothing wrong with underpinning your metal with a tech where you
can at any time move your architecture to be any combination of a private,
public or hybrid cloud, storage aside.

This looks to be a really interesting project, I hope you can continue to blog
about it in detail.

------
dsr_
H1: I don't see a discussion of getting an ASN and doing BGP with a number of
upstreams. You say you want a carrier-neutral facility, but that won't buy you
much unless you have your own AS.

~~~
rhizome
Some ISPs will let you announce on a /24, which shouldn't be too hard for
GitLab to set up for themselves.

~~~
jlgaddis
@sytse: If this (BGP + ASN + /24 IPv4 + /n IPv6) is your plan, I'd encourage
you to get started on the process now.

Go ahead and apply for your ASN now and an IPv6 allocation. Then start working
on the paperwork for an IPv4 allocation. Because there is no more IPv4 to
allocate you'll have to go through the auction process and then the subsequent
transfer process.

You'll easily be able to find a provider that can give you a /24 if you buy
transit from them, but you don't wanna go through the trouble of renumbering
into your own IP space if you can avoid it.

------
matt_wulfeck
I'd like to see more about how they plan to increase the engineering and on-
call staff to keep watch over the thing. The increased head and stress is a
pretty tough hidden cost on my opinion.

------
intrasight
CPU-wise, I would look instead at the Xeon D series. Either the D-1557 or
D-1587

[https://www.servethehome.com/supermicro-superserver-
sys-1018...](https://www.servethehome.com/supermicro-superserver-
sys-1018d-frn8t-review/)

[https://www.servethehome.com/supermicro-x10sdv-12c-tln4f-rev...](https://www.servethehome.com/supermicro-x10sdv-12c-tln4f-review-12c-xeon-
d-with-sfp-10gbe/)

------
pipeep
Regarding D2: If you're going to go with bcache, make sure you're using a
kernel >= 4.5, since that's when a bunch of stability patches landed
([https://lkml.org/lkml/2015/12/5/38](https://lkml.org/lkml/2015/12/5/38)).
Alternatively, if you're building your own kernel, you should be able to apply
those patches yourself.

~~~
sytse
Thanks, added in [https://gitlab.com/gitlab-com/www-gitlab-
com/commit/da003427...](https://gitlab.com/gitlab-com/www-gitlab-
com/commit/da0034272d9b7dfa952ea8e451907f98654efa9b)

------
stonogo
Is there a reason you don't just put out an RFP and purchase a complete
supported solution from a VAR or OEM vendor?

I'm not sure you have the in-house expertise to maintain a production service
of this kind. That's not an attack; this has not been your focus in the past,
so it might be wise to have a third party provide some assistance and support
as an intermediate step toward doing everything in-house.

------
erhardm
M1: Each CPU has four memory channels, each node has 16 DIMMs (8/CPU, 2
DIMMs/channel) which means that you can use 1 DIMM per channel -> maximum
memory bandwith and speed with RDIMMs. Using 64GB or 128GB LRDIMMs with E5v4
CPUs won't affect your bandwith or speed as long as you populate all the
channels.[0]

Your memory options are:

1TB - 16x64GB / 8x128GB

2TB - 16x128GB

[0] - SuperMicro X10DRT-PT Motherboard manual page 35(2-13).

------
plasma
I may be way off base, but is this a solvable problem by just sharding/placing
customer repositories over various storage systems?

~~~
sytse
See
[https://news.ycombinator.com/item?id=13153354](https://news.ycombinator.com/item?id=13153354)

------
devicenull
> We want to dual bound the network connections to increase performance and
> reliability. This will allow us to take routers out of service during low
> traffic times, for example to restart them after a software upgrade.

does not really agree with

> Each of the two physical network connections will connect to a different top
> of rack router.

Sure, you can do it with something like MLAG, but that's really just moving
your SPOF to somewhere else (the router software running MLAG). Router
software being super buggy, I wouldn't rely on MLAG being up at all times.

> N1 Which router should we purchase?

Pick your favorite. For what you're looking for here, everything is largely
using the same silicon (broadcom chipsets).

> N2 How do we interconnect the routers while keeping the network simple and
> fast?

Don't fall into the trap of extending vlans everywhere. You should definitely
be routing (not switching) between different routers. You can read through
[http://blog.ipspace.net/](http://blog.ipspace.net/) for some info on layer 3
only datacenter networks.

You'd want to use something like OSPF or BGP between routers.

> N3 Should we have a separate network for Ceph traffic?

Yes, if you want your Ceph cluster to remain usable during rebuilds. Ceph will
peg the internal network during any sort of rebuild event.

> N4 Do we need an SDN compatible router or can we purchase something more
> affordable?

You probably don't need SDN unless you actually have a SDN use case in mind.
I'd bet you can get away with simpler gear.

> N5 What router should we use for the management network?

Doesn't really matter, gigabit routers are pretty robust/cheap/similar. I'd
suggest same vendor as you go for whatever your public network routers.

Also, consider another standalone network for IPMI. I can tell you that the
Supermicro IPMI controllers are significantly more reliable if you use the
dedicated IPMI ports and isolate them. You can use a shitty 100mbit switches
for this, the IPMI controllers don't support anything higher.

> D5 Is it a good idea to have a boot drive or should we use PXE boot every
> time it starts?

PXE booting at every boot is cool, but can end up sucking up a lot of time. If
you have not already designed your systems to do this, and have experience
with PXE, then don't.

> The default rack height seems to be 45U nowadays (42U used to be the
> standard).

You may not have accounted for PDUs here. Some racks will support 'zero-U'
PDUs, but you'd need to confirm this before moving on.

> H3 How can we minimize installation costs? Should we ask to configure the
> servers to PXE boot?

Assume remote hands is dumb. Provide stupidly detailed instructions for them.
Server hardware will PXE by default, so that's not really a concern. IPMI
controllers come up via DHCP too, so once you've got access to those you
shouldn't need remote hands anymore.

> D2 Should we use Bcache to improve latency on on the Ceph OSD servers with
> SSD?

Did you consider just putting your Ceph journals on the SSD? That's a lot more
standard config then somehow using bcache with OSD drives.

~~~
sytse
Thanks for the suggestions.

We're already planning a separate router for the management network ("Apart
from those routers we'll have a separate router for a 1Gbps management
network.").

All Ceph journals will be on SSD too. I've added a question about combining
this with bcache in [https://gitlab.com/gitlab-com/www-gitlab-
com/commit/a9cc9aad...](https://gitlab.com/gitlab-com/www-gitlab-
com/commit/a9cc9aade393d5f190d5ace93b7027140b42676b)

~~~
devicenull
If you're doing this right, the management network will not actually be
accessible via the normal operating system. Most IPMI controllers _support_
sharing a normal nic and management (meaning both the IPMI controller and host
OS can access it), but I wouldn't recommend doing this.

------
PanMan
I have never run a deployment this large, but I wonder: Only 1 staging server
of 64 servers? I have usually tried to at least have a same order of magnitude
when testing architecture changes: Sure, it works on my laptop, but how will
this change work on 10 servers? Isn't it common to have a full staging setup,
with similar dataset sizes?

------
dsr_
C1: The SuperMicro offers blade-server densities without the ridiculous prices
that proprietary blade-server chassis come with. It's a pretty good system.
Please make sure you have assessed your power requirements correctly and
communicated that to your datacenter.

If you were larger, Open Compute Platform might be a way to go. Maybe next
generation.

------
dsr_
D5: You want a local boot drive, and you want it to fall back to PXE booting
if the local drive is unavailable. Your PXE image should default to the last
known working image, and have a boot-time menu with options for a rescue image
and an installer for your distribution of choice.

~~~
kkrs
It's better to have it the other way around. Attempt to boot off the network
and fall back to local drive. That will allow you to reimage a node without
having to fiddle with the BIOS. For regular boot there should be no image
configured. Network boot will fail and the node will boot off disk. However if
an install image is configured for that node, you can reimage it at will. The
install image should reset to having no image for that node, once it's done.

You'll need to maintain an association between - dns, ip, mac address, ssh
keys etc.

Hardware break-fix workflow is usually ignored by most production engineers.
You'll be doing that a lot. You want to get your hw back into use as fast as
possible.

Have you thought about how many spares (CPU, RAM, disks) you'll have to keep
at your datacenters ?

------
ddon
We moved 10 racks of servers from New York hosting to custom build small
hosting facility in Tallinn, Estonia, and we were able to save about €13K a
month on hosting. We use free side cooling to save on electricity. We are here
for 3 years now, no major problems so far...

------
jorblumesea
What is the amount of money you are expecting to spend on staff vs the
performance return you get?

Running your own boxes can be done, but usually at great cost and usually by
blowing up your sla. Given the inexperience you have at this some other
options might be politically cheaper.

------
hatsunearu
D2: Ceph has built in support for cache tiering.

[http://docs.ceph.com/docs/jewel/rados/operations/cache-
tieri...](http://docs.ceph.com/docs/jewel/rados/operations/cache-tiering/)

------
AnonNo15
I would say you should start with hiring 2 full-time senior ops and 2 senior
network admins.

Then give them freedom to do all hardware picks AND hiring some more
intermediate/junior ops/admin staff.

You can do with 3-4 ops and maybe 3 network admins people in total.

------
z3t4
Ive done MMO setups that are more simple. You do not need this large cluster.
Spread it out with nodes plus backup nodes near customers. Separate customers.
Scaling will be as easy as adding more servers one case at a time.

------
ilaksh
Hard for me to accept that the price differences server vs desktop components
are legit and not some kind of scam.

Also, assuming you are getting jacked by AWS like most people, have you looked
into Linode, Digital Ocean or anyone else?

------
falconed
Have you looked into using a CDN? A good CDN configuration can pay for itself
in cost savings, regardless of whether you go self hosted or remain in the
cloud.

------
tnorgaard
Have you considered Rackspace's OnMetal products or other IaaS providers that
run bare metal, such as Joyent Public Cloud? If you are in such a rush, I'll
suggest to factor in both the migration risk and the time to deliver said
hardware. Fx the Joyent Public Cloud will allow you to have a mixed
private/public cloud as their IaaS software, Triton, is open source and you
can run it in your own hardware. Note: I'm not familiar with anybody running
Ceph on illumos LX containers although.

------
bfrog
In my experience bcache ssd + spinning rust performed much worse than ZFS with
its caching layer. ZFS did a significantly better job.

~~~
sytse
I'm sorry to hear that but it makes sense. Unfortunately I don't think it is
an option to run ZFS below Ceph.

------
skuhn
For server hardware, the Supermicro 2U Twins are a reasonable choice, but I
prefer their 4U FatTwin chassis. The engineering quality is a little better
IMO, and the cost increase isn't too big. Absolutely do not buy their 1U Twin
systems, they are hot garbage.

The FatTwin chassis has similar density, and can support either 1U half width
or 2U half width systems in a particular chassis. Typically I use 1U's for app
/ web servers and 2U's for lower end database / storage. Separate 2U servers
for higher end database and 4U servers for bulk storage.

HPE has the Apollo 2600 / XL170r 2U chassis, which I think is somewhat
inferior but still a reasonable choice. Dell sells the same thing as the
C6300. I really prefer a 4U chassis though from a cooling and power supply
perspective, but Dell or HPE may have a better international story for you.

You absolutely should not buy the 2630v4 CPU. I say that because the lower-end
Intel CPUs do not support maximum throughput for memory and QPI. The 2630v4 is
8.0GT/s QPI and 2133MHz DDR4. A better solution low-watt part is 2650Lv4
(9.6GT/s QPI and 2400MHz DDR4). I have a guide that I created (and use myself)
to determine comparative $/performance of Intel CPUs based on SPEC numbers
[1]. If you can go up to 105W the 2660v4 is probably your best bet. Presuming
that you're targeting 12-15kW per rack, a 105W part should allow you to deploy
between 60-80 hosts per rack.

Also, don't use a W-series CPU that draws 160W. That's crazy power draw per-
socket. If you want a super high-end CPU for your database server, I suggest a
2698v4 -- but normally I would go with 2680v4 or 2683v4 depending on the part
cost.

In terms of hard drives, absolutely you should specify HGST over Seagate. At
some point you may want to dual source this, but if you're only going with one
vendor right now HGST is the best option. He8 or He10 8TB are your best bet in
terms of cost and availability right now, although start thinking about He10
10T drives. The newly announced He12 drives shouldn't be on your radar until
Q2 2017. Stock spares, maybe 2-3% of your total drives deployed, but at least
5-6 drives per site. You don't want to get caught out if there is a supply
shortage when you need it most. Your business depends on ready access to
considerable quantities of storage.

The P-series Intel SSDs are probably not going to be cost effective for your
use case. But they are considerably better in terms of IOPS and remove the
need for an HBA or RAID controller. Consider a Supermicro 2U chassis with 2.5"
NVMe support, which will allow you to go considerably denser than the PCIe
form factor. However, I think it's too early to go with NVMe unless you truly
need beyond-SSD performance.

Don't PXE boot every time you boot. This creates a single point of failure
(even if your architecture is redundant), and you will regret this at some
point. However, DO PXE boot to install the OS.

Don't use 128GB DIMMs. They are not cost effective today.

There is only one solution for database scaling: shard. You'll either shard it
today, or you'll shard it tomorrow when the problem is much harder. Scale up
each host to what is easily achievable with today's hardware, and if push
comes to shove retrofit to get over a hump that arises in the future, but know
that you MUST shard in order to keep up with demand. Scaling up simply does
not work.

There's a lot more to say, but without doing my job for free in a HN comment,
the best advice I can offer is:

1\. Simplify what you hope to accomplish in the first round. This is a lot to
achieve at once. I think you'll have a hard time achieving the fanciness you
want from a software perspective while also forklifting the entire stack over
to physical hardware. It's perfectly fine to have something be good enough for
now.

2\. Find people who have done this before and get their advice. Find a VAR you
can trust.

3\. Plan, plan, plan, plan. Don't commit until you have a plan, make sure the
plan is flexible enough to change course without tossing everything out, and
plan to do a good enough job to survive long enough to figure out a better
plan next time.

4\. Get eval gear, qualify and test things, and make sure that what you think
will work does work.

[1]
[https://docs.google.com/spreadsheets/d/1bbbeMCmqt5pZCb_x2QMW...](https://docs.google.com/spreadsheets/d/1bbbeMCmqt5pZCb_x2QMWVSL9_e8YGovZthHNGqr5H1A/edit?usp=sharing)

~~~
yuhong
And if you want to upgrade later, I think you should estimate about ~3 years
before 128GB DDR4 LR-DIMMs are cost effective, right?

~~~
skuhn
It's hard to say with absolute certainty, but I think 2-3 years is a
reasonable guess. 64GB DIMMs have only recently become semi-reasonable, and I
still use a lot of 16GB or 32GB DIMMs on smaller deployments.

Basically whatever the top-of-the-line DIMM option may be (and this applies
for CPUs and HDDs and other stuff too), you want to avoid being in a situation
where you HAVE to use it. Vendors price these parts accordingly: you pay a
premium for top-of-the-line because you must have it. If you can avoid that,
do so.

------
rnxrx
The network piece seems kind of incomplete and outdated when compared to
what's being discussed in terms of compute and storage. Most of the new
networking fabrics being built at this point are 100G, with emerging server
connectivity focusing on nx25G for in-rack. Longer-distance 25G is awaiting
implementations of FEC but the currently shipping gear can do 3-5M - which is
about ideal for top-of-rack. The economics and development cycles of these
link-types tends to dictate that 40GE doesn't make sense for a new
installation.

The call for "SDN" is incredibly nebulous - to the point of being almost
meaningless, IMO. What the big guys tend to be after is a way to control the
fabric via standardized API calls - so capability for YANG/NETCONF or some
mechanism for direct access to SDE calls. The other thing that's not addressed
is how to efficiently get information about the network _out_ of the network.
Traditional polled mechanisms (SNMP/RMON, et al) have been shown not to both
lack scale and adequate resolution while legacy approaches to push telemetry
(sFlow/Netflow) miss the mark in terms of level of detail and compatibility
with large-scale data processing needs.

The next point is the selection of topology and the integration with multi-
site planning. There's a lot of cool stuff happening in this regard and there
seems to still be a pretty big disconnect between what the systems folks seem
to understand and what's happening in the network industry, which is a shame
because there's probably more opportunity for neat stuff (read: scalability,
performance, fault resistance, manageability) than seems to be discussed (at
least on HN).

Finally - there's a certain conventional wisdom among the systems and some
sections of the programmer crowd that network control planes are just another
mostly simple bit of software to be implemented. It's not. It's a hard problem
and is the manifest reason why only a handful of organizations have been able
to produce software that runs a meaningful percentage of large-scale L3
infrastructure in the world (hint: Arista is a great company but isn't
included in this number quite yet). Truly rugged, useful/usefully-featured and
performant network code is hard. Making that code work in the context of 30+
years of protocol implementations, morphing standards and a world of
bad/clueful actors is _REALLY_ hard. There's an inverse relationship between
the amount of money spent on a solution and the amount of specialized
expertise you have on staff. A more traditional commercial solution might be
more expensive but it also relieves you of the need to keep some relatively
rare, likely expensive and almost certainly non-revenue producing skill-sets
off-staff.

------
jmakov
Can someone fill me in why renting a couple of root servers wouldn't solve the
problem?

------
foxhop
Good luck. I think moving from cloud is a mistake. I hope your prove me wrong.

~~~
connorshea
We hope so too ;)

------
peterwwillis
The first consideration should always be the DC, and the very last one is the
software, after hardware, network, power and cooling.

Where will my DC be? What kind of DC is it? What services do they provide? How
long do I want it to take for an employee to get there, whether or not they
have 24/7 remote hands? What kind of power resiliency do they provide? What
will power cost? What kind of power do they provide per cage and rack? How
will their uplinks affect my traffic needs? Etc etc.

Cooling I didn't deal with directly, but suffice to say you will always need
more cooling, and its efficacy will determine if your hardware stays alive or
not. I've seen 11 foot racks with only 6 feet of hardware because they simply
couldn't cool the racks at full height. Learn how to look for properly
designed hot/cold racks and keep your racks well organized to make cooling
efficient.

Power is pretty obvious, except that it isn't. Eventually you will draw too
much power and you'll need to shunt machines into different racks and monitor
your power trends. So one of the things to consider, besides dual drops, is
how many extra racks do I have for when I need to spread out my power OR
cooling into new racks? Will they make me use a rack on the other side of the
building, or am I going to pay for some in reserve next to the current ones?
Get PDUs that aren't a pain in the ass to automate (APC sucks balls).

Network: i'm not a neteng, don't listen to me, but obviously it should be
managed with nice fat switching fabric bandwidth, good forwarding rate and big
uplink module support. 48-port switches don't always have the same bandwidth
ratios as 24-port switches, and uplinks are much easier to manage on a 24-port
than a 48.

Hardware: you don't seem to need anything special, so you need to determine if
a support contract is necessary, and if not, buy the cheapest pieces of shit
you can and then rely on remote hands or a local employee to change out broken
shit all the time. If space, power, cooling are at a premium, a blade chassis
can be handy. But if you can spare the space, power, and cooling, 0.5U and 1U
shitboxes are fine for most purposes. Don't get wrapped up in the details
unless your application design requires specific hardware performance
guarantees.

Looked at iSCSI SANs? Could make WAN sync easier, reduce overhead from NFS,
but probably depends on how well your OS supports it and the features of the
SAN. Oh, and an OOB terminal server can be a godsend when combined with a good
PDU.

Go find all the industry datacenter design papers out there (there are tons)
to bone up on the design considerations. Remember that you can always replace
machines, but you can't replace rack, cooling or power design.

~~~
virtuallynathan
>Network: i'm not a neteng, don't listen to me, but obviously it should be
managed with nice fat switching fabric bandwidth, good forwarding rate and big
uplink module support. 48-port switches don't always have the same bandwidth
ratios as 24-port switches, and uplinks are much easier to manage on a 24-port
than a 48.

With current gen ASICs and switches, this isn't generally true anymore. A
$1000 48 port 1Gbps switch is fully non-blocking with almost 1:1 10G uplinks
(48x1G in, 4x10G out)

~~~
mohctp
agreed. There is absolutely zero reason to even consider 24 port switches in
this environment.

------
networkguy6
Servers:

C1: Have a look at the FatTwin Line for more Disks per U. More PCIe Slots too.

Disks

D3: Check the measurements, having it not fit is painful

D4: More, smaller drives. Make sure you go PMR not SMR if you do go for 8TB

Network

N1: The "SDN" aspect of the supermicro one is not really any different than
any other. Look at [https://bm-switch.com/](https://bm-switch.com/) and get
one that supports Cumulus Linux. Buy one with an x86 CPU. If you want to do
"SDN" things or run custom monitoring, not dealing with PPC is great.

N3: Probably not needed, but not terribly expensive if it provides benefit.

N4: see N1, no.

N5: Cheap 1G switch that supports cumulus, x86, probably broadcom \ helix4

Networking General:

\- I wouldn't advise using the 10GbE Copper --Go for 25GbE with DAC, it's
basically the same price, Mellanox NICs are small/cheap

\- Transit is cheap, you can get 500Mbps on a 10G port for $325/mo from Cogent

\- If your bandwidth needs to scale up, data center locations matter more than
you think

25GbE adapter -- minimal additional cost for 2.5x the perf, lower latency as
well:
[http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=2...](http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=2816&idcategory=6)

a 32 port 100GbE switch is about $7000-12,000 -- you can break that down to
128x 25GbE, and use the 25GbE ports running at 10GbE mode for your carrier
uplinks. Could even do 100GbE to your Ceph nodes if you wanted, but be aware
of PCIe bottlenecks -- x8 is about 64Gbps, x16 will do 125Gbps. Dual port 40G
on x8 or dual port 100G on x16 will not provide more than those numbers.

Consider Supermicro NVMe servers (Ultra series) for DBs, and 2.5in NVMe SSDs
instead of PCIe.

Rack: Don't assume 45U 40-48 is common. Consider buying 2 racks.

Power:

19kW seems high for a single rack, you will need a good datacenter to support
that density. Density costs money more racks is cheaper generally.

208v * 30A = 5000W usable per feed, unless they are talking about

208v 3-phase, which gets you 8600W usable per feed, which again is only 17kW
and you need 18-20kW.

helpful reference: [http://www.raritan.com/blog/detail/3-phase-208v-power-
strips...](http://www.raritan.com/blog/detail/3-phase-208v-power-strips-rack-
pdus-demystified-part-ii-understanding-capac)

You can only use 80% of your power provided.

You also need Rack PDUs, higher density PDUs cost more money consider buying
port-switchable PDUs. Raritan makes good ones.

Ask supermicro (or your reseller) for a "Power Sheet" it will tell you almost
exactly how much power your server will use. I've had good luck with ThinkMate

Hosting

H1: Yes, too many to mention

H2: Do it yourselves

H4: yes

Hosting general: Cross connects cost money, a number of facilities offer free
xconnects, this can add up.

Other notes: \- You want a small toolbox in the data center

\- Buy more cables than you think you'll need

\- You'll always forget something

\- There are a number of companies that will lease you servers for pretty
decent rates

------
avifreedman
Agree that bare metal is effective and doable at your scale, and can if done
right give better SLA and much much much better control - especially of
Internet-facing network performance than public cloud, or combos thereof).

We are running a 50%+ gross margin mid-stage venture-backed startup in Equinix
facilities (but started there vs. cloud), and have no people near our
facilities, and have had 0 issues service-wise related to doing management
remotely. Yes, people go out to set up cabs, etc, but we hired our ops folks
as generalists who had some network experience, and our CEO and CTO do as
well, though AFAIK I don't have network logins active right now.

2 high-level thoughts I'd share:

1) Try not to use Ceph unless you're committed to having 2 people with deep
experience at the code level.

2) I'd use Juniper QFX or EX, or Aristas. You don't seem to be running at
scale or functionality where SDN magic is needed and there is a large
community of QFC, EX, and Arista users your folks can reach out to when
problems happen.

The other comments are more tuning and FYI on what we do HW-wise:

Specifically re: HW, at Kentik we run tens of worker nodes + flow ingest
servers, all SM 1us w a few SSD and 256-384gb RAM. 48 logical cores, 2 x
E5-2650v4.

We run approaching 1PB of storage, and while we still have some 4u 36-disk
3.5" boxes, those are phasing out and all we buy now is 2u SuperMicros w/
24x2TB Samsung Evo 850ss. Procs are 72 logical core, 2 x E5-2697v4.

The Evo SSDs have been great - but our workload is largely appends or
create/writes - largely but not all sequential, with high read IOPS. Before
Samsung I was a big fan of Intel but we have no data on the modern Intels -
slower for sure, but a focus on reliability is great...

We use JBOD and ZFS on the storage nodes; the LSI 9300-8i. Have things tested
so we can do TRIM.

They do make SuperServers for roughly those configs, but we go with SM
resellers who assemble and burn-in for +10-15%. I had 50+ SuperServers that
were great at my Usenet company, but we'd rather have our ops folks work on
things other than burn-in.

Happy to explain why we went to SSD vs. spinning at 2x the cost, but basically
it made enough of a different at 95th and 99th percentile in our query times,
and we had access to venture debt on great terms (which you should too and
happy to discuss, since we're both funded by August).

Last note re: gear - when we were doing spinning, we found a screaming deal on
new 2TB enterprise SATA (Hitachi, I think) for $50 and took the power/space
hit for the +IOPS and extra compute we got for firing up the additional
machines. Not sure if those are still out there, or the IOPS of this kind of
approach would be needed.

------
i336_
First of all, massive kudos for the Stack Exchange-like technical
transparency. Definitely consider a massive upgrade album like
[http://blog.serverfault.com/2015/03/05/how-we-upgrade-a-
live...](http://blog.serverfault.com/2015/03/05/how-we-upgrade-a-live-data-
center/) and [http://imgur.com/a/X1HoY](http://imgur.com/a/X1HoY)!

GitLab is awesome. I'm really sad that in the past two or three months I've
only found one GitLab link on HN to click. There really needs to be more. (I'm
not sure if this is because I'm browsing in AEDT or if GitLab isn't used a lot
on here.)

I wondered about how you guys might do advertising to get more mindshare, and
then I realized one possible explanation about why you're doing this: getting
technical advice from the community means everyone's had a part to play, and
they're likely to remember that. Good move ;)

\---

In my case, I have little (okay, 0) practical experience; a lot of the
following is mentioned experimentally, to see how these ideas would handle the
described environment. It's pretty much all stuff I've read online.

I welcome replies that shoot down any of these ideas.

> _Disk_

 _Disks can be slow so we looked at improving latency. Higher RPM hard drives
typically come in GB instead of TB sizes. Going all SSD is too expensive. To
improve latency we plan to fit every server with an SSD card. On the
fileservers this will be used as a cache. We 're thinking about using Bcache
for this._

There's already been another brief comment
([https://news.ycombinator.com/item?id=13153317](https://news.ycombinator.com/item?id=13153317))
about ZFS.

So, I'll ask. Why not ZFS? You don't have to run FreeBSD anymore to get a
stable implementation.

You can put both the L2ARC and ZIL on SSDs. You can even use striping with
them. Don't quote me on this but I think there _MAY_ be some recovery
capabilities built into these layers for if the power goes out (either it
didn't use to be possible and now it is, or it's architecturally impossible, I
hilariously cannot remember which).

\---

> _In general 1GB of memory per TB of raw ZFS disk space is recommended._

This is ONLY if you have dedupe switched on. If you have dedupe off you can
run systems in just 4GB. A lot of home server enthusiasts do this.

There are a lot of unfortunate and widespread misconceptions about ZFS.

\---

(This bit's somewhat anecdotal and is more informational than actionable. It's
worth noting if you're interested in disks.)

> _Every node can fit 3 larger (3.5 ") harddrives. We plan to purchase the
> largest one available, a 8TB Seagate with 6Gb/s SATA and 7.2K RPM._

 _Technically_ , the largest one available (on Amazon and presumably
elsewhere) right now is 10TB, but its price/capacity ratio is atrocious
compared to the rest of the market ($450-$520 _per disk_ ).

I've heard that Seagate Enterprise Capacity drives either die within the first
2-4 weeks or last 20 years. They have 5 year warranties in any case. I haven't
heard anything else about other disks.

Very interestingly, 8TB seems to be the current market leader. Here are a
bunch of prices I took straight off Amazon, as guides:

#3: 4TB: $170 (13 disks for 52TB = $2040)

#2: 5TB: $200 (10 disks for 50TB = $2000)

#4: 6TB: $239 (9 disks for 54TB = $2151)

#1: 8TB: $360 (7 disks for 56TB = $1673)

#5: 10TB: $450 (5 disks for 50TB = $2250)

A little while ago 5TB was the leader, and I was going to argue for more
disks.

\---

(Hitting _add comment_ now instead of waiting so I can keep up with the
discussion)

------
dimino
> We are attempting to build a fault-tolerant and performant CephFS cluster

Ohhh boy. I hope this works, and look forward to hearing updates.

~~~
mritun
from CephFS website:

"Important: CephFS currently lacks a robust ‘fsck’ check and repair function.
Please use caution when storing important data as the disaster recovery tools
are still under development. For more information about using CephFS today,
see CephFS for early adopters."

I'm getting the same feeling I get when I'm watching those "Hold my beer and
watch this..." videos.

From my vantage point they look to have 0 experience in building and running
infrastructure... and asking advice on HN. They might ask well post a Ask
Slashdot thread if they want armchair advice. Genuinely, I think they've
crunched some numbers and think they can run their stuff cheaper and faster
in-house... but probably underestimated the human-experience angle.

For just 10-20 physical servers, this is going to be either extremely
expensive (if they hire right) or extremely painful (if they don't).

~~~
sytse
We certainly don't have any experience hosting our own hardware.

We're not doing this to save money, we're doing it to increase performance
[https://about.gitlab.com/2016/11/10/why-choose-bare-
metal/](https://about.gitlab.com/2016/11/10/why-choose-bare-metal/)

~~~
mritun
Then prototype first. Rent a server or two somewhere for 3-6 months and run a
shadow first. Once you're confident that you understand all the "other 80%"
stuff that is involved running your own infrastructure and don't lose data,
then think about doing it yourself.

A service providers' biggest responsibilities to its customers are security,
durability, availability and performance -- in that order. You guys are vastly
underestimating the complexity involved in getting first 3 right.

