
Why we are not leaving the cloud - happy-go-lucky
https://about.gitlab.com/2017/03/02/why-we-are-not-leaving-the-cloud/
======
Dangeranger
It's interesting that the first half of the explanation is largely quotes from
a prior HN post's responses.

Does anyone else feel a bit weird seeing off hand comments getting quoted in
the explanation for a business decision? I guess we should all get more
accustomed to our public input carrying weight in the zeitgeist.

Gitlab's "develop in the open" nature really shows through here. I am not
saying that's bad, it's just so different than most startups and established
businesses.

~~~
gtaylor
I don't think they were using this as justification or explanation in their
eventual reversal. I think they provided these quotes as interesting tidbits.
Kind of like pull-out quotes on an article. You can ignore them entirely and
still get the gist of the post.

On a more general note: It's got to be incredibly hard to do what GitLab does
with their extreme transparency. I feel like we have to be careful about
reading too deeply into things and nitpicking their culture or process. HN is
full of "expert" advice, much of it being terrible. They weighed their
options, invited feedback, then made a decision.

I appreciate what GitLab does in being so transparent. None of us are owed
explanations or insight into how they operate, yet they go out of their way to
provide it. Kudos to syste and his team!

~~~
sytse
Thanks. The most useful advice was shared in private. People coming to our
office to share war stories of regretting moving to bare metal. I also
received dozens of people sending direct messages via twitter because they
couldn't share their stories publicly. Some of the best advice are things that
we can't share publicly. For example a major company going bare-metal and then
spending a lot of time to set up an authorization system that you get for free
with AWS IAM.

~~~
dustinmoris
Did you have to sign a NDA? Because if people send you random advice via email
after reading a blog post then this sounds to me very much like "publicly
sharing" their experience...

~~~
shshhdhs
I disagree. It's a private email. Gitlab can ask permission to post, but it
would be impolite and perhaps unethical to assume a private email is now
"public". If they wanted it public, they could have chosen to provide a
comment on HN, Twitter, etc.

~~~
sytse
Exactly, everything is private unless it was posted in public by the author or
explicit permission was given. We have a transparency value, but we understand
that other organizations are different. And even we assume that our private
communication will stay private.

------
dx034
As much as I like gitlab, that decision seems weird. I'd expect a business to
come up with numbers for different scenarios, not just quotes (even if those
weren't the only deciding factor).

What I mean is, instead of writing about 8TB disks vs 2TB disks, get hard
numbers for those and calculate costs for different versions to the costs in
the cloud.

Instead of saying that engineers are expensive, look at market data to
determine how much you have to pay to hire these people.

Instead of writing about the potential of full cabinets, get quotes from data
center providers of what they can provide and how they would deal with this.
I'm sure Equinix has solutions for that kind of problem.

And in addition to that, it seems that managed dedicated wasn't considered at
all. Why hire someone to swap disks for you? Why not just renting capacity for
servers and have the data center provider taking care of hardware issues? They
operate at bigger scale and can do this with little overhead costs. You're
still much cheaper than the cloud but don't need to drive somewhere to swap a
server (you just switch to a spare one until the data center fixed the
problem). With Gitlab's scale, it shouldn't be too hard to get a decent
bespoke quote from companies like OVH.

~~~
discordianfish
They also miss out on the dog fooding aspect. They target enterprise customers
who run probably run gitlab on bare metal. By doing the same, you experience
what your customers do and have much closer feedback loops on any issues that
might especially affect bare metal.

Regarding managed dedicated.. I had very bad experience with rackspace in that
regard, which is why we moved AWS. At another job we had our own DC suite but
had some company taking care of initial cabling as well as remote hands from
the datacenter. If I ever do bare metal again, that's why I would do.

~~~
sytse
I agree that dogfooding is important. Our users and customers run GitLab both
on bare-metal and in the cloud. GitLab runs better on bare-metal because it
tends to have better IO performance. So by running it in the cloud we are
optimizing where it is needed.

------
keepper
Why was hybrid not considered? With providers like megaport providing really
inexpensive direct connect, this is almost a no brainer.

This may seem harsh, but relying on random commenters shows a huge flaw in how
you guys went about this.

Physical environments DO work well, but they do require experience to run
them. I have used AWS for as long as its been around, but nothing beats
physical environments for "known workloads", as long as you run them EXACTLY
like you would run a cloud environment. This is why the hybrid approach is
such a great thing. Run what you know well in physical and reap great savings,
run what you dont know well cloud ( as well as take advantage of the analytics
products), and reap the speed ;)

\- This means thinking of physical servers are individual units. \- This means
redundancy at every level. \- This means architecting for failure ( servers
and switches do gie ). \- This means no shared storage for performance
critical parts ( shared storage as below the OS level ). \- This means objects
stores/sharding/etc as your storage layer. \- This means real engineering \-
And this means exactly the same whether you're physical or not.

I've managed environments of 50 vm's and environments of 50000 physical
servers. The methodology is always the same.

and yes, this means you can save some monthly cost, and apply it to staff,
that can do a lot more than just maintain this infrastructure.

PS: for those that think that showing up to a datacenter is required, you're
doing it wrong. Pretty much any datacenter has hot-hands service, and with the
right redundancy, hardware replacement is something you can do at a slower
pace.

PPS: im sure someone will nitpick some of my points. The reality is, there is
real money savings here. For example, Snapchat spends MORE in cloud
infrastructure in 2016/2017 per year, than ALL OF GOOGLE did in 2012... think
about that for a second... even netflix runs openconnect to push bits..

~~~
dasil003
> _This may seem harsh, but relying on random commenters shows a huge flaw in
> how you guys went about this._

Your other points and experience are well taken, but this is definitely too
harsh. Think about the nature of GitLab as an open company building dev tools
and paying below-market salaries. They have a huge amount of developer
mindshare relative to their internal head count. They are not very competitive
for hiring people with your particular skillset.

Given these facts, I think an open decision making process serves them better
than trying to make decisions behind closed doors. It may come off a little
ham-fisted, but in actuality I think it's far _less_ ham-fisted than a huge
number (maybe even majority?) of engineering decisions made every day in
companies that just don't have the expertise to make the right decision and
often don't even realize they are getting burned by what they don't know they
don't know.

~~~
extrovert
> _They are not very competitive for hiring people with your particular
> skillset._

There is plenty of ops talent willing to work at a high-profile startup. I
have experience building and running PB-scale Ceph clusters, private and
public clouds, HA infrastructure covering everything from DB failover to BGP,
and interviewed at Gitlab when they announced going on-prem. I asked for
_half_ the market rate (because I was more interested in solving the scaling
problem than money), but did not get it.

The HR person at the first (non-technical!) interview asked a bunch of
programming related questions from a script which caught me completely off
guard. One question went along the lines of "tell me about a difficult problem
that you had a particularily elegant solution for". I think they've learned by
now that ops problems usually can't be solved by a simple algorithm.

In retrospect I'm glad I didn't get it, but they seem to be having issues with
more than infrastructure.

As a side note, if there are other startups wanting to move to own hardware
but struggling to find the right people, I would love to know where to find
you. Non-AWS ops jobs are far and few between these days.

~~~
skuhn
It has definitely become less common to see companies build much outside of
cloud environments. And the talent has dried up to an extent as a result.
Expertise and interest in building outside of cloudland is important to find,
but there's also a challenge in finding the right operational mindset as well.

It seems more common with more mature "startups" that have done the cost /
benefit analysis and realized that the only way to achieve economies of scale
for their business is to exit AWS. Dropbox, GitHub, Uber and so forth. Or
businesses that simply must operate their own hardware, like CDNs.

There are some smaller places that see value in building a low-cloud service
from the beginning. One of those companies is where I work, e-mail in profile.

------
101km
Gitlab is hosted on Azure and like many startups received six figures worth of
free credits to be in the cloud.

As the credits run out the architecture reverts to the natural state -
architectures tend to be the derivative of organizational structure, code
base, and processes of a company.

Through this lens these decisions start making more sense. It is a reactionary
process. Sometimes when you start small you bounce around like that and are
lucky enough to grow for years and years until you hit a wall.

------
AgentK20
A bit disappointed in the article, as the bulk of it is simply quoting various
viewpoints, some of which disagree with each other. I'd be much more
interested in the thought process that occurred for them to reach the
conclusion they did. What internal doubts did they raise? What possible
solutions to those doubts did they consider? Things like that.

~~~
Ajedi32
The reasoning behind their decision is linked near the top of the article,
under "Sid and the team decided": [https://gitlab.com/gitlab-
com/infrastructure/issues/727#note...](https://gitlab.com/gitlab-
com/infrastructure/issues/727#note_20044060)

------
ben_pr
I'm a little shocked at this decision. The issues faced are all solved
problems from PB storage to HA on server clusters even across Data centers.
Good solutions do have upfront costs but the math that cloud hosting is 5-10X
more than co-location company owned/leased is still in the ball park. It
sounds like they may need an architect with enterprise experience to help them
out rather than random comments on HN.

~~~
djsumdog
Yea, every growing company I've been at has been in the process of moving off
of hosted solutions and onto their own hardware to cut costs.

Two such companies made the mistake of doing that with OpenStack, which is
terrible and should die in a fire.

But one later switched to DC/OS and containers and it has worked really well.
They've been migrating apps running on EC2 instances into docker containers
than can run on marathon in our local data center and the savings is pretty
substantial (even adding in the cost of the teams needed to maintain our own
platforms).

Managed solutions are great for startups. There is a lot of value in not
having to setup, maintain and manager your own hardware .. but that does reach
a limit and companies need to be prepared for that transition and avoid
lockins.

------
user5994461
I love all the comments trying to explain that managing your own hardware,
storage, network and backup cannot take that much of your time.

Not only it does take a huge amount of time, but it takes so much time that it
require multiple dedicated highly-skilled people.

Last but not least, the last article about Gitlab was about a major outage
where they had as much as 6 out of 6 backups unusable! The fact that they
still exist today is to the sheer luck that there was an unexpected copy made
manually before the disaster.

That speaks at length on how that 100 people company is struggling to have the
resources and the qualifications it needs to execute (what startup isn't?).
Let's not send them on a suicide mission to manage their own hardware on top
of everything they already have to achieve. ;)

~~~
nailk
Why do you think that if you don't use a cloud service you have to own
hardware? You don't have to do it.

Dedicated server providers will manage your hardware, that's it.

------
pestaa
What about managed dedicated servers? You don't need to go all in full metal,
and be responsible for networking and swapping disks and all that.

I know it's more than a handful of deployment recipes, but it's not like they
don't manage their cloud instances either...

This article is a false dichotomy with a seemingly rushed decision.

~~~
dx034
I was wondering the same. Why not manage servers without managing the
hardware? There are several hosting providers that offer enough capacity and
data centers to provide enough capacity.

You'll still need people managing servers, but no one has to live close to a
datacenter and you can emulate at least some cloud-like features (e.g.
spinning up a new server if one server fails and dealing with the failing
server later).

The only companies where that doesn't work is when you need physical security
for your data (locked cabinets that no one from another company can enter),
but that's no concern if you already had your data in the cloud before.

------
dylanha
Where is the original thought? There were few hard numbers. There was way too
much quoted material and no Gitlab thoughts on it. It looks like they made the
post based on feelings instead of a metric like time or dollars.

~~~
timanglade
Point well taken on the blogpost leaving some important info out — sorry about
that. Having been involved in the conversation, I can guarantee you that we
came to this conclusion primarily through an analysis of the numbers (both
cost & effort). I’ll see what we can do about publishing them.

------
frabbit
Buried towards the bottom is the interesting decision to dump CephFS and go
with NFS instead.

~~~
Rezo
I wonder if Amazon's new managed EFS service would make sense? It's exposed as
a NFS mount to the OS. The claims are:

\- Up to thousands of Amazon EC2 instances, from multiple AZs, can connect
concurrently to a file system.

\- Data is stored redundantly across multiple AZs.

\- Low, consistent latency.

\- Multiple GBs per second.

~~~
sytse
EFS is pretty great. But for our use case there are a few drawbacks:

1\. It doesn't scale to the size we need for all repositories, so we still
need sharding.

2\. It is expensive, $300/TB-month

3\. You're still sending all traffic over the network, we prefer the latency
of local storage we can achieve by developing Gitaly
[https://gitlab.com/gitlab-org/gitaly](https://gitlab.com/gitlab-org/gitaly)

~~~
jamesmiller5
Why develop Gitaly instead of using consensus (etcd, zookeeper, etc.) for
writes to fast storage like ssds and the normal git binary? As far as I can
tell it's adding a lot of complexity by caching high level RPC calls but still
doesn't really address multi-master read after write consistency or
coordination.

------
nik736
So, GitLab based a crucial part of their company on some random comments that
were just quoted and could be said the other way around as well? Were is your
own thinking and your own decision making?

Bare metal doesn't mean colocation. It also doesn't mean going bare metal
only, you could go with a hybrid approach and save the base costs while
maintaining the cloud scalability.

~~~
timanglade
See my comment clarifying how we came to this decision [0].

And for what it’s worth we considered colocation and hybrid models as well
before coming to this conclusion.

[0]:
[https://news.ycombinator.com/item?id=13776735](https://news.ycombinator.com/item?id=13776735)

------
tgtweak
Use your bare metal for your low-water mark and "scale over" to the cloud.

Setting up a hybrid cloud like this is not overly complicated and AWS and
other providers will offer you a direct connect network connection in many
carrier neutral colocation datacenters so you can sit on your cloud vpc with a
physical connection in your datacenter of choice.

Get the best of both worlds. Providers like ovh can get very close to the cost
of buying your own metal with a tco break-even at over 18 months, not counting
network and power equipment costs. Using ec2 or GCE for a large number of
persistent vms without a heavy discount seems... Wasteful.

------
mschuster91
Hmm. Nearly all the commenters seem to be "extreme", i.e. either full-cloud or
full-bare-metal.

Why not go hybrid? Do your baseline load with bare-metal and your spike load
with AWS, GCE or Azure Cloud.

~~~
detaro
Cloudbursting either a database server or their git file storage doesn't sound
all that viable, and I'd expect that to be the things that matter/are the
bottleneck for Gitlab?

------
StavrosK
I'm really glad you guys decided to listen to the experienced commenters'
advice and change your mind, rather than obstinately plowing on with what
seems to be a bad decision.

------
theptip
As a potential Gitlab customer (I would be running an instance in the cloud if
I moved off of their hosted offering), this removes the main concern I had
with migrating to their platform.

Fundamentally the performance problems they had could be solved by either
software or hardware, and I have more faith in the team's ability to learn the
distributed computing / caching techniques required than to learn how to run
their own metal.

------
jstoja
Why didn't they chose to split it? You always have your bare-metal
infrastructure and if you need a terrible growth and need more IT infra, you
can use public cloud to handle to load until you find plan a good solution
bare metal.

Sure it's some work, the setup is sometimes not ideal, but it's a huge gain
imho.

In addition, many people bring the argument that infrastructure is not their
core skill. It is true, but do they need to get to the level of public clouds
to run their own infrastructure ? I mean, even with sub-optimal choices on
hardware but smart choice on how you handle your architecture, you still can
have huge savings and not so many risks by doing it in-house.

------
jtwaleson
Nitpick for the author, you put twaleson for my quote, should be jtwaleson.

~~~
connorshea
Whoops, sorry. Opened a merge request to fix that.

[https://gitlab.com/gitlab-com/www-gitlab-
com/merge_requests/...](https://gitlab.com/gitlab-com/www-gitlab-
com/merge_requests/5175)

~~~
sytse
And it's merged and deployed.

------
tabeth
Couldn't they have easily just drawn the opposite conclusion? I don't see a
strong connection between the article's contents and its conclusion.

~~~
Rezo
I read it as: commenters raised a bazillion legit questions that we had not
necessarily considered (unknown unknowns), and now the whole endeavor seems
more risky and the TCO questionable. Let's instead gradually re-architect our
application, so it fits better in the cloud, which also happens to align with
what most of our enterprise customers are going to need anyway.

------
impappl
This commit really does tell the story of the disproportionate influence HN
comments had on this decision:

[https://gitlab.com/gitlab-com/www-gitlab-
com/commit/8f7d618d...](https://gitlab.com/gitlab-com/www-gitlab-
com/commit/8f7d618d14b91c338e9d4bfca96f86c3c6e5766b)

------
drinchev
HN is great way to get opinion about a decision. Although sometimes the
criticism is way more than the positivism and support.

Post about "We use X and is great" will usually attract lots of "X is bad, use
Y" comments.

I really hope that GitLab took a well informed decision based on arguments and
counter-arguments.

------
mahyarm
In my observing-from-a-distance experience, going bare metal makes sense once
you are a certain size and have predictable baseline load to maintain. Many
companies that have their own data centers still use cloud services to auto
scale and deal with spikes.

------
lightedman
As long as you prove yourselves incapable of handling things without the aid
of third parties, I will never be using your services. Sorry, I'm just picky
like that and demand true competence.

------
bernardlunn
TL:DR = people cost more than servers = stay in cloud

------
tbrowbdidnso
So everyone says the cloud is the future. I get it. But is this the truth, or
what all the tech giants want people to believe? Don your foil hats for a
moment and listen to me.

All the original internet companies run their own hardware. They rent out
excess production capacity to us peons in the form of cloud services.

These companies that all run their own hardware exclusively are telling
everyone that it's stupid to run your own hardware... Why are we listening?

Look to the newer tech giants as a prequel of what's to come. Netflix for
example, is completely at the mercy of Amazon. They might as well be "Netflix
brought to you by Amazon". Their edge is in software alone, nothing that
causes a huge barrier to entry like custom hardware. This makes them much
easier to dethrone. Hilariously, they rent all their hardware from a direct
competitor who has access to all their software secrets. Does anyone else see
something wrong with this?

The cloud as a money saving venture is and always has been a damn lie.

It's the same tactic as when automotive companies paid off local governments
to destroy Americas public transport many years ago. All the big tech
companies have their hands in the cookie jar besides Facebook, who remains
mostly silent but runs their own hardware as well.

The major tech companies have every incentive to make you think running your
own servers is nigh impossible. Don't drink the fucking koolaid. If the
industry continues to consolidate rapidly AmaGooSoft will be the ONLY places
you can have web servers in ten years

~~~
dsr_
The cloud isn't a money-saving tactic. It enables you to test demand for a
service for a very low absolute cost. Spin up a tiny machine for $60/year? And
you can pay for it by the month, or the hour? That's great. You can find out
if anybody is going to buy your thing at all. You can probably get to ramen-
profitable.

As soon as you can confidently predict sufficient demand, the economically
rational decision is to hire a good ops team and run real hardware. But to do
so, you need to either have money or get money - and the costs of cloud
service are zooming up, potentially leaving you without the margin to invest.

The question is what that level of sufficient demand is.

~~~
deegles
> Hire a good ops team

I feel this is one of those "easier said than done" statements...

~~~
tw04
I'd argue that's only easier said than done if you insist on doing a roll-
your-own open-source science project. If you default to established enterprise
vendors, finding ops people isn't _THAT_ difficult.

~~~
tobltobs
But then the cloud might be cheaper again.

~~~
tw04
It isn't, and it's not even close. Unless you're running at less than 40%
efficiency the cloud is more expensive for infrastructure running 24/7\. If
you're a shop with 1 server and no IT guy? Sure, probably makes sense to use
the cloud - although I'd argue you'd be better off with managed hosting so you
actually have a number to call if something breaks.

An organization of even medium size? Not a chance.

~~~
dragonwriter
> Unless you're running at less than 40% efficiency the cloud is more
> expensive for infrastructure running 24/7.

Sure, paying extra for a dynamically scalable service is inefficient if your
resource needs are flat 24/7/365, but is that realistically the case?

(OTOH, one must not forget that there is a cost to engineer systems to take
advantage of dynamic scaling to minimize costs while meeting requirements.)

~~~
hueving
>Sure, paying extra for a dynamically scalable service is inefficient if your
resource needs are flat 24/7/365, but is that realistically the case?

Yes, the vast majority of apps are not 12 factor architectures designed for
scaling so nodes can't be brought online and killed at the flip of a switch.

------
winteriscoming
Every time I see a post from Gitlab, I worry for them. I haven't used gitlab
but have checked it out a few times. On the other hand I use github regularly.
But this isn't about github vs gitlab.

Instead this is about gitlab's definition or rather level of transparency and
something they seem to be proud of and are even being appreciated very often
here at HN. I feel that gitlab have taken the transparency thing to an
extreme, to the extent that they even named the engineer that caused a recent
downtime. I don't think it matters that the engineer didnt object to it. Then
there are posts about why they are choosing some UI framework and why they are
moving away from cloud and now this. It's fine to be transparent but in a
professional world it's also important to be private about certain things.
Trying to please people by being transparent for the sake of it will take a
toll on the team eventually. It's much more effective to just get things done
instead of constantly being transparent to the external crowd, about every
single detail

