Hacker News new | past | comments | ask | show | jobs | submit login
Why we are not leaving the cloud (gitlab.com)
350 points by happy-go-lucky on Mar 2, 2017 | hide | past | web | favorite | 212 comments

It's interesting that the first half of the explanation is largely quotes from a prior HN post's responses.

Does anyone else feel a bit weird seeing off hand comments getting quoted in the explanation for a business decision? I guess we should all get more accustomed to our public input carrying weight in the zeitgeist.

Gitlab's "develop in the open" nature really shows through here. I am not saying that's bad, it's just so different than most startups and established businesses.

I don't think they were using this as justification or explanation in their eventual reversal. I think they provided these quotes as interesting tidbits. Kind of like pull-out quotes on an article. You can ignore them entirely and still get the gist of the post.

On a more general note: It's got to be incredibly hard to do what GitLab does with their extreme transparency. I feel like we have to be careful about reading too deeply into things and nitpicking their culture or process. HN is full of "expert" advice, much of it being terrible. They weighed their options, invited feedback, then made a decision.

I appreciate what GitLab does in being so transparent. None of us are owed explanations or insight into how they operate, yet they go out of their way to provide it. Kudos to syste and his team!

Thanks. The most useful advice was shared in private. People coming to our office to share war stories of regretting moving to bare metal. I also received dozens of people sending direct messages via twitter because they couldn't share their stories publicly. Some of the best advice are things that we can't share publicly. For example a major company going bare-metal and then spending a lot of time to set up an authorization system that you get for free with AWS IAM.

Did you have to sign a NDA? Because if people send you random advice via email after reading a blog post then this sounds to me very much like "publicly sharing" their experience...

I disagree. It's a private email. Gitlab can ask permission to post, but it would be impolite and perhaps unethical to assume a private email is now "public". If they wanted it public, they could have chosen to provide a comment on HN, Twitter, etc.

Exactly, everything is private unless it was posted in public by the author or explicit permission was given. We have a transparency value, but we understand that other organizations are different. And even we assume that our private communication will stay private.

Apparently according to syste some of the best arguments came from emails, so he could have put those arguments anonymised and trimmed down to the main points into the blog post, which would have respected the company's privacy while still sharing good arguments, unless there was a NDA which would prohibit that.

So I disagree with you, because you look at this very black and white while the reality is grey.

We quote a lot of responses from HN but that doesn’t mean it’s the only thing that led us to this decision. I was personally involved in a lot of private conversations about this, with the executive team, with investors and with people who had gone through this exercise before, and they all had as much if not more impact on our final decision than the HN thread did. It’s just harder to quote them than referencing additional (& similar) opinions noted in a public forum.

That is good to hear. My comment was not meant as a negative criticism of your process, although I can see how it could be interpreted that way. It was intended more as an observation about the openness that Gitlab brings to the process of design and decision.

No worries! I didn’t perceive your comment as negative at all — I just took it as an opportunity to clarify something that we should have made more clear in the blogpost itself really. Thanks for bringing this up and for doing so in a constructive way!

Well... Here is something very interesting. Myself and others were "cautioning" very heavily against moving to self-hosted gear. It wasn't until their CTO talked to somebody in person he had respect for that they flipped over to staying in the cloud.

This to me calls into question the entire idea of crowd-sourcing decisions and plans like this. It was a single person who moved the needle, and now there is retroactive agreement(and quoting) with the people who were against it from the start. If it went the other way there would be plenty of quotes to pull from of people supporting the move to dedicated.

Recommending someone to build in the cloud is an easy and safe recommendation to make. Therefore, more people will go that route when asked what you should do.

Pretend for a second you plucked someone from a royal family that never prepared their own food, and they asked you how to make food. Are you going to recommend they purchase land, start a farm, grown their own vegetables, and raise their own cattle? Hell no. You tell them to go to the grocery store (cloud) and if they are desperate, they can purchase some pre-made food at restaurants (SaaS providers).

When you don't know anything about a person's exact needs or abilities, making the safest recommendation is easiest, and therefore it proliferates.

I hate the cloud but I know how to do what I want on my own without it. You don't know that about other people.

I don't think the cloud is the grocery store. It doesn't give you a pre-packaged solution that you can just rip open and immediately access.

Rather, the cloud is like leasing the land from an owner. You still have to till, plant, maintain, and harvest, but you have to pay an annual rent to do so. There are a few things that the landowner is responsible to provide, but most of the day-to-day maintenance responsibility is on the lessee. People generally recognize that while renting is sometimes necessary, it's much nicer to be an owner.

We really need to stop this belief that the cloud is a magic thing that can fix your problems. It splits out the responsibility for actual, real-life hardware maintenance to a hosting company, but other than that, the responsibilities, including the responsibility to take backups, set up security, and design a fault-tolerant application, remain in the cloud user's court.

Cloud is just a way to rent a virtual server; you still have to make the server work correctly.

I think most cloud providers go beyond that. For example, creating/restoring snapshots with a UI, monitoring/automation tools, large scale DB services, etc.

I rent dedicated servers from a large-scale provider. They're really good at hardware and networking, but that's all they do (well, that and some OpenStack S3 equivalent). I have to do my own Icinga monitoring, Ansible automation, etc. For the type of product I run, it's fairly simple. It's maybe not as cheap as owning the hardware, but I think it's a good compromise.

Not to mention that when everyone is using that same tool from Amazon, and at some point it gets big and clunky and its development stalls because people rely on it never changing.. then other options out there eventually leap frog. (I strongly dislike the google-cloud UI, I waste so much time when I have to use it for one of my clients that has stuff there)

Snapshots can be created and restored with bare metal if you install the right software. Only basic metrics are monitored by AWS by default (have to pay more for more detailed monitoring) and alarms have to be specifically configured, as I recall. Note that "bare metal" doesn't necessarily mean non-virtualized; the end user can be running VMWare or Xen on owned hardware and get "cloud-style" flexibility and goodies like automatic monitoring.

DB services is a different class of problem than bare metal v. cloud application servers, and cloud DB services can be used whether you run your application on bare metal or cloud instances (but, for the record, the same type of things are mostly applicable here, in that the user still has to do their own security and configuration).

This echochamber has a profound and mostly unwarranted impact on people, myself included.

When I evaluate any tech project for use in production, I always give HN a cursory search for materials. It's not always fruitful, but in the cases where I stumble across something big in the HN space, I pick up valuable intel here...

I still try to research elsewhere too, just to cover my bases, but HN is quite helpful as an archive of knowledge. Especially knowledge about running something in production.

No surprise here. Honestly, their team seem to be lower skill or less experienced than I'd have thought.

I would never approve a production system as expansive as Gitlab's to only have two databases in a cluster. That is asking for trouble, and any {sys,db}admin worth their salt will tell you the same. As soon as you need to do anything on one database, you've just lost your cluster policy.

The lack of automation, especially around validating db backups, failover (not having the failover process scripted and tested is _begging_ to have a nightmare at 2am where you're reading documentation on how to fail over a db), etc.

The simple thing of having your hostname / $PS1 say the machine's purpose could have stopped this. All prod machines have a bright red PS1 and a clear name of <type>-<service>-<prod/dev/etc>-<region>-<dc>.corpnet in my setups.

All of this is reflected in their discussion style, meeting style, etc. Ad-hoc, not very carefully designed, random off-hand comments. Obviously a young team with a lot to learn. Nothing wrong with that, but a lot of customers are relying on their skills! Learn quick!

> I would never approve a production system as expansive as Gitlab's to only have two databases in a cluster. That is asking for trouble, and any {sys,db}admin worth their salt will tell you the same. As soon as you need to do anything on one database, you've just lost your cluster policy.

I do agree with you it reflects poorly on GitLab for only having a primary and replica, with broken backups.

BUT, they are a startup, and they need to be laser focused on growing the company and securing funding for the next quarter so they can keep the lights on.

This kind of pants on fire growth in a startup often (always?) comes at the cost of redundancy and best practices. If you stop to make your platform bulletproof instead of the new features you promised customers/investors, you die.

I'm not saying this is an excuse for them to permanently shrug off making their platform more redundant and engaging in best practices. But, as with many startups, they're focused on delivering features as fast as possible to grow their user base.

I think, and hope, that their recent outage has been the experience they needed to prioritize their shift toward more redundancy and best practices.

"BUT, they are a startup, and they need to be laser focused on growing the company and securing funding for the next quarter so they can keep the lights on."

Having run and sold a few startups and turnkey operations, let me tell you - if you don't focus on your core product first and foremost and demonstrate the utmost competency in it, you're just pissing money away and are likely to fail.

> if you don't focus on your core product first and foremost and demonstrate the utmost competency in it, you're just pissing money away and are likely to fail.

I don't have direct experience running/selling a startup, but I have worked for several as an employee.

In my experience, the product is important, but not necessarily the deciding factor in whether the company succeeds. Most of the success is down to the management team (e.g. CEO, CTO, CFO) being able to sell the company (and product) to investors during funding rounds.

Doesn't matter how good the product is, if management can't pitch it to get funding you're dead.

Obviously it is important to have a good product, but just look at Uber for example. Losing billions of dollars per year, with no clear path to profitability, and yet they're valued at something like $60B. Great sales/marketing by their management to investors!

"Doesn't matter how good the product is, if management can't pitch it to get funding you're dead."

Good products literally fund themselves and don't need outside investors. Just like drugs sell themselves, a good product will sell itself without the need for anything like marketing or investors. The money will naturally come.

For sure it caused us to prioritize more redundancy, for more information please see https://about.gitlab.com/2017/02/10/postmortem-of-database-o...

As much as I like gitlab, that decision seems weird. I'd expect a business to come up with numbers for different scenarios, not just quotes (even if those weren't the only deciding factor).

What I mean is, instead of writing about 8TB disks vs 2TB disks, get hard numbers for those and calculate costs for different versions to the costs in the cloud.

Instead of saying that engineers are expensive, look at market data to determine how much you have to pay to hire these people.

Instead of writing about the potential of full cabinets, get quotes from data center providers of what they can provide and how they would deal with this. I'm sure Equinix has solutions for that kind of problem.

And in addition to that, it seems that managed dedicated wasn't considered at all. Why hire someone to swap disks for you? Why not just renting capacity for servers and have the data center provider taking care of hardware issues? They operate at bigger scale and can do this with little overhead costs. You're still much cheaper than the cloud but don't need to drive somewhere to swap a server (you just switch to a spare one until the data center fixed the problem). With Gitlab's scale, it shouldn't be too hard to get a decent bespoke quote from companies like OVH.

"What are the facts? Again and again and again – what are the facts? Shun wishful thinking, ignore divine revelation, forget what “the stars foretell,” avoid opinion, care not what the neighbors think, never mind the unguessable “verdict of history” – what are the facts, and to how many decimal places? You pilot always into an unknown future; facts are your single clue. Get the facts!"

-- Robert A. Heinlein

They also miss out on the dog fooding aspect. They target enterprise customers who run probably run gitlab on bare metal. By doing the same, you experience what your customers do and have much closer feedback loops on any issues that might especially affect bare metal.

Regarding managed dedicated.. I had very bad experience with rackspace in that regard, which is why we moved AWS. At another job we had our own DC suite but had some company taking care of initial cabling as well as remote hands from the datacenter. If I ever do bare metal again, that's why I would do.

I agree that dogfooding is important. Our users and customers run GitLab both on bare-metal and in the cloud. GitLab runs better on bare-metal because it tends to have better IO performance. So by running it in the cloud we are optimizing where it is needed.

So everyone says the cloud is the future. I get it. But is this the truth, or what all the tech giants want people to believe? Don your foil hats for a moment and listen to me.

All the original internet companies run their own hardware. They rent out excess production capacity to us peons in the form of cloud services.

These companies that all run their own hardware exclusively are telling everyone that it's stupid to run your own hardware... Why are we listening?

Look to the newer tech giants as a prequel of what's to come. Netflix for example, is completely at the mercy of Amazon. They might as well be "Netflix brought to you by Amazon". Their edge is in software alone, nothing that causes a huge barrier to entry like custom hardware. This makes them much easier to dethrone. Hilariously, they rent all their hardware from a direct competitor who has access to all their software secrets. Does anyone else see something wrong with this?

The cloud as a money saving venture is and always has been a damn lie.

It's the same tactic as when automotive companies paid off local governments to destroy Americas public transport many years ago. All the big tech companies have their hands in the cookie jar besides Facebook, who remains mostly silent but runs their own hardware as well.

The major tech companies have every incentive to make you think running your own servers is nigh impossible. Don't drink the fucking koolaid. If the industry continues to consolidate rapidly AmaGooSoft will be the ONLY places you can have web servers in ten years

The actual streaming video from Netflix does not come from Amazon's systems. From https://arstechnica.com/information-technology/2016/02/netfl...,

"Netflix operates its own content delivery network (CDN) called Open Connect. Netflix manages Open Connect from Amazon, but the storage boxes holding videos that stream to your house or mobile device are all located in data centers within Internet service providers' networks or at Internet exchange points, facilities where major network operators exchange traffic. Netflix distributes traffic directly to Comcast, Verizon, AT&T, and other big network operators at these exchange points."

Everything else - such as the Netflix UI, your account management, billing, etc - is on Amazon. But Netflix has an enormous amount of hardware distributed around the world. If you wanted to compete with Netflix, you would need to do a lot more than uses Amazon's cloud services.

> If you wanted to compete with Netflix, you would need to do a lot more than uses Amazon's cloud services.

Yes you'd need multi-million dollar contracts with all of the major studios. Netflix isn't as much a technical problem as it is a business problem.

HBO has a far worse network, UI, and technology in general but its still competitive simply because it has Game of Thrones and other shows.

Scratch business problem, replace with "legal problem". Copyright law makes it virtually impossible to compete with Netflix. Netflix was only allowed to start because they wedged their way in as a rent-by-mail service.

After pioneering streaming and realizing that the studios were asking themselves why they don't just load up their own computers with movies and get paid to stream them out, Netflix was forced to start its own content division to keep subscribers onboard and prevent the studios from poaching all of their customers by failing to renew contracts for content.

We have a similar problem in online services. It's impossible to compete with Facebook because the law allows them to hold data hostage.

We badly need copyright reform. The current incarnation is impractical in the digital age.

I don't understand what this has to do with copyright laws. This has to do with content licensing and studios not wanting to license content...

The copyright laws are what give studios the rights to license that content. Right now, most things published commercially after 1923 are copyrighted and require permission from their rightsholder to distribute. That's practically the entire history of cinema locked up tight.

Copyright should be adjusted so that people can make money off of their creations without allowing the culture to be held hostage to corporations. If the copyright term was shortened to, say, 10 years, a Netflix-with-only-content-from-10-years-ago would be totally viable without having to ask for anyone's permission.

Copyright is a government-granted monopoly. It's gotten totally out of control and severely restricts competition in this type of space.

That implies the studios don't own the copyright which means they shouldn't be suing for piracy.

The movie industry has Long spent its years fighting piracy instead of providing a solution to a problem. Now there are solutions to the problem and the industry is refusing to adapt and continues to hold the content for themselves instead of licensing it to the likes of Netflix.

We know the studios have the right to license content because that's exactly what they do when they sell licenses over seas to broadcasters in other countries to air content.

This isn't about copyright. It's about the studios lack of ability to license content.

This is why in 2017 piracy is still a major issue.

This is a very naive view of the issue. There are many hands in the production of a film and many rights associated with are assigned at the time it was made, screening rights, distribution rights, periods for when rights expire, etc. When new technology, such as streaming on computers, comes about, the licensing of new rights needs to be established all over again with all the rights holders. A single holder that can't be contacted or established(due to death, closure, acquisition, etc) can effectively put any new licensing into limbo.

This isn't even getting into exclusivity licensing that the studios seem to be real keen on doing with Netflix, Amazon, Crackle, VUDU and a jillion other streaming services. You know those shows that are branded "Netflix Original" on Netflix? Those aren't literally made by Netflix, they've just paid for some/all of the production and licensed them exclusively for some period of time. They're still produced by all the same studios that have been producing shows and movies for years.

It wouldn't matter to Netflix. The really old titles are cheap and Netflix always carries them. The problem Netflix had/had is access to latest releases. When it comes to movies this the bulk of what people want to watch. Since they couldn't guarantee access to new content they had to pay to produce their own new content.

I think it'd matter because right now, there is not really a market for only public domain video content, because such content is so sparse. If we made copyright expire after 10 years, I think there are a lot of people who'd be content with the "classics", and it'd allow innovation and competition to occur between "classics" providers, which is what will give customers the best experience.

Right now, competition is hamstrung because people are prevented from entering the market with anything that the consumer would find desirable, barring permission from the rightsholders.

Competition is not real competition if the same data cannot be accessed. For example, Chrome and Firefox compete because they both access the same internet, and whichever browser provides the better experience for that internet will win out. That kind of thing is not possible on walled gardens like Facebook because the law allows them to obliterate anyone who would create an alternate "Facebook browser", and it's not possible with things whose primary offerings fall under copyright protection, like Netflix.

We need to open these things up so that people can compete and provide the consumer with the best experience without needing anyone else's permission. It's true that most people will want access to new releases, but if copyright expired after 10 years, you'd at least have a secondary service that people would actually want to use, which would compete with Netflix et al for viewership on a lot of the content.

I'd like to see Rambo and The Terminator on Netflix.

I don't know that I'd say Netflix is completely at the mercy of Amazon. Sure, if AWS goes down, Netflix goes down. But that would be true enough for they ran all of their own stuff, too... "Netflix is completely at the mercy of their ops team".

One of the big points of the cloud is that is gives one the ability to do things that one would trouble doing without the cloud. A small business with a single part-time developer can deploy an app to Heroku, and not need an ops person. That sure beats the heck out of the small business not being able to deploy software at all. As a former small biz owner, I did not want to have to manage a bare metal server in a colocation rack with a not-cheap data pipe. I just wanted a CRUD app with a minimal feature set.

When Heroku went down, I sat around wishing it would come back up, and eventually it did. That was way better than having to service a physical box when it failed.

Heroku cost me a few hundred a month, and freed me up to earn much more than that. In my eyes, it paid for itself many times over.

>But that would be true enough for they ran all of their own stuff, too... "Netflix is completely at the mercy of their ops team".

They employ their ops team and their ops team has equity stake in the company's success as well as their career riding on the line.

If a chunk of Amazon goes down and takes out your website, all you can do is wait and hope Amazon is prioritizing the fix.

>When Heroku went down, I sat around wishing it would come back up, and eventually it did.

This method of recovery just doesn't cut it in businesses with lots of transactions 24/7. Just sitting around can be flushing thousands per minute down the drain. In those deployments, you need to build across multiple cloud providers; and once you've reached that abstraction, you might as well have your own hardware as well and just burst to providers.

A company like Netflix can get by because they have their own hardware for streaming so an outage only affects people browsing for a movie at the time. And on top of that, their only interactive transaction is signing up for the service, which is extremely infrequent compared to the rest of the traffic. So essentially it's safe for Netflix to use AWS because they don't need high reliability/control of their own destiny.

>A small business with a single part-time developer can deploy an app to Heroku, and not need an ops person.

This use case I understand. However, anything at the scale that is frequently discussed in these contexts (gitlab, etc), they need ops people.

> Sure, if AWS goes down, Netflix goes down.

Well. AWS went down (part of it), but Netflix stayed alive.

The cloud isn't a money-saving tactic. It enables you to test demand for a service for a very low absolute cost. Spin up a tiny machine for $60/year? And you can pay for it by the month, or the hour? That's great. You can find out if anybody is going to buy your thing at all. You can probably get to ramen-profitable.

As soon as you can confidently predict sufficient demand, the economically rational decision is to hire a good ops team and run real hardware. But to do so, you need to either have money or get money - and the costs of cloud service are zooming up, potentially leaving you without the margin to invest.

The question is what that level of sufficient demand is.

> Hire a good ops team

I feel this is one of those "easier said than done" statements...

I'd argue that's only easier said than done if you insist on doing a roll-your-own open-source science project. If you default to established enterprise vendors, finding ops people isn't THAT difficult.

But then the cloud might be cheaper again.

It isn't, and it's not even close. Unless you're running at less than 40% efficiency the cloud is more expensive for infrastructure running 24/7. If you're a shop with 1 server and no IT guy? Sure, probably makes sense to use the cloud - although I'd argue you'd be better off with managed hosting so you actually have a number to call if something breaks.

An organization of even medium size? Not a chance.

> Unless you're running at less than 40% efficiency the cloud is more expensive for infrastructure running 24/7.

Sure, paying extra for a dynamically scalable service is inefficient if your resource needs are flat 24/7/365, but is that realistically the case?

(OTOH, one must not forget that there is a cost to engineer systems to take advantage of dynamic scaling to minimize costs while meeting requirements.)

>Sure, paying extra for a dynamically scalable service is inefficient if your resource needs are flat 24/7/365, but is that realistically the case?

Yes, the vast majority of apps are not 12 factor architectures designed for scaling so nodes can't be brought online and killed at the flip of a switch.

In most enterprises yes. They've got a consistent flat workload 90% of the time. I'm willing to bet gitlab is the exact same way other than a tiny portion of the environment. And, as you mentioned, writing an app from scratch to take advantage of instant scaling is extremely difficult, and sometimes impossible depending on what statefulness is required.

Established enterprise vendors are increasingly promoting cloud offerings, and their pricing reflects that.

Also can be very expensive. If another IT devops type costs 100k per year. How much will that buy in cloud services and time saved?

150 TB/month of data transfer on AWS will cost you 15k/ month the same thing on dedicated provider will be 500/month. And so on.

That's about the fully loaded cost of one engineer, and you may easily need multiple additional people to run your own hardware instead of using AWS.

That's fairly modest traffic and does not include cost difference for the compute and persistence layers. In case of GitLab their dev and ops office is in Ukraine so that difference on traffic cost alone would pay for 3 good ops people.

> The cloud isn't a money-saving tactic.

Most tactics in business are money-saving tactics, and cloud is clearly one: dynamic scaling means only spending now for what you need now rather than spending now based on anticipated future need, and someone-else's-server means economies of scale in ops.

I'd also argue that "sufficient demand" is a threshold that's been rising very fast over the last couple of years and likely to continue doing so.

This also explains why older tech companies disproportionately use their own hardware, even those that aren't all that big. They all made that decision at a time when it made sense and the switching cost is sufficiently high to prevent them from turning back now.

I doubt there are many rational cases now for newer companies to use anything but the cloud. Even at the scale of Snapchat (>3B$ infrastructure spend over next 5 years) and Netflix, it appears the public cloud is still the rational choice.

You should remember that the early cloud unicorns are getting super low rates for their name clout as well. Their discounts are passed as extra costs to everyone else.

So while it may be worthwhile if you're huge enough to get a big discount, that doesn't mean it applies to small or medium size companies

> The cloud isn't a money-saving tactic.

If it requires no or a smaller ops team then it seems like it might be.

Generally you're hitting the cheap part of your ops team. The rack and stack, and dc technicians. You start seeing the real savings when you move up to using higher services like managed databases, and analytics engines.

I think that's right. We tend to see fragmentation w/ our hybrid approach. Teams doing DevOps type stuff with VMware/vRA on internal hardware and also on AWS/Azure. "Customers" of each environment also tend to be different so priority conflicts arise.

Could be solved by more staffing, I suppose...

Anecdotally, I used to be on a two-man team and switching to Azure made life insanely better.

> If it requires no or a smaller ops team then it seems like it might be.

If you needed a team to manage it before, you still need Ops with "the cloud"

> and the costs of cloud service are zooming up, potentially leaving you without the margin to invest.

What do you mean by this, because it sounds like a straight up lie?

Amazon and Google are in an ongoing price war and cut costs regularly.

>What do you mean by this, because it sounds like a straight up lie?

There is no economy of scale with the cloud. As you scale up your service, Amazon/Google happily scale your bill right up with it. When you roll your own datacenter, things get much cheaper per unit (bandwidth, storage, etc) as you get larger.

Possibly outgoing bandwidth. These costs are directly proportional to traffic and regularly 20-30x the unmetered rate for a colocated box

Yes, cloud bandwidth is expensive. I don't disagree with that. But it's not increasing in price.

Obviously if you're growing your costs will grow, but that's basically a tautology. Bare metal costs also rise as you grow.

Will you stop calling me a liar if I correct that from "the costs" to "your costs"? I am assuming that you are growing, and I think there's a provable assumption that for high levels of service, hiring someone else to do it is more expensive than in-house.

That's an entirely different statement than your original one. Saying "the costs of cloud service are zooming up" disingenuously makes it sound like the price of cloud services are going up.

Thanks for the charitable interpretation of my words.

> hiring someone else to do it is more expensive than in-house.

Except you're gonna need between 10 and 100 people to recreate the service in-house, not just one.

Trying to re-do something from scratch in-house is always more expensive than just buying the product.

I get the sentiment, but it really depends: For some services, it's very true. For other services, careful problem analysis will show that you only need one or two features from the full package, which then can be replaced by a very small shell script.

Can you give an example where a shell script would solve the problem? Curious to know

Last month, I needed to add a periodic job to a simple Rails app to snapshot some data and send it to another system. I discussed the design with a colleague, a Rails expert, who went into the pros and cons of various worker-queue configurations. In the end, I added a hidden API call, and deployed a shell script along the lines of

  while true; do
    curl https://rails.app/do/the/thing
    sleep 3600
The actual version is slightly larger because of error checking etc., but that's the gist. So that's one example of replacing (or rather: avoiding) a distributed system (an asynchronous job queue) with a small shell script.

I don't mean to say that shell scripts always can replace more complex systems, but sometimes they're enough.

> [Netflix's] edge is in software alone, nothing that causes a huge barrier to entry like custom hardware.

Actually, they are a media company. Their "edge" is in content creation and licencing deals. They might have been first-to-market for mass-scale streaming, but streaming video technology is a commodity, and they know that they will live or die by the desirability of their media catalog.

> Netflix for example, is completely at the mercy of Amazon.

Amazon and Netflix were symbiotic for years, as Netflix was a headline corporate customer, and I'm sure they received preferential pricing. And given the quality of Netflix's engineering team, I would be very surprised if they don't have a plan to migrate by the expiration of their contract if necessary.

> "Actually, [Netflix is] a media company. Their "edge" is in content creation and licencing deals..."

That's only partly true. But if you considering a lot of other "media companies" haven't been able to compete with Netflix's ease of use, then you have to ask the question how is Netflix different from any other media company. And the answer is because they are also a software company, where their successes is primarily attributed to making DRM video in browser easy and seamless for average folk.

This answer is in the same category as "Actually, Google is an ads company". That is to say, it makes a plausible case but ultimately wrong.

The original success of Netflix depended on being the first company with a fixed-price-no-ads streaming business model in addition to really good user experience across a wide variety of platforms. I remember being astonished at watching a Netflix movie on my iPhone (when I was still an iPhone user) while riding a train in 2011'ish. It was miles ahead of any other competitor in providing solid, usable clients on multiple platforms combined with reliable streaming in most cases and networks. This was as much a triumph of technology and product design as it was of the content available. If it was only about the content, it would be Hulu (at least in US) instead of Netflix who would be in pole position today.

> The original success of Netflix depended on being the first company with a fixed-price-no-ads streaming business model

Actually, the original success involved putting DVDs in mailboxes.

> I remember being astonished at watching a Netflix movie on my iPhone (when I was still an iPhone user) while riding a train in 2011'ish.

Right, but you're going to be equally astonished in 2017 if you don't get the same level of service from Prime Video, Hulu, HBO GO, YouTube Red, et. al. And in 2019 a YC company will offer a one-click media publishing company that will allow any random joe on the planet to publish a DRM-protected movie available through a world-wide CDN with a pay-per-view payment system.

In 2011 tech was a major differentiator, today not so much. Netflix is moving with the times, that's why they are offering original content instead of being content to provide a good platform to stream other people's IP.

> This answer is in the same category as "Actually, Google is an ads company". That is to say, it makes a plausible case but ultimately wrong.

https://qz.com/607378/were-live-charting-googles-first-alpha... (tl;dr: the overwhelming majority of google's revenue comes from ads)

> equally astonished in 2017

> Prime Video, Hulu, HBO GO, YouTube Red,

So Netflix has a six year lead and attendant network effects in it's favor. If you don't believe me, I created a facebook clone in PHP last weekend which works just as well (or better) than the original Facebook. Do you want to join it?

> So Netflix has a six year lead and attendant network effects in it's favor.

I completely agree with you. I personally have a Netflix account, and do not subscribe to other services.

My point is that right now, today, this very instant in time, Netflix is not primarily differentiated on technology, but on having content that their subscribers want; that online video distribution is now a commodity.

> but on having content that their subscribers want

This is where we disagree. I don't even know what content Amazon or HBO or Hulu has at this point. Netflix has already made me a lifelong customer with their product and user-experience in the past 10 years or so that it'll take an earth-shaking difference in content for me to switch (or add) a second monthly bill for content. Netflix is part of social fabric (at least in the US). All of this was won by relentless focus on an excellent product and user experience... and perhaps only a middling content catalog.

I don't think it's in doubt that it makes sense to own everything from the land to the silicon when you're the size of one of these "original internet companies."

The question is, does it make sense for businesses with a handful of servers running their office environment, to a few thousand servers running production for a moderately popular web service, to pay consultants to drive out and tend the servers in their office closets/colo racks, rather than piggy-back on someone else's (mostly) well-oiled machine with economies of scale.

If you can build an in-house equivalent of the AWS team, do it. If you're going to pay someone to do it for you, AWS might be a better deal than your neighborhood business IT firm.

Running on rented hardware is the equivalent of a traditional product company renting all of their factories. Almost any company of reasonable size will want vertical integration of their supply chain and web companies are no different.

Using Netflix as an example again, if they're pitted against a company that runs their own hardware but is otherwise equal, they will lose. A portion of their profit is siphoned off as Amazon's profit. What's already happened is Netflix is directly funding development of Amazon Prime Video.

> Running on rented hardware is the equivalent of a traditional product company renting all of their factories.

Or a traditional business renting a storefront? Something that happens all the time. Even huge companies like Apple use OEMs. Samsung manufactured Apple chips for years despite putting up a direct competitor to apple in the Galaxy S.

Systems like this allow people who do not have capital to scale up. Just like developers build malls and lease space to retailers or OEMs build factories and lease them to companies to make equipment.

>Almost any company of reasonable size will want vertical integration of their supply chain

Meanwhile, the most successful hardware vendors are all farming production out to third party manufacturers...

I would think running on 'owned hardware' in rented colo space is very close to 'rented hardware' in the cloud.

The only clear differentiator then is actually owning a datacenter. And the Total Cost of Ownership of that isn't easy. For someone who doesn't want to be in the datacenter-ownership business, opting in to cloud would make sense. Concerns for privacy and government intervention are a valid concern though.

> Running on rented hardware is the equivalent of a traditional product company renting all of their factories.


Traditional product companies very rarely own all their own factories.

Do you honestly think Netflix management is incompetent? They've run the numbers on this countless times. They have way more insight into their spend and operations than a random anonymous internet commentator.

Netflix is a classic case of very bursty demand when they eg regularly re-encode all video, or even just see huge audience swings when people get home from work and turn the tv on.

Most companies at even 1k servers will get enormous cost savings from O&O.

A former employer, then in the 600-ish box range, costed out moving to ec2. Our ops actually brought up our internal hadoop cluster on ec2 as well as ran parallel workloads for a week. Amazon at the time offered price deals to attempt to get them as a marquee customer, as well as a bunch of engineering time. Even so, ec2 cost 3-4x as much for 1/3 the throughput. ec2's prices have since fallen, but so has the price of bare hardware.

People also overestimate the cost of hands-on; this company contracted for 4 hours / week with a local person. If you get serious about puppet/chef you can successfully run lots of servers in a remote datacenter.

Netflix runs their most expensive part (pushing those terrabits of traffic ), with their own hardware and network. https://openconnect.netflix.com/en/

so, you were saying?

Or do you think facebook,google,aol,tripadvisor and pretty much anyone beside HotNewStartup are incompetent?

More than likely so that the video feed is as close to you as possible, rather than only because of cost efficiency.

The video is the most important part of their product, so keeping it at your ISP or a local exchange not only lowers latency, but most likely also gives better throughput since it only has to travel through your ISP's network or local exchange links without having to go through the rest of the internet backbone.

Its not "more thank likely", it is the real reason. CDN's could not keep with their volume AS WELL as pricing structure.

Latency ( 30ms vs 300ms) doesn't matter much in video playback, only throughput. Once you hit play, it isn't bi-directional.

But you're just nitpicking. This IS their core business, and they realized its "too core" to give to someone else.

So yes, you do "outgrow the cloud", and well, part of "outgrowing" can be cost structure. Gitlab offers a free service, their main competitor, github, runs their own datacenter. Why do you think that is?

Right, but that alone shows that the reason they didn't go for the cloud for those has nothing to do with the capabilities of it, just that the locations are not close enough to their userbase to make sense.

Put another way, if there was a cloud DC in each major city and Netflix hadn't already started using their own edge nodes, then there's nothing stopping them from spinning them up in each cloud DC as opposed to their own hardware. There's nothing inherently special about their setup except location.

Nothing about this situation says that Netflix "realized its "too core" to give to someone else." I don't quite understand how you're making that leap. Instead, they had a very specific requirement (that most companies don't have, mind you) that current cloud hosts can't provide. Nothing else.

Even Netflix uses AWS for the autoscaling enabled workloads.

Besides, here is a list of companies using AWS:


You think Google and Microsoft is incompetent when they offer cloud services to their customers? There are a lot of companies that do know want to own any datacenter related infrastructure because it is irrelevant to their business and there is no on-board expertise. While there are companies that can afford to hire staff for running large DC operations. I thought this is pretty obvious to everybody.

I know Netflix runs their own edge nodes, which is precisely why OP sounds incompetent and uninformed. Netflix has spent lots of time and effort into analyzing and building the architecture which is optimized for them.

> Or do you think facebook,google,aol,tripadvisor and pretty much anyone beside HotNewStartup are incompetent?

Far from it, and that's precisely my point. At Facebook scale, it certainly makes sense to run your own data centers. At HotNewStartup, it obviously doesn't. Competence is finding the solution which works best for your scale and organization, not following some generic rule.

I bet to wager Netflix and Snapchat are at Facebook scale. Certainly $500M/ year in cloud commitments for the next 4 years seem to indicate so.

I'm not sure, but I am sure that someone internally is doing those calculations.

Does Apple own any factories?

Actually they own a lot of very expensive equipment that their subcontractors operate for them. E.g. all those CNC milling machines for their phones/laptops/watches. These capital expenditures were a bit steep for their subcontractors, so this creates a kind of symbiotic relationship.

If they used their own hardware a good portion of their profit would go to maintain that very same hardware. And because they would build that infrastructure from scratch it would cost even more to set up.

The article has some really good points why being in the cloud is a good idea. GitLabs core competencies are in software, not hardware- similar to many startups. Sure, they could hire up an infrastructure team, but that's at an opportunity cost of their software team.

The cloud offers convenience that all the `original internet companies` wish they had when they were starting out. There may be value in switching when you hit a certain scale, companies like Netflix will have already done cost-benefit analysis to that degree. But it's a massive undertaking that introduces substantial risk for long periods of time. Many companies are just not in a position to do that.

I think their recent cock up is a great example of why they need an ops team regardless of whether or not they're in the cloud.

> Sure, they could hire up an infrastructure team

Their original proposal on bare metal was for 64 servers. They need an ops team regardless of whether they control the hardware or not.

I mean, sure, you might not save money being in the "cloud" but running hardware is a different skillset than building software. If you can pay someone else to run your hardware, someone else who wants to be really good at running hardware, you might be able to focus more on your software, and get really good at making that?

There is nothing hard about running hardware. The current VM trend gives you bare metal access. the only difference is you have to plug in your machines, and you get to look at them every once in a while.

Really, what software dev hasn't put together a desktop PC and plugged in some Ethernet jacks? That's all you need to Colocate.

Your response echos what cloud providers want everyone to think. Hardware is too hard for us, let's pay someone to do it and make 50% profit margins on us

As a professional that came from software engineering let me tell you... that plugging in one machine is easy.

Plugging in 150 machines is significantly more challenging. Providing disaster response, backups, stable power, and VLANs to secure services, quickly ratchets up the cost. And nobody wants to pay one engineer per machine to babysit one machine, x 150. That wouldn't even be a good idea, because engineers get bored and start mucking with stuff that needs to be left alone.

Data center administration is a first class professional grouping. Software engineers who don't know that are a real hazard to any business.

First, most of the time it's not a question of paying someone to run an entire data center, but rather some racks in someone else's data center. The data center has data center staff who handle a lot of this. It's unfair to pretend that going bare metal entails starting with an empty plot of land and building/managing a whole data center. While that may be the case for the largest companies, even moderately-sized companies won't do that.

When you're a business, you have to pay people to handle hardware no matter what. The question is not whether you need to pay someone to administer the hardware -- the question is who you are going to pay. It can be an employee, contractor, or third-party service company like AWS or remote hands at a colo.

I've seen multiple companies move to cloud merely because they were annoyed at having to go down to the rack to deal with hardware, often making oblique justifications about difficulty when they really meant that they don't like driving down to the colocation center.

Those companies have spent millions more paying AWS than they would've if they just hired a couple of hardware jockeys.

The argument basically becomes a question of whether you know how to hire someone who knows how to deal with hardware. If you don't, it's better to go with Amazon even though they're going to rake you over the coals cost-wise. If you do, then practically speaking, the differences should be small. Your hardware people will automatically fix the hardware issues just like Amazon's hardware people.

The bulk of the work of running servers is at the OS admin level, which is still the customer's responsibility with cloud servers. These arguments about difficulty of administration would work for something like Heroku, but they don't work for Amazon.

I was about to make a comment similar to this, but this is articulated so much better than what I was able to get out.

I'm in agreement, after the hardware deploy, you're pretty much left with what you would have in a cloud environment: a bunch of VMs (or in this case OS instances) that need care and feeding, and only rarely, much rarely than many would think, do you have to replace a disk, or recable because requirements changed.

> As a professional that came from software engineering let me tell you... that plugging in one machine is easy.

Till you're the guy in your 20 people startup who has to do it and crawl around the racks with 100kg of equipment to assemble, cable and power up everything.

I'm sorry, but your comments in this thread are incredibly out of touch.

It's absolutely hard to run your own hardware. There's a reason that companies which run their own hardware have entire teams devoted to it.

You seem to be thinking like an engineer, not a business person. For most businesses, cloud providers offer a substantial cost savings.

I know it's a controversial opinion. And probably not even true in a lot of cases... but I'm glad it sounds like I'm out in left field.

The echo chamber needs a devil's advocate and sometimes the resulting discussion gets interesting enough that I question my original beliefs.

In this case I believe owning my hardware is cheaper for me because I already do. I might be a rarity though, or just old fashioned and wrong

Your opinion isn't really that original. There are plenty of old-school sysadmins who will insist that the cloud is a hoax. They'll fight tooth and nail to avoid it. Eventually, someone fires them and moves to the cloud with significant savings.

There are some scenarios where it makes sense to run bare metal, but they're few and far between. If you've evaluated the (capital, operational, and human) costs of cloud vs. bare metal and bare metal came out on top, that's fine. But don't assume that other mature businesses haven't also done so.

For companies with AWS bills less than $10k/m, it'd be ludicrous to even consider bare metal.

I believe the truth is somewhere in the middle. The hype will eventually die down and we'll see some kind of equilibrium.

I've been experimenting with "cloud over" where I only use cloud servers for spare capacity. With docker it's getting easier. My baseline CoLo has the same performance as a super high end AWS instance and costs a third as much after my fixed investment.

I still have the cloud if my box goes down or takes a big traffic hit, but I'm not paying for a bunch of cloud servers all the time either.

I've seen businesses move to the cloud and cost themselves millions of dollars in doing so. I'm not sure where you're coming from believing that cloud always offers a massive savings. It's possible that in some circumstances, cloud will be cheaper, but I actually think that more often than not, it's more expensive.

It's ludicrous that you suggest that anyone who is not willing to dive head-first into 100% cloud everything should be fired. The cloud has a place, but there's no reason that cloud should be the only option, as you appear to suggest.

I have not once said that cloud is the only option. You're arguing with a straw man.

If you're spending millions on infrastructure, you should absolutely do a rigorous analysis of whether the cloud is better or worse.

Yeah, it's not a direct quote. However, that is how I interpret the sentiment that those who fight the cloud will be fired. You suggest that's an unfair interpretation? You can say "cloud may not be the only option" all you want, but "those who oppose cloud will be fired" ensures that somehow, everyone in your organization will believe the cloud to be a supremely good option. :)

Because I do fully believe that the cloud is the superior solution for most businesses, especially those employing sysadmins.

I've seen way too many companies with full-time sysadmins maintaining only a few dozen servers. The significant cost savings come from eliminating those positions.

Who administers the servers after you eliminate those positions? Sysadmins mostly do operating system administration, not hardware maintenance. AWS boots you into an empty operating system that still needs to be configured, secured, backed up, etc. You still need sysadmins! In fact, cloud tends to require more of them because people spawn a lot of instances.

If you have a bunch of idle sysadmins because you thought you'd have to replace a system's drive every week and hired 4 extra hardware-only sysadmins, you can probably fire them without moving to the cloud and save even more money, right?

If you can treat your instances as ephemeral, it's a lot easier to manage them.

>Eventually, someone fires them and moves to the cloud with significant savings.

Significant savings is not something you get from the cloud unless you were massively over-provisioned with your hardware. In nearly every case I've seen, moving an equivalent workload to the cloud has resulted in more costs for the same size workload. But the company still made the decision to do it because it was less stuff to worry about and it made expanding that much quicker.

You pay a premium for the flexibility of the cloud.

> There are plenty of old-school sysadmins who will insist that the cloud is a hoax. They'll fight tooth and nail to avoid it. Eventually, someone fires them and moves to the cloud with significant savings.

Those "old-school sysadmins..." get hired by large cloud companies. Amazon has plenty. The cloud is just somebody else's computers.

Sure, and that's precisely the operational efficiency which comes from cloud providers.

It really really depends how you structure your application. Depending on your architecture having a few huge boxes is very much worth it, and that scales a lot better with bare metal.

You're comparing a home desktop (or a single data center) to a battle tested globally accessible infrastructure. With experts running it.

Yes that's a loaded statement but so is comparing "the cloud" to "a desktop".

Google ran on desktop computers for a long time. If anything, the cloud is less reliable than individual machines once were. You have no control over when or why one of your VM's goes down, and there's nothing you can do but wait and hope everything comes back in short order.

I've had degraded instances screw my customers websites many many times in AWS. On our VMware cluster I never have any such issues because I make sure not to touch anything running production sites

> You have no control over when or why one of your VM's goes down, and there's nothing you can do but wait and hope everything comes back in short order.

Except that most cloud providers offer replication and load balancing, integrated directly into their other products already. They spent a lot of time making sure that it works so you don't have to.

Yes, shit happens. But shit happens everywhere, and it is far more likely to happen when you run your own bare metal if you don't pump sufficient time and resources into it, than at the company that has hired thousands of engineers and ops people specifically to provide these services.

This belief is the problem. Most cloud services are not automatically redundant.

They have stuff that will allow you to make your stuff redundant if you know how to architect, configure, and deploy it, but you don't get these benefits just by becoming an AWS customer. You have to know how to set up an ELB just the same as you have to know how to set up haproxy on bare metal.

Is it easier to configure an ELB than a haproxy instance? Probably, but it's not that much harder, and configuration is mostly a one-time cost. Pay someone on staff to spend a couple of hours testing and configuring haproxy or pay 3x more than you should every month to rent VMs on Amazon? Which one is cheaper over "the long term"?

All you're doing is renting some space in a server. The configuration is still mostly up to the user. Hardware requires maintenance, yes, but people are really hamming up both the frequency and difficulty of hardware maintenance.

>Except that most cloud providers offer replication and load balancing, integrated directly into their other products already. They spent a lot of time making sure that it works so you don't have to.

Have you ever used EC2 with traditional apps hoisted onto the cloud? Just because you can click a button for ELB doesn't mean anything when the system wasn't horizontally scalable to begin with. EC2 is just VMs, it doesn't provide any high availability at the VM level.

With local solutions (e.g. vmware/kvm/openstack), you can live migrate these annoying (but critical) pets around as you phase out hardware with very limited traffic interruption. With EC2, the thing eventually just shits the bed and you get a notice saying the instance will be retired even though it's already in a worthless state.

Google that ran on desktop computers isn't the same Google of today. Times change. Technology advances.

Glad to hear the vmware cluster is working out for you. Anecdotal but I've seen people having issues without touching vmware too. One personal experience doesn't diminish the cloud. Maybe different business needs? One sysadmin maintaining customer websites vs multi-person team managing a web app?

I'm not against running your own servers. Bare metal is fun, interesting, and possible. I'm just against blanket statements that "it's bare metal or nothing!".

This is incredibly misguided.

> The cloud as a money saving venture is and always has been a damn lie.

I personally have conducted migration savings assessments for companies going from data centers (owned/lease hardware) to cloud services. I can tell you with 100% certainty that the savings are there and have seen financial proof of the savings.

It's either that, or I'm a liar.

Okay, help me out here. Here's my math: I move about 10 TB of bandwidth a month, I currently am in a DC that offers me 30TB on gigabit for $90/mo for my 1U box that hosts my small business and a few development systems - my server cost me a total of $1500 with 72GB of RAM and 4x3TB drives in RAID10, so I have 6TB of disk that's about at 50% capacity.

To go to Amazon EC2 let's see... 64 GB of RAM for $0.862 should do the job there, but just for that I'm at $620/mo. Then let's add bandwidth, 10TB out at $0.01/GB = $100/mo. Now for disk I have to use EBS which runs (let's say st1 will do the job) 0.045/GB/mo*3TB = $135/mo. And that's before any IO.

So to move from bare metal to amazon I go from $1.5k upfront + $90/mo to a total of $855/mo on Amazon.

I can buy a new server every few months for the same cost of hosting that at Amazon - unless I'm severely missing something. Maybe it's worth if it you're a very small installation or one with very unpredictable traffic? But I just don't see it.

3 things I think you should consider in this situation.

1. Reserved instances. If you've got a running business and are expecting to be around for 1-3 years, you get 40-60% discount on that 4xlarge.

2. Do you want to move the service exactly as it is? Maybe you don't need the large ebs? Maybe you can rewrite the storage to S3 instead which is much cheaper? Do you have heavy, sporadic tasks that you can move out of your main system and into lambda+queue and use a smaller instance?

3. How much do you spend on people to monitor the hardware, source and replace the disks, do firmware updates, etc. ? How much on external services to monitor your box which could be replaced with integrated AWS solutions at free tier?

Basically what I'm trying to say is: if you just lift your current system and move it to AWS, you're ignoring lots of opportunities. You need to consider much more than a 1:1 hardware requirements migration.

Why should he re-architect his system and introduce some Amazon-specific dependencies like S3 just so he can give Amazon money? How much development time and money will he spend trying to make his setup work on AWS? How many new bugs will he accidentally introduce in the process? How badly will moving to cloud storage via S3 affect his performance v. having the files on a local disk? When S3 outages like the one that happened two days ago occur, will his customers be understanding of his decision to move into the cloud for funsies?

It's really amazing that, when faced with clear evidence of the expense involved in moving to AWS, you suggest that he prepay for a year of service up-front and redesign his application just so Amazon's bill doesn't look so egregious anymore.

My experience aligns with his. We moved from racks with 20-something boxes to EC2 with 100+ instances. Our monthly bill was 80% of the cost of the hardware in our data center.

How did we solve this problem? Why, move to Docker and Kubernetes of course! Over a year of manpower has been devoted to that task. What kind of savage would ever return to bare metal in this enlightened age of expending millions of manhours redoing stuff that was already working perfectly well?

If you want to autoscale, autoscale on the cloud and keep your primary nodes on bare metal. There's no need to start forking over millions of extra dollars to cloud providers to host all of your infrastructure.

Cloud has some unique benefits, but it should be used for those unique benefits only. There's no reason everything has to be moved there.

I'm trying to compare costs in a better way than "lift app from here and move it there". Sure, you can do it that way, but you're moving an app which was written with your current architecture in mind. It's not surprising it will be more expensive after the move.

I'm not advocating to rewrite just to give Amazon money - you're creating straw man arguments here.

I'm just saying that if we're comparing different ways of hosting, we shouldn't pretend they work the same and have the same tradeoffs. Maybe one server approach is optimal. Maybe the cost would be smaller if the service would be split into various components. That's all part of a proper comparison. Including the cost of getting from one arch to the other.

Specific example:

> How badly will moving to cloud storage via S3 affect his performance v. having the files on a local disk?

Depends what they do with the data and how the users interact with it. Maybe there's no data that could be migrated (all records need to be available in memory for the web app), maybe it would be slower but have a trade-off of using smaller instance, maybe it would improve the performance considerably because file-like blobs are not stored in the database anymore and users can get them quicker via local cloudfront caches. There's no generic answer!

As for the outage... S3 was down for a few hours. And for many people it happened when they were sleeping. If your single server goes down in a data centre - what's your cost and time of recovery? S3 going down once a year still gives many companies better availability than they could ever achieve on their own.

If the need to reassess the architecture doesn't convince you this way, think about migrating from AWS to own hardware. If you were using S3, sqs, lambda and other services, are you going to plan for standing up highly available replacements of them on separate physical hosts? (Omg, we need so much hardware!) Or will you consider if it can be all replaced with just redis and cron if you have relatively little data?

>Why should he re-architect his system and introduce some Amazon-specific dependencies like S3 just so he can give Amazon money?

You could make the same argument going the other way though - why should GitLab spend money recruiting/hiring people, leasing space, setting up monitoring, etc, if their solution today works? Why should anyone re-architect their system to give $COMPANY money? When your bare metal's RAID controller craps out and you have to order another, will his customers be understanding of his decision to move to bare metal?

It doesn't makes sense to factor in fixed costs of such a migration.

>You could make the same argument going the other way though - why should GitLab spend money recruiting/hiring people, leasing space, setting up monitoring, etc, if their solution today works? Why should anyone re-architect their system to give $COMPANY money?

Well, that reason would be, at a minimum, a halving of their hosting expenses.

It also doesn't necessarily take much refactoring to move either way. Even if you're heavily dependent on cloud storage, etc., you can access that from an external application server.

The person I replied to was suggesting that the parent refactor in order to make AWS costs less egregious without articulating any particular reason that the original commenter should do so.

> When your bare metal's RAID controller craps out and you have to order another, will his customers be understanding of his decision to move to bare metal?

His customers won't need to know, assuming he has a standby that can take over. Even if he doesn't, he can rush down and install one and move the disks over. With an AWS outage, you can't do anything but say "I hope Amazon fixes it soon". The bare metal equivalent is a power or connectivity loss at the DC, which is much rarer than AWS outages.

> rewrite the storage to S3 instead which is much cheaper

So, pay for the privilege of being vendor-locked to Amazon. What a wonderful idea.

That's how business choices work. You gain x for y. If you value no lock-in more than other gain, then that's your choice to make.

(There are multiple services providing S3 interfaces BTW)

Let me clarify - it isn't equivocal that every situation is money savings by moving from data centers to the cloud. I wouldn't have to provide an assessment if it _always_ saved money.

Definitely, but I'm just wondering what the circumstances were where it was actually worth it - I thought my deployment was quite small, but it still didn't seem worth it. I guess due to high bandwidth and storage usage, maybe without those I could get away with it.

You're not including the time costs of operating that server.

Also, most businesses don't have demand that steady. Nor is it advisable in a cloud architecture to rely on a single large server.

Yeah, but I'm just trying to ballpark it. I figured the larger server would be a costsavings over many smaller ones.

Most of it does need to remain online at all times including several large databases. Just running my logstash server alone eats 8 GB of RAM and 200+ GB of disk.

In bandwidth and storage alone I'm past the cost of the server - so even if I could split this up and keep it mostly offline it still wouldn't be close to worth it unless I'm missing something more significant?

> You're not including the time costs of operating that server.

Do you mean swapping the drives about once every 3 years with free DC remote hands? Operational costs there are almost nothing - maybe 10 hours a year of my time. Surely much less than it would take to get everything running on a cloud architecture. And that cloud architecture would still require some level of maintenance and monitoring I'm sure.

> I figured the larger server would be a costsavings over many smaller ones.

Nope, you pay more for a single large machine than multiple small ones.

If your service is truly so stable that you only need to spend 10 hours a year maintaining it (and demand isn't growing), then maybe the cloud isn't right for you. But that's not true for most companies.

Actually pricing at both aws and gcp is linear.

For a single host that you know you'll want in 3 years it probably isn't worth it. If demand grows 10x in the next year though, how does that look? What if the host has a sudden failure? What if the business dies and you want to get rid of the host?

What about the personnel costs? Also, what about the instantaneous bandwidth. If you have huge demand spikes where you need a lot of bandwidth at one time, can your system handle it?

"What if we grow 20x overnight" is basically magical thinking, that's about what it'd take to really cause problems, the odds of it occurring are dirt low. Most of the time it'll lead you to waste money. Certainly not enough to account for nearly that price increase. Especially given that you can still rent dedicated or cloud servers temporarily to accommodate in the event that something like that does happen.

As for personnel, if I had to, I'd hire mostly devops people or pay freelancers for jobs as needed until hitting a limit where you have to dedicate a large portion of someone's time to it and then repeat the math - I fully suspect that this would mean sticking with bare metal. I suspect a 2 or 3 person group dedicating their time to managing only hardware and basic infrastructure for something like Kubernetes could probably handle hundreds of servers without issue. Even if that averages out to an additional cost of $1000 per server per year it'd only be approximately a $80 per server month, nothing like the increase of going to EC2. And renting entire racks gets cheaper than the individual colocation I'm paying for at the moment. It'd surely be interesting

> What about the personnel costs?

I have no idea where people came from when they moved to AWS/etc but I always have (and still do) picked up the phone and got someone on the line who usually fixes whatever the issue is with me on the phone.

As far as demand spikes - even smaller hosting companies will have hardware they can spin up pretty quick.

Actually, unfortunately I think that the price is more like 9 cents per GB, so the calculation looks even worse. The $0.01/GB you saw must be across zones within AWS?

So ~$900/month for bandwidth alone.

The disk and CPU costs for me are worth it, much better to pay $600/month and let that be someone else's headache. But the bandwidth makes this totally a nonstarter.

Out of curiosity, what are you moving?

I'm also not sure how these smaller datacenters somehow charge so much less for bandwidth.

Care to share more details?

On paper, the cost differences between renting/colocating traditional dedicated servers and using cloud service with similar performance/capacity is huge, so it's pretty unusual to actually save money by migrating to cloud. I'd love to hear where the savings actually come from.

My company has done some research on that, in the webscale bubble. It is wrong to think of the divide as "Cloud" vs "Bare Metal" in a VM for server thing. Disregarding up-front investment, you're comparing rented VMs in the cloud with bare metal boxes, datacenter costs, and most important, manpower.

At a small scale, you're paying the cloud less than 2 or 3 competent admins with datacenter and networking experience cost for salary. In that situation, you're saving money, because you're saving the hardware ops team. You're paying more dollars per iop, core and GB of ram, but the alternate method of obtaining these resources has more surrounding costs than "ze cloud".

On the other hand, once you're shoveling several hundred kilo-dollars per month to a cloud provider, it makes sense to throw a million or 10 at dell and hire those operators, because it will save money within a year or two. This only makes sense if you need the resources, but if you need those resources, it helps being more efficient.

Cloud may save from hardware/driver issues - by providing an already tested environment, that doesn't (normally) trip on some weirdness in, say, network drivers.

But if the problem is in the software stack, one would still need competent sysadmins/system engineers with skills to diagnose and resolve the issue. When (just a random example) oom-killer wreaks havoc and free(1) insists there's more than half of physical memory still available, it doesn't matter whenever one's in the cloud or not.

I believe, skilled system engineers are still a requirement for any large project, be it in cloud or not.

> I believe, skilled system engineers are still a requirement for any large project, be it in cloud or not.

And that's why cloud will win every single time, forever, on all metrics.

Because the cloud requires less engineers to achieve the same work, as it takes care of the low level hardware work.

How that's different from hosting provider engineers taking care of setting up a dedicated server and even pre-installing a tested OS image? They also usually handle all the hardware-related issues you may encounter.

That used to work well before the "cloud" era, and still does.

This is why I said hardware op.

If you use some VM-based cloud infrastructure you will still need a bunch of good system/linux admins to make your application work on the linux in the cloud. Without these, you're toast.

Bare metal requires you to have both good system and linux admins, and on top of that, good hardware admins, networks and datacenter hands. These are different skill sets.

So again: If you do the bare metal thing right at the scale it pays for itself, you'll probably end up paying both the skillset for the cloud deployment, and add the hardware skillset on top.

Are hardware sysadmins a different caste? Unless you you mean having own networking (like a router/ASA in addition to the server, or even your own private fiber), of course, and those CCNAs etc. Or those experts on some specialized hardware, like giant FC SANs or whatever one might fancy. If, when talking about "bare metal" we go to those extremities, then, sure, cloud is unbeatable.

I believe, usually, "bare metal" hosting means you order the hardware, get it installed, but networking/cooling/power supply/etc are done by the datacenter people, not your own staff (your own staff may be not even permitted to enter the server room). No less-common specialized hardware to deal with, either.

System engineers must known OS internals well. If they do, it's unlikely they can debug, say, a kernel memory leak (which can be a thing in a VM), but not a lockup in a network card driver's interrupt handler (which is close to impossible in a cloud VM, but I saw this on a bare metal). Maybe I'm wrong, but that would be, like, too specialized and just weird sort of specialization. Am I wrong?

Hey, you are a liar. He is very sure that the cloud saving money has always been a damn lie! To hell with your data.


what data?

"I can tell you with 100% certainty that the savings are there and have seen financial proof of the savings.".

My attempted joke wouldn't make sense outside of this context

> The cloud as a money saving venture is and always has been a damn lie.

When you are in business, saving money is not always the most important thing. In reality, you usually spend 95% of your revenue. The question is not "How much money can we save", but "What investment will bring the greatest return".

It's a bit like buying the building that you are working in rather than renting it. Or hiring full time cleaning staff rather than hiring a firm to take care of it. The second choice is more expensive, but requires less capital/time up front. If you can use that money and time to better effect then it is a no brainer. As companies get bigger, they have less opportunities to make large returns on their money/time, so it often makes sense to start focusing on those smaller returns.

As an aside, this is also why you shouldn't go to your CTO and say, "I can save $X by refactoring/rewriting/testing our code". The CTO can usually always find things that appear to be better ROI opportunities and so you get into a situation where refactoring/rewriting/testing becomes forbidden. Rather you should always attempt to use appropriate techniques to maximise throughput in development and to keep your stake holders happy.

So if I understand you correctly, if I need an appendectomy I shouldn't consult a doctor, but rather open up an anatomy book and do the procedure myself?

Netflix has edge caching deployed at "thousands of ISPs worldwide."


Don't forget to take the tinfoil off when you're done ;-)

What are some examples of companies that started in the cloud then switched to their own metal? Surely there are some blog posts about costs before & after?

Dropbox switched from S3 to their own in-house hardware:


The thing is that at the small to medium scale running your own servers just isn't cost-effective. It's the same principle that leads smaller companies to outsource services like IT.

I agree with your overall point, but it was my understanding that Netflix does use Netflix but runs its own CDN with FreeBSD machines some of which are located at the local ISP.

>> All the original internet companies run their own hardware

Exactly and most of the companies are not original internet companies, and they use the cloud.

It's a conspiracy, got you.

Why was hybrid not considered? With providers like megaport providing really inexpensive direct connect, this is almost a no brainer.

This may seem harsh, but relying on random commenters shows a huge flaw in how you guys went about this.

Physical environments DO work well, but they do require experience to run them. I have used AWS for as long as its been around, but nothing beats physical environments for "known workloads", as long as you run them EXACTLY like you would run a cloud environment. This is why the hybrid approach is such a great thing. Run what you know well in physical and reap great savings, run what you dont know well cloud ( as well as take advantage of the analytics products), and reap the speed ;)

- This means thinking of physical servers are individual units. - This means redundancy at every level. - This means architecting for failure ( servers and switches do gie ). - This means no shared storage for performance critical parts ( shared storage as below the OS level ). - This means objects stores/sharding/etc as your storage layer. - This means real engineering - And this means exactly the same whether you're physical or not.

I've managed environments of 50 vm's and environments of 50000 physical servers. The methodology is always the same.

and yes, this means you can save some monthly cost, and apply it to staff, that can do a lot more than just maintain this infrastructure.

PS: for those that think that showing up to a datacenter is required, you're doing it wrong. Pretty much any datacenter has hot-hands service, and with the right redundancy, hardware replacement is something you can do at a slower pace.

PPS: im sure someone will nitpick some of my points. The reality is, there is real money savings here. For example, Snapchat spends MORE in cloud infrastructure in 2016/2017 per year, than ALL OF GOOGLE did in 2012... think about that for a second... even netflix runs openconnect to push bits..

> This may seem harsh, but relying on random commenters shows a huge flaw in how you guys went about this.

Your other points and experience are well taken, but this is definitely too harsh. Think about the nature of GitLab as an open company building dev tools and paying below-market salaries. They have a huge amount of developer mindshare relative to their internal head count. They are not very competitive for hiring people with your particular skillset.

Given these facts, I think an open decision making process serves them better than trying to make decisions behind closed doors. It may come off a little ham-fisted, but in actuality I think it's far less ham-fisted than a huge number (maybe even majority?) of engineering decisions made every day in companies that just don't have the expertise to make the right decision and often don't even realize they are getting burned by what they don't know they don't know.

> They are not very competitive for hiring people with your particular skillset.

There is plenty of ops talent willing to work at a high-profile startup. I have experience building and running PB-scale Ceph clusters, private and public clouds, HA infrastructure covering everything from DB failover to BGP, and interviewed at Gitlab when they announced going on-prem. I asked for half the market rate (because I was more interested in solving the scaling problem than money), but did not get it.

The HR person at the first (non-technical!) interview asked a bunch of programming related questions from a script which caught me completely off guard. One question went along the lines of "tell me about a difficult problem that you had a particularily elegant solution for". I think they've learned by now that ops problems usually can't be solved by a simple algorithm.

In retrospect I'm glad I didn't get it, but they seem to be having issues with more than infrastructure.

As a side note, if there are other startups wanting to move to own hardware but struggling to find the right people, I would love to know where to find you. Non-AWS ops jobs are far and few between these days.

It has definitely become less common to see companies build much outside of cloud environments. And the talent has dried up to an extent as a result. Expertise and interest in building outside of cloudland is important to find, but there's also a challenge in finding the right operational mindset as well.

It seems more common with more mature "startups" that have done the cost / benefit analysis and realized that the only way to achieve economies of scale for their business is to exit AWS. Dropbox, GitHub, Uber and so forth. Or businesses that simply must operate their own hardware, like CDNs.

There are some smaller places that see value in building a low-cloud service from the beginning. One of those companies is where I work, e-mail in profile.

It's not just too harsh, it's not true. See the comments from those involved in the decision in this thread.

Why is it not true? Their main competitor, github, runs their own datacenter? Why is that?

What's not true is that Gitlab relied on random commenters to influence its decision.

> Snapchat spends MORE in cloud infrastructure in 2016/2017 per year, than ALL OF GOOGLE did in 2012

Not 2012 but in 2013, Google spent 7.3 Billion dollars on data centers [1]. Snap's has agreed to spend a minimum of 400 Million dollars per year on Google cloud according to their S1 [2]. Where did you get your figures?

[1]: http://www.datacenterknowledge.com/archives/2014/02/03/googl...

[2]: https://www.sec.gov/Archives/edgar/data/1564408/000119312517...

Gitlab is hosted on Azure and like many startups received six figures worth of free credits to be in the cloud.

As the credits run out the architecture reverts to the natural state - architectures tend to be the derivative of organizational structure, code base, and processes of a company.

Through this lens these decisions start making more sense. It is a reactionary process. Sometimes when you start small you bounce around like that and are lucky enough to grow for years and years until you hit a wall.

A bit disappointed in the article, as the bulk of it is simply quoting various viewpoints, some of which disagree with each other. I'd be much more interested in the thought process that occurred for them to reach the conclusion they did. What internal doubts did they raise? What possible solutions to those doubts did they consider? Things like that.

The reasoning behind their decision is linked near the top of the article, under "Sid and the team decided": https://gitlab.com/gitlab-com/infrastructure/issues/727#note...

Every time I see a post from Gitlab, I worry for them. I haven't used gitlab but have checked it out a few times. On the other hand I use github regularly. But this isn't about github vs gitlab.

Instead this is about gitlab's definition or rather level of transparency and something they seem to be proud of and are even being appreciated very often here at HN. I feel that gitlab have taken the transparency thing to an extreme, to the extent that they even named the engineer that caused a recent downtime. I don't think it matters that the engineer didnt object to it. Then there are posts about why they are choosing some UI framework and why they are moving away from cloud and now this. It's fine to be transparent but in a professional world it's also important to be private about certain things. Trying to please people by being transparent for the sake of it will take a toll on the team eventually. It's much more effective to just get things done instead of constantly being transparent to the external crowd, about every single detail

I love all the comments trying to explain that managing your own hardware, storage, network and backup cannot take that much of your time.

Not only it does take a huge amount of time, but it takes so much time that it require multiple dedicated highly-skilled people.

Last but not least, the last article about Gitlab was about a major outage where they had as much as 6 out of 6 backups unusable! The fact that they still exist today is to the sheer luck that there was an unexpected copy made manually before the disaster.

That speaks at length on how that 100 people company is struggling to have the resources and the qualifications it needs to execute (what startup isn't?). Let's not send them on a suicide mission to manage their own hardware on top of everything they already have to achieve. ;)

Why do you think that if you don't use a cloud service you have to own hardware? You don't have to do it.

Dedicated server providers will manage your hardware, that's it.

I'm a little shocked at this decision. The issues faced are all solved problems from PB storage to HA on server clusters even across Data centers. Good solutions do have upfront costs but the math that cloud hosting is 5-10X more than co-location company owned/leased is still in the ball park. It sounds like they may need an architect with enterprise experience to help them out rather than random comments on HN.

Yea, every growing company I've been at has been in the process of moving off of hosted solutions and onto their own hardware to cut costs.

Two such companies made the mistake of doing that with OpenStack, which is terrible and should die in a fire.

But one later switched to DC/OS and containers and it has worked really well. They've been migrating apps running on EC2 instances into docker containers than can run on marathon in our local data center and the savings is pretty substantial (even adding in the cost of the teams needed to maintain our own platforms).

Managed solutions are great for startups. There is a lot of value in not having to setup, maintain and manager your own hardware .. but that does reach a limit and companies need to be prepared for that transition and avoid lockins.

What about managed dedicated servers? You don't need to go all in full metal, and be responsible for networking and swapping disks and all that.

I know it's more than a handful of deployment recipes, but it's not like they don't manage their cloud instances either...

This article is a false dichotomy with a seemingly rushed decision.

I was wondering the same. Why not manage servers without managing the hardware? There are several hosting providers that offer enough capacity and data centers to provide enough capacity.

You'll still need people managing servers, but no one has to live close to a datacenter and you can emulate at least some cloud-like features (e.g. spinning up a new server if one server fails and dealing with the failing server later).

The only companies where that doesn't work is when you need physical security for your data (locked cabinets that no one from another company can enter), but that's no concern if you already had your data in the cloud before.

Where is the original thought? There were few hard numbers. There was way too much quoted material and no Gitlab thoughts on it. It looks like they made the post based on feelings instead of a metric like time or dollars.

Point well taken on the blogpost leaving some important info out — sorry about that. Having been involved in the conversation, I can guarantee you that we came to this conclusion primarily through an analysis of the numbers (both cost & effort). I’ll see what we can do about publishing them.

Buried towards the bottom is the interesting decision to dump CephFS and go with NFS instead.

I wonder if Amazon's new managed EFS service would make sense? It's exposed as a NFS mount to the OS. The claims are:

- Up to thousands of Amazon EC2 instances, from multiple AZs, can connect concurrently to a file system.

- Data is stored redundantly across multiple AZs.

- Low, consistent latency.

- Multiple GBs per second.

EFS is pretty great. But for our use case there are a few drawbacks:

1. It doesn't scale to the size we need for all repositories, so we still need sharding.

2. It is expensive, $300/TB-month

3. You're still sending all traffic over the network, we prefer the latency of local storage we can achieve by developing Gitaly https://gitlab.com/gitlab-org/gitaly

Why develop Gitaly instead of using consensus (etcd, zookeeper, etc.) for writes to fast storage like ssds and the normal git binary? As far as I can tell it's adding a lot of complexity by caching high level RPC calls but still doesn't really address multi-master read after write consistency or coordination.

I find that interesting as well. Furthermore, looking into deploying containers with cheap cloud hosters , I was curious how to share data across various nodes by looking at how Google and Amazon do it. Turns out it's still NFS. I've never looked into networked FS before, but I thought they'd move past NFS to something more advanced...and newer I suppose.

Fundamentally not much different, though. N shards holding the data, you have to know which shard is which and maintain connections to it. Split brains could happen if that sharding is dynamic etc.

So, GitLab based a crucial part of their company on some random comments that were just quoted and could be said the other way around as well? Were is your own thinking and your own decision making?

Bare metal doesn't mean colocation. It also doesn't mean going bare metal only, you could go with a hybrid approach and save the base costs while maintaining the cloud scalability.

See my comment clarifying how we came to this decision [0].

And for what it’s worth we considered colocation and hybrid models as well before coming to this conclusion.

[0]: https://news.ycombinator.com/item?id=13776735

Use your bare metal for your low-water mark and "scale over" to the cloud.

Setting up a hybrid cloud like this is not overly complicated and AWS and other providers will offer you a direct connect network connection in many carrier neutral colocation datacenters so you can sit on your cloud vpc with a physical connection in your datacenter of choice.

Get the best of both worlds. Providers like ovh can get very close to the cost of buying your own metal with a tco break-even at over 18 months, not counting network and power equipment costs. Using ec2 or GCE for a large number of persistent vms without a heavy discount seems... Wasteful.

Hmm. Nearly all the commenters seem to be "extreme", i.e. either full-cloud or full-bare-metal.

Why not go hybrid? Do your baseline load with bare-metal and your spike load with AWS, GCE or Azure Cloud.

Cloudbursting either a database server or their git file storage doesn't sound all that viable, and I'd expect that to be the things that matter/are the bottleneck for Gitlab?

You are adding more points of failure and complexity. To what, save some money? Which is fine, but then most of the arguments against pure on-prem apply. Save money but a lot of complexity. If your on-prem dies, most hybrid cloud setups will die.

For most use cases that will complicate the system. Keeping everything the same and simple is preferable.

I'm really glad you guys decided to listen to the experienced commenters' advice and change your mind, rather than obstinately plowing on with what seems to be a bad decision.

As a potential Gitlab customer (I would be running an instance in the cloud if I moved off of their hosted offering), this removes the main concern I had with migrating to their platform.

Fundamentally the performance problems they had could be solved by either software or hardware, and I have more faith in the team's ability to learn the distributed computing / caching techniques required than to learn how to run their own metal.

Why didn't they chose to split it? You always have your bare-metal infrastructure and if you need a terrible growth and need more IT infra, you can use public cloud to handle to load until you find plan a good solution bare metal.

Sure it's some work, the setup is sometimes not ideal, but it's a huge gain imho.

In addition, many people bring the argument that infrastructure is not their core skill. It is true, but do they need to get to the level of public clouds to run their own infrastructure ? I mean, even with sub-optimal choices on hardware but smart choice on how you handle your architecture, you still can have huge savings and not so many risks by doing it in-house.

Nitpick for the author, you put twaleson for my quote, should be jtwaleson.

Whoops, sorry. Opened a merge request to fix that.


And it's merged and deployed.

Couldn't they have easily just drawn the opposite conclusion? I don't see a strong connection between the article's contents and its conclusion.

I read it as: commenters raised a bazillion legit questions that we had not necessarily considered (unknown unknowns), and now the whole endeavor seems more risky and the TCO questionable. Let's instead gradually re-architect our application, so it fits better in the cloud, which also happens to align with what most of our enterprise customers are going to need anyway.

This commit really does tell the story of the disproportionate influence HN comments had on this decision:


HN is great way to get opinion about a decision. Although sometimes the criticism is way more than the positivism and support.

Post about "We use X and is great" will usually attract lots of "X is bad, use Y" comments.

I really hope that GitLab took a well informed decision based on arguments and counter-arguments.

In my observing-from-a-distance experience, going bare metal makes sense once you are a certain size and have predictable baseline load to maintain. Many companies that have their own data centers still use cloud services to auto scale and deal with spikes.

As long as you prove yourselves incapable of handling things without the aid of third parties, I will never be using your services. Sorry, I'm just picky like that and demand true competence.

TL:DR = people cost more than servers = stay in cloud

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact