Hacker News new | past | comments | ask | show | jobs | submit login
Rolling your own servers with Kubernetes (gravitational.com)
409 points by old-gregg on May 6, 2019 | hide | past | favorite | 203 comments

AWS employee here--thoughts and opinions are my own.

Prior to AWS, I was in IT Operations at a large financial services company. I saw the writing on the wall that over time, companies would not want to manage this part of their IT infrastructure themselves. Keep in mind, I was someone who was responsible for keeping the lights on for a decent number of Linux severs.

For an individual company, there really isn't much value in having to maintain firmware levels on all your hardware, patch hypervisors (and try to coordinate all of the maintenance around a fixed pool of hardware), perform months-long evaluation of new hardware before purchasing, test and validate configurations on new hardware, etc. I used to do all of this. I don't miss it either.

Yes, the items above are important, but doing them right is really table-stakes for any reliable IT Operations department. You can choose to spend time getting these right, or delegate that responsibility to a service provider whose main job is to get that stuff right (and recoups that cost across a much larger customer base).

What seems to often go unsaid in these discussions is that the choice isn't between cloud and colo. There's a third, hugely popular and mature option: dedicated providers - which address all of your issues.

It's convenient for cloud vendors to have people believe the choice is between them or having to deal with hardware.

But isn't dedicated providers a subset of cloud providers? I can see how having a focused provider with a narrow mission might be beneficial in some cases, but I can't say it's that much of an advantage compared to the ecosystem & convenience of a cloud provider.

what's an example of a dedicated provider?

I also worked in ITOps at a medium-ish company and we were moving our colo to Azure, when I left.

There are thousands (tens of thousands?). OVH is probably the biggest by server count. Softlayer arguably had the most potential (prior to its IBM acquisition). Hivelocity, ReliableSite, WebNX, Hetzner, LeaseWeb, Online.net, DataPacket, QuadraNet, PhoenixNap, ...

Hetzner, scaleway and packet.net come to mind

i3d.net will give you a machine, with as hands-on support as you want up to even patching the OS.


The guys that are selling managed services on AWS?

They sell all of those those things. RS has their fingers in many pies, even though some of them appear to conflict at first glance.

Yeah, I think the other factor that seems to be missed here is access to security patches for the hardware/frmware and OS. If you look at recent history with even the CPU attacks over the last few years Amazon and MS had access to the issue and vender workarounds months before even other large cloud players did. Digital ocean and other very large players were left holding the bag when the announcements were made with very short windows to get their systems up to speed. Consumer level onprem were waiting sometimes months for the patches/firmware and software to be available.

Not saying at all that is how it SHOULD BE, but if you are planning on pulling back to onprem (or colo) it should be a concern as it is a hard to mitigate risk.

correct me if i'm wrong, but I thought basically all recent exploits (including spectre/meltdown) were only really viable on shared hypervisors?

so while yes, there weren't any fixes for your onprem virtualizers -- there also wasn't any immediate danger as the attacker had to compromise one of your nodes before actually using these attack vectors...

What is missing from this discussion is level of control and tradeoffs.

First, having a full level of control can be desirable.

Second, you trade off maintaining firmware levels on hardware and software levels on hardware to managing the way cloud providers build networking and servers themselves (and the cost control as well).

Maintaining your own hardware and keeping it up to date really isn't as bad as people make it out to be. Its not hard to get right either, assuming you can hire any sort of decent system engineer.

>Should you roll your own servers?

If you are not certain, the answer is most likely “no”. The staggering growth of AWS happened for a reason.

Funny how for many decades companies and people were running their own servers. The hardware was getting cheaper each year. More software became available via open source. Then several ubecorporations entered the hosting/cloud business, and suddenly no one seems to be able to afford their own infrastructure.

>The hardware was getting cheaper each year. More software became available via open source. Then several ubecorporations entered the hosting/cloud business, and suddenly no one seems to be able to afford their own infrastructure.

The decision framework the executives use isn't just the "hardware+software" -- it's the whole "IT organization".

In other words, it's not "in-house cpu" vs "Amazon's cpu". It's in-house IT employees' speed of tech innovation vs Amazon's engineers'. An example of this disparity was Guardian's disaster with its in-house organization trying to use OpenStack.[1]

For many non-tech companies where IT computing is a cost center, their employees won't be able to match the iteration speed of Google's engineers constantly improving on GCP or Amazon's employees enhancing the features of AWS.

We've all heard the stories where a company's project submits a requisition to internal IT department for 2 development servers for their programmers -- and then the IT bureaucracy tells them that it will take 2 weeks. Over time, the internal IT dept treats the other departments as adversaries instead of customers. Executives get fed up with slow IT departments and get excited when a few clicks on AWS dashboard gets them servers spun up in 10 minutes. It's not just a cpu+hardware comparison.

Companies outsource to AWS/GCP/Azure because it's quicker turnaround with more datacenter features than their internal IT teams can deliver. Most companies are not like Facebook or Dropbox that can maintain an internal IT organization at a high level equivalent to AWS.

[1] https://www.computerworlduk.com/cloud-computing/guardian-goe...

>Most companies are not like Facebook or Dropbox that can maintain an internal IT organization at a high level equivalent to AWS.

Let's try this with different phrasing. In 200X a lot of companies were able to maintain their own infrastructure, just like Facebook and Amazon did at the time. Forward 10-13 years. We have cheaper hardware. We have extra 10+ years of development in open-source software. And yet that list of self-hosting companies shrunk by a huge degree. Doesn't that seem interesting?

>In 200X lots of companies were able to maintain their own infrastructure, just like Facebook and Amazon did at the time.

But my point is that companies' IT departments did not maintain infrastructure just like Amazon did.

In ~2005 when companies were first experimenting with AWS cloud, they might start with dev & test servers. They click a few buttons and are amazed when new servers get spun up in minutes and their programmers are productive immediately. The natural question that company execs ask is, "why can't our own internal IT department spin up servers for us in 10 minutes?!? Why does it take them so damn long?!?"

They wouldn't have asked those hard questions if their internal IT capability was equivalent to AWS. Eventually, their improved experiences with AWS on Dev&Test&QA convinced them to migrate mission-critical Production workloads to AWS as well.

>We have cheaper hardware. We have (supposedly) better software.

You're still focusing on on hardware+software and not considering the IT employees' speed of execution in how company executives compare the situation.

Even Netflix as a tech company maintained their own datacenters for over 10 years but ended up migrating to AWS. Their "Guardian" moment was a big database corruption in 2008. They also had ambitious plans to expand into countries outside of USA. Those were some of the factors that convinced them they didn't want to invest anymore in their own datacenters and preferred AWS take care of it. Amazon employees iterated on datacenter capabilities faster than Netflix engineers could do it.

netflix does not depend on aws to deliver their video bits. they do it themselves based on a big network of core and edge pni and caches. this infra, that netflix has built, in datacenters, is a big competitive advantage. it would be worse performance and much higher cost to do this over something like aws cloudfront.

>this infra, that netflix has built, in datacenters,

Unless things have changed since 2015, reports say Netflix eliminated their last datacenter already.[1]

The Netflix "edge appliances" for CDN streaming are located in others' datacenters owned by Verizon,Comcast,ISPs,etc.

[1] https://arstechnica.com/information-technology/2015/08/netfl...

That's still in datacenters that aren't Amazon's, Google's, etc. It's the most critical component of netflix (the actual video delivery) and it's not "in the cloud".

>That's still in datacenters

The point isn't that they are still _in_ datacenters. Yes, of course, they are. Even the "cloud" ultimately resolves down to somebody's datacenter somewhere. The point is that Reed Hastings & Netflix wanted to get out of managing their own datacenters.

Putting their Netflix appliances inside of ISP owned datacenters still lets them avoid managing their own datacenters. Their critical user accounts signup, monthly billing, and analytics, etc workloads are at AWS. And as the article mentions, even updating the cache on the Netflix appliances is coordinated through AWS. The combination of those strategies keeps them out of the "datacenter business" and let's them stay focused on their core competency of "video content".

The distinction makes important fault domains and whose data centers.

If you are a Comcast customer and your internet goes down, and Netflix is unavailable, who do you complain at? If you're smart enough to notice that both are down, the answer is almost certainly not Netflix. That does not make the Netflix workloads in Comcast data center any less critical, they are core business functions. But they are well aligned with Comcast, who also depends on the proper functioning of those datacenters.

It makes sense for Netflix appliances to be in Comcast datacenters then, especially given that Comcast cannot outsource their data centers any more than the Pentagon can reasonably do so.

Joe Company from off the street can outsource their data centers and derives no competitive advantage from maintaining their own private data centers. Netflix in that sense is closer to Joe Company than they are to Comcast, I guess. I'm not sure what all we can learn from this, but it's interesting.

Netflix employees still walk inside those datacenters.

I’ve avoided my own datacenter by using Equinix, for example. I wouldn’t call that “cloud” though.

The 'web' and browsing part is on AWS, the CDN is their own not on AWS.

The Netflix example is a poor one. They have a friggin fleet of CDN-boxes around the world that are as locally close as possible to eye-balls. To say they are "100%" on AWS is disingenuous - they still have their own hardware deployed all around the world - just closer to the edge.

AWS is used for control-plane / management-plane tasking - but the bits are pushed from locally-peered CDN boxes at all kinds of ISPs.

>In 200X a lot of companies were able to maintain their own infrastructure, just like Facebook and Amazon did at the time.

I worked at a decently large tech-ish company at that time. It took 3 months to provision a handful of dev servers in the data center. After filling out forms in triplicate. We asked for extras since we had almost zero ability to manage them (reboot, etc.) or get them re-built if something broke. No backups of any kind as far as I knew. We once had one break at night and apparently someone had to psychically drive there at 3am to reboot the thing after we filed a ticket. Used some sort of in-house distribution of linux that had it's own quirks.

I'll take the cloud any day of the week.

That sounds like a problem with process, not technology.

I can provision a handful of production servers in 5 minutes.

> In 200X a lot of companies were able to maintain their own infrastructure, just like Facebook and Amazon did at the time.

I would question that equivalence. Speaking as someone who was trying to get what we now call devops going at the time, there were a _few_ companies starting big wins from automation and a ton of places which were content pouring huge amounts of human time into doing it the hard way and slow metrics for time to patch, provision new services, recover after failures, etc. When they had automation, it was known-bad practices like cloning VMs and had accordingly greater cost and lower benefits.

The companies in the latter groups were the ones who were faced with either spending a huge amount of time and money catching up or cutting a cloud contract and not having do a large percentage of that remediation work at all, while being able to deliver results immediately.

Ok, I will try to counter that. The stack grows higher. More specialized employees become more effective. In 200X you just needed a Linux admin. Now on top of being good a good Linux admin you also need to know docker, kubernetes (the admin part, not the user part) etc. If you don’t need that daily, because you also need to maintain some apps, you will be less professional than a cloud provider.

Of course it’s a trade-off.

If you’re large, with constant high resource needs and benefit from custom optimization, you may be better off with your own team/infrastructure.

If you’re small (2 it guys), fast changing, normal speed requirements / easy to scale out then the cloud should be interesting to you.

Really good points. Maintaining your own infrastructure may be "boring" with "old" technologies. But its a known process. Its really easy.

But with the cloud its a whole different domain. It may be easy too, but I don't think its as easy as hosting your own.

And your last two lines are great. It really comes down to the company about which tradeoff is appropriate.

I worked at a couple of those self-hosted companies in the 200Xs. For me, as a developer, it was painful and way, way, way worse than AWS or GCP. Planning for big expected spikes of traffic was extremely painful, and once the cloud vendors started adding new services we felt continually behind the times.

I can only think it was "interesting" if you've forgotten how difficult and expensive managing your own infrastructure can be.

It's a false dichotomy (or at least a false equivalence).

So, there were a lot of, and still there a lot of corporations running their own stuff. Some of these are still running their own stuff well. Some got big, some stayed small in the past decade (some downsized).

It's not necessarily hard to run your own stuff either.

Of course, of course, managed platforms are easy (or easier) from a lot of aspects. (That's their value offering after all.) And the IT expertise market was never exceptionally great in a lot of places. Meaning it was hard to find good sysadmins, good ops people, good devops folks, good programmers, leads, tech PMs, POs, etc. who were able to work together and effectively run their shit, build their product/service, run with the flow (of open source or whatever vendor they used), and so on.

AWS/GCP/Azure is a big paradigm change. No longer do you need to argue about what and how, at best you pick one of the three, and that's it. No longer you need to think about hardware, colo, uplink, switches, peering, multi-homing, picking the right dedicated server provider, and so on. (Sure, you need network guys, who at least have some basic understanding of VPC - overlay networking, but the basics are easy, and the rest is YAGNI anyway.) And this is due to their size. They are big enough, that their price premium - compared even to whatever random small hosting provider you might find on the 'net - is irrelevant, because they are usually more efficient, more secure, more geo-available than the small ones.

But, ... of course, while generalizing all self-hosted companies (meaning that all self-hosted is a pain) leads to a falsehood, on the margin, especially during the IT consolidation of the past years, most companies moved away from self-hosting. And developers usually rejoiced. And similarly, generalizing the problem of self-hosting from your experiences again leads to a flawed conclusion (that it's necessarily hard and expensive to manage your infra), on average, outsourcing a cost-center in an ever increasing complexity world (that is potentially unbounded costs) and focusing on profit-centers is a sane decision.

> it was hard to find good [key job roles] who were able to work together and effectively run their shit

Having had to interact with sysadmins and others, first at at a university and then in business, "work together" is too frequently overlooked.

The crusty, angry and rude stereotype proved right. I've never met a bigger group of downright bastards. Those who were nice people usually couldn't get the servers/software to work reliably. So you were forced back to group 1.

Spiky usage patterns are great for cloud.

Otherwise, I’ve did some tests years ago and what I spent over 5 years would have gotten me only a few months on AWS. And I think I only walked into the data center once in those five years.

When I did tests and studies for our new infrastructure, AWS was at least 2x as expensive per year. And in the 4-5 year range, it made it 3x+ expensive.

His whole comment was about how it's not about hardware cost, but staff cost.

> It's in-house IT employees' speed of tech innovation vs Amazon's engineers'

The cloud shrunk the cost of IT team. Sure hardware is cheaper, but that cost is marginal versus employee cost.

> For many non-tech companies where IT computing is a cost center, their employees won't be able to match the iteration speed of Google's engineers constantly improving on GCP or Amazon's employees enhancing the features of AWS.

Exactly this. As an ops person, this is exactly how I explain it. Sure we can build something fairly competitive on-prem or in a colo, but we won't have an entire team of top-flight experts bent on improving it as fast as possible. That per-GB cost is buying a lot more than just bandwidth and drive space.

> Over time, the internal IT dept treats the other departments as adversaries instead of customers. Executives get fed up with slow IT departments and get excited when a few clicks on AWS dashboard gets them servers spun up in 10 minutes.

This tends to be more of a problem with organizational mandates & processes. If those processes aren't addressed, then you'll end up with all the same problems. Perhaps they will have different labels on them, but underneath it will be the same issues, delays, outages, and recriminations

>>> the IT bureaucracy tells them that it will take 2 weeks

You're not in touch with reality. Try 2 months at a bare minimum.

I worked in a shop that had dedicated lab equipment and resources.

Suddenly buying new equipment that we would use for years became like pulling teeth, no money they said. They couldn't explain what changed.

Then just as suddenly they were willing to approve spending more than our one time equipment purchase... per month in monthly cloud costs / some poorly thought out cloud deployments....


-1 Dedicated person to manage the lab.

+2 two new way more expensive dev ops guys.

-X Occasional new equipment costs / service contracts.

+Y A metric ton of reoccurring monthly costs.

Sure, but the accountants got to move it from capEx to opEx and then everyone went out for a beer.

And the boss get to control a bigger budget, instead of having to ask for the money every time. It would probably be much nicer to work for a company where all exec and middle management where replaced by some AI mastermind, that would give compiler like answers: - "sorry, we can't order this equipment because the inquiry does not end with a new-line character". Replace HR too, and the answer to your C++ job application would be "Sorry, we could not find c++ in your resume" (case sensitive, it has to be in lower-case). Think of how much money and grief that would save.

Add's c++ to resume.

"job added!"

This is awesome!

Also someone got to put it on their resume... probably someone who didn't have to deal with the headaches ;)

> The hardware was getting cheaper each year. More software became available via open source. Then several ubecorporations entered the hosting/cloud business, and suddenly no one seems to be able to afford their own infrastructure.

That conspiracy theory depends on not asking about the major costs areas you left out (e.g. ops, security, reliability) which are the most significant until you're at a fairly large scale. The places I saw switching to cloud providers did so after comparing the quality & cost of the options available, not because they were following the cool kids.

Ops, security, and reliability don’t go away when using EC2 either, which is what led the massive cloud explosion.

Places switched because they didn’t like having to forecast capacity so hourly billed resources turned on and off at the drop of a hat were super attractive.

> Ops, security, and reliability don’t go away when using EC2 either

They don't go away, but you have to spend considerably fewer resources on it because Amazon does a lot of it for you.

Edit: This effect is even more pronounced when you use Amazon's more specialized managed services, like S3, RDS, Lambda, etc. Then the things you mentioned almost completely go away, as you're letting Amazon completely maintain your infrastructure.

What used to be behind firewalls oftentimes became accidentally public through the many, many blunders of publicly available S3 buckets that people started to use because they were frustrated with AWS IAM. This is ironic because IAM was supposed to help you secure your data and such, not make things worse for your security profile. Even with all the tooling that exists to help people write better IAM policies and figure things out, cross-account IAM permissions and roles, S3 bucket policies, etc. are a nightmare for most people beyond fairly seasoned engineers in AWS and that's a problem. Secure-by-default is not quite 100% true with most AWS services in practice unfortunately.

It's extremely easy to prevent publically available S3 buckets, and there's even built in policies for it.

It's actually fairly difficult to make a bucket public without going through a few hoops even without that.

The only reason this is happening is sure laziness and lack of any true change control processes. It's not because it's hard.

Having witnessed what people do, it’s easy to prevent it now for the most part but badly written / enforced access controls and laziness (in the form of overcommit of engineers to projects) are the norm for most large companies. Most of the compromised buckets were launched years ago before a lot of safeguards were put in, and object level permissions can override bucket policies anyway. Getting sharing of objects across a Byzantine bureaucracy in internal IT is a great way to increase the chance some engineer desperate to get their work done will mark something public and forget about it.

S3 based URLs to get cheap web hosting for low traffic sites is exactly what leads to bad permissions as well. I’ve seen plenty of S3 objects that are made public so that they can be viewed from a web browser and are just a badly targeted script run away from being on the latest tech blog about how some other institution leaked PII.

They don't go away but you are generally focused a lot more on your apps and you have better tools for many problems, especially because you're getting out of the lower-level stuff which is harder to work with and full of pitfalls. If you're running your own datacenter, you're going to pay for the team of people supporting and debugging basic things — power & cooling, network infrastructure, storage, routing & load-balancing, server hardware & firmware, etc. creating APIs for all of that for your application teams to use, performance and security testing all of the above, etc. in multiple locations. If you haven't done that before, it's really easy to underestimate how many little things will soak up expensive staff time, especially because so much enterprise hardware was designed on the theory that you'd have humans doing all of the work with automation either an afterthought or a perceived chance to sell $$$ add-ons.

Beyond a certain level of scale you can find efficiencies which pay for all of that but many places aren't at the level where it's unquestioned win, and doing it yourself means you have to start paying the full cost immediately in the hopes that it'll become a win in the future.

> Ops, security, and reliability don’t go away when using EC2 either, which is what led the massive cloud explosion.

I'd argue that in many cases they do get significantly easier in a cloud environment especially for lightweight setups.

For example, if you run a managed sql database in aws backups are largely handled for you. Upgrades take a handful of clicks and ~5 mins. You dont have to think about patching the underlying operating system.

Broadly speaking, cloud providers give you lots of tools to make things that were fiddly much easier.

A lot of areas of expertise do go away though. I don't need to worry about the electrical features of the building where I'm hosting servers. No worries about whether I set my automated backups for the prod DB right because RDS makes it trivial. No worries about physical security and having to be anal about site access management or physical segregation between hardware and users. And so on and so forth.

Physical server management is in general a liability and cost center.

It's kinda annoying how the valid criticism of cloud solutions is getting drowned out by criticism on strawmen.

Who exactly is saying that no one can afford their own infra anymore?

What some people say is that cloud is usually cheaper than running your own infra. They may be wrong (I'm doubtful myself), but it's a wholly different argument.

It usually takes the form of "ugh then you have to hire dedicated people to manage it."

Which is a valid argument, isn't it? Engineers are expensive! Whether they're more expensive than AWS depends entirely on what you're doing, but the answer is almost always "yes" at a small enough scale.

See my other comment.

At least link it

You don't just have to hire dedicated people. You also have to manage those people effectively.

That tends to lead to the typical old-style IT organization which is nowhere near as responsive to the needs of the business as it could be, and this is a competitive disadvantage.

The value proposition of cloud is much more than just technical. Engineers who don't understand why cloud is "winning" are focusing on the wrong factors. The technical aspect is only a part of the picture.

There's no conspiracy: operating your own hardware in a datacenter requires expertise, start-up costs, and additional time. That power, control and (later) cost-savings doesn't come without a few drawbacks.

Short-term budgets and time windows to meet deliverables shrank along with the costs. That's why so many are on the cloud.

I dunno, you can get a lot of dedicated power for a small amount of money from great hosting companies such as Linode.com.

I have been working with them for the last few years and they've been great. They even roll out new hardware to existing customers for no additional cost.

Blows bad hosting companies out of the water. For example Media Temple who never upgrade their hardware for existing customers at no additional cost.

The hardware isn’t the expensive part, it’s the people to run it that is.

Using publicly available costing data from last November, I see that an m4.xlarge RDS reserved instance of SQL Server Enterprise Edition with multi-AZ failover is $39,000 / year. If I have 4 DBAs at a cost of $150K / yr salary + benefits, that's the equivalent of around 15 smallish, managed SQL Server instances. My current SQL footprint is on the order of 600 instances, which works out to be a bit over $23M a year. It is true that I've saved the $600K on the expensive personnel, though...

As your operation gets bigger, the scale generally starts to tip in favor of running your own hardware. But 600 SQL Server instances is fairly large—when that number is in the single digits, the cost of DBAs/DevOps dwarfs that of the hardware. So the scale of your business matters a lot, and it's not useful to generalize this to all companies.

Edit: Your math seems off as well. Don't forget that you'd have to pay the SQL Server license regardless of where it's hosted, as well as the cost of dedicated hardware if you go that route, so the difference between the two options will be much smaller than $39k per instance. And I don't know if that salary number is real, but $150k including overhead seems really low for a good DBA (at least in the US)—I would expect at least $200k.

One last thing to consider, even with a small team, is that without at least one DBA you may end up burning money by deploying non-optimized, non-performant database workloads, and using Amazon's very lucrative "throw money at the problem" service to buy your way out.

This is very true, but my point still stands: For a smaller team, the overhead of maintaining your own infrastructure is very significant compared to the resources you have. If you have, say, 10 engineers, having 3 of them doing DevOps fulltime means 30% of your engineering spending is going towards your infrastructure, even before we factor in capital costs and all of the other overhead—not to mention room for error—that comes with it (security, etc). And if you're small enough to only be able to afford 10 engineers, you're most likely not going to spend 3 engineers' salaries on the equivalent AWS services. So now you only need maybe a single DevOps person at most, and you get another engineer or maybe two for "free" that can work on your core product. I'm not saying it doesn't come with its own disadvantages, but there are a lot of benefits to going "all in" on cloud services.

And obviously there are no absolutes—there are certain workloads that really do benefit more from dedicated hardware, even at smaller scales, and there are huge companies that are better off with cloud services (ex: Netflix).

I should also have added - SQL on RDS, the license cost is baked in. You have to re-purchase the license every year. You don't have to do that with enterprise licensing, which is a one-time upfront cost. You also can't overprovision, and you can't use alternate licensing models like MSDN.

Well of course it wasn't strictly apples-to-apples because I did leave off the cost of the DCs, the hardware, and the initial license acquisition. But those costs are shared across multiple teams, amortized over long lifespans, etc, and we also get volume discounts. But even taking that all into account - for 24M we can, and have, bought about a PB of storage, multiple blades and enclosures, licensed OS, and acommodate the salaries of the teams we need to run it all. I'd hazard a guess that that 24M is a pretty significant portion of the entire IT budget, in fact.

Also you can get bulk discounts for most services if you're running a lot of instances. That probably includes RDS.

Do you have multi-AZ failover now? And probably many of your 600 instances would be fine on a lower tier? On Azure (as an example) one gets 100 of the lowest managed SQL DB tier per 150K/year developer.

On tier 1 stuff, yeah. We certainly have a number of instances that would be fine on a lower tier, like dev instances etc that we can license under MSDN subscriptions, or that honestly would be better off as containers on individual devs' machines. But AWS doesn't offer that, and from those same numbers I see that a single-AZ instance is still $19K a year, every year. That's actually new cost because we can't BYOL for RDS.

How much is the license for SQL enterprise nowadays? still 5k per node or does Microsoft charge per socket now?

Depends on your organization, the licensing model you pick, etc. They generally license per core now, and you have to run at least 4 cores. Without giving too much away I'd say the cost of licensing a 4-core box running Enterprise is on the order of $20K, which you pay one time, up front. Then you just pay for your support contract. You can also do stuff like license an entire VMware farm or failover cluster, and then you can overprovision services for extra cost savings.

Here is an awesome presentation from Rancher about how cloud providers are not providing resources but convenience and want people to buy more and more services. https://m.youtube.com/watch?v=mnu477057W0 full slides etc are here after registration if interested https://info.rancher.com/single-node-standalone-servers-scal...

I think one of the issue is availability or distributed data centers. Its indeed cheaper to buy hardware and setup server but to ensure you have datacenters in many locations is the problem.

Back then you basically had to do it, so everyone did. These days you have to compete with companies that make use of the flexible offerings of AWS/GoogleCloud/Azure. Which for small companies is a large advantage, since it makes it possible to start up with almost no money, and scale up smoothly, while letting your employees be focused on the product you offer for your customers.

Yeah and how long did it take to update software back then? Weeks, months, years? Now changes can reliably get pushed to production in hours or even minutes which has incredible business value

That's the difference between comparative advantage and absolute advantage.

Yep and we were running bare metal servers with pretty bad behaviour sec wise.

Verio was a harbinger.

So before we had an IT team that would maintain the bare metal servers. Now we need "Cloud Engineers" to maintain the cloud infrastructure working properly. I don't know if the argument of externalize the maintenance of the servers is valid since the complexity of the cloud services is just increasing everyday.

True for those that go for Kubernetes etc straight away. If you just have a few apps to run and put it on a PaaS like Heroku, things are pretty manageable as a side-thing for the regular software developers. No dedicated team, or even person, needed.

When your infrastructure is too simple, you have simple regrets. When your infrastructure is too exotic, you get exciting, bleeding edge regrets.

This is my view based on the work I've done. I like the ability to dispense with capacity planning or dealing with power supplies and fire suppression you get from cloud hosting, but when in any doubt I set up what I need using VMs (be those droplets or EC2 instances or whatever).

I like to imagine that I've avoided wasting weeks by wasting a few hours here and there.

Last physical datacenter my team at the time ran our app in we spent weeks troubleshooting some hardware network driver issues that caused the network to drop. Was an enormous distraction and Dell and VMware support were useless at resolving it for us. I’m glad to have others deal with those low level “oh it must just be your setup” issues.

Anecdotal evidence. I've also seen colleagues spend weeks arguing with AWS support while debugging a weird performance degradation issue, that would have been straight-forward to investigate in a bare metal deployment with full control over everything.

It's not like the cloud is a magical place where no unexpected issues ever happen. Cloud providers can be surprisingly buggy, especially AWS, particularly at scale when you start hitting the implicit scaling limits their docs conveniently forgot to mention.

Each technology has their unique challenges.

> spend weeks arguing with AWS support

Yup. We've run into that repeatedly. The "we didn't notice anything on our side, please send more screenshots and logs" gets really old when working with "managed services".

Network packet losses/truncations, EKS control plane failures, cloudformation stacks getting stuck in really wierd states, inconsistent cloudformation implementations for new and existing services, ENI weirdness in containers...

Managed services just feel like they aren't - more and more every day. Amazon (or any other cloud) will never have the same investment in your availability and infrastructure as you will.

My experience with AWS support for performance issues is similar. My ticket for an Aurora RDS performance issue ended up in a kind of purgatory where they continued to ask for more information and more tests on my side, continued, to infinity. Once one support engineer was satisfied with my results, the ticket would go dark for months, and when I would follow up, another (new) engineer would ask for a whole new round of tests, logs, etc., typically for the latest release.

After a few rounds of back-and-forth, I eventually gave up.

(I even gave them permission to copy test snapshot data, at their request, and sample queries to reproduce the issue. They could have done all the testing they wanted in-house. But that might have actually required them to spend time on their side to diagnose and fix something, I guess.)

It wasn't a critical issue, exactly, and we worked around it in code. I mostly just wanted to report a problem.

Running into one of those exact issues (cloudformation stacks getting stuck in really weird states) right now. A real pain in the ass, because I have zero control or visibility into it.

Is switching to Terraform an option?

Based on our own research - probably. But it requires tooling to integrate into a CI/CD pipeline, conversion of existing CF stack definitions to Terraform definitions, and a lossy/downtime laden migration.

For a team with the goals of IAC, hands-off, and high-availability infrastructure, migrating is a pain.

With Terraform's import functionality, you might be able to migrate into Terraform without any downtime. I believe there are even utilities to translate CloudFormation code straight into Terraform, though I've never tried them.

It would definitely be an annoying migration, even if import works for your use-case, but honestly, dealing with CloudFormation is so irritating on a day-to-day basis for me that I'd consider it worth it.

Cloudformation is a tragedy in itself.

The idea behind CloudFormation is great, though. Platform-native IAC with promised first party support for all future projects, and backfilling existing services. Plus, it supports deep integrations with first-party supported configuration management services.

The problem is that the reality has not lived up to the promise, and "first party support" means "only the first party can support".

CloudFormation is a declarative state management language and framework just like Terraform. The problem is that CF abstracts said state away from you to the degree that you can't hack around it and you wind up forcibly deleting resources if you try to use it like a configuration management framework. With CF Custom Resources you can add all sorts of other stuff and that's pretty cool at least.

It's infinitely better than what Azure provides, IMO.

I am a Terraform convert, I agree.

You need to treat a commodity like a commodity. It's perfectly OK to run business-critical processes in someone's cloud, as long as you have a fully automated process that will deploy your entire environment from the latest backup to any set of cloud nodes at any cloud provider you point it to. Unless you're doing that, you're just shifting risks, not eliminating them.

Yup, the famous "leaky abstractions"[0]. You can make some reliable assumptions of the data rate of a 40Gbps link in your DC between this and that physical server, and how to get the best out of it. On AWS? Not so much.

0: https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-a...

I've absurd bugs at AWS,GCP and azure. I love dynamic capacity but clouds are super flakey, especially AWS. Azure networks are the worst. GCP I haven't used in over a year so I wont talk ...

I’ve had performance issues with GCP, however I can switch cloud providers in that case without too much effort. With physical hardware that’s much harder.

If you have a well-built and automated bare-metal deployment, it's no harder than switching cloud providers. At the scale where it makes sense to go bare metal, either would be a lot of effort.

An extra abstraction layer like k8s makes it a lot easier, which is exciting. It's also precisely why IBM bought Red Hat - a well-built k8s distribution like OpenShift is one of the very few real alternative to public clouds for many companies.

> An extra abstraction layer like k8s makes it a lot easier, which is exciting. It's also precisely why IBM bought Red Hat - a well-built k8s distribution like OpenShift is one of the very few real alternative to public clouds for many companies.

OpenShift is built on a prescriptive mentality. They choose everything for you from OS through pipeline and many deviations are unsupported. As you mentioned there are others, but depending on your environment OpenShift is either a very good fit or square-peg-round-hole.

OpenShift is a superset of k8s, so you're free to customize it (you just won't get support if it breaks).

Openshift doesn't choose the pipeline. You can deploy to it however you want. Openshift is literally the same exact thing as kubernetes, with a simpler api and a nice ui. Openshift is extremely flexible and you can use it however you want. Don't know why you think it's so restrictive.

It has the same API as k8s. It's literally the same thing, OpenShift just adds some PaaS feature on top of k8s.

Which is precisely why I've run it for over 2 years now. The devs on my team love throwing a git URL into OpenShift and have a project go live in mere minutes, previously a more senior developer would need to setup a CI pipeline and deployment automation and even with experience it was a chore.

It's not like I'm personally afraid of k8s or building docker images myself either, but anything to help my team be more productive without needing to hold hands or do gruntwork on their behalf improves the lives of everyone.

Exactly right. I don't understand all those teams reinventing the wheel building custom k8s distributions, rather than using something like OpenShift and focusing on their core business.

Running custom k8s in production is the 2019 equivalent of using Gentoo.

You’re forgetting the network needs to be replaced too, there’s lead times on getting hardware, there the whole racking and cabling, and even at that if the this is due to a more complex networking issue then I can’t swap out the network since there are too many components and they’re shared by other teams. In the cloud I can do this quickly tho.

You wait until you have to debug something like that with a VPC. We had to spend couple days to convince our cloud support the issue was on their end during full site outage.

Counterpoint: These issues still exist, you just lose visibility into them by putting someone else in between. And usually, the someone else is big enough that you're not a priority.

with limited resource, you have to choose to balance between the stability or more control over your infra. IMO, small companies can't afford both, but also likely they would not need both at that size.

B.S. I was solo IT for a 50+ppl company, on a three-person team with a CIO/DBA and a data manager. I supported a full Windows/Exchange LAN with heavy printing needs, tape backups, internal/external infrastructural services and webservers with LOB java webapps, remote users, etc. Through all this I learned Perl from scratch, rearchitected the business systems (hardware) and reduced their production time by like 90% (two weeks to two days) with zero downtime (primordial blue-green, as it turns out), and wrote a completely new ETL flow with my new Perl skills. From four servers under a table to four racks, SaaS architecture is exceedingly simple next to all this, and the company spent probably around $10K/mo all told, maybe up to $20k while I was buying hardware.

Maybe you simply haven't seen this kind of operation in action before?

At the scale where owning infrastructure makes sense, that's not a distraction but some team's full-time mission.

Even before that tipping point, you're likely to have a team dedicated to working with infrastructure hosted by a cloud provider. Based off past (hardware) and current (AWS) experience, the team sizes aren't that different.

We did actually have a team who handles hardware and a networking department, and they couldn’t figure it out. So lots of people had to get involved to do a binary search for the problem. I should mention this is a company with considerable resources.

In my last team, our team suffered from a CPU bug and a SSD bug that caused painful debugging nights. We used multithreading heavily, and it could have been library/kernel issues or some bugs in our code, which cover massive amount of potential target code that needed to sort through initially. I don't think small companies can afford such gigantic debugging effort, so either they should rely on their luck (or flaky tests) or on someone else who can afford it (a cloud/infra service provider or a startup that depends on you.)

How is that related to bare metal vs. clouds? That could have happened on either.

Big cloud providers or some big VC backed companies have the resource to figure out the issues by wrestling with the gigantic and unyielding vendors, or by working around it with their engineering resource. Not many teams have the time for even wanting to understand the issues. This is not an inherent problem of bare metal, but one aspect to consider if you were to choose the path. But you are right, cloud providers have their own quirks.

> Big cloud providers or some big VC backed companies have the resource to figure out the issues by wrestling with the gigantic and unyielding vendors, or by working around it with their engineering resource.

This is all operating on the assumption that those entities will care enough about your problem enough to do something meaningful to fix it. As others have stated here, at least in the context of "Big cloud providers", that frequently is not the case. Often you just get the runaround (continuous delays, requests for "more information", other attempts to stall in first level support).

Add a third party "managing" some piece of your company's infrastructure via a cloud provider into the mix and it often gets even worse.

It is true that there are real costs to having your own hardware/software onsite and people who know how to manage it. However, the promised reductions in cost/hassle of moving things offsite are frequently offset or even exceeded by the costs/hassles you get by not having control of things yourself.

Except they won't. Not for an esoteric or exotic bug that doesn't affect 99.9% of their customers. Unless your businesses contract costs more than that team of technical and engineering resources(Eg your Netflix et al), it's not worth their time to bother looking into it.

All that assumes they even admit the issue is their hardware not your code, which is a mighty big assumption itself.

Cloud providers also tend to run much more exotic configs for pretty much everything (since tenant isolation is a top priority), so such issues are more likely to happen in the first place.

Because when you're caught between those who manage the hardware and those who manage the K8s cluster, you get ping-ponged between them in the blame shifting. It's very annoying.

Until you hit this can not happen we will not spend time looking into it issue in a cloud :)

This is basically the "on-prem/hybrid cloud" business which everyone's after (IBM, Google, AWS, MS). The market seems to be pushing towards this model. The old businesses were naturally not moving to "cloud" so the cloud wants to come to their house (which makes sense).

The next-to-last line cracked me up: "The costs can be much lower… or much higher!"

The cost of shiny new tech. "Hey look at all these cool things you could / can do that would lower costs and do other cool things."

Weeks or months later you're still learning about how you do the thing that will maybe lower costs and do cool things ....

Meanwhile the market and mindshare has moved on to the next shiny new tech.

Kubernetes has been here for a while. I'm sure we can agree that it ceased to be a candidate for a meteoric fad for a while now.

I am not a Kubernetes expert but I feel like Kubernetes still has such a high rate of change going on that ... it still is in that phase where you just don't know what you might get out of it... maybe not "just a fad" but still variable enough that who knows where we'll be or if in two years it's all different.

I've probably wasted more than three weeks installing kubernetes on 3 nodes. I've tried Racher, RKE, kube-admin, kubespray and doing everything myself. I always failed and the way it failed was completely opaque and I didn't feel like I could even understand what was going wrong.

Don't install k8s. Use it managed. It's best that way. Too many moving parts for a small op to manage.

Openstack had similar hype (though more on the systems side than the developers), and it seems to finally be fizzling out.

I use to enjoy having a cheap desktop under my table or in a closet serving traffic to people across the internet. Computers have gotten faster, software has gotten better, network has gotten faster, things have gotten cheaper.

Sadly instead of seeing more of these, most of these are now being outsourced to cloud providers, we have bought and drank the kool-aid that they can do it better and cheap. Which is not true, perhaps this is why it's so hard to find good unix and network admins these days.

As an ex big iron commercial UNIX sysadmin, let me tell you; they can do it way better than most sysadmins.

Developing IT policies for physical security, server security, update policies, purchasing, wiring and so on takes a huge amount of knowledge, and the possibility that a middle-of-the-road sysadmin is not only competent at, but excels, in all those areas is literally zero.

Cloud providers have world-class experts in each area whose sole responsibility is providing for their specific area. It's economies of scale at its best. It doesn't mean that it's always the best choice, but having to have someone do full-time server maintenance at no additional value to yourself or customers is just a huge drag.

> As an ex big iron commercial UNIX sysadmin, let me tell you; they can do it way better than most sysadmins.

I'm pretty sure that is true. However, the parent said "they can do it better and cheap", and cheap it isn't.

I don't think cost is the benefit of cloud. For tech companies the benefit is scale (you won't know how popular a product is until you deploy). For non-tech companies the benefit is not having to deal with things that aren't related to your core business.

I think you can argue that tech is always related to your core business and you only need a few servers to handle a million simultaneous users but those kinds of arguments aren't usually good enough to businessmen or entrepreneurs.

Yeah, you don't know how big you'll need to scale, but I think that's a bit of a boogeyman for a lot of companies. Servers have a relatively massive amount of cores and memory these days, along with screaming fast IO, and you can handle quite a lot of web traffic with only a few dual socket servers if you don't bog them down with incredibly inefficient code.

> we have bought and drank the kool-aid that they can do it better and cheap

The few times I've done the analysis, cloud servers tend to cost about as much as self-hosting, when you factor in the cost of electricity.

There are reasons one may argue against cloud servers, on the basis of the internet's health, user freedom, etc. But for me and for many people they are at least both better and cheaper.

Looking at Knative to mature in a couple of years and with it on top of k8s we may see a slow migration back to self-hosting where it makes sense

How does it work with the consumer ISP? Can you elaborate on any special considerations? Obviously not for customers, but I'm interested for it as a hobby thing.

Some thoughts:

1) Do you really need to invest a million dollars in bin-packing containerized stateless microservices?

2) You don't have to use K8s to get the benefits of rolling your own servers. In fact, I'd argue you should do the latter well before you do the former.

3) Definitely hire someone who has done it before. You will save so much time and money your head will spin.

4) Do not just build a rack full of random commodity gear. Make sure it is suited specifically for your purposes, and then weigh the cost of service contracts and managed colo against a $100k+ cage monkey on call 24/7.

5) Do not fall for the "We've got <insert tech hype>, we don't need redundant hardware!" lie. The more parts of your system rely on lots of hardware, the more fragile your system becomes. Distributed decentralized services become a PITA when the underlying gear is flaky, and centralized services require it. Do not underestimate the shittiness of your colo; always design for the most redundancy you can get for what you have. If you can run dual power, do it. Dual network stacks, do it. Redundant disks, do it. Remote management, do it. Outside modem to a management port on the router, do it. Always be postponing entropy. Later, when you become a FAANG (yeah right) you'll have the time and money to automate away some of the redundancy issues.

6) It's sometimes harder to upgrade disk or bandwidth on an existing machine than it is to just buy another machine, but the more machines you have, the more problems (and overhead costs), and there is an upward limit on the scaling of most services without rebuilding them. So buy big on local disk and network, and buy new machines to expand cpu and memory. The redundancy and performance issues inherent to SAN/NAS often make local disks a better choice, as they are unlikely to impact your whole network at once, you can fix/upgrade them piecemeal, and they are less difficult to manage/operate.

7) Don't DIY critical components such as data replication; it's surprisingly hard. In fact, don't even believe vendors if they claim they can solve X for you with some magic software or hardware. Get them to show you a live demo on your own network.

8) Don't forget that in 3 years you'll be replacing it all.

There is also kubespray Ansible provisioning of Kubernetes, kops and kubeadm.

Rolling Kubernetes on your own is quite hard though especially the networking part so there should be a market for a company helping out with provisioning.

Using public cloud can be very expensive for some work loads which were not written for the cloud due to higher resource usage. It is still quite hard to forecast cost of the cloud due to a lot of moving parts and micro billing for each item.

With traditional service providers it can be easier to budget for the service expenses.

I did a post on installing DC/OS a few years back. I was going to follow up with a K8s post, but between the official Ansible scripts, the (at the time) very alpha kubeadm, broken pod logging and WeaveNet that would constantly fail with no logs, I ended up giving up on that post:


I remember the manual-ish server provisioning days of Autoyast and RedHat's Kickstart. It's not that difficult to bootstrap physical servers to get them to the point of being nodes in a Docker scheduling system. Rancher is another great tool for setting up K8s (which you can run against physical nodes once you get them past the initial install/bootstrap).

I worked at one place where they took a hybrid approach. They had DC/OS running in the local VMWare data center (which we were migrating to from AWS to save costs), but there were nodes that could scale to AWS as well and you could label your deployments for which data center you needed them to run in. A lot of high volume loads we were able to move back to our data center, but we also had a kick-ass platform team and our AWS bill was well over $150k/month.

The trouble is it's almost impossible to do a real cost comparison. There are just too many factors, and when you start going AWS, GCE or Azure, you start using all their other managed services that may not have self-hosted replacements, not to mention off-setting the costs of setting up master/slaves, backups, snapshots, redundancy and other operations tasks that come by default or with an additional price from a hosted solution.

From my limited experience, you'd save a lot on the cloud if you replaced the default IngressController. Google Cloud, for instance, will spin up a fresh load balancer for every service by default and those cost money.

Of course, it's not an easy task as a beginner to use nginx or traefik in its place, and to handle the complexity of that deploy.

Approximately 100% of the GKE users I know about use a reverse proxy behind the GCE Ingress. FWIW I'm not especially ops-y and configuring ambassador took me about a day.

It is better now with kubeadm. There are still lots of things to learn and lots of breaks which can happen but after months of learning I now have my hown cluster and it works perfectly. It mounts NFS store, also use local storage and I plan to play with OpenEBS. Nginx ingress + certbot (let's encrypt auto-renowal) works fine. Logging can be done with Fluentd and Graylog. Prometheus for performance measures. MetalLB as load-balancing replacement on L2. Calico for networking and networkPolicies. Cloud providers definitely have some added benefits, less friction. But 'do-it-yourself' is well possible as wrll and it helps you to understand more individual pieces K8S is made of and how to debug those.

isn't this what they all sell nowadays? A prettier UI on top of kubernetes?

- Kubermatic by loodse

- rancheros

- even mesosphere is kinda that now, except the first two are opensource, but definitely not the only solutions

I personally deploy kubespray here and there, but I would still recommend smartos for normal humans, unless they want to setup an elk + kafka cluster or whatever else you might want to do.

upvoted for triton+smartos - hear hear

The article seems to focus on K8s with reference to micro-services. How well does K8s do if you're running a monolith?

All of the following assumes you plan on running more than a single instance of your monolithic application. If that's not the case, then ignore Kubernetes, and be glad you don't have the problems it was designed to solve.

If you consider what it takes to manage the end-to-end lifecycle of a single application, monolith or micro-service, you need a solution for the following items: deployments, application configuration, high availability, log and metrics aggregation, autoscaling, and load balancing across multiple application instances.

Kubernetes provides an opinionated way of doing all of those things. For example, Kubernetes leverages container images and declarative configs for packaging and deploying applications. For many people this approach is much simpler than what Puppet, Chef, and Ansible bring to the table in terms managing applications.

When it comes to high availability Kubernetes provides an orchestration layer across multiple machines, grouped in clusters, that deals with distributing applications based on resource requirements and automatically responding to node and application failures. When applications crash, Kubernetes restarts them. When nodes fail, Kubernetes reschedules the applications to healthy nodes, and avoids the failed nodes in the future.

Many of the patterns for managing applications, even monoliths across a handful of nodes, Kubernetes provides out of the box. In essence, Kubernetes is the sum of all the bash scripts and best practices that most system administrators would cobble together over time, presented as a single system behind a declarative set of APIs.

One other major caveat to all of this.

Just like I would not recommend standing up OpenStack from the ground up in order to deploy your monolithic application across a set of virtual machines, I don't recommend rolling your own Kubernetes cluster either. You should strongly consider leveraging a fully managed Kubernetes offering such as Google Kubernetes Engine, Digital Ocean's Managed Kubernetes, or Azure Kubernetes Service.

> Just like I would not recommend standing up OpenStack from the ground up in order to deploy your monolithic application across a set of virtual machines, I don't recommend rolling your own Kubernetes cluster either. You should strongly consider leveraging a fully managed Kubernetes offering such as Google Kubernetes Engine, Digital Ocean's Managed Kubernetes, or Azure Kubernetes Service.

The rest seems reasonable, but I disagree strongly with that claim. If it's worth using the tool then it's also worth learning how it works.

Even just kubespraying a cluster myself helped me build a much stronger mental model of how Kubernetes works than trying to take over a colleague's black box Kops setup. GKE or another managed service would have been even worse.

Setting up a small cluster isn't that hard, and it will teach you a lot about the internals and how things can go south (and what to do when that inevitably happens).

FYI: the person you are responding to is the author of Kubernetes The Hard Way [1], which is effectively a tutorial of learning how all the K8s pieces work together. He also co-authored the first book on it [2]. He's also a Google employee, but I would trust his opinion more than others just because he's probably seen more use cases than anyone else.

[1] https://github.com/kelseyhightower/kubernetes-the-hard-way

[2] https://www.amazon.com/Kubernetes-Running-Dive-Future-Infras...

"I don't recommend rolling your own Kubernetes cluster either. You should strongly consider leveraging a fully managed Kubernetes offering such as Google Kubernetes Engine, Digital Ocean's Managed Kubernetes, or Azure Kubernetes Service."

Working for one of those vendors. There is no cloud, just someone's else computer.

You might find this slide deck I wrote useful: "Migrating Legacy Monoliths to Cloud Native Microservices Architectures on Kubernetes"


Using kubernetes without microservices is like using hadoop on a 1000 row csv file

I don't know if that's a fair comparison. If you deploy a monolith you have to set up a custom multi data center deploy and manage all the load balancing/disaster recovery/volume management(databases) yourself. Kubernetes isn't that hard to setup and gives you out of box solutions to all that.

It also gives more flexibility on adding additional services around your monolith. Elk stack, kafka, that kind of stuff. Also gives you a standard api to deploy against. I don't think your metaphor holds up.

This is a poor analogy. To quote a sibling comment:

"If you consider what it takes to manage the end-to-end lifecycle of a single application, monolith or micro-service, you need a solution for the following items: deployments, application configuration, high availability, log and metrics aggregation, autoscaling, and load balancing across multiple application instances.

Kubernetes provides an opinionated way of doing all of those things. For example, Kubernetes leverages container images and declarative configs for packaging and deploying applications. For many people this approach is much simpler than what Puppet, Chef, and Ansible bring to the table in terms managing applications."

Using Kubernetes even with a monolith can be pretty nice, if (big if here) Kubernetes behaves.

Ah, as someone not well versed with K8s, thanks for that analogy.

I’ve done this for small projects using AWS and Scaleway.

The setup is simple and it just works, but the maintenance and security aspects were more worrying. One time I found out that one of my nodes failed to automatically apply security patches from my distro. I couldn’t detect it sooner because my monitoring was lacking. The overhead escalates fast.

If I start an effort like this again today, I would make sure to get monitoring right from day 1. Possibly a combination of Prometheus with Grafana, or something along the lines of ElasticSearch APM.

So what is Gravity? kops/kubespray without an MIT license?

Without MIT.... but with Apache2 ! :)


In all seriousness though, gravity is an open core toolkit to package and deliver complicated sets of micro services in air gapped and restricted environments as a virtual installable appliance.

It takes care of both application and Kubernetes lifecycle, software distribution and licensing workflows.

If you get to build your own servers why torment yourself with a 2-socket solution? It's just harder to program with no benefits. A single-socket UMA solution will be more than large enough single nodes and a lot easier to program and manage.

I once heard a story that the reason some people would build their own server is they don't trust those service provider on the internet. They feared that the data store there are quite easily breached.

Thanks to these guys, they just lower the unemployment rate a little more!

> A “starter pack” 15amp cabinet with a gigabit connection can be rented for as little as $400 per month.



Can't speak about the power given on their $400/mo cabinet deal.

Yeah, this isn't possible for a entire rack, maybe 1U if it includes network and power if it is in any decent datacenter.

How much would a 48U rack like described in the article cost a month? I would guess in the thousands if not ten thousands....

It mostly goes by power not rack units, if you have 48U and only 15amps, you're not going to fill it up. 15A is not a lot.

15A is more than you think these days, still probably need more for a full rack but a modern server can draw around 150W under full load these days. Hell, my entire rack at home hangs around ~.3A during normal usage (2x Dell R210 II's, Dell R320, Dell R520, Lenovo SA120 with 4 bays filled, Juniper EX2200-48T, Mikrotik CRS317).

Can I ask you what do you use all those servers for? Looks like a not-so-small office rack config

FreeNAS for media and VM storage on the R320+SA120.

The R520 runs Proxmox and hosts whatever stuff I’m playing with at any given time along with Kubernetes, FreeIPA, Graylog2, probably my mailserver again soon, etc.

One R210 II runs Sophos XG to handle router+security duties, the other runs Windows Server Essentials 2016 for AD and NPS.

The CRS317 switch provides 10Gb networking for storage on a couple of servers, and everything else is connected to the EX2200.

So, $400/mo is overkill power wise. What would the bandwidth bill look like? How much of an up front investment is it to buy 48U worth of hardware? Seems expensive...

I'm spending $375/mo for 8U, 12A, and 300mbit unlimited (3 x 100mbit bonded). The 8U gives me a GBe private network, 120 cores or so, 350GB RAM, and 20TB or so of storage.

The servers cost me $1200-$2200 each and I would buy a new one every quarter until I got to where I am today. I haven't added new hardware for almost 2 years, but its more than capable of handling my workloads (mostly web apps, lots of servers satisfying sql,ldap,etc dependencies). I burst (and failover) to all the clouds today.


The problem is not running kubernetes on "bare-metal". That's the somewhat easy and cheap part. The problem is the supply chain to spec, purchase, rack, validate, provision and keep up to date your servers, their drivers and firmware. It's your capacity planning. It's keeping up with the rate of hardware evolution. What happens when you have to go from 10G to 25G networking ? Oh and you also need a solid scalable L4 loadbalancing solution (facebook has a nice opensource dataplane based on ebpf)

You can build / assemble from open-source bits most of the classic IaaS bits (compute/storage/networkLB). It's doable, the open-source solutions are pretty good. But unless you have 10k compute nodes how can you bear the cost of having the expert knowledge necessary to debug this ?

Hardware sucks and is always on fire. Once you take this into account, the economic equation changes dramatically.

I used to specialize in 2 - 10 rack build outs, but its been nearly a decade since those gigs evaporated. I wonder if k8s creates a new niche for it.

These kind of gigs are definitely nowhere near as common as they once were, but they haven't evaporated completely - I work in emergency services infrastructure in the process of a 10 rack build and that is something I don't see changing in the next 10 years, where we will do a refresh possibly once, or twice more.

The complexity in our case is in the backend transmission network and unless we control everything (hardware, network) end-to-end as much as possible, we can't guarantee to getting close meet our SLA's for our customer. They wouldn't allow us to run our system in the cloud even if we wanted to.

I am curious to see how they recommend handling storage on bare metal k8s.

In many environments, you don't actually need distributed storage:

- Application servers are stateless.

- All state lives in stateful databases.

- Those database handle high availability at the application layer by replication data, so you can use local storage and deploy them using StatefulSets. Has the extra benefit of avoiding the network storage latency penalty.

But StatefulSets require storage...someplace.

We're dealing with that exact issue right now. Our K8s cluster is not playing nice with a NetApp NAS. We are exploring other options such as local iscsi storage volumes ($$$) or an external state store db (slow).

The point is that with native HA, it doesn't have to be distributed storage. You can use locally attached SSDs or NVMe drives, which are straight forward to manage and fast.

Importantly, they also have independent failure modes. I’ve never known a SAN to be within an order of magnitude of the advertised uptime — and some interesting failures have been caused by HA features. Local storage avoids whole classes of problem and has much greater aggregate performance if you can make it work for your app.

I highly recommend Rook [1], which is based on Ceph, provides PersistentVolumes to k8s workloads and is also running on k8s itself.

[1] - https://rook.io/

I worked on Colossus at Google and Ceph is the closest thing out there. Gregory Farnum gave a great talk about it at the Open Source Summit 2017.

I heard from someone at a large company, though, that it's not getting a lot of love from Red Hat nowadays, even if you're a large paying customer. Now I'm curious.

I'm working on Rook+Ceph at Red Hat. Rook 1.0 was just released last week which added support for the very latest Nautilus release of Ceph.

Great to hear. I couldn't get more details from them, but it was something about both rebalancing and data recovery. They picked SwiftStack, which is at least in part open source. Maybe by now you have figured who that company is.

My team is still very interested in Ceph and Rook, for the record.

I'm curious also. I don't have any special insight into RedHat, but rook went 1.0 just last week: https://blog.rook.io/rook-v1-0-a-major-milestone-689ca4c7550...

That announcement gave me the impression the project is making active progress.

Really curious about Rook. The stakes of running something like that on K8S successfully seem quite a bit higher than stateless apps, and commensurately the rewards as well.

Big fan of rook myself. We took several nodes loaded with 16 disk drives and used Rook to make a Ceph and Minio cluster out of it. Great storage solution.

Have you done any disaster recovery exercises? We've started to play with it, but haven't gone that far yet.

Me too.

These days I usually recommend that any data an app wants to persist be written to an object store or database.

And that any app that needs to write critical data to a disk should be provisioned with config management on bare metal (or pods statically assigned to machines with attached storage if K8s is a must).

This is mostly due to how tough the problem of storage is on K8s, and how much is at risk when it goes south.

I have been using openebs (https://www.openebs.io/) at my work and its been pretty stable thus far.

link is broken

Yup, 404. Even manually going to their blog I can't find the entry anymore. Bookmarked because it was supposed to be an ongoing series with multiple parts. Weird provider. Probably best to avoid.

I have a feeling they deleted the blog post. There is an archived copy here: https://web.archive.org/web/20190508123542/https://gravitati...

You can colo your server and roll your own k8s clusters. But you can't afford the luxury of high throughput networking, EBS, etc., and don't forget HA options (Multi-AZ) for your clusters.

Untrue. I purchased a 48 port 10GbE Arista 7148 the other month on eBay for $320.

Nice! OT, but have you seen any reasonably priced RJ45 10GBe out there? Preferably without screaming fans :-)

You really do not want RJ45 / 10GBase-T for higher network speeds. They consume A LOT more power and has significantly higher latency than SFP+/SFP28 and fiber cables.

I have a Brocade ICX 6450 with 4 10G ports in my homelab, replaced the fan with a whisper silent one. Cannot hear and is right next to me. Brocade ICX 6610 has 16 10G ports and 2 40G ports, not super quiet, but not that noisy either.

The only issue is that I have 3 10 GbE devices, and all are 10GBase-T, so that makes the SFP option less attractive.

That said, I'm very much a networking noob, so maybe there's a way to make that work with some sort of copper to SFP bridge or something. Not sure if that would kill the advantages, even if it existed.

Good idea on replacing the fan rather than hunting for a quiet router!

I haven't. I've also not heard that great of things about it, but we run it in our DCs so I assume it works well enough.

Ah, yeah, looking for something for my home office (hence the preference for no screaming 1U fans), so I wouldn't have very long runs. But I have heard that 10 GbE RJ45 can be finicky.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact