"On June 6, 2020, your Google Kubernetes Engine (GKE) clusters will start accruing a management fee, with an exemption for all Anthos GKE clusters and one free zonal cluster.
Starting June 6, 2020, your GKE clusters will accrue a management fee of $0.10 per cluster per hour, irrespective of cluster size and topology.
We’re making some changes to the way we offer Google Kubernetes Engine (GKE). Starting June 6, 2020, your GKE clusters will accrue a management fee of $0.10 per cluster per hour, irrespective of cluster size and topology. We’re also introducing a Service Level Agreement (SLA) that’s financially backed with a guaranteed availability of 99.95% for regional clusters and 99.5% for zonal clusters running a version of GKE available through the Stable release channel. Below, you’ll find additional details about the new SLA and information to help you reduce your costs."
Oh, and per their docs, three-control-planes decision is not reversible - I cannot in fact shut two of those down without shutting down my production cluster and starting a new one. https://cloud.google.com/kubernetes-engine/docs/concepts/typ...
Awful. Just so awful.
Edit: To answer some questions below - we have a single-tenant model where we run an instance of our async discussion tool (https://aether.app) per customer for better isolation, that’s why we had bought into Kubernetes / GCP. Since we have our own hypervisor inside the clusters, it makes me wonder whether we can just deploy multiple hypervisors into the same cluster, or remove the Kubernetes dependency and run this on a Docker runtime in a more classical environment.
There's also one completely free zonal cluster for hobbyist projects.
The main issue is that not charging for the control plane and charging for the control plane leads to two very different Kubernetes architectures, and as per your docs, those decisions made at the start are very much set in stone. You cannot change your cluster from a regional cluster to a single zone cluster for example. So you have customers who built their stacks taking into account your free control plane, and you’re turning the screws in by adding a cost for it — but they cannot change the type of their cluster to optimise their spend, since, per your docs, those decisions are set in stone. That’s entrapment.
You should keep existing clusters in the pricing model they’ve been built in, and apply this change for clusters created after today.
That said, many of us made a bet on GCP. For us in particular, we made a bet to the point that our SQL servers are on AWS, but we still switched to GCP for ‘better’ Kubernetes and for not nickel and diming, since AWS had a charge that looked like it was designed to convey that they’d much rather have you use their own stuff than Kubernetes. It is a relatively trivial amount, but it makes a world of difference in how it feels and you guys know more than anyone how much of these GCP vs AWS decisions are made based not on data sheets but for the ‘general feel’ for the lack of a better word.
AWS’ message is that they’re the staid, sort of old fashioned, but reliable business partner. GCP’s message, as of this morning, is stop using GCP.
These changes won't take effect until June - customers won't start getting billed immediately. I'm sorry that you feel trapped, that's not our intention.
> You should keep existing clusters in the pricing model they’ve been built in, and apply this change for clusters created after today.
This is great feedback, but clusters should be treated like cattle, not pets. I'd love to learn more about why your clusters must be so static.
What’s inside our clusters are indeed cattle, but the clusters themselves do carry a lot of config that is set via GCP UI for trivial things like firewall rules. Of course we could script it and automate, but your CLI tool also changes fast enough that it becomes an ongoing maintenance burden shifted from DevOps to engineers to track. In other words, it will likely incur downtime due to unforeseen small issues.
It’s also in you guys’ interest that we don’t do this and clusters are as static as possible right now, since if we are risking downtime and moving clusters, we’re definitely moving that cluster back to AWS.
I'm sure there are tools out there to help with cluster migrations, but it is far from trivial.
You provide a web interface, so it’s reasonable to assume people will use it.
Meanwhile most of my large government customers have a couple of tiny VMs for every website. That's it. That's already massive overkill because they see 10% max load, so they're wasting money on the rest of the resources that are there only for redundancy. Taking things to the next level would be almost absurd, but turning things off unnecessarily is still an outage.
This is why I don't believe any of the Cloud providers are ready for enterprise customers. None of you get it.
Get your data centers all running VMWare, then VMDK import to AWS AMIs, then wrap them all in autoscaling groups, figure out where the SPOFs have moved to, and only then start moving to containers.
In the mean time, all new development happens on serverless.
Don't let anything new connect to a legacy database directly, only via API at worst, or preferably via events.
They rejected it because it's "too complex to do internal chargebacks" in a shared cluster model.
This is what I mean: The cloud is for orgs with one main application, like Netflix. It's not ready for enterprises where the biggest concern is internal bureaucracy.
(Or, really, I'd rather just run my workloads directly on some scale-free multitenant k8s cluster that resembled Borg itself—giving me something resembling a PaaS, but with my own custom resource controllers running in it. Y'know, the k8s equivalent to BigTable.)
Multiple clusters lets us easily firewall off communication to compute instances running in our account based on the allocated IP ranges for our various clusters (all our traffic is default-deny and has to be whitelisted). Multiple clusters lets us have a separate cluster for untrusted workloads that have no secrets/privileges/service accounts with access to gcloud.
Starting in June our monthly bill is going to go up by thousands. All regional clusters.
GKE has the ability to run pods with gVisor, which prevents the pod from communicating with the host kernel, even maliciously. (I think they call these sandboxes.)
The only reason to use multiple clusters is if you want CPU isolation without the drawbacks of cgroups limits (i.e., awful 99%-ile latency when an app is being throttled), or you suspect bugs in the Linux kernel, gVisor, or the CNI. (Remember that you're in the cloud, and someone can easily have a hypervisor 0-day, and then you have no isolation from untrusted workloads.)
Cluster-scoped (non-namespaced) resources are also a problem, though not too prevalent.
Overall, the biggest problem I see with using multiple clusters is that you end up wasting a lot of resources because you can't pack pods as efficiently.
We're okay with the waste as long as our software & deployment practices can treat any hosted Kubernetes service as essentially the same.
Ultimately, all isolation has its limits. Traditional VMs suffer from hypervisor exploits. Dedicated machines suffer from network-level exploits (network card firmware bugs, ARP floods, malicious BGP "misconfigurations"), etc. You can spend an infinite amount of money while still not bringing the risk to zero, so you have to deploy your resources wisely.
Engineering is about balancing cost and benefit. It's not worth paying a team of CPU engineers to develop a new CPU for you because you're worried about Apache interfering with MySQL; the benefit is near-zero and the cost is astronomical. Similarly, it doesn't make sense to run the two applications in two separate Kubernetes clusters. It's going to cost you thousands of dollars a month in wasted CPUs sitting around, control plane costs, and management, while only protecting you against the very rare case of someone compromising Apache because they found a bug in MySQL that lets them escape the sandbox.
Meanwhile, people are sitting around writing IP whitelists for separate virtual machines because they haven't bothered to read the documentation for Istio or Linkerd which they get for free and actually adds security, observability, and protection against misconfiguration.
Everyone on Hacker News is that 1% with an uncommon workload and an unlimited budget, but 99% of people are going to have a more enjoyable experience by just sharing a pool of machines and enforcing policy at the Kubernetes level.
I agree with most things you’ve said around gVisor providing sufficient security, but it's not just about security, noisy neighbors are a big issue in large clusters.
More importantly, this dialogue speaks volumes to Google’s stubbornness. Seth’s/Google’s position is: do it the Google way, sorry-not-sorry to all those that don’t fit into our model.
Like we haven’t heard of infrastructure as code? That can’t paper over basics like being unable to change your K8s cluster control plane. This is precisely the attitude that lands GCP as a distant #3 behind AWS and Azure.
AWS has the complete opposite model.
I know already too many people which are stuck at a certain k8s version. Do not allow that to happen!
Off-topic, but is this really how people do k8s these days? Years ago when I was at Google, each physical datacenter had at most several "clusters", which would have fifty thousand cores and run every job from every team. A single k8s cluster is already a task management system (with a lot of complexity), so what do people gain by having many clusters, other than more complexity?
People had exactly the same experiences with Mesos and OpenStack, but k8s has decent tooling for turning up many clusters, so there is an easy workaround
I mean, if people aren't smart enough to run a large shared infrastructure, how can I trust them to run a large number of shared clusters, even if each cluster is small. The final scale is still the same.
Clusters are probably still pets to most orgs, but the lessons about how to manage complexity still apply. Each of my terraform state files is a pet and I treat it like such... but I also use change-control to assure that even though I don't regularly recreate it from scratch, I understand all that was there.
* Fully reproducible cluster builds and deployments.
* The type of cluster (can be) an implementation detail, making it easy to move between e.g Minikube, Kops, EKS, etc. After all, K8s is just a runtime.
* Developers can create temporary dev environments or replicas of other clusters
* Promote code through multiple environments from local Minikube clusters to cloud environments
* Version your applications and dependent infrastructure code together
* Simplify upgrades by launching a brand new cluster, migrating traffic and tearing the old one down (blue/green)
* Test in-place upgrades by launching a replica of an existing cluster to test the upgrade before repeating it in production
* Increase agility by making it easier to rearchitect your systems - if you have a pet, modifying the overall architecture can be painful
* Frequently test your disaster recovery processes as a by-product for no extra effort (sans data)
* Reduced blast radius
Heh... how many teams actually treat their clusters like cattle, though? Every time I advocate automation around cluster management, people start complaining that "you don't have to do that anymore, we have Kubernetes!"
Some people get it, yes, but even of that group, few have the political will/strength to make sure that automation is set up on the cluster level—especially to a point where you could migrate running production workloads between clusters without a potentially large outage / maintenance window.
Sugarkube is designed to do exactly that.
I don't know GCP though. In the past I've seen kube cluster archs which are very very fragile as they spin up. If that's the case with GCP I can see why you wouldn't do the above and rather hand hold their creation.
Ha. They should, but they are absolutely not. Customers typically ask "why should we spend time on automating cluster deployment when we are going to do it just once?" and when I explain that it's for when the cluster goes away, if it goes away, they say it's an acceptable risk.
The truth of the matter is, even in some huge international companies, they don't have the resources to keep up with development of tools to have completely phoenix servers. They just want to write automation and have it work for the next 10 years, and that's definitely not the case.
Clusters often are not "cattle". If your operation is big enough, then yes, they might be. Usually they aren't, they are named systems and represent mostly static entity, even if the components of said entity change every hour.
Personally, I'm running in production a cluster that by now had witnessed upgrades from 1.3 to 1.15, in GKE, with some deployments running nearly unchanged since then.
Treating it as cattle makes no sense, especially since on API level, the clusters aren't volatile elements.
For us, Clusters are a promise to our developers. We can’t just spin up a new cluster because we feel like it. I must be missing something or maybe our culture is just different.
For every 1 competent person who can develop a solution to fully automate everything, there are 99 others who can automate most of that, maybe minus a cluster or DB or two, and another 500 whom cannot do either, but can run a CentOS box at a reasonable service level.
You experience using great tools and your vast knowledge of k8s each day to do all that, and you have the support of your org, but those other folks may not have the tools, support, knowledge, or sometimes even the capability to attain the knowledge to do that. That doesn't mean they're useless to anyone, to be cast off at will!
The type of thinking that leads to, "get new engineers/developers/designers/architects if yours aren't perfect" needs to die, and needs to be replaced with, "let's do what we can to train and support our current employees to do a great job" because, frankly, there aren't enough "superstar" people who have your skills to do that at every org.
We need to work on accepting people for who they are-- helping them to strive to be a bit better each day of course-- and utilizing those skills in the right place, rather than trying to make everyone the same person with the same skills doing the same things.
Some applications don't need clusters which can be rebuilt and destroyed at will, so let's not make that the bar for every project.
P.S. I consider "pet vs cattle" to be a horrible metaphor that should be taken behind the barn and short like the diseased plague bearer it is.
Please don't do this.
You can apologize for your actions work to improve in the future , but you cannot apologize for how someone feels as a result of your actions.
Also, intent doesn't matter unless you plan to change your behavior to undo or mitigate the unintended result.
So my understanding that the official k8s way to upgrade your cluster is also to throw it away and start a new one (with some cloud provider proprietary alternatives).
Let's say there is something actually important, stateful, single-source-of-truth in my k8s cluster, like a relational DB that must not lose data. I don't want downtime for readers or writers, and I want at least one synchronous slave at all times (since the data is important). I also don't want to eat non-trivial latency overheads from setting up layers of indirection.
What's the recommended way of doing this?
I am not expert, but K8s handles task replication, and either spawn or route a request to another task instance somewhere else. However, the application logic itself must handle the fault-tolerance (by handling its states through transactions or something else) should an instance fail. K8s doesn't do that for you.
You need to instantiate a secondary DB replica somewhere else and start the DB migration. Since there will be "two instances" of the same DB running, you will also need to set up a (temporary) proxy for routing and handling the DB requests w/ something like this:
1) if the data being requested is already migrated, request is handled by the (new) secondary replica.
2) Primary instance handles the request, otherwise
2.1) Requested data should be migrated to secondary replica (asynchronously, but note that a repeated request may invalidate a migration).
Turn the proxy router off once the whole state of your primary DB instance is fully migrated, making the secondary replica the primary one.
That's really just a napkin recipe for completing a live migration, though.
* We are now getting into the distributed transaction world because you can never be 100% sure that writing to 2 databases can succeed or fail at the same time. There is this talk from this guy who deals with similar problem you have: http://www.aviransplace.com/2015/12/15/safe-database-migrati...
This is a pretty good in depth explanation, but at a high level if a your server dies and you are extremely upset about it (similar to if your pet died) you are putting too many eggs in that single basket, with no secondary plan. Conversely if you build your infra in such a way that your server dying is something you see no worse than how a farmer sees one of his cattle dying (which are raised to be killed) - you are much better prepared for the inevitable downtime from your server and can very easily recover
That about sums up most things Google does for developers.
My problem is that this fee doesn't look very "cloud" friendly. Sure the folks with big clusters won't even notice it, but others will sweat it.
The appeal of cloud is that costs increase as you go, and flat rates are typically there to add predictability (see BigQuery flat rate). This fee does the opposite.
This cost increases rapidly for those scenarios.
Or, better yet, don't use k8s. You don't need it, especially as a startup on a shoestring budget. You can migrate later if you decide you really need to, but just a plain LAMP gets you 99% of the way.
Around the time "Google Cloud Platform" became a thing, Google changed GAE from an encapsulated bubble into a basic frontend management system that interacts with normal services through public APIs (either inside or outside GCP). It's more expensive than GCE, but it's fully managed and lets you skip the devops team.
This makes it seem like Google Cloud for Startups is aimed at startups that aren't really on a shoestring budget.
So I guess the big question in my mind is how do you run containerized apps in the major clouds besides K8s if it's a bulldozer and you just need a cargo bike? Is there something simpler?
However this means we are paying for egress on both sides. This was something we chose to eat due to GCP Kubernetes, but considering today’s changes, it probably no longer makes sense.
What about clusters that are used for lumpy work loads? Like data science pipelines? For example, our org has a few dozen clusters being used like that.
Each pipeline gets its own cluster instance as a way to enforce rough and ready isolation. Most of the times the clusters sit unused. To keep them alive we keep a small, cheap, preemptive node alive on the idle cluster. When a new batch of data comes in, we fire up kube jobs which then triggers GKE autoscaling that processes the workload.
This pricing change means we're looking at thousands of dollar more in billing per month. Without any tangible improvement in service. (The keepalive node hack only costs $5 a month per cluster.) We could consolidate the segmented cluster instances into a single cluster with separate namespaces, but that would also cost thousands in valuable developer time.
I don't know how common our use pattern is, but I think we would be a lot better served by a discounted management fee when the cluster is just being kept alive and not actually using any resources. At $0.01, maybe even $0.02, per hour we could justify it. But paying $0.10 to keep empty clusters alive is just egregious.
It will be interesting in any case to see if DigitalOcean and Azure are going to follow suit! I'd be very surprised if they do, (but I've also been wrong before, recently too.)
We don't recommend using node pools for isolation.
PTAL at https://www.youtube.com/watch?v=6rMGRvcjvKc
Please give us feedback there in case you hit any issue!
https://github.com/rcarmo/azure-k3s-cluster (this is an Azure template that I use precisely for testing that kind of workloads - spinning up one of these, master included, takes a couple of minutes at most).
(full disclosure: I work at Microsoft - Azure Kubernetes Service works fine, but I built the above because I wanted full control over scaling and a very very simple shared filesystem)
I guess they realized they couldn't make cluster management MT.
(As a security person: ugh.)
More likely though, AWS or OpenShift running on bare metal on a beefy ATX tower in the office. We want to have production and staging as close to each other as possible, so this is an additional reason and a p0 flag on reducing the dependency on Google-specific bits of Kubernetes as much as possible, hopefully also useful for our exit strategy as well.
I'll use helm to install metallb for the load balancer, which you can then tie into whatever egress controller you like to use.
For persistent storage a simple NFS server is the bees knees. Works very well and a NFS provisioned is a helm install. Very nice, especially, over 10GbE. Do NOT dismiss NFSv4. It's actually very nice for this sort of thing. I just use a small separate Linux box with software raid on it for that.
If you want to have the cluster self-host storage or need high availability then GlusterFS works great, but it's more overhead to manage.
Then you just use normal helm install routines to install and setup logging, dashboards, and all that.
Openshift is going to be a lot better for people who want to do multi-tenant stuff in a corporate enterprise environment. Like you have different teams of people, each with their own realm of responsibility. Openshift's UI and general approach is pretty good about allowing groups to self-manage without impacting one another. The additional security is a double edged. Fantastic if you need it, but annoying barrier to entry for users if you don't.
As far as AWS goes... EKS recently lowered their cost from 20 cents per hour to 10 cents. So costs for the cluster is on par with what Google is charging.
Azure doesn't charge for cluster management (yet), IIRC.
For example you could use NFS for 90% of the storage needs for logging and sharing files between pods. Then use local storage, FCOE, or iSCSI-backed PVs for databases.
If you are doing bare hardware and your requirements for latency are not too stringent then not hosting databases in the cluster is also a good approach. Just used dedicated systems.
If you can get state out of the cluster then that makes things easier.
All of this depends on a huge number of other factors, of course.
I think NFS is heavily underrated. It's a good match for things like hosting VM images on a cluster and for Kubernetes.
In the past I really wanted to use things iSCSI for hosting VM images and such things, but I've found that NFS is actually a lot faster for a lot of things. There are complications to NFS, of course, but they haven't caused me problems.
I would be happy to use it in production, and have recommended it, but it's not unconditional. It depends on a number of different factors.
The only problem with NFS is how do you manage the actual NFS infrastructure? How much experience does your org have with NFS? Do you already have a existing file storage solution in production you can expand and use that with Kubernetes?
Like if your organization already has a lot of servers running ZFS, then that is a nice thing to leverage for NFS persistent storage. Since you already have expertise in-house it would be a mistake not to take advantage of it. I wouldn't recommend this approach for people not already doing it, though.
If you can afford some sort of enterprise-grade storage appliance that takes care of dedupe, checksums, failovers, and all that happy stuff, then that's great. Use that and it'll solve your problems. Especially if there is some sort of NFS provisoner that Kubernetes supports.
The only place were I would say it's a 'Hard No' is if you have some sort of high scalability requirements. Like if you wanted to start some web hosting company or needed to have hundreds of nodes in a cluster. In that case then distributed file systems is what you need... Self-hosted storage aka "Hyper Converged Infrastructure". The cost and overhead of managing these things is then relative small to the size of the cluster and what you are trying to do.
It's scary to me to have a cluster self-host storage because storage can use a huge amount of ram and cpu at the worst times. You can go from a happy low-resource cluster, then a node fails or other component takes a shit, and then while everything is recovering and checksum'ng (and lord knows what) the resource usage goes through the roof right during a critical time. The 'perfect storm' scenarios.
However, if you can tolerate client side failure then go for it.
And if that is your central infrastructure, shouldn't it be worth the money?
I do get the issue with having cheap and beefy hardware somewhere else, i do that as well, but only for private. My hourly salary spending or wasting time on stuff like that costs the company more than just paying for an additional cluster with the same settings but perhaps with much less Nodes.
If more than one person is using it, the multiplication effects for suddenly unproductive people, is much higher. Also that decreases the per head cost.
Do you want to mention anything more about what you're hoping to get out of hierarchy? Is it just a management tool, is it for access control, metering/observability, etc...?
Since there are no character limits to worry about here unlike Twitter, better to put up the full URL so the community can decide for themselves if the domain linked to is worth clicking through or not.
Friendly docs: https://docs.google.com/document/d/1R4rwTweYBWYDTC9UC-qThaMk...
It's a living product which comes with Terraform modules. We introduced various features to enable doing Multi-Tenancy as well (and more on their way!)
We do recommend robust configurations for production setup (e.g. dev, staging and production) however you can certainly squash and skip it if not necessary.
Thanks for the feedback though. We'll consider adding such notes explicitly.
I am trying to do that. Where would you suggest? Throughout the comments on this post, when people suggest namespace, pod or node level separation you ask them to PTAL and read the link which suggests the single-tenant cluster-per-project approach (that is under the Multi-tenant cluster, confusingly.) The link you sent talks about cluster-per-project, which is not multi-tenancy as I understand it. Perhaps a different name would be less confusing (robust federated cluster administration?)
This "multiple" multi-tenant clusters part isn't coming through. Please do jump into "Securing the cluster" section to cut corners and learn what to do in a single cluster. We're fixing the sections to avoid the confusions. Thanks for the feedback!
 https://github.com/kubernetes/minikube/issues/3207, https://github.com/docker/for-mac/issues/3065, https://github.com/docker/for-mac/issues/3539, etc; there must be dozens and dozens of these
Swarm, in comparison, is much friendlier (and you can use it for dev/test across multiple machines just fine)
What kind of system have you seen where this isn't true?
(My apologies if we've chatted about this before in another venue, I'm losing track of whom I've already talked to)
That said they do make it damn hard. Our k8s cluster is as basic as it comes, no databases, simple deployments, but we do still have a dependency on Google Cloud Loadbalancer (which we hate).
If pricing goes up too much from this we'll move, but the GCL dependency will be a PITA :/
I'm guessing 99% of workloads won't notice either of these issues, but it is an actual issue.
Notably, we may choose to make breaking changes to our API specification (i.e. the Issuer, ClusterIssuer and Certificate resources) in new minor releases.
It's not dark magic, it might make building off of it in the form of integrations prohibitive, but they have done a great job making sure users can upgrade one release to the next.
It is a little bit of a treadmill, but it certainly beats manually renewing certificates!
It's no different than running your own load balancer like HAProxy pointed at the nodes which forward to a node-port service.
There's also MetalLB if you're running your own hardware: https://metallb.universe.tf/
Pretty much anyone who works in ops longer understood from the go that its impossible to be totally provider-agnostic. K8S is just a nice api on top of provider api that still requires provider specific configuration.
If you're going to run on bare-metal or in your own VMs, OpenShift is very much worth a look. There are hundreds, maybe thousands of ways to shoot yourself in the foot, and OpenShift puts up guard rails for you (which you can bypass if you want to). OpenShift 4 runs on top of RHCOS which makes node management much simpler, and allows you to scale nodes quickly and easily. Works on bare metal or in the cloud (or both, but make sure you have super low latency between data centers if you are going to do that). It's also pretty valuable to be able to call Red Hat support if something goes wrong. (I still shake my head over the number of days I spent debugging arcane networking issues on EKS before moving to OpenShift, which would have paid for a year or more of support just by itself).
Don't get me wrong, I appreciate RHAT's code contributions very much, they have done a lot for k8s! But running OKD on one's own is a bad idea, while paying for IBM support makes you as much a hostage as anything Google will do to you. Better to just stick with a distribution with better community support and wait for RHAT's useful innovations to be merged upstream (while avoiding the pitfalls of their failed experiments...)
* Yes it's open source, but the community version (okd) isn't really supported, nor is it widely used, so if you're serious about running this you're doing so for the Enterprise support and you're going to be writing those checks
Your concern is valid, and I agree with you that OKD is not supported enough. I have my own theories as to why, but I will keep my criticism "in the family" (but do know there are people that want to see OKD be a first-class citizen, and know we are falling short right now). We had some challenges supporting OKD 4.x because the masters in 4.x now require Red Hat CoreOS (and nodes it is highly recommended), but RHCOS was not yet freely available. This is obvoiusly a big problem. Now that Fedora CoreOS is out, there is a freely distributable host OS on which to build OKD, so it will be better supported and usable. FWIW I have a personal goal to have a production-ish OKD cluster running for myself by end of the year.
I'll admit I am a little offended at being called a "proprietary IBM K8s distribution," but I don't think you meant to be offensive. IBM has nothing to do with OpenShift, beyond the fact that they are becoming customers of it. Every bit of our OpenShift code is open source and available. You are right that it's not in a production-usable state (although there are people using it) but it's a lot better than you'll get from other vendors. We are at least trying to get it to a usable state, unlike many of them. We are strapped for resources like everyone else, and K8s runs a mile a minute and requires significant effort to stay ahead). This space is still really young and really hot, and I am confident we'll get the open source in a good, usable, state, much like Fedora and CentOS are. I also don't think OpenShift is really all that expensive considering the enormous value it provides. The value really does shine at scale, and may not be there for smaller shops.
I don't blame you for waiting, I probably would too. Our current offering is made for enterprise scale, so isn't tailored to everyone. I've heard OpenShift online has gotten better, but haven't tried it myself. Eventually I plan to run all my personal apps on OKD (I have maybe a dozen, mostly with the only users being me and my family), but until then I've been using podman pods with systemd, which will be trivial to port to OKD once it's in a good state.
It's not that openshift is bad per se, I just don't imagine it solves many problems an org that is fretting about lock in or gcp pricing will have. Such an org is probably cost sensitive and looking for flexibility, but openshift is expensive, and if you adopt its differentiating features you are de facto locking yourself in. And if you do not leverage those features out of a desire to avoid lock in, you are effectively paying a whole lot just for k8s support...
And I really should say, for certain orgs (especially bigcos) this may well be worth it, I just don't think it is a good option for anybody worried primarily about avoiding vendor lock in and keeping costs in check.
Again, I think the idea of OS is great, but you've lost us, and likely other big customers because of that restriction. Having old kernels is just not an option for some people.
If you are like me and are old school and think "huh, yeah that makes me nervous" I completely understand that, but we've seen some serious success with it. I'm a skeptical person, and telling me I can't SSH to my node freaks me out a bit, but I'm becoming a convert.
I would also note if you buy OpenShift you get the infrastructure nodes (masters, and some workers for running openshift operator pods) for free (typically, but I'm not a salesperson so don't hold me to that if I've misspoke :-P), so you aren't paying for the super locked in OS. I suppose you do have to pay for RHEL8 or RHCOS on the worker nodes running your pods, and we don't support other distros (because we expect a very specific selinux config, CRI-O config (container runtime), among other things), so I guess there's some dependence there, although I recommend RHCOS for all your nodes and then just use the Machine API if you need it.
Yes RH did a lot for k8s, but killing of a working distribution without a direct migration path that is like "start again". will make your customers angry.
also I think the OpenShift terminology is way too much and OpenShift should be a way more thinner layer on top of k8s.
Sure it's opinionated, but at this point, which flavor of k8s isn't? Even k3s (which I play around with on https://github.com/rcarmo/azure-k3s-cluster) has its quirks.
Everyone who's targeting the enterprise market has a "value-added" brew of k8s at this point, so kudos to RH for the engineering effort :)
Because of that major change, the cluster upgrade path is a little more involved than usual. It's a complete reinstall of the OS and rebuild of the cluster. There are tools to help tho, and as the path is tread things will get easier and better supported. If you go through the same wave as me, you'll be annoyed at first but then once it is done and you have a 4.x cluster you'll be really happy with it (especially when you can manage everything as an operator).
Luckily from an app perspective very little will change since Kubernetes is the API. A sibling comment linked to some helpful documentation. I can't be specific right now but I can tell you that your need is known, and some very smart people are working on it. If you want to email me (check my HN bio page) or jump in the Keybase group called "openshift_okd," I'm happy to chat more about it (not in an official Red Hat support capacity, just as friend to friend :-) ). I haven't done a migration myself yet but I know people who have, and I plan to get into it personally soon as well.
It's not a rolling upgrade, but it allows you to move the apps, PVs, policies in as granular a manner as you wish.
If you want something production-grade (i.e. doesn't say "beta" on the tin) then I think Calico should solve most of the same problems too (it does BGP peering to your ToR switch):
Does MetalLB do something extra I'm missing?
So either way you end up in a situation where you require dedicated devops team pr dedicated teammembers to keep up with changing requirements.
Google will ignore it like all of the tickets I file. The fact is that Google is in the business of making money and they are focused on enterprise users. Enterprise users are not sensitive to integration difficulty since they can just throw people at any problems. So eventually everything Google makes will become extremely time-consuming to learn and difficult to use. They're becoming another Oracle.
With this fixed fee model, the change will barely make a difference (== Google revenue) for the large customers who can spare the money, but will create a significant entry barrier to that side project / super-early stage that considers getting hooked on GCP, specifically GKE.
Then again, not my decision to make.
There is no billing protection (which could make you very poor very fast) and every service has a certain cost and quality which is just not feasable in the beginning.
Even GKE with its free kubernetes master does block a lot of resources on the nodes: https://cloud.google.com/kubernetes-engine/docs/concepts/clu...
Also a ton of great features on gke you will probably never use if you are too small. It is so much cheaper to just get cheap hardware somewhere and put your own k8s onto it if you have more time then money.
Even on Digital Ocean you have the load balancer problem: you need to use the provided and also 'costly' LoadBalancer service. There is only one hacky way to prevent it by exposing your ingress on the host and mapping that one ip but then you loose all the self healing stuff and loadbalancing capability.
You can set billing alerts that will project your monthly budget every hour, and send you an alert when it's projected to be exceeded.
IMHO your claim (that there is an entry cost barrier) is the opposite of reality. AWS and Google have brought incredible power and choice to developers starting at zero initial cost.
The fundamental issue with setting a limit is it's technically infeasible to decide what to do when it's exceeded. They have no way of knowing what assets to terminate. The way to avoid what you describe is to shut off access to APIs that you don't want to use, and keep your credentials safe.
Then again, I know enough about AWS and how to control my cost.
I mean, yes? You can either build your own for “free”, or pay for the value-add features DO provides that you have described. I don’t see where the problem is.
>"The EKS control plane is the easiest to understand with a fixed cost of $0.20 per hour."
https://aws.amazon.com/eks/pricing/ is the up to date page. https://aws.amazon.com/blogs/aws/eks-price-reduction/ is the announcement of the price cut.
After seeing what they did to Google Maps and Api.AI / Dialogueflow jumped from free to 5k$ overnight - just can't trust them.
There is a billion dollar company on the horizon for whoever can best commoditize bare metal with an apple-esque usability model.
How is Google different from other cloud poviders in terms of vendor lock-in?
I used to recommend GKE as the best way to get started with K8S. With this price change, that advantage is gone. This price change kills off any advantage Google had for getting started in the cloud. Their enterprise-unfriendly habits have already killed off any advantages Google had for established, larger customers of cloud computing.
AWS _still_ has SimpleDB kicking around. That has been retired as a product for more than 8 years now.
I haven't seen Azure do similar, either, but I haven't paid as much attention to them.
For AWS or Azure I developed way more trust over time - could be subjective - but also could be that there is a reason Google is distant 3rd in the game.
The magic quadrant isn't a good housekeeping seal of approval. It's a screener for an architect in Fortune 500 or .Gov to show social proof that their product selection isn't insane.
The "cautions" for GCP are about the nascent state of their field sales and solution architects, enterprise relationship management, and limited partner community.
The "cautions" for Azure are poor reliability, poor support, and sticker shock.
My takeaway was very different from yours. When you read the analysis, it was reflective of a mature, dominant player (AWS) and two highly capable challengers with different issues.
Google is a newer business that is missing some services (ie. anything user facing) and is transitioning from a weird sales model to a more conventional enterprise one. Microsoft has an established business process and best in class sales org, but they tend to use sticks of various types to force adoption and are organizationally poorly equipped to support customers.
Also lack of SLA / shady SLA does not help.
Ps. Talking as someone with hands on experience.
Ps2. Azure support is terrible and their response times are constantly breaking SLA..
Honestly, I've been pleasantly surprised.
The best support I got a mispleasure to work with didn’t even know how the service Im having an issue is even working... And it was a “technical” ticket.
Seems like after the last round of pouching of best support engineers by AWS, Azure is left with only outsourced mob.
Another - constant disconnections of PV..
Another - 1/3 times the new provisioned node in vmss has a broken kubelet and doesn’t successfully register in control-plane. I was literally shocked when it happened twice in a single day. Response from support was that we are supposed to monitor that ourselves and drain unsuccessfully provisioned node - (we already were and it was mentioned in opening ticket) makes scaling horizontally REALLY PAINFUL..
CNI default reservation of IPs (30) - cannot be less - so if you have a service node running and you want only few pods to run on it for HA - well sucks to be you.
Kubenet not working until up recently with anything - for example AG, tho AG is a disaster of a service by itself.
Various API failures related to networking - sometimes control-plane lost connection to AKS subnet for some time (fixed by itself by still...)
AliCloud and Rackspace are very close to GCP as well.
That being said, if you're planning on running Kubernetes, I'd choose GCP over any other offering - the tooling and support just seems better, in my entirely subjective opinion.
It has more features, yes. How well those features work is another matter entirely.
I loosely follow AWS, GCP and Azure but I always get mixed opinions on them, especially the last two
To pick a random example that I'm familiar with: Azure DNS Zones.
When I used AWS Route 53, the main issue I had with it was that I thought the cutesy marketing name was stupid. That's about it. By reading through the docs, I learned a little bit about DNS I didn't know, and I got to learn about the clever engineering that AWS did to work around issues with the DNS protocol itself. In the end, it had more features than I needed, and the basic stuff Just Worked.
When I tried to use Azure DNS, their import tool shredded my data. I then wrote a custom PowerShell import tool, but it took hours to import a mere few thousand records. The next day my account was locked out for "too many API calls" because I simply had the console web gui open. Not used. Open. The GUI showed entries different to the console tools. The GUI string limits don't match the console tool. The perf monitor graph was broken, and is still broken. Basic features were missing, broken, or "coming soon".
You would think DNS would be one of those services that "just works", but nope. Bug city.
Now mind you, most of those issues are fixed now, and they're adding more features and fixing the issues those new features are introducing.
But ask yourself: Why are buggy features being rolled out in production? Did nobody test this shit? Did they ever do a load test? Did they even try basic things like "have the console open with more than 10 records"? Why am I discovering this? Do they not have thousands of customers who have battle-tested this stuff?
Clearly they're just throwing things over the fence and letting support tickets be their QA feedback.
PS: It's even how they use DNS themselves that's just wrong. E.g.: If you use Azure CDN you end up with like 6 CNAME redirects in a row. The DNS standard says CNAMEs shouldn't point to other CNAMEs! At a minimum this is slower than it needs to be, but it's also less reliable because there's more points of failure...
My experience is that once AWS offers a new service that gets attention, a few month later also Azure offers it - and vice versa.
I also think of them as the second, inferior cloud, but they're almost certainly the better k8s hoster. If you're serious about running k8s on AWS, there's a good chance you're doing something like CloudPosse's Terraform-based setup, not EKS.
I'm sorry, but first choice is AWS, second is Azure (mostly because retailers don't like AWS), and third is actually GCP.
For cluster per customer architecture, would you be able to look into https://cloud.google.com/kubernetes-engine/docs/best-practic... to see if there is anything useful? We understand changing the architecture isn't easy at all and we'd love to know how we can help.
What I can do is to lower the bar for aggregating clusters with the investments we've done so far. Hope that makes sense.
Aren't they the third choice, after AWS and Azure?
I'm not sure what your usecase is that you would choose gke and you are worried about 300$ per month infra costs.
For Corp we use gke. For private I use selfhostet k3samd for our startup asuper cheap digital ocean cluster.
That feels like the wrong attitude.
If your kubernetes cluster is part of your core infrastructure, then 300$ more or less should not be an issue at all (not to say that i think 300$ is nothing).
That should not mean that you should waste money but often enough, if you buy cheap and your hardware breaks and your time&material costs much more then what a better hardware would have cost, then you wasted money by buying cheap.
Unfortunate with IT products, there are certain things which are not directly visible: Like how secure is your product. GCP offers 2FA, Digital Ocean does not. How much money is it worth to you to have your whole infrastructure protected by 2FA? For me in a business context, non 2FA would be a no go.
Digital Ocean definitely supports 2FA
They are 3rd for me, I will take AWS or Azure long before I would take GCP. Hell for some projects I would even take Digital Ocean or Linode over GCP.
The amount of management they make you do is amazing. Need to upgrade the kernel on your nodes? Fill out a five page CloudFormation workflow, and if you copy-pasted anything wrong, your cluster will just randomly work strangely for no reason. This is what they consider "fully managed"! (There are also lots of neat bugs, like if you create a cluster in the web interface while logged in with SSO, all credentials necessary to access the cluster will disappear in an hour. By design, apparently. So even though they have a web interface, you have to create a role account and use the command-line.)
It's really a wondrous product and shows how complacent you can be when you're in first place.
That's because they are aware of their position as a distant fourth place. They just (well last quarter) allocated an additional $2 billion to try and get a foothold in the enterprise, shook up the management team, and is on a spending spree as well.
Because google search is so shitty I couldn't find the $2B announcement (though it was on HN at the time) but here's a recent article on the general subject https://www.fool.com/investing/2020/02/27/google-buying-way-...
That's a concerning take on business decisions on two levels.
If I'm running a small cluster, I can use a small VM for the master node(s) and get my control plane costs down much lower than $73/month.
1). CEO with delusions of grandeur who thought that Amazon was a direct competitor to their business and should not be given money.
2). Projects that used Kubernetes.
Two is the only type of project which doesn't result in tens of millions of dollars wasted.
I mean, look at Netflix.
In fact, Netflix helped build AWS. There were daily long-term scheduled meetings with AWS developers to review new features and versions.
In addition, the policy from mgmt. was that they were fine being mono-cloud, since if they changed their mind, they had the engineering resources to migrate whenever it made sense.
Source: worked at Netflix.
Everyone's having EKS cost problems while I'm just sitting over here paying nothing for ECS control planes.
Turned out to be irrelevant. Following their instructions I couldn't get EC2 runtime hosts to attach after a couple hours of fucking around (which is my standard for, 'is this mature enough to use'), while with ECS I hit one button and was up an running. Wasn't a hard choice when we started dockerizing (especially since I could simplify most jobs even further having dev use Fargate, albeit at a premium).
EKS struck me as a feature parity product, not something you'd actually use.