I can tell you, as I'm sure anyone in my team can, that Azure is one big alpha-stage amalgation of half-baked services. I would never ever recommend Azure to literally any organization no matter the size. Seeing our customers struggle with it, us struggle with it, and even MS folks struggle with even the most basic tasks gets tiring really fast. We have so many workarounds in our software for inconsistency, unavailability, questionable security and general quirks in Azure that it's not even funny anymore.
There are some days where random parts of Azure completely fail, like customers not being able to view resources, role assignments or even their directory config.
An automatic integration test of one of our apps, which makes heavy use of Azure Resource Management APIs, just fails dozens of times a week not because we have a bug, but because state within Azure didn't propagate (RBAC changes, resource properties) within a timeout of more than 15 minutes!
Two weeks back, the same test managed to reproducibly produce a state within Azure that completely disabled the Azure Portal resource view. All "blades" in Azure just displayed "unable to access data". Only an ultra-specific sequence of UI interactions and API calls could restore Azure (while uncovering a lot of other issues).
That is the norm, not the exception. In 1.5 years worth of development, there has never been a single week without an Azure issue robbing us of hours of work just debugging their systems and writing workarounds.
On topic though, we've had good experiences with these k8s runtimes:
- Rancher + DO
- IBM Cloud k8s (yeah, I know!)
Regarding Azure in general: Azure Websites is c*. Having used Heroku and App Engine for some time before, this feels like a joke. Deployments sometimes work, sometimes they don't. Have to deal with node gyps? Don't, just don't. If you ever are forced to use Azure Websites (free startup package? ;)), learn Ansible as soon as possible and convince your team to switch to VMs.
The VMs are okay, you can't do much wrong with. I don't really know where the complexity of Azure Websites comes from, maybe from the fact that it runs on Windows, but this cannot be the full explanation. I have seen people work with node on Windows (even without Ubuntu on Windows) and they were fine. For anyone interested, this is the Azure Websites backend: https://github.com/projectkudu/kudu
Disclaimer: my long adventure with it was years ago, maybe the service has changed 100% but I doubt it
You're much better off trying to find some backchannels via MVPs on Twitter or through blogs, or figure out the developers or evangelists that give talks on this kind of stuff and contact them directly.
No crashes, ever. Way more reliable than AWS ever was. (GCP is our failover.)
So it seems that your experience is, from my POV, the exception. Maybe there's something wrong with the way you guys have Azure set up?
The moment I removed the last ASM bits, my entire infrastructure became reliably versioned and deployable.
We missed having more instance types to choose from, but it was a nice experience.
FWIW we're running a simple, custom cluster made of Debian droplets set up using kubeadm.
Have you never used any other Microsoft software?
I mean it is the same software company that made Windows ME, Vista, 7, and 10, along with countless other chocolate covered turds.
It also doesn't seems to ignore a number of the points, especially how support was handled. I think it's bad form to only respond to the one thing that can be rebutted, ignoring the rest. And personally, I would have apologized for the bad experience here.
In Microsoft's shoes, I would have strongly avoided anything that sounded like customer blame. E.g.: "We really regret the bad experience they had here. They were using the platform in a way we didn't expect, which led to an obviously unacceptable failure mode. We appreciate their bringing it to our attention; we've made sure it won't happen going forward. We also agree that some of the responses from support weren't what they should have been and will be looking how to improve that for all Azure users."
The goal with a public statement like this isn't to be "right". It isn't even to convince the customer. It's to convince everybody else that their experience will be much better than what is hopefully a bad outlier. The impression I'm left with is that a) Azure isn't really owning their failures, and b) if I use their service in a way that seems "wrong" to them, I shouldn't expect much in the way of support.
I think this is the main reason for AWS lead. They simply treat customers right (well, better than G and MS anyway).
(SPLA is the licensing paid by service providers to lease their customers infrastructure running Microsoft products. So if you pay some VPS or server provider $30/mo or whatever they charge for Server 2012, and they turn around and send $28 of it to MS or whatever, MS reports that $28 as cloud revenue)
From his perspective he was using it right; from Azure's apparently he was using it wrong. A difference in perspective isn't bad faith.
I note that you don't say your comments here are just your perspective as you trash-talk him. Does that mean you're pursuing a smear campaign and not acting in good faith? Why should he be held to a standard you yourself aren't willing to follow?
I'm a bit confused why the cluster nodes don't come configured like this out of the box... kubernetes users aren't supposed to have to worry about OOM of the underlying system killing ops-side processes are they?
P.S. I work at Google.
note, it also needs to match node configuration (how cgroups are setup specifically) so I doubt this works well on EKS which is BYO node. Maybe it's the issue with AKS too, I don't know enough about how it works...
My experience with azure has been reasonably positive, but even I've seen some weird stuff where things randomly don't work (AAD) or the dashboard just refuses to show anything for a while.
That this is a widespread endemic problem in Azure seems entirely plausible...
Scheduling is only really going to work well if you set limits, requests and quotas for containers. Please do this if you're running containers in production. I know it's a pain, as it's non-trivial to figure out how much resource your containers need, but the payoff is you avoid the issues described in the article.
Note that a service as AKS also draws in new customers that may not yet have years of Kubernetes experience. I'm one of those for example, and I created an AKS cluster so we could deploy short-lived environments for branches of our product. We're using GitLab and the 'Review Apps' integration with Kubernetes.
The instability experienced by the author of this article is something I experienced as well, and I have spent a lot of time draining, rebooting, and scaling nodes to try and find out what is happening. I would not have been able to guess the absence of resource limits could possibly kill a node.
Fortunately these instabilities disappeared a couple of weeks ago after a redeployment of the AKS instance, and it has been stable ever since. I guess the system reservation change was included there? From my perspective that was also the moment AKS truly started feeling like a GA product.
Thanks for posting this here. It would be cool for there to be a way to hold application users to account without needing to chase viral Internet posts and do your best to pin some accurate reporting on slightly after the fact. A tricky general problem.
If there's one thing I miss with Azure (and AWS), it's the perpetually-free 600MB RAM KVM VM GCloud gives everyone to play with. It only has 1GB outbound, but inbound bandwidth is free, and I can do pretty much whatever I want with it. But anyways...
I think this has to do with cgroups and ensuring the OOM killer doesn't target what is essentially the `init` process of a Kubernetes cluster - the docker daemon or kubelet.
sounds like a bunch of people have just learned for the first time about OOM killer. I mean the production systems with overcommits and the running loose OOM killer and I bet without swap ... And they blame the customer. Sounds like a PaaS MVP quickly slapped together by an alpha state startup. You may want to look into man pages, in particular oom scoring and the code -17.
I see OOMScoreAdjust=-999 for kubelet being used but why not -1000. -999 seems like it would be equally likely to be evicted as -998 unless the for_each_process(p) macro always goes first to last processes?
>unless the for_each_process(p) macro always goes first to last processes?
It seems it would usually go first to the first processes - the macro below - i.e. it would get to the "top" processes like kubelet, docker, etc before the containers.
#define next_task(p) \
list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
#define for_each_process(p) \
for (p = &init_task ; (p = next_task(p)) != &init_task ; )
Given that "chosen" is updated only "if (points > chosen_points)" it seems that the first listed process with score 1 will stay the "chosen" in that situation, ie. it will be one of the top processes like the [-999] kubelet, not a [-998] container.
From a provider of an Azure class i'd have expected that they wouldn't rely on that machinery and would instead go the way of disabling the killer for the top processes outright.
If you absolutely need managed Kubernetes, stick to GCP for now.
While the Engineers and PM would complain a lot about quality issues, management wants to prioritize more features. It was a running joke at Microsoft: No one gets promoted for improving existing things, if you want a quick promo, build a new thing.
So when you see a bazillion half baked things in Azure. That’s because someone got promoted for building each of those half baked things and moving on to the next big thing.
Going from 0-90% is the same amount of work as 90-99% and the same amount of work as 99.0% - 99.99%. Making things insanely great is hard and requires a lot of dedicated focus and a commitment to set a higher bar for yourself.
1. Deploy your Linux service on k8s with redundant nodes
2. Create a k8s VolumeClaim and mount it on your nodes to give your application some long-lived or shared disk storage (i.e. for processing user-uploaded files).
3. Wait until the subtle bugs start to appear in your app.
Because persistent k8s volumes on Azure are provided by Azure disk storage service behind the scenes, lots of weird Windows-isms apply. And this goes beyond stuff like case insensitivity for file names.
For example, if a user tries to upload a file called "COM1" or "PRN1", it will blow up with a disk write error.
Yes, that's right, Azure is the only cloud vendor that is 100% compatible with enforcing DOS 1.0 reserved filenames - on your Linux server in 2018!
>> An Azure disk can only be mounted with Access mode type ReadWriteOnce, which makes it available to only a single AKS node. If needing to share a persistent volume across multiple nodes, consider using Azure Files.
So you must be using a file share across multiple nodes using Azure Files, which is a SMB file share service that may have compatibility issues with the Samba protocol as described in the (arguably hard to find) docs: https://docs.microsoft.com/en-us/rest/api/storageservices/na...
>> Directory and file names are case-preserving and case-insensitive.
>> The following file names are not allowed: LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, PRN, AUX, NUL, CON, CLOCK$, dot character (.), and two dot characters (..).
>Yes, that's right, Azure is the only cloud vendor that is 100% compatible with enforcing DOS 1.0 reserved filenames - on your Linux server in 2018!
This is hyperbolic bordering on flatly false. This is more reasonable and accurate:
"Azure is the only cloud vendor that serves their Samba product from Windows boxes, and thus leak Win/NTFS-isms into their Samba shares [that shouldn't be used anyway]."
How would an ext4 filesystem, mounted under Linux, attached as a block device to a VM, be subjected to Windows-isms? What you're implying doesn't even make sense.
1. Most Azure VM types have very stringent limits on attached disks; a K8s worker can easily blow past this limit.
2. You have tremendous complexity to deal with: pick Azure managed disks vs unmanaged disks on storage accounts (you can’t mix them on the same cluster). You have to understand the trade of of standard vs premium storage and how they bill (premium rounds up and charges by capacity, not consumption). And you need the right VM types for premium.
3. Managed disks each create a resource object in your resource group. A resource group last I checked had hard limits on the number of resources (like 4000?). Each VM is minimum 3 to 4 resources (with a NIC, image, and disk)... at scale this gets difficult.
4. Azure disks require significant time to create.
, mount and remount. A StatefulSet pod failure will sometimes take 3-5 minutes for it’s PV to move to a different worker. And worse when your Azure region has allocation problems. Azure files are near instantaneous unmoubt/remount.
5. Azure disks are block storage and thus only ReadWriteOnce. Azure files are RWM.
So, sure, if you’re running a cluster database with dedicated per node PVs and limited expected redeployments... use Azure disks. If you need a PV for any other reason... especially for application tiers that churn frequently.. use Azure files.
Maybe Azure Files performance has improved to the point where it's more usable for storage scenarios. I suppose it probably comes down to the use case and application behavior.
It would be good if Azure had someone testing out these scenarios and interfacing with the larger k8s community, maybe through the SIG, for these sorts of musings and questions.
Azure - never again. Company moved to AWS within a quarter.
Libc has pretty slow retries (5s, I think) by default, and until 1.11 hits you can't easily set up resolver configs, though you can inject an envvar separately into each. And musl-based distros like Alpine don't even support some of libc's options, iirc.
We ended up scaling up KubeDNS to 2 replicas and moving them to a dedicated nodepool just to make sure they weren't competing with other nodes. That fixed our issues for now.
Only a matter of time before GCP becomes the #1/2 cloud provider.
Doesn't surprise me. Cosmos was too good too be true :
- Infinite Scalability
- Mongo API or Gremlin API or SQL API
It's obvious that it can't hold up to all it's promises.
Microsoft has some great people working on Azure, but I do feel like AKS was released to GA too soon. Without a published roadmap and scant acknowledgment of issues, I'm not sure I could recommend it to my clients or employer. It's disappointing, because I've had few issues with other Azure services.
Full disclosure: I receive a monthly credit through a Microsoft program for Azure.
I can't speak to EKS but we've been running production workloads on GKE for over a year with very good results. There have been a very few really troublesome "growing pains" type issues (an early example: loadbalancer services getting stuck with a pending external IP assignment for days) but Google has been awesome about support, even to the extent of getting Tim Hockin and Brendan Burns on the phone with us at various times to gather information about stuff like the example I gave above. I give them high marks and would recommend the service without hesitation.
Hope I don't have to move over to Google cloud.
I can understand being upset about their consumer products but it doesn’t really apply here.
The more services you put under one banner, the more the stink of one disaster is going to linger, and hinder adoption of the successes.
To me, a far simpler proposition would be a new brand & share issuance for each new sub-company (eg. Waymo), with existing Alphabet shareholders getting pro-rata shares in the new company.
In general AKS is a vanilla k8s cluster and expects you know what you’re doing. MS arguably should enforce some opinions about how things like system services have reservations, etc, but none of this is vanilla. The trouble is that K8s defaults are pretty poor from a security (no seccomp profiles or apparmor/se profiles) and performance perspective (no reservations on key system DaemonSets).
We’ve had this interesting industry pendulum swing between extreme poles of “we hate opinionated platforms! Give me all the knobs!” And “this is too hard, we need opinions and guard rails!”. I think the success of K8s is exposing people to the complexity of supplying all of the config details yourself and we will see a new breed of opinionated platforms on top of it very shortly. It reminds me of the early Linux Slackware and SLS and Debian days where people traded X11 configs and window manager configs like they were treasured artifacts before Red Hat, Gnome and KDE, SuSE, and eventually Ubuntu, started to force opinions.
Because it's designed entirely for enterprise customers. If you have a startup you have very little reason to choose OpenShift compared to Heroku or AWS honnestly.
I still love Redhat tho.
However, I do share that Azure indeed has released a lot of half-baked features and services lately (last 1.5 to 2 years). I hope this trend does not continue.
1. What version of docker / container runtime is being used?
2. What base image for your containers is being used? eg. alpine has known DNS issues