I've been at two companies that attempted to go down the OpenStack route. One wanted to start a cloud offering to their clients and hemorrhaged tons of money trying to just keep OpenStack stable. We couldn't even run our basic Logstash offerings on our OpenStack cluster without them having bizarre performance issues.
We had a really good manager too who had accounts on every other provider (Rackspace, RedHat, Canonical .. all the big ones) and time and time again he was like, "What is this? How are they doing this.." and we just figured they used a ton of specialized proprietary plugins they just weren't open sourcing or a ton of special patch sets.
Second shop had tried moving onto an OpenStack cluster to save on AWS prices. It could never run anything reliably and they scrapped the entire project and re-purposed all the servers for DC/OS, which was super nice and reliable and every team migrate hundreds of services onto.
My employer runs 5-6 complete openstack environments and those things have never had an unplanned outage that I'm aware of. My stuff hasn't ever gone down, I know that.
Back in 2013 we had to evaluate existing cloud/VM platforms in order to replace the plain KVM/libvirt and support and enable the growth. oVirt was garbage (missing installation ISOs, randomly broken install process, cluster nodes not communicating etc.), OpenNebula buggy, OpenStack seemed to be quite hard to grasp, Hyper-V Windows only and VMWare expensive as hell (even now the TCO Calculator gives us 4000+ EUR/VM - this must be joke).
We run several VMs with docker and our apps, manage dedicated servers and their networks (VLANs as provider networks in OpenStack), provide IPSec VPNs to for tenants, run Kubernetes clusters on OpenStack. We also manage several dedicated servers that are not managed by OpenStack for historical reasons and hopefully will migrate them to the cloud.
If OpenStack makes our heads hurt it is due to lack of documented design patterns. After the years, documentation is good for the initial deployment and IMHO for developers (either API consumers or contributors), but no so much for network engineers or system architects.
Some design choices are pretty crucial upfront and you will pay the price to change the design. We ended up modifying database records several times and then slowly rolled the changes to the compute nodes. Recently the OpenVSwitch flow tables were populated undeterministically after some network changes and we had to inspect sources and even then did not understand, why do we experience the issues.
But never did we encounter the stability issues, that were not caused by wild actions of an administrator.
I had this conversation with a colleague who offers VMWare managed Windows VMs for his clients and he told me similar thing, but on the other hand, he was shocked of the prices of our hardware (approx. $6k per server) and was seriously considering migrating to the OpenStack.
A 2U or 4U server is $10k to $20k. You're going to fit all you can in the box, including a minimum of 512GB of memory.
It's not just about license costs though, it's about hardware costs and capacity management. You want to have as few servers as possible for a given capacity, it's easier to manage and cheaper. You must have VmWare to abstract the hardware, a bit like AWS. You work with virtual machines and it packs them on the hosts.
Last I bought it but that was a few years ago. VmWare was $5000 per node for the full package. There was a free edition limited to about 100 GB of memory, but without cluster management and live transfer of running VM (vMotion).
If the system was implemented correctly the first time, resource use never exceeds capacity, maintenance always works properly, versions are always up to date, and the infrastructure (power, network, host, storage, cooling) never has problems, then the system appears perfectly stable. But introduce changes and errors with increasing frequency and you quickly find out how robust it actually is.
Also openstack was open for fake vendor openness. Where vendors could make compatible api with extensions. This doesn’t help the system integrator in the long run.
Your performance issues are only really going to be related to the VM tech, overlay network,storage layer, the orchestrator settings, or logstash itself.
Given that to can switch these in and out, you can isolate the problem and replace the broken part. You can also trace the app to see what syscalls are taking so long.
You can have similar issues with pretty much any environment if your team can't debug that, and if that's the case you should probably go for a popular vendor supported solution, but you'll be in a sad place when the vendor doesn't have the staff to debug their solution, so pick carefully.
This post isn't supposed to sound insulting to you or your ex colleagues, just pointing out that there is a gulf of knowledge between the guys who can get things to work, and the guys who can tell you why something doesn't work, and this gulf only really presents itself when shit hits the fan.
I'd really be interested in post-mortems. As long as you're not using SDN/overlay networks/weird plugins for Cinder instead of plain NFS, many components of OpenStack are nothing more but a config generator and deployer for core Linux iptables/bridges/KVM virtualization.
> It could never run anything reliably and they scrapped the entire project and re-purposed all the servers for DC/OS, which was super nice and reliable and every team migrate hundreds of services onto.
We're moving our stuff away from DC/OS as we're sick of the instabilities and especially the UI and configuration changing every release. It's been two and a half years of banana-ware for us.
Our biggest pain point, next to the tendency of amok-running deployments leading to disks filled up with useless logs (leading once to a totally corrupted master after a weekend), was/is that the "official" Jenkins package is the ultimate PITA to upgrade, massively lags behind despite security issues (current: 2.150.3 - mesosphere/jenkins: 2.150.1!) and you can't even run Jenkins outside of DC/OS because it needs the Marathon shared library to work.
Another thing that we dearly missed was the ability to "drain" a node - for example if I want to perform maintenance on a node, but cannot shut it down right now as a service on the node is being used... then I'd like to at least prevent new jobs from being spawned on that node. Or during system upgrades that stopping the resolvconf generator does not restore the original resolv.conf leading to a broken DNS, or when specifying NTP servers by name that the NTP server could not be resolved at boot time (as the resolv.conf still referred to the weird DCOS-round-robin-DNS), leading to DC/OS not wanting to start because the clock was out of sync,...
No cloud for that project, contractual prohibition - everything must be kept in-house.