I'd really be interested in post-mortems. As long as you're not using SDN/overlay networks/weird plugins for Cinder instead of plain NFS, many components of OpenStack are nothing more but a config generator and deployer for core Linux iptables/bridges/KVM virtualization.
> It could never run anything reliably and they scrapped the entire project and re-purposed all the servers for DC/OS, which was super nice and reliable and every team migrate hundreds of services onto.
We're moving our stuff away from DC/OS as we're sick of the instabilities and especially the UI and configuration changing every release. It's been two and a half years of banana-ware for us.
Our biggest pain point, next to the tendency of amok-running deployments leading to disks filled up with useless logs (leading once to a totally corrupted master after a weekend), was/is that the "official" Jenkins package is the ultimate PITA to upgrade, massively lags behind despite security issues (current: 2.150.3 - mesosphere/jenkins: 2.150.1!) and you can't even run Jenkins outside of DC/OS because it needs the Marathon shared library to work.
Another thing that we dearly missed was the ability to "drain" a node - for example if I want to perform maintenance on a node, but cannot shut it down right now as a service on the node is being used... then I'd like to at least prevent new jobs from being spawned on that node. Or during system upgrades that stopping the resolvconf generator does not restore the original resolv.conf leading to a broken DNS, or when specifying NTP servers by name that the NTP server could not be resolved at boot time (as the resolv.conf still referred to the weird DCOS-round-robin-DNS), leading to DC/OS not wanting to start because the clock was out of sync,...
No cloud for that project, contractual prohibition - everything must be kept in-house.