
How We Failed at OpenStack - jsnell
https://www.packet.net/blog/how-we-failed-at-openstack
======
jpgvm
Doesn't surprise me the slightest to be honest. Having worked on a customized
fork of OpenStack that used a pure L3 networking model I know that you are set
for pain the moment you don't want to run everything on a single Ethernet
segment.

It doens't help that the Neutron data-model at the time that I was working on
in (say 12 months ago or so) was terrible and basically impossible to
scale/make to perform.

Inevitably you were then stuck with the deprecated and janky nova-network
interface. Which while efficient and fast was also old and missing tons of
stuff - meaning more monkey patching and janking around. Not to mention the
fact that because of it's deprecation many completely ridiculous bugs befell
it in later releases. (Grizzly and onwards basically)

TBH I am so disillusioned with the project I hope I don't have to work in or
around it again.

~~~
ewindisch
> TBH I am so disillusioned with the project I hope I don't have to work in or
> around it again.

You're not the first I've heard this from, nor do I suspect the last.

The problem isn't that the code is bad as much as it is that the climate often
makes it impossible to fix it. Review queues are weeks or months long. The
article makes a good point about the necessary man hours to work on OpenStack.
I've seen code removed not because it didn't have a maintainer, but because
200 lines of code didn't have 3-5 full time developers. Insanity persists and
money talks.

Looking back, I'd say that OpenStack Nova in the beginning was never this bad.
It may not have been the best thing ever, because it wasn't, but no code needs
to be terribly great in the beginning. The beginning of a project needs good
process more than it needs good code, and OpenStack didn't establish this well
enough, early enough.

OpenStack never had a solid, centralized architectural vision. Anyone that
attempted to contribute architecturally was essentially ejected. Those that
flushed millions into controlling the process and millions more into building
adhoc features got their way. I mistakenly advocated early for wrangling
control from Rackspace. The increased influence gained by individual
contributors was quickly dwarfed by large corporate influences.

I'm still involved with OpenStack, but far less than I had been in the past.
Mostly, I prefer to see myself peripherally involved where I might improve the
lives of those trapped in that ecosystem, either to help them deal with the
pains they've inflicted upon themselves, or to escape them entirely.

~~~
nl
_The problem isn 't that the code is bad as much as it is that the climate
often makes it impossible to fix it._

Lots of the code isn't great either.

------
bhaisaab
Not sure if people _know_ about Apache CloudStack or not, it has all those
IaaS feature and it just works with various basic to advance networking
models.

~~~
andyidsinga
Apparently CloudStack "lost" to OpenStack ( see:
[http://www.infoworld.com/article/2608995/openstack/cloudstac...](http://www.infoworld.com/article/2608995/openstack/cloudstack
--losing-to-openstack--takes-its-ball-and-goes-home.html) ). I've also heard
this sentiment in my admittedly openstack biased circles.

That said, I'm not suggesting it should not be considered.

~~~
nl
It lost out in getting _vendor_ support from lots of different big vendors.
OTOH, many hosting environments find it more useful. (It used to be a Citrix
product which might explain the more unified architecture it has).

------
pm90
I agree with the Author's observation that there are a lot of vendor specific
changes on top of openstack before making it production ready. This struck me
as slightly alarming when I first started working with the project: my
previous experience with FOSS had been linux, GCC and the like, which were
good to go from the start. To his credit, it does seem like the author made a
serious effort to understand how to get Neutron to do what he wanted...

I'm guessing that a lot of people make the same mistake of thinking openstack
is just as easy as linux to get running. Its really not. But it does provide
95% of the groundwork to get you started; often that remaining 5% is either
your secret sauce or security overheads. And unfortunately, the details of how
to do _that_ are not open to the public...yet.

Also, a the slow pace of adding changes to openstack makes many projects add
their changes to their custom patches and once its working there really isn't
much incentive to push it upstream.

------
marktangotango
Sounds like these guys are doubling down on the IAAS model, 'premium bare
metal'? Certainly there are a lot of people who'd like to run on bare metal,
with a more configurable network, but how realistic is it at this time?

>>You see, physical switch operating systems leave a lot to be desired in
terms of supporting modern automation and API interaction (Juniper’s
forthcoming 14.2 JUNOS updates offer some refreshing REST API’s!).

This. Network hardware vendors have no incentive to make their devices more
easily automated, and in fact face disincentive not to.

Doe anyone remember the excitement and promise around Google App engine when
it was first announced, and before they changed the pricing model to per
instance? The ability to put your app on the cloud, and scale up to the free
tier, then out from the free tier on a paid plan if that's what you needed.

That model entirely disappeared. I miss it. Is anyone doing that now?

~~~
thomseddon
>> >> You see, physical switch operating systems leave a lot to be desired in
terms of supporting modern automation and API interaction (Juniper’s
forthcoming 14.2 JUNOS updates offer some refreshing REST API’s!).

>> This. Network hardware vendors have no incentive to make their devices more
easily automated, and in fact face disincentive not to.

There is actually a relatively established roadmap for the solution to this in
"bare metal" / "white box" switches that essentially just talk OpenFlow to a
controller. Google moved their entire international internal backbone (more
traffic than public facing) to this model[1].

The issue at the moment is that there isn't lots of OS options and
subsequently very little hardware support. Google developed their own hardware
(despite preferring to have bought it[2]) and my understanding is they wrote
their own software too.

[1] [http://www.opennetsummit.org/archives/apr12/hoelzle-tue-
open...](http://www.opennetsummit.org/archives/apr12/hoelzle-tue-openflow.pdf)
[2]
[http://youtu.be/VLHJUfgxEO4?t=39m20s](http://youtu.be/VLHJUfgxEO4?t=39m20s)

~~~
donavanm
+1, talking about rest endpoints hosted by junos is missing the forest for the
trees. Protocols like Openflow, and whatever the contrail version was called,
seems o be where network automation is headed. Centralized state & modelling
and pushing specific paths/updates out to the edge.

With regards to "bare metal" virtualization I'd expect to see a lot more in
the next 12-18 months. On the network you need dynamic path configuration and
traffic encapsulation/isolation. That should be "openflow" and vxlan/nvgre. On
the host hardware youll want io virtualization (sr/mr iov) and possibly
hardware encap as well. Substantial progress is being made on both fronts.

Edit: although its great to have two encap options I think theyre incomplete
at best. All of the hard work has been punted to the centralized controllers
and the rfcs have nothing useful to contribute there. Some of the rfc behavior
is also insane/laughable; multicast for broadcast and mac/tunnel endpoint
discovery ORLY? Ill be very surprised if there are any large vxlan/nvgre
deployments which arent bespoke.

~~~
thomseddon
The opportunity extends beyond support for x86 devices in that more
traditional hardware switches etc. should also work with openflow.

I've found there's lots of OSS around the controller and virtual switches for
testing/lab but the only serious openflow agent designed for hardware switches
I've found is Big Switch's Indigo[1] it has very limited hardware support.

I see experimental support in OpenWRT - this is very interesting as it opens
up a shed load of hardware options.

[1]
[http://www.projectfloodlight.org/indigo/](http://www.projectfloodlight.org/indigo/)

------
parasubvert
This seems to be more of a lesson on the failures of community building vs.
secrecy in the face of a presumed pile of money and the resulting vendor
politics, than "OpenStack sucks".

An OSS project isn't really supposed to be about "I can freeload on the works
of others for some investment in comprehension and customization", which is
how I felt the author framed his situation at times.

The underlying failure seems to be that the author decided it was easier to
maintain his own proprietary platform than modify OpenStack for their needs
and contributing back to the community. This would lead to others to pick up
their stuff down the road and potentially reducing the maintenance burden (at
the expense of exposing any secret sauce you feel you might have).

This deeper failure is In the incentives for Rackspace to withhold key commits
on Ironic from the community because they feel it is secret sauce. (I am
taking the OP's version of the tale at face value). They're one of the
flagship supporters of OpenStack, and their behavior is perceptably a big
reason for its failures to date.

The limitations of Neurtron without a product like VMware NSX underneath are
well known. Production grade virtual networking at scale is hard and also
mostly a secret sauce (for now).

OpenStack seems to effectively have become the OMG and CORBA 1.0 with a
reference implementation - it's cloud vendor kabuki instead of distributed
objects square dancing. You need vendor help to get going, and the portability
is very limited, you'll get some value out of what's been done but at great
effort. It seems to also be a useful commons for network and storage vendors
to help drive interperability with the side modules (Cinder and Neutron). If
anything, OpenStack is how the industry is desperately brute-force learning
what Amazon Web Services has accomplished before they swallow the universe,
which is valuable but messy.

OpenStack seems the only "I want to run a general purpose cloud" game in town
today - CloudStack exists but doesn't seem to have a lot of momentum. Google,
Azure and Digital Ocean are the only competitors to AWS of note and they don't
open source their stuff. CoreOS on PXE or Ubuntu MaaS might work but needs a
much more mature cluster schedulers, network and volume management. Or perhaps
the real next generation will be "none of the above".

~~~
jroll
> This deeper failure is In the incentives for Rackspace to withhold key
> commits on Ironic from the community because they feel it is secret sauce.
> (I am taking the OP's version of the tale at face value). They're one of the
> flagship supporters of OpenStack, and their behavior is perceptably a big
> reason for its failures to date.

I'm an Ironic core reviewer and work on OnMetal at Rackspace.

At Rackspace, we run ahead of Ironic trunk. It's true that we haven't been
super vigilant about upstreaming our patches into Ironic; this is not because
it's "secret sauce", not because we don't care. Priorities are hard, both
upstream and downstream.

OpenStack moves slowly compared to a team developing proprietary software.
This is a well-known fact. We do our best to upstream our patches as quickly
as the project allows, but they often need to be improved to work with other
hardware/drivers/etc.

For example, when we launched in July, we already had support for "cleaning" a
server - erasing disks, flashing firmware, etc. The "spec" for the new feature
was first posted upstream June 25, 2014.[0] This spec finally landed last
January 16, 2015.

Our work on improving network support in Ironic has been similar; the project
hasn't been ready for it (again, priorities). It's been done in the open[1],
but the code is not in Ironic trunk yet.

We've been extremely open about what we're doing since we joined the Ironic
project almost a year ago; I'm curious which patches the article has in mind.

As an Ironic developer, this article bums me out a bit, but it's a good
pointer as to what we're doing poorly. /me starts writing better docs

[0]
[https://review.openstack.org/#/c/102685/](https://review.openstack.org/#/c/102685/)
[1] [https://etherpad.openstack.org/p/ironic-neutron-
bonding](https://etherpad.openstack.org/p/ironic-neutron-bonding)

~~~
dlaube
I just wanted to chime in here to say that although there were several
situations where our questions couldn't be answered, we probably wouldn't have
made it as far with our testing if it wasn't for the answers that _were_
received from the openstack ironic developers. I should also point out that
I've always found the openstack ironic devs to be kind and professional. So be
it as it may, it is unfortunate that there are some conflicting priorities but
I certainly do not blame the devs.

------
bkeroack
My experience agrees with the general tone of the article (although I didn't
dig as deep into the code as OP). I implemented an OpenStack private cloud for
testing/QA purposes but never felt comfortable enough with it to migrate
production (this was Icehouse, so pretty recent).

It was too easy to break core functionality--for example, I literally never
saw resizing an instance work properly. It does this crazy hack where under
the hood it SCPs the VM image to another host and then tries to bring it up.
It could have been a quirk of our installation but it would break every time.
I saw similar breakage with Cinder operations where volumes would get "stuck"
on VMs. Again, it could be a bad installation but it goes to show you how easy
it is to break OpenStack if you aren't an expert in the codebase.

My current thinking is that a container-centric (as opposed to VM-centric)
infrastructure is the way to go--that way I can just throw CoreOS or whatever
on the bare metal nodes and migrate containers as needed.

~~~
hueving
Sure, container-centric is great if you are running containers. Until it is
shown that containers are actually secure, people are going to push for
virtualization.

------
Luyt
The author writes:

 _" As we finalize our installation setup for CoreOS this next week (after
plowing through Ubuntu, Debian and CentOS)"_

Pity he doesn't elaborate on that. I understand that CoreOS is his choice, but
it would be nice to know why the other distros aren't.

~~~
mratzloff
Our guys tried CentOS for awhile with OpenStack, but gave up because so much
of it is clearly maintained with Ubuntu in mind. So they switched to Ubuntu
and things have been mostly smooth sailing since then.

~~~
AndyNemmity
Same for us. So much of the documentation is really only for Ubuntu, and you
quickly run into gotchas where you have to decide.

Do I want to work out and fix all the issues to get this distro working, or do
I want it to work and move on to the other gotchas?

------
snarfy
Can someone explain 'bare metal' to me? Is it a better hypervisor or
something? Why would it be better than all development effort put into
something like linux? Doesn't the linux kernel run on 'bare metal'?

A fellow developer tried to get me into openstack a little over three years
ago, and when I looked, it was far too enterprise for my tastes, but I care
more about code than the devops and managing servers.

~~~
andrewstuart2
Bare metal just means not virtualized. There's no hypervisor between the OS
and the hardware (the hardware is made of metal, therefore "bare metal"). If
you buy a computer with Windows on it, Windows is running on bare metal.
Hypervisors run on bare metal.

So yeah, as you suggested, if you install Linux on your computer, the Linux
Kernel is running on the bare metal.

~~~
chris_wot
So really, this is still virtualisation - but with less overhead.

------
andyidsinga
It would be cool if the author could elaborate on this conversation: "As the
conversation developed, I eventually agreed that many of the public cloud
services were not user friendly and had an overly high barrier to usage"

...as I read through the article is sounds like it was probably around bare-
metal needs - still, elaboration would be nice here :)

------
AndyNemmity
I am doing this same work for the company I work for. I think it's going well,
but it's taken far longer than any estimates I have planned.

The Ironic guys are amazing, really great people to work with. The guys in IRC
are good at working with us.

Just hope we can provide some value to the project as well to return the
favor.

~~~
zsmith928
@AndyNemmity I totally agree - the Ironic guys (particularly jroll) are
awesome and were very helpful.

------
Rapzid
Shouldn't take more than an evening for somebody experienced with hosting to
pick up these red flags reading through the OpenStack documenation/source.

From what I recall the documentation left a ton to be desired. Just trying to
figure out how Neutron and their "VPC" equivalent was supposed to be
implemented left more questions than answers :|

~~~
mdekkers
Yeah, I looked at openstack for a few days - including a small test
deployment, and came away with "not for a few more years"

------
kordless
> Premium Bare Metal

Given that's the offering, it doesn't surprise me a bit they didn't go with
OpenStack. That said, I guess they think running containers on bare metal is a
better way to roll.

~~~
thinkingkong
It would be, if the docker containers were actually root safe. Currently,
youll probably have to rent an entire physical machine to run your containers
on, which will be fast, but not necessarily great for packet as they grow.

Openstack really isnt appropriate for this type of scenario, unless their
original goal was to use KVM machines to add some extra security / multi-
tenancy.

~~~
ewindisch
They were looking to use Ironic, which is specifically built for renting
entire physical machines. I agree, however, that contributing and working
upstream in OpenStack is challenging. I do not doubt that they would find it
easier to build new infrastructure than contribute to OpenStack.

OpenStack no longer behaves like a nimble startup and may no longer be the
right option for someone looking for a quick, iterative development process.
I'd question if any startup should really be a consumer of OpenStack at this
point.

~~~
Daviey
Eric, I think that is bit of a leap. If we look back to the mission statement,
it still fulfills that role IMO. Ironic is without doubt the immature
stepchild, which to me really only makes sense if you wanted to do
visualization - but also offer bare metal under the same API.

~~~
ewindisch
To put it another way, I question if any startup should be using OpenStack
today if OpenStack does not immediately solve the needs that startup expects
to solve in the future. That's especially true for DIY. I'm speaking as
consumers of OpenStack, of course, not for companies building value on it.

If OpenStack doesn't solve the startup's future needs right now, the startup's
future need will come sooner than the features needed in OpenStack.
Contributing upstream will have too great an opportunity cost. The only
legitimate options for such companies are not to use OpenStack or maintain
their own fork.

Right now, at the rate of innovation and improvement currently in OpenStack
and the processes necessary for participating in the community, I'd argue that
if a startup _consuming_ OpenStack has resources to dedicate toward upstream
development and baby-sitting that process, that they're either A) Not a
startup, or B) a failing startup.

~~~
geoffarnold
s/OpenStack/customLinuxKernel/

No startup should be rolling their own cloud, any more than they should be
putting together their own Linux kernel. Go public, or if you MUST be on your
own metal, use a turnkey solution like Metacloud or Nebula (and let them
manage it for you).

