Hacker News new | past | comments | ask | show | jobs | submit login
How We Failed at OpenStack (packet.net)
113 points by jsnell on Jan 19, 2015 | hide | past | web | favorite | 57 comments



Doesn't surprise me the slightest to be honest. Having worked on a customized fork of OpenStack that used a pure L3 networking model I know that you are set for pain the moment you don't want to run everything on a single Ethernet segment.

It doens't help that the Neutron data-model at the time that I was working on in (say 12 months ago or so) was terrible and basically impossible to scale/make to perform.

Inevitably you were then stuck with the deprecated and janky nova-network interface. Which while efficient and fast was also old and missing tons of stuff - meaning more monkey patching and janking around. Not to mention the fact that because of it's deprecation many completely ridiculous bugs befell it in later releases. (Grizzly and onwards basically)

TBH I am so disillusioned with the project I hope I don't have to work in or around it again.


> TBH I am so disillusioned with the project I hope I don't have to work in or around it again.

You're not the first I've heard this from, nor do I suspect the last.

The problem isn't that the code is bad as much as it is that the climate often makes it impossible to fix it. Review queues are weeks or months long. The article makes a good point about the necessary man hours to work on OpenStack. I've seen code removed not because it didn't have a maintainer, but because 200 lines of code didn't have 3-5 full time developers. Insanity persists and money talks.

Looking back, I'd say that OpenStack Nova in the beginning was never this bad. It may not have been the best thing ever, because it wasn't, but no code needs to be terribly great in the beginning. The beginning of a project needs good process more than it needs good code, and OpenStack didn't establish this well enough, early enough.

OpenStack never had a solid, centralized architectural vision. Anyone that attempted to contribute architecturally was essentially ejected. Those that flushed millions into controlling the process and millions more into building adhoc features got their way. I mistakenly advocated early for wrangling control from Rackspace. The increased influence gained by individual contributors was quickly dwarfed by large corporate influences.

I'm still involved with OpenStack, but far less than I had been in the past. Mostly, I prefer to see myself peripherally involved where I might improve the lives of those trapped in that ecosystem, either to help them deal with the pains they've inflicted upon themselves, or to escape them entirely.


The problem isn't that the code is bad as much as it is that the climate often makes it impossible to fix it.

Lots of the code isn't great either.


The networking situation has improved in Juno and Kilo, but yeah: One often gets the impression with OpenStack that there's so much attention being paid to new stuff that none of the existing stuff (even recently new) is ever brought to a state of stability and usefulness.


Considering that the neutron core is entirely L2 abstractions with an L3 plugin on top of it, it's not really surprising that you had issue with a pure L3 model.


Not sure if people know about Apache CloudStack or not, it has all those IaaS feature and it just works with various basic to advance networking models.


Apparently CloudStack "lost" to OpenStack ( see: http://www.infoworld.com/article/2608995/openstack/cloudstac... ). I've also heard this sentiment in my admittedly openstack biased circles.

That said, I'm not suggesting it should not be considered.


It lost out in getting vendor support from lots of different big vendors. OTOH, many hosting environments find it more useful. (It used to be a Citrix product which might explain the more unified architecture it has).



It looks like they have a pretty good user list going:

http://cloudstack.apache.org/users.html


I agree with the Author's observation that there are a lot of vendor specific changes on top of openstack before making it production ready. This struck me as slightly alarming when I first started working with the project: my previous experience with FOSS had been linux, GCC and the like, which were good to go from the start. To his credit, it does seem like the author made a serious effort to understand how to get Neutron to do what he wanted...

I'm guessing that a lot of people make the same mistake of thinking openstack is just as easy as linux to get running. Its really not. But it does provide 95% of the groundwork to get you started; often that remaining 5% is either your secret sauce or security overheads. And unfortunately, the details of how to do that are not open to the public...yet.

Also, a the slow pace of adding changes to openstack makes many projects add their changes to their custom patches and once its working there really isn't much incentive to push it upstream.


Sounds like these guys are doubling down on the IAAS model, 'premium bare metal'? Certainly there are a lot of people who'd like to run on bare metal, with a more configurable network, but how realistic is it at this time?

>>You see, physical switch operating systems leave a lot to be desired in terms of supporting modern automation and API interaction (Juniper’s forthcoming 14.2 JUNOS updates offer some refreshing REST API’s!).

This. Network hardware vendors have no incentive to make their devices more easily automated, and in fact face disincentive not to.

Doe anyone remember the excitement and promise around Google App engine when it was first announced, and before they changed the pricing model to per instance? The ability to put your app on the cloud, and scale up to the free tier, then out from the free tier on a paid plan if that's what you needed.

That model entirely disappeared. I miss it. Is anyone doing that now?


>> >> You see, physical switch operating systems leave a lot to be desired in terms of supporting modern automation and API interaction (Juniper’s forthcoming 14.2 JUNOS updates offer some refreshing REST API’s!).

>> This. Network hardware vendors have no incentive to make their devices more easily automated, and in fact face disincentive not to.

There is actually a relatively established roadmap for the solution to this in "bare metal" / "white box" switches that essentially just talk OpenFlow to a controller. Google moved their entire international internal backbone (more traffic than public facing) to this model[1].

The issue at the moment is that there isn't lots of OS options and subsequently very little hardware support. Google developed their own hardware (despite preferring to have bought it[2]) and my understanding is they wrote their own software too.

[1] http://www.opennetsummit.org/archives/apr12/hoelzle-tue-open... [2] http://youtu.be/VLHJUfgxEO4?t=39m20s


+1, talking about rest endpoints hosted by junos is missing the forest for the trees. Protocols like Openflow, and whatever the contrail version was called, seems o be where network automation is headed. Centralized state & modelling and pushing specific paths/updates out to the edge.

With regards to "bare metal" virtualization I'd expect to see a lot more in the next 12-18 months. On the network you need dynamic path configuration and traffic encapsulation/isolation. That should be "openflow" and vxlan/nvgre. On the host hardware youll want io virtualization (sr/mr iov) and possibly hardware encap as well. Substantial progress is being made on both fronts.

Edit: although its great to have two encap options I think theyre incomplete at best. All of the hard work has been punted to the centralized controllers and the rfcs have nothing useful to contribute there. Some of the rfc behavior is also insane/laughable; multicast for broadcast and mac/tunnel endpoint discovery ORLY? Ill be very surprised if there are any large vxlan/nvgre deployments which arent bespoke.


The opportunity extends beyond support for x86 devices in that more traditional hardware switches etc. should also work with openflow.

I've found there's lots of OSS around the controller and virtual switches for testing/lab but the only serious openflow agent designed for hardware switches I've found is Big Switch's Indigo[1] it has very limited hardware support.

I see experimental support in OpenWRT - this is very interesting as it opens up a shed load of hardware options.

[1] http://www.projectfloodlight.org/indigo/


As someone who works with OpenFlow (a lot) I have my doubts if the tech will pan out. Look at what Facebook did to their network using traditional technologies. Look at Cisco's ACI, and Jupier's Contrail. About the only thing OpenFlow has going for it is that it runs on multiple vendor platforms (assuming you ignore all the switch-to-controller interop problems).


Doe anyone remember the excitement and promise around Google App engine when it was first announced, and before they changed the pricing model to per instance? The ability to put your app on the cloud, and scale up to the free tier, then out from the free tier on a paid plan if that's what you needed.

That model entirely disappeared. I miss it. Is anyone doing that now?

Just about every PAAS (including AppEngine) does this[1][2][3][4]. What am I missing?

[1] https://www.heroku.com/pricing

[2] https://cloud.google.com/appengine/docs/quotas

[3] https://www.openshift.com/products/pricing

[4] https://appharbor.com/pricing


The difference in my estimation is: pricing based on per instance (what ever an instance is) and pricing per resource used, past the free tier. And auto scaling. I could have been clearer, apologies.

Edit; further, in the original GAE pricing model, the customer paid for specific services, usually by volume. Maybe accounting was prohibitive?


You can do auto-scaling on OpenShift, as part of the platform, and on Heroku, with HireFire:

http://hirefire.io


GAE still uses the same model - the prices just are more than they were previously. See my previous link and [1] for details. It even explicitly gives you an instance/hour cost (above the free quota).

Plenty of PAASs do autoscaling.

[1] https://cloud.google.com/appengine/pricing


Don't forget BlueMix!


It might not be a bad area at all to be solving problems in.

I was a little incredulous when I read that they started writing ip manager code, but then I remembered this article about amazon AWS scale:

see section titled "The Network Is A Bigger Pain Point Than Servers" in this article about AWS scale: http://www.enterprisetech.com/2014/11/14/rare-peek-massive-s...


> how realistic is it at this time?

Quite realistic, as there are many bare-metal offerings at this time, as a quick Google search will attest to. What exactly the packet.net people mean when they say 'premium', however, is unclear.

> That model entirely disappeared. I miss it. Is anyone doing that now?

Heroku still offers a free-tier to start.


This seems to be more of a lesson on the failures of community building vs. secrecy in the face of a presumed pile of money and the resulting vendor politics, than "OpenStack sucks".

An OSS project isn't really supposed to be about "I can freeload on the works of others for some investment in comprehension and customization", which is how I felt the author framed his situation at times.

The underlying failure seems to be that the author decided it was easier to maintain his own proprietary platform than modify OpenStack for their needs and contributing back to the community. This would lead to others to pick up their stuff down the road and potentially reducing the maintenance burden (at the expense of exposing any secret sauce you feel you might have).

This deeper failure is In the incentives for Rackspace to withhold key commits on Ironic from the community because they feel it is secret sauce. (I am taking the OP's version of the tale at face value). They're one of the flagship supporters of OpenStack, and their behavior is perceptably a big reason for its failures to date.

The limitations of Neurtron without a product like VMware NSX underneath are well known. Production grade virtual networking at scale is hard and also mostly a secret sauce (for now).

OpenStack seems to effectively have become the OMG and CORBA 1.0 with a reference implementation - it's cloud vendor kabuki instead of distributed objects square dancing. You need vendor help to get going, and the portability is very limited, you'll get some value out of what's been done but at great effort. It seems to also be a useful commons for network and storage vendors to help drive interperability with the side modules (Cinder and Neutron). If anything, OpenStack is how the industry is desperately brute-force learning what Amazon Web Services has accomplished before they swallow the universe, which is valuable but messy.

OpenStack seems the only "I want to run a general purpose cloud" game in town today - CloudStack exists but doesn't seem to have a lot of momentum. Google, Azure and Digital Ocean are the only competitors to AWS of note and they don't open source their stuff. CoreOS on PXE or Ubuntu MaaS might work but needs a much more mature cluster schedulers, network and volume management. Or perhaps the real next generation will be "none of the above".


> This deeper failure is In the incentives for Rackspace to withhold key commits on Ironic from the community because they feel it is secret sauce. (I am taking the OP's version of the tale at face value). They're one of the flagship supporters of OpenStack, and their behavior is perceptably a big reason for its failures to date.

I'm an Ironic core reviewer and work on OnMetal at Rackspace.

At Rackspace, we run ahead of Ironic trunk. It's true that we haven't been super vigilant about upstreaming our patches into Ironic; this is not because it's "secret sauce", not because we don't care. Priorities are hard, both upstream and downstream.

OpenStack moves slowly compared to a team developing proprietary software. This is a well-known fact. We do our best to upstream our patches as quickly as the project allows, but they often need to be improved to work with other hardware/drivers/etc.

For example, when we launched in July, we already had support for "cleaning" a server - erasing disks, flashing firmware, etc. The "spec" for the new feature was first posted upstream June 25, 2014.[0] This spec finally landed last January 16, 2015.

Our work on improving network support in Ironic has been similar; the project hasn't been ready for it (again, priorities). It's been done in the open[1], but the code is not in Ironic trunk yet.

We've been extremely open about what we're doing since we joined the Ironic project almost a year ago; I'm curious which patches the article has in mind.

As an Ironic developer, this article bums me out a bit, but it's a good pointer as to what we're doing poorly. /me starts writing better docs

[0] https://review.openstack.org/#/c/102685/ [1] https://etherpad.openstack.org/p/ironic-neutron-bonding


I just wanted to chime in here to say that although there were several situations where our questions couldn't be answered, we probably wouldn't have made it as far with our testing if it wasn't for the answers that were received from the openstack ironic developers. I should also point out that I've always found the openstack ironic devs to be kind and professional. So be it as it may, it is unfortunate that there are some conflicting priorities but I certainly do not blame the devs.


This is excellent information, thanks for sharing it, and correcting my assumptions above.


" 'I can freeload on the works of others for some investment in comprehension and customization', which is how I felt the author framed his situation at times. "

That was a huge, unnecessary leap from your part. I did not read it like that. To me, it was more like "we were going to leverage the existing projects and add to that and give back (as they said they will do), but we could not because the underlying projects are not mature so right now it is more work to fix than to start from scratch."

I don't know if what they did is advisable or not. All I am saying yours is an unnecessarily aggressive conclusion attacking someone who just spent a lot of time warning the community about a lot of the issues under discussion.


I wasn't making any conclusions. What's aggressive or unnecessary about explaining how I felt when I read his article? You felt differently than I did reading it.

Also, freeloading is, by far, how most people and organizations use open source, so it's not exactly a unique situation.


SmartDataCenter[1] was recently open sourced and is built on a lot of cool tech (ZFS, Dtrace, node.js, Illumos).

[1]: https://github.com/joyent/sdc/


Good point, I have to remember that. I wish the world was more open to Illumos.


> it's cloud vendor kabuki instead of distributed objects square dancing

This is a beautiful and evocative metaphor.


My experience agrees with the general tone of the article (although I didn't dig as deep into the code as OP). I implemented an OpenStack private cloud for testing/QA purposes but never felt comfortable enough with it to migrate production (this was Icehouse, so pretty recent).

It was too easy to break core functionality--for example, I literally never saw resizing an instance work properly. It does this crazy hack where under the hood it SCPs the VM image to another host and then tries to bring it up. It could have been a quirk of our installation but it would break every time. I saw similar breakage with Cinder operations where volumes would get "stuck" on VMs. Again, it could be a bad installation but it goes to show you how easy it is to break OpenStack if you aren't an expert in the codebase.

My current thinking is that a container-centric (as opposed to VM-centric) infrastructure is the way to go--that way I can just throw CoreOS or whatever on the bare metal nodes and migrate containers as needed.


Sure, container-centric is great if you are running containers. Until it is shown that containers are actually secure, people are going to push for virtualization.


The author writes:

"As we finalize our installation setup for CoreOS this next week (after plowing through Ubuntu, Debian and CentOS)"

Pity he doesn't elaborate on that. I understand that CoreOS is his choice, but it would be nice to know why the other distros aren't.


I think the author is saying they already finished setup fur the other three.


Our guys tried CentOS for awhile with OpenStack, but gave up because so much of it is clearly maintained with Ubuntu in mind. So they switched to Ubuntu and things have been mostly smooth sailing since then.


Same for us. So much of the documentation is really only for Ubuntu, and you quickly run into gotchas where you have to decide.

Do I want to work out and fix all the issues to get this distro working, or do I want it to work and move on to the other gotchas?


Can someone explain 'bare metal' to me? Is it a better hypervisor or something? Why would it be better than all development effort put into something like linux? Doesn't the linux kernel run on 'bare metal'?

A fellow developer tried to get me into openstack a little over three years ago, and when I looked, it was far too enterprise for my tastes, but I care more about code than the devops and managing servers.


Bare metal just means not virtualized. There's no hypervisor between the OS and the hardware (the hardware is made of metal, therefore "bare metal"). If you buy a computer with Windows on it, Windows is running on bare metal. Hypervisors run on bare metal.

So yeah, as you suggested, if you install Linux on your computer, the Linux Kernel is running on the bare metal.


So really, this is still virtualisation - but with less overhead.


If you want to give your users instances that have full hardware access, bare metal instances allow you to still manage that using OpenStack. This might be to allow access to GPUs or other specialized hardware.

Another example is deploying hypervisors: Test suites for OpenStack are run against many different versions by using OpenStack to deploy the systems to test. HP's OpenStack distribution uses it as a deployment mechanism, taking over and managing the nodes of the OpenStack cluster from a small initial cluster.


It means owning and controlling your own hardware. It's really, really nice to be able to reason about real hardware when doing performance optimization or deep bug analysis. If you're stuck in the public cloud ghetto there's only so deep you can go before throwing up your hands and saying, "eh, I guess Amazon is having a bad day..."


I'm pretty certain that's the point - you get a dedicated server. They seem to guaranteeing one to be setup for you in four minutes!

Someone tell me if I'm wrong :-)

This would mean he could give two hoots about virtualisation, I guess his concern would be automated deployment and network allocation, along with monitoring.


I tend to define it more like andrewstuart2 above > "not virtualized" ...as far as I know it doesn't really mean dedicated though. (although maybe it should ?)

edit : why not dedicated ? ...because you can have containers running on "bare metal"


Containers are a form of virtualization, as are zones or jails.

Bare metal implies dedicated hardware.


It would be cool if the author could elaborate on this conversation: "As the conversation developed, I eventually agreed that many of the public cloud services were not user friendly and had an overly high barrier to usage"

...as I read through the article is sounds like it was probably around bare-metal needs - still, elaboration would be nice here :)


I am doing this same work for the company I work for. I think it's going well, but it's taken far longer than any estimates I have planned.

The Ironic guys are amazing, really great people to work with. The guys in IRC are good at working with us.

Just hope we can provide some value to the project as well to return the favor.


@AndyNemmity I totally agree - the Ironic guys (particularly jroll) are awesome and were very helpful.


Shouldn't take more than an evening for somebody experienced with hosting to pick up these red flags reading through the OpenStack documenation/source.

From what I recall the documentation left a ton to be desired. Just trying to figure out how Neutron and their "VPC" equivalent was supposed to be implemented left more questions than answers :|


Yeah, I looked at openstack for a few days - including a small test deployment, and came away with "not for a few more years"


> Premium Bare Metal

Given that's the offering, it doesn't surprise me a bit they didn't go with OpenStack. That said, I guess they think running containers on bare metal is a better way to roll.


It would be, if the docker containers were actually root safe. Currently, youll probably have to rent an entire physical machine to run your containers on, which will be fast, but not necessarily great for packet as they grow.

Openstack really isnt appropriate for this type of scenario, unless their original goal was to use KVM machines to add some extra security / multi-tenancy.


They were looking to use Ironic, which is specifically built for renting entire physical machines. I agree, however, that contributing and working upstream in OpenStack is challenging. I do not doubt that they would find it easier to build new infrastructure than contribute to OpenStack.

OpenStack no longer behaves like a nimble startup and may no longer be the right option for someone looking for a quick, iterative development process. I'd question if any startup should really be a consumer of OpenStack at this point.


Eric, I think that is bit of a leap. If we look back to the mission statement, it still fulfills that role IMO. Ironic is without doubt the immature stepchild, which to me really only makes sense if you wanted to do visualization - but also offer bare metal under the same API.


To put it another way, I question if any startup should be using OpenStack today if OpenStack does not immediately solve the needs that startup expects to solve in the future. That's especially true for DIY. I'm speaking as consumers of OpenStack, of course, not for companies building value on it.

If OpenStack doesn't solve the startup's future needs right now, the startup's future need will come sooner than the features needed in OpenStack. Contributing upstream will have too great an opportunity cost. The only legitimate options for such companies are not to use OpenStack or maintain their own fork.

Right now, at the rate of innovation and improvement currently in OpenStack and the processes necessary for participating in the community, I'd argue that if a startup consuming OpenStack has resources to dedicate toward upstream development and baby-sitting that process, that they're either A) Not a startup, or B) a failing startup.


s/OpenStack/customLinuxKernel/

No startup should be rolling their own cloud, any more than they should be putting together their own Linux kernel. Go public, or if you MUST be on your own metal, use a turnkey solution like Metacloud or Nebula (and let them manage it for you).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: