
1000 nodes and beyond: updates to Kubernetes performance and scalability - boulos
http://blog.kubernetes.io/2016/03/1000-nodes-and-beyond-updates-to-Kubernetes-performance-and-scalability-in-12.html
======
whalesalad
We are running Kubernetes in production at FarmLogs and LOVE it. We're a very
small team with a ton of operational work to do in other facets of the company
as we prep for the season to begin, but once we've got some free time there
will be an in-depth blog post describing our migration and roll-out. We've
also built some really neat tooling that we would like to share with the
world.

Upgrading to 1.2 has been incredible. Deployments are faster and pods get
scheduled almost instantly now. Our Master nodes are down to about 1/4th of
what they were normally doing in terms of CPU usage.

We're really excited to be ridin' on kubelets!

~~~
xur17
I would be very interested to hear your process!

Our team is looking at using it, but we haven't found a great way to do
automated deployments with our current build system (Bamboo). The best we've
come up with is a series of bash scripts as the deployment step, but I'm not
fully comfortable with how that would handle failed deployments yet.
Basically, we need a way to handle automated deployments, and see the status
of our currently deployed systems / promote environments.

If anyone is using Kubernetes in production, I'd love to hear what your
deployment process looks like.

~~~
meddlepal
We're looking at Spinnaker.io + Kubernetes right now for just this reason. K8s
support was just added.

~~~
TheIronYuppie
Disclosure: I work at Google on Kubernetes.

Please feel free to email me (aronchick (at) google) if you'd like to discuss
this further (either P or GP post). We've seen a lot of this, and would love
to help you out!

~~~
meddlepal
Awesome, thanks for the offer! I'll reach out in the next day or two! :)

------
boulos
It's exciting to see that Kubernetes is ready for basically any scale. You're
more likely to run out of quota (on your cloud provider, particularly IPs) or
some other resource (on-prem) before you can't schedule a container quickly
enough.

Disclaimer: I work on Compute Engine and chat with the Kubernetes folks a lot.

~~~
tzaman
That resource being money. I'm deploying a fairly simple app on GKE and things
go out of hand quickly due to confusing pricing. Or maybe I just don't where
to look.

~~~
TheIronYuppie
Disclaimer: I work at Google on Kubernetes.

Can you say more? Did you just spin up too many nodes?

~~~
vetinari
I would say that there's impedance mismatch between GKE pricing and unclear
requirements, how much resources in what structure you will need.

I was looking at the Kubernetes tutorials and couldn't even start to figure
out, how much would it cost to run them. (Well, I didn't try too hard, it
wasn't that important.)

------
jbeda
I just want to give a public shout out to Wojtek on this blog post. It shows
scalability in for an actual scenario at levels that most users won't need
(10M req/s!). Beyond that, there is a clear methodology with lots of hard
data. This along with listing the work that it took to get there. Very good
post!

Disclaimer: I co-founded Kubernetes and help to coordinate the k8s Scalability
SIG, although I'm no longer at Google. I didn't see this before it was
published, though.

------
Rapzid
1.2 has a lot of really nice additions such as infrastructure containers, the
new config map API, service draining for node replacements, and many more.

Unfortunately I would be running it on AWS and HA still hasn't been worked out
and manual setup is a bear.

~~~
justinsb
1.2 includes multi-zone support, so your nodes can be in multiple AZs. This
means that a failure of a single zone shouldn't interrupt your apps:
[http://kubernetes.io/docs/admin/multiple-
zones/](http://kubernetes.io/docs/admin/multiple-zones/)

What is not yet in 1.2, but is planned for 1.3, is HA Master - so that failure
of the zone which contains your master won't interrupt the control plane.
(i.e. you will be able to update your apps even as zones are failing).

~~~
Rapzid
Ah, nice! That wasn't super clear to me but now that you mentioned it, perhaps
it should have been.

~~~
justinsb
Not your fault - I was a little slow on getting the docs written up!

------
deepanchor
Curious to know why they are choosing to go with protobuf for intracluster
communications as opposed to zero copy protocols like capn proto or
flatbuffers.

No doubt protobufs are probably much more battle tested in google scale
environments, but are there any other clear benefits if the goal is to reduce
spending cpu time encoding/decoding messages?

Especially in SOA deployments where many small services need to communicate
with one another, I would think that the ability to quickly read any field
from a message and pass it on (without first having to decode the entire
message) would be a very desirable trait.

~~~
kentonv
Protobuf is the Google standard, used by basically every single server at
Google for the last 15 years. They have built an internal ecosystem of tools
around the format. For a Google project to use something different would be
weird and would face lots of internal push-back, for good reasons.

Even though FlatBuffers is technically from Google, it's from a sub-team of
Android working on tools aimed at Android games. The idea was that you'd store
your assets in this format. IIRC the initial release didn't do bounds checking
so was totally vulnerable to malicious input (but it wasn't intended for such
use cases anyhow). I doubt it is widely used on Google's servers.

Cap'n Proto is not from Google and there's simply no way they'd choose to use
it. To be fair, its support for languages other than C++ remains weak, largely
because Sandstorm.io doesn't currently have the resources to build it out.

FWIW the ability to read a single field from a message is less important in
networking situations because sending/receiving the message is already O(n)
and the messages are small-ish, so parsing in O(n) is not a huge deal. Random-
access parsing really shines when the input is a massive file on disk.

(I'm the author of Cap'n Proto and also of Protobuf v2 (the first version
Google open sourced).)

------
tsenart
For those wondering what is being used for load generation in the demo:
[https://github.com/tsenart/vegeta](https://github.com/tsenart/vegeta)

~~~
roberthbailey
+1. Thanks tsenart for the awesome load generator!

------
rodionos
The frame at 2:37 shows avg response time of 1.75 ms at 10 mln QPS. Which API
call was measured? I'm looking at bar charts under "Metrics from Kubernetes
1.2" and the latencies graphed there appear to be different/higher.

~~~
boulos
That's the _nginx_ response time. You can see that when he scales up the
loadbots but not the backends and says that the "tail latency has gotten quite
high" (about 1min in).

~~~
roberthbailey
Correct. In addition, the source code used to run the demo is available on
github at [https://github.com/kubernetes/contrib/tree/master/scale-
demo](https://github.com/kubernetes/contrib/tree/master/scale-demo)

~~~
rodionos
Thanks, all clear now.

------
Dangeranger
Does anyone have a good experience to share with a hosted Kubernetes provider
outside of GCE and Tectonic? I am primarily comparing using Kubernetes to
alternatives such as Rancher or Nomad.

~~~
TheIronYuppie
Disclaimer: I work at Google on Kubernetes.

Do you mean GKE?

~~~
Dangeranger
My 'GCE' reference was to Google Container Engine with Kubernetes as the
cluster manager, yes.

~~~
jbeda
Funny story -- Google Container Engine was the obvious name for that product
but the TLA for it (GCE) conflicted with Google Compute Engine. We broke the
tie by deciding the TLA for Container Engine would be GKE. The 'K' is a nod
toward the Kubernetes underpinnings.

Google Compute Engine itself was difficult to name. There were those that were
pushing for Google Compute Cluster. But I veto'd as the TLA would have been
GCC or GC2. Both would have been awful.

Naming is hard.

~~~
AndrewWright
Why not go with Alphabet Cloud, or ABC?

~~~
thesandlord
Hahaha this is awesome! Unfortunately, Alphabet wasn't a thing back then.

------
atemerev
And still no way to make a simple 2-node cluster in 2 different availability
zones. What if one AZ fails completely? Happens quite often.

I tried to read HA documentation on Kubernetes and it all starts with warnings
like "this is fairly advanced stuff, requiring intimate knowledge of
Kubernetes inner workings", and going on with pages and pages of setup
process.

Basic HA is not a "fairly advanced stuff", it is a commonplace requirement in
any production environment. Why do I need a 1000-node cluster if all 1000
nodes are in the same AZ, which can have an outage anytime?

~~~
SEJeff
Conceptually speaking, having two nodes is not high availability, it is
failover / fault tolerance. High availability is generally N + 2 where N is >=
1.

This is a better explanation than I would write on this:

[https://www.quora.com/What-is-the-difference-between-a-
highl...](https://www.quora.com/What-is-the-difference-between-a-highly-fault-
tolerant-and-a-highly-available-system)

~~~
atemerev
You are right, of course. Still, I don't understand why it is so low priority
in container orchestration platforms. And how it is even possible to live
without it in production.

~~~
beeps
It's not low on the priority list at all. These are the same people who worked
on borg (I'm a contributor, but didn't work on borg); they get stateful
applications and understand that it needs to be done RIGHT. No second chances.
Nailing this for 1.0 or 1.1 would have consumed a significant portion of the
team, but rest assured it _will_ work, soon.

------
grandinj
Wasnt there an article recently about how 99.9 measurements can still hide
lots of bad stuff with high volume services?

I seem to remember the article noting that 99.9995 was more useful.

