
Scaling Kubernetes to 2,500 Nodes - stanzheng
https://blog.openai.com/scaling-kubernetes-to-2500-nodes/
======
SEJeff
This is a really fantastic set of general "how to tune kubernetes and the
various components for large clusters". Thanks for writing this up!

------
drewrobb
I'm surprised that the scaling story of k8s/(+etcd?) is still so far behind
mesos/zk. There have been mesos clusters at over 10k Nodes for several years
now.

I have never personally needed more than a few hundred mesos agents, but these
have been added without any noticeable impact on our extremely modestly
provisioned (and multi purpose) zk cluster or any other components.

Has anyone used both systems and can speak to any advantages of k8s for these
types of workloads?

Also is anyone using some kind of torrent approach as a more reasonable
solution to avoid network bottlenecks when distributing big docker images to a
large number of nodes?

~~~
jo909
A lot of the issues were kind of "external" and while worth thinking about for
every deployment, not really something the k8s project can do much about other
than warn in the documentation.

    
    
      - disk latency
      - monitoring queries
      - homemade autoscaler killing all etcd nodes
      - custom scheduling policy moving many kubedns processes to the same node
      - unusually large docker images
      - "sharing" gcr.io request quotas because of Azure NAT IPs
    

That's not to say that Mesos is not indeed scaling better or easier. I don't
know enough about Mesos.

------
merb
what I find amazing about k8s is that it's one of the first solution that is
relativly simple for a small cluster (HA, while schedule stuff on the
masters), but can scale amazingly well even for a big cluster. you can start
with 3 nodes with like 8gb per machine (or less, I guess even 2gb is feasible
if you only want to use like 1-1,5gb of memory per machine). (non ha can of
course be smaller)

~~~
scarface74
The Nomad executable is self contained and less than 15MB if I remember
correctly. It can be used with Docker containers, shell scripts, or just raw
executables.

[https://www.hashicorp.com/c1m.html](https://www.hashicorp.com/c1m.html)

It was dead simple to install and use compared to my brief experience with
k8s.

------
roscoebeezie
As a person who doesn’t understand containers, where do I go to learn the
basics?

~~~
nickjj
If you want to avoid doing a bunch of research on your own I've put together
an up to date self paced video course at
[https://diveintodocker.com/](https://diveintodocker.com/).

It covers everything from "What is Docker?" to learning how to apply it to
your own projects. There's a tiny bit of theory, followed by lots of guided
labs and examples.

In case you're curious, I've been using Docker in development and production
since 2014 and am also a Docker Captain (TL;DR is Docker reached out to me to
join their team as a trusted content provider).

------
djb_hackernews
350TB of memory, and 50,000 cores, nice.

ARP caching seems to be a common issue in cloud environments. AWS recommends
turning it off and does so itself in their Amazon Linux distro.

------
myrandomcomment
Ran into the ARP scale issues when trying to put 1000 containers on a system
for scale testing over year ago. strace helped figure out where the issues was
and what settings to change. I guess I should have sent an email to the
mailing list. At that time if you searched for scaling to 1000 docker contains
was a failed search, as it was "hey here is how I scaled to 1000 containers
over X numbers of nodes". No one was crazy enough to try to get 1000 on a
single machine.

------
eggie5
Does OpenAI train w/ GPUs on k8s clusters?

~~~
thesandlord
According to the article, they are using NC24 VMS, which have 4 K80s attached.
So yes, I would assume they are using GPUs.

Check out
[https://github.com/google/kubeflow](https://github.com/google/kubeflow) if
you are interested in doing the same.

(Disclaimer: I work for GCP doing K8s stuff, I know GKE clusters support GPUs
and Kubeflow, not 100% sure if AKS supports it or if you need to set up your
own cluster like OpenAI did.)

~~~
frakkingcylons
I have a somewhat off-topic question as a complete TensorFlow beginner and it
seems like you'd be in the know:

If I want to train a TF model distributed over many machines in GCP, it seems
like I could use Cloud ML Engine or deploy Kubeflow to a K8s cluster running
in GKE and train it there.

What should I consider when choosing between these two options? Is there
another option I should consider?

~~~
henningpeters
RiseML provides a higher-level abstraction than Kubeflow that is more similar
to Google Cloud ML. I would love to get your feedback on our solution:
[https://riseml.com](https://riseml.com)

Btw: we are currently preparing an open-source release

Disclaimer: I am co-founder at RiseML

------
EDevil
Isn’t it a problem to have etcd store its state on a non persistent volume?

How do they recover it after a restart? I suppose it's not a manual process.

~~~
justingood
The replacement machine will start pulling its data from the remaining nodes
when it joins the cluster. However, it's recommended to migrate the failed
node's data first if it's greater than 50MB:
[https://github.com/coreos/etcd/blob/master/Documentation/op-...](https://github.com/coreos/etcd/blob/master/Documentation/op-
guide/runtime-configuration.md#replace-a-failed-machine)

------
bdburns
(Azure containers lead here) Awesome to see OpenAI scale Kubernetes on Azure!

~~~
Findeton
(in Azure) Are VMs still tied to a specific machine and thus if the machine
goes down the VMs need be restarted?

~~~
bdburns
VMs in all clouds are always tied to specific machines. If that machine fails
unexpectedly then those VMs will restart. If it is a controlled reboot (e.g.
host update) then they may not restart...

~~~
brianwawok
Well, at least in Google Cloud for planned updates you can get your VM host
migrated and not lose a Node due to a planned maintenance. I am not aware if
Azure supports this, but my guess is they do not.

~~~
rarudduck
They (Azure) do for most operations.

[https://docs.microsoft.com/en-us/azure/virtual-
machines/wind...](https://docs.microsoft.com/en-us/azure/virtual-
machines/windows/manage-availability)

I've found actual reboots to be rare - the exception being the recent Spectre
/ Meltdown patching.

