Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's not for everyone and it has significant maintenance overhead if you want to keep it up to date _and_ can't re-create the cluster with a new version every time. This is something most people at Google are completely insulated from in the case of Borg, because SRE's make infrastructure "just work". I wish there was something drastically simpler. I don't need three dozen persistent volume providers, or the ability to e.g. replace my network plugin or DNS provider, or load balancer. I want a sane set of defaults built-in. I want easy access to persistent data (currently a bit of a nightmare to set up in your own cluster). I want a configuration setup that can take command line params without futzing with templating and the like. As horrible and inconsistent as Borg's BCL is, it's, IMO, an improvement over what K8S uses.

Most importantly: I want a lot fewer moving parts than it currently has. Being "extensible" is a noble goal, but at some point cognitive overhead begins to dominate. Learn to say "no" to good ideas.

Unfortunately there's a lot of K8S configs and specific software already written, so people are unlikely to switch to something more manageable. Fortunately if complexity continues to proliferate, it may collapse under its own weight, leaving no option but to move somewhere else.



In places worked we usually had a vmware cluster, load balancer, NFS for shared data when necessary and DNS set up (e.g: through consul).

This setup is very, very simple and scalable. There is very little to gain IMO on moving to Kubernetes.

Consul, VSphere and load balancers have APIs and you can write tools to do everything that K8s does.


How do you load balance? i mean load balance the "public ip"

In some networks DNS failover is really not that great, so at least a virtual ip needs to be used.


We use haproxy. I wrote code [1] that configures it based on Consul to do load balance. It has been running in production for 2 years without issue (tested with consul 1.0.6)

For people wondering why not use consul template, this has benefit of understanding haproxy, and it minimizes numbers of restarts to make changes.

Using haproxy over and has the benefit that if Winkle or consul goes down things continue to work, just not updates.

[1] https://github.com/takeda/winkle


Haproxy only solves a single Part of the Problem. If you do DNS based failover you should really Check how clients behave, when one node goes down. Without a floating ip or a cloud lb, some stuff will be troublesome


The haproxy method doesn't rely on DNS at all so I'm a bit confused.


well either it uses DNS for failover or you have ipvs (lvs, keepalived) enabled or worse if the machine with haproxy crashes your basically dead. Of course there is also bgp and anycast, but this is not "cheap"


That's not how it works. It is very similar to sidecar approach which various service discovery solutions have. You have haproxy running locally on localhost and you communicate with it, the haproxy then routes the request to right nodes. No DNS, no LVS and no keepalived.


well I've talked about edge Load Balancing..


Scalable NFS, riiite.


If you have some time to read "how Google works" you would be surprised by how long the company ran on NFS. I assume there are lots of workloads running on Borg to this day on top of NFS. If that isn't enough for you you should have a look in the client list of Isilon and see which kind of work they do, in case you ever attend a SIGGRAPH most of what you see is built on top of NFS, so, essentially, all of the computer graphics you see in movies. At last job our NFS cluster did 300 000 IOPS with 82gb/s throughput


82gb/s (assuming you mean gigabit) is _per-node_ throughput at Google (or FB, or I assume Amazon/Microsoft -- they all use 100GbE networks now). 300K IOPS is probably per-node, too, at this point. :-)


Having a 100gbps nic in a node isn’t the same thing as doing storage at that speed in an HA cluster.

Also, don’t confuse 100 gbe networks where spine links are 100 but the node links are only bonded 10s (much more common at $fang).


Nope. It's all 100GbE throughout as far as I know. And people do work really hard to be able to saturate that bandwidth as it is by no means a trivial task to saturate it through the usual, naive means without the use of RDMA and Verbs. Years ago when I was there it was (IIRC) 40Gbps to each node straight up.

It's a necessity really. All storage at Google has been remote and distributed for at least the past decade. That puts serious demands on network throughput if you want your CPUs to actually do work and not just sit there and wait for data.

Here's some detail as of 2012: https://storage.googleapis.com/pub-tools-public-publication-.... Note that host speed is 40Gbps. And here's FB talking about migrating from 40Gbps to 100Gbps in 2016: https://code.fb.com/data-center-engineering/introducing-back...


Sorry I don’t have to read it because i was borg sre for 6 years and i know how (the server part of) it works. You assume wrong.

I know there are a lot of companies that try to put some lipstick on nfs pig and call it reliable/scalable/etc. so long their clients don’t actually try to run it at scale or don’t complain too publicly when they try and can’t, they are able to get away with it.


Your concept of what is scale looks very different than mine, in my experience NFS does a very good job for in-datacenter workloads. CG rendering, oil/gas and others usually take this approach for HPC as far as I've seen. I consider this "scale". Close to 100k procs sharing the nfs is the biggest cluster I've worked at.

Of course that over longer networks it isn't suitable as the roundtrips have too much latency, other than that, is your experience much different regarding nfs?


What you consider ‘scale’ is a high watermark used by cloud providers that is irrelevant to 99.999% of the industry.

Supporting all of a Fortune 500’s business operations is very reasonable to call ‘scale’ in the normal world.

Your comment is a like a billionaire claiming that somebody that managed to hit 30 million isn’t rich.


worked at a company with 4 Petabytes on NFS ... FWIW


> leaving no option but to move somewhere else

Many of the major infrastructure/platform vendors are rolling out their own distribution of Kubernetes either as a cloud service e.g AWS, Azure, GCP or on premise e.g. RedHat.

So I suspect they are going to try and differentiate on features and ease of use and make it as hard as possible to move anywhere else.


k8s is meant to be hard to use. You're supposed to rent space on a k8s cluster from Google. Google has been pumping millions into marketing k8s as a mechanism to improve GCP adoption and establish a foothold in the cloud provider space.


I'm not exactly sure what point you're trying to make here. k8s is not meant to be a paas, but no one is trying to make k8s harder to use.

I work at Google on a large team of engineers dedicated to making it as easy as possible to use.


[Disclaimer: this is pure conjecture and represents my opinion only; I'm a Google outsider.]

For a while, "AWS" and "the cloud" were practical synonyms; it was very rare that anyone meant not-AWS. In my opinion, Kubernetes is a major piece of Google's strategy to turn that tide and improve their marketshare.

Does throwing "a large team of engineers" at a problem typically result in something that's "as easy as possible to use"? "Design-by-committee" is not a term of endearment.

A sibling comment at https://news.ycombinator.com/item?id=18958077 notes that "there are a ton of nicer UIs for Kubernetes", but that they're sold separately as proprietary PaaS platforms. Even if we pretend like GCP/GKE isn't one of them, Kubernetes-The-Platform will be impacted by the interests of its primary vendors and advocates, whether or not certain teams at Google are keen to admit that.


The fact it takes a team of highly skilled engineers from the top of the talent pool trying to make it easy should tell you perhaps the design is wrong ?


Or that to run a secured, highly available platform with the type of features k8s provides is non-trivial beyond basics.


What would make it easy is a good non-gcp GUI


I agree, some better UI for end users would be awesome: Kubernetes Dashboard kind of works, but is pretty limited and more a "kubectl in the browser". There are a ton of nicer UIs for Kubernetes, but they are all part of the value-add of proprietary platforms AFAIK (think about all the managed K8s offerings out there).


It took me a while to get comfortable in Borg (and in general that your binary can take hundredths of verbosely written command-line arguments (coming from gamedev, I was in a bit of shock state for a while)... But then got used to it - still I felt I could never fully internalize the evaluation rules - but the other tooling (diffing) really helped in that respect.

One thing I've really appreciated, was how one could enable/disable things based on the binary version rolled in, and if it's rolled back the state goes back.

Basically something like this:

    {
       new_exp_feature = (binary_compiled_after_changelist( 123456789 ) || binary_compiled_with_cherrypicks( { 123456795, 1234567899 } )
    }
Since piper is changelist based (like perforce/svn), each "CL" goes up atomically, so you can use this to say - this specific flag should get turned ON only if my binary has been compiled with base CL > 12345789 or if it was compiled with earlier, had these cherrypicks (e.g. individual Changelists) built with it. But this was heavily integrated with the whole system - e.g. each binary would basically be built at some @base_cl and additional @{cherry_pick_cl1, chery_pick_cl2, ..} maybe applied. For example the team decides to release with verison @base_cl, but during the release bugs were found, and rather than rolling to a new @base_cl, just individual cherry picks maybe be pushed - so basically you can then control (in your configuration) how to act (configuration could be pushed indepedntly of your binary, ... though some systems would bundle them together)... And then if you have to rollback, the Borgcfg would re-evaluate all this, and decide to flip the switch back (that switch would simply emit something like --new_exp_feature=true or --new_exp_feature=false (or --no-new_exp_feature, it was long time ago so I could be wrong)).

With git/hg - you no longer have such monotonic order, but also that monotonic order worked best with monorepos (or maybe I'm just too narrow-sighted here)...


From this comment thread I’m beginning to think I’m one of the few people on HN that hasn’t used Borg.

All of this seems way more complicated than the tools we use at my company. Is there a specialized need here I’m not seeing?


You seem confuse Borg and borgcfg.

The evaluation rules are merely a borgcfg artifact.

Disclaimer: I maintain borgcfg.


I found https://jsonnet.org/ to fix several issues with the Borg configuration language.

https://github.com/ksonnet/kubecfg is an attempt to reboot the borgcfg experience with k8s + jsonnet


There is also Https://GitHub.com/dhall-lang/dhall-kubernetes which is built on top of dhall, another competitor of jsonnet


There is no such thing as borgcfg experience. It’s just a configuration dsl applied to producing borg specs.


You might take it for granted, and thus not experience it as an "experience", but if you use other tools that are popular in the k8s world (such as helm) you might feel a tinge of nostalgia.

For example, {borg,kube}cfg allow you to import an existing config and override it so you can adapt it to another scenario (different things in different clusters, like prod vs staging, or a cluster has a new feature while another one doesn't etc).

Furthermore, the overrides are described with the same "shape" as the things they override, and can override things that weren't necessarily marked as overridable.

Compare this with the current state of affairs with helm, where the only way for users to be able to inject e.g. the resources requests and limits in a pod spec is for the original template author to have foreseen that need and explicitly added a hook in the template so that values from another file (the values.yaml) can be injected into it.

        spec:
          imagePullSecrets:
            - name: {{ .Values.image.pullSecret }}
          containers:
            - name: {{ template "mycontainer.name" . }}
              ......
              resources:
    {{ toYaml .Values.resources | indent 12 }}
          volumes:
              ......


Sure yes - I meant borgcfg, not borg - stupid me...


An open source having complexity of an enterprise monster - this is what generates the half million plus salaries. An old enterprise software trick. Simplification of it would serve no interests of anybody in the position to do the simplification.


I'm going to voice a contrarian viewpoint here and say that "half million plus" salaries are actually a good thing. Rising tide lifts all boats and since a lot of technies live in areas with exorbitant cost of living, that money re-enters the economy at a rapid clip anyway. But such salaries are only good if commensurate value is being delivered for the money. Which in a large IT shop it might be, but as a small business owner K8S is a hard slog, hence my suggestion to simplify. I'm pretty sure 80/20 breakdown still applies, and 80% of K8S complexity could be removed without affecting anything much. One might suggest that I use GKE and bypass the problem entirely, but I need to run a lot of GPUs 24x7, and the pricing on those in any cloud is insane.


"a lot of techies live in areas with exorbitant cost of living,"

And what do you think is the primary driver of said cost of living?

I believe there is a ton of arbitrary complexity in the system, and though it's not specifically created, it definitely grows if it's not checked and entities with power have no reason to do that.

Google, Oracle, MS, Governments, Banks - have very little incentive to clear the weeds, usually just the opposite.

Wherever there is steady profit, there are layers of cruft.


>> And what do you think is the primary driver of said cost of living?

Mostly NIMBY-driven refusal to build more housing and transportation infrastructure. It's not like the US is lacking for land. There's no reason a dilapidated teardown-ready shack should cost $2M, no matter where it is.


No, there is NIMBY-ism in most places, the reason Valley prices are sky high is the salaries.

" There's no reason a dilapidated teardown-ready shack should cost $2M, no matter where it is."

The cost is not the shack, it's the land it's sitting on.

The higher the salaries in the valley, the more that dilapidated shack will cost.


You make it sound like people enjoy paying millions for dilapidated shacks, which I can assure you is not the case. The reason housing costs so much is limited supply coupled with high demand. There could be a 10-story 40-apartment building in place of that one shack. Place a few thousand of those strategically through Bay Area, and the price per square foot would come down big time even if the cost of land stays high. But, NIMBY. Can't reduce the "value" of all that (mostly dilapidated) real estate people already own.


" much is limited supply coupled with high demand. "

No, it's just 'supply and demand' neither are 'high or low' necessarily.

SV has quite high wages, that's a huge driver of demand.

The residents of SV do not want to be like NY or Hong Kong, that is their choice. It's the choice many, many places make as well. Zurich, Paris, even London, they don't live in high rises.

The attractiveness of Cali in many ways is that it's not entirely flooded/urban like NYC.

There are not just arbitrary ways to create more homes, it has an effect on the situation.


It is both. There is more money buying a relatively fixed number of homes, so price goes up. But if you could buy a piece of land with one home. And put 10 units on it, then that could be offset.


Simplified Kubernetes is a thing that exists. OpenShift (and the open source version, OKD) jumps out as the immediate example. There are other non-k8s tools that cover some of the same territory, like Docker Swarm or Cloud Foundry.

There's still a learning curve, but it's much more humane than Kubernetes.


I think you meant to write "(and the upstream community version, OKD)", because OpenShift is also fully open source.


Yes, many thanks for the correction.


Hmm, I think you and many others do not get how complex a general purpose infrastructure can and should be.

Kubernetes is very simple. And it will become much more complex with the growing hardware, network, and applications it's trying to manage.

What's missing is that there is a layer of complexity on top of k8s are still left for figuring out. And I think the operator's pattern is the right abstraction for service jobs. Some kind framework is still needed to handle the batch/offline workloads though.


Quite the opposite, I want it to be flexible and pluggable for other use cases other than the most simple. I've gotten a lot of benefit from adding custom features.


I'm not sure but would something like docker swarm qualify?


Re configuration: ksonnet is an option (although I personally find jsonnet a “lipstick on a pig” kind of solution).

There’s some work going on to have something more user-friendly (think Google’s Piccolo) - https://github.com/stripe/skycfg (disclaimer - I contributed to this project)


There's also Kubecfg [1], which uses Jsonnet, but has a much smaller surface area than Ksonnet.

[1] https://github.com/ksonnet/kubecfg


Could you describe Piccolo a bit? Can't find anything on it.


Not going into too much details- it was a Python-esque dsl equivalent to bcl. You still had to learn Borg abstractions but at least you didn’t have to fight the language as much if you wanted to implement DRY in your configs.


Piccolo is very similar to Pystachio.


> I wish there was something drastically simpler

Have you tried Nomad?

https://www.nomadproject.io


> _and_ can't re-create the cluster with a new version every time

actually I used kubeadm and the higher the version was going the better it worked for major upgrades.

At the moment with the new master upgrade methods I did not have any problems so far. on two clusters.

Sadly I created my cluster with an "external" etcd, beside that it is internal and also tried to maintain my own certificates, which is now a pita. (at the time cert handling wasn't as good in kubeadm as it is now).

Also I have a CloudConfig/Ignition Config creator which can bootstrap all necessary configs to bootstrap a kubeadm cluster on ContainerLinux/Flatcar Linux. So if I really have time I can just recreate a new cluster and move everything over. (I.e. the only thing which is problematic in "moving" over is the database created with kubedb)

Also you can use keepalived as your kubeadm load balancer.


Nomad is drastically simpler than Kubernetes. All you need is consul and nomad to get a running cluster


I think Istio (https://istio.io) is a nice effort to create both an abstraction on top of k8s and to package a set of commonly needed functionality out of the box. Unsure of its production status or overhead though.

Also I'd only go with a managed k8s solution and I'm not sure I'd consider k8s for older or non-microservice/containerized architectures. In the later case though I don't think there's anything better out there in terms of orchestration.


I have pretty mixed feelings about Istio. It's trying to solve a lot of fundamental problems by introducing yet another layer of stuff. It's basically the middleware box all over again.


Lots of magic for me, I've broken a (dev) k8s cluster by installing Istio via Gitlab k8s integration. The overhead appeared to be non-negligible, but I noped-out-of-there pretty quick, so I don't have the data to back that up.


Hi bvm, GitLab PM here. Sorry to hear your dev cluster broke. Would like to offer any help we can provide and if possible learn more about the failure so we can take corrective action to avoid this in the future. Thanks.


Hi drugesso - Thanks for getting back to me. I actually signed up to premium to get support for this issue. Got one email back, replied and then never heard back :(

It would be really great if there is a human I could speak to at GitLab about this. I've put my email in my profile.


Hello Tom, thanks for reaching out about this. I've forwarded your request to the team internally, please let us know when everything is sorted out :-).


we used to maintain our own k8s cluster and it's a pain in the ass given we have no dedicated ops. the cluster crashed every one or two month and we never tried making it up to date.

I suggest every startup use a hosted k8s solution, which takes care of most things like authentication, networking, monitoring, updating, etc.

also keep away from templating system such as jsonnet which is a huge overkill. you will end up writing a lot code you will hate to read later. instead write your own yaml builder in CI, together with parts that do docker image building, and code that deploys the microservices

imo Google did a really smart move with open sourcing k8s, as a latecomer of cloud provider. now infrastructure become so insignificant since everything runs on docker and pods.


There was Rancher 1.6 with Cattle was the sweet spot for us. Rancher 2 went full kubernetes. Probably makes sense for their customers. We're looking for a replacement in that sweet spot.


Very true !!! There is no greater culprit responsible for "complex systems" than the act of "extensible/future proof" in software design !


I think that the unix philosophy of focused and relatively simple tools that are easy to glue together is a better way to future-proof. Yet to do that you need to have a stable substrata to provide the basis of composition. In k8s case it seems that k8s _is_ the basis where the composition is to happen upon.


In conversations, I often compare the Kubernetes API to the Linux Kernel API (as analogy) - both provide primitives we kind of "agreed on" in the industry. I hope the Kubernetes ecosystem will flourish in the same way as the Linux base.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: