While it is still a stretch, the title should be "Google's promo committee is killing Kubernetes", since no other FAANG-equivalent is using or contributing to Kubernetes in a meaningful way.
The core point:
> It's too indirect, fixing a bug in kube-apiserver might retain a GCP customer or avoid a costly Apple services outage, but can you put a dollar value on that? How much is CI stability worth? Or community happiness?
While correct, the same is true for most projects at these companies. Very little work that a single engineer does can be assigned a provable revenue number. How does someone working on an internal build tool get promoted? Or a new training model? A large-scale refactor of a legacy codebase? (All of these are examples of very senior promos at my own FAANG company).
Kubernetes is no different. An engineer has to show how the work they did first and foremost aligns with their team's goals. If it doesn't, then well they shouldn't have been working on it in the first place. While working on open-source projects might be tolerated, even on company time, it makes sense that it won't put you on a path to promotion unless there is direct benefit to the company from it.
> While it is still a stretch, the title should be "Google's promo committee is killing Kubernetes", since no other FAANG-equivalent is using or contributing to Kubernetes in a meaningful way.
> Very little work that a single engineer does can be assigned a provable revenue number. How does someone working on an internal build tool get promoted?
While I generally agree, I've made changes in our internal build tools that I can easily put a number on, although it's not a revenue number. Make build process 10s faster, 1000 engineers run this 10x a day, you're saving more than an engineer day per day of time. That's a definite impact right there, assuming your management is on board with accepting it :)
Apple is #32 in contributor commits[1]. Between SAP and Glassdoor, with ~7000 commits. You need to start adding zeroes and then some to get to Google's number of commits.
We shouldn't just measure in commits. I'd be interested to hear the case made. Where has Apple lead in Kubernetes? What have been their major initiatives? That'd be interesting to hear.
There is currently an Apple employee on the k8s steering committee, so they could mean that.
Or, since the sentence is applying to a list of projects rather than to k8s specfically, it could be just a misparse and they're not claiming to lead each of the listed projects.
If you don't want the pointy-haired bosses to start measuring productivity by number of commits, start with not doing so yourself.
Commit count is largely meaningless and a low-effort way to fling around "contribution" numbers. For example, large commit numbers could be a reflection of a specific company's internal conventions around making a larger number of small changes as their own commit.
Now it's certainly plausible to me that Google is a much more major contributor to Kubernetes than Apple, but you need better numbers for showing that. For example, which companies have contributed to designing and implementing the major features in the last 5 releases?
> Make build process 10s faster, 1000 engineers run this 10x a day, you're saving more than an engineer day per day of time. That's a definite impact right there
Agreed, and you can measure and include similar statements regarding your Kubernetes contributions. However, if what you have to show for it is making the life of a hundred random startups easier, well why exactly should a Google promo committee care? Unless it is the specific goal of your business unit to make this impact, then the work is meaningful again.
> However, if what you have to show for it is making the life of a hundred random startups easier, well why exactly should a Google promo committee care?
He is actually talking about this in his tweets: "fixing a bug in kube-apiserver might retain a GCP customer", now he can put revenue associated with that customer into his perf packet?
The problem is that's a very strenuous argument to make (unless he could find an email that said that a customer was going to leave unless they fixed the bug) vs another engineer who found a new way of compressing emails and save the company X exabytes of storage space.
Keeping the ship running seems to be continuously undervalued at these companies compared to how they value building a new wing.
FAANG manager here, with experience in other industries. I have definitely seen that promo committees here value maintenance work less than other industries. I think the problem is that the committee is made up of individuals from other functions, so they really have very little idea about the services a candidate supports, they're just looking at the accomplishments written in the packet. In other industries, that's typically but the case - if you run a service that has a reputation for being unreliable, you're not even getting considered.
You still have to do the above and beyond work elsewhere, but in FAANG some people will get into the conversation by ditching maintenance and focusing solely on the stretch work, and they tend to have success getting promoted.
Could be could not be, hence the hard to measure part.
I've stopped using many tools because they created just too many issues. And in that case you will not write these emails as a customer, you just leave (as long as there are better options).
There are a whole lot of companies who can look at MTTR in terms of dollars, especially eCommerce. Rollback failures and CPU throttling can translate very directly to dollars when each request is a potential transaction.
Make build process 10s faster, 1000 engineers run this 10x a day
Depends where you are starting from. Taking the build from 100 to 90 seconds probably won't make a difference since people will be used to doing something else while waiting for the build to finish in the background. taking it from 10.1 to 0.1 seconds will make a huge difference since you can change your entire development process when builds are 'instant' and you can get constant feedback on any changes you make. Of course the second one is a hell of a lot harder.
> Make build process 10s faster, 1000 engineers run this 10x a day, you're saving more than an engineer day per day of time.
I upgraded my home internet to one 2x as fast as the old service, so that I would spend only half as much time online, but it didn't work. Speed up the build process and the build will just get more complicated and the test suite slower, until it once again slows to the point of being barely tolerable. ISTM that part of Go's early popularity was from its fast builds.
I think the tweet is a bit too much. I worked for GCP for several years before. Impact in promo is defined within the same group and aligned with the goals of the group. If improving the stability of k8s is among the goals, then no problem at all.
OSS projects are not built on top of anyone's mercy. For infra projects like k8s, companies just contribute what they need and Google is no exception.
> I worked for GCP for several years before. Impact in promo is defined within the same group
Did you personally go through promo from L6 to 7 or L7 to 8 since that's what the OP thread is about?(or have seen the process from sufficiently close).
I can see how your description of impact & promo would work when there's enough people inside the group to compare from (i.e other L4s, L5s, L6s), but when going L7 & L8 there might not be enough people in the group. Plus the folks at that level would be the one defining the goals, so it seems logical that they can't get promoted on that criteria.
Up through 7 is defined within the same group for Cloud, afaik, which is almost certainly the relevant portion of the company.
> Plus the folks at that level would be the one defining the goals, so it seems logical that they can't get promoted on that criteria.
Fwiw you start defining goals and L6 (or even 5), that's one of the defining characteristic of 6 vs. 5, you can't get to 6 from execution alone, you need to be setting roadmaps that impact beyond your team. And at 7 and 8, the scope of both people and technology get's larger. A strong-L7/just promo'd L8 would probably be expected to exert direct influence over an organization of around 100 fulltime engineers (caveat: I'm saying this from below, but it fits what I've seen).
Which is to say that the positions in k8s that qualify for L7/L8 promo-worthy work are to a first approximation, just the steering committee. And that looks like it's working as intended: the steering committee is full of people who are Principal or Distinguished (or more in some cases) engineers and manager equivalents from a variety of companies.
> Which is to say that the positions in k8s that qualify for L7/L8 promo-worthy work are to a first approximation, just the steering committee.
So you are basically agreeing that an engineer who wants to advance to that level has to go and start a shiny new project (open source or not) so they are, by default, on the steering committee or the equivalent.
The k8s steering committee is a rotating set of 7 people with elections of 3-4 members every year. You need a sustained record of contributions and leadership to k8s, but no you don't need to have been one of the original creators.
The same thing that happens when any other L7 or 8 completes a major initiative. They find something else to work on. Many of the steering group members also lead multiple (in some cases, a dozen or more) k8s working-groups, but they presumably also have responsibilities related to their company (and possibly how their company uses k8s), so they'd focus more intently on those things, or perhaps they started some initiative while on the steering committee and continue to lead that particular thing even once off of steering.
No. Leadership is not a frozen thing. And leading a tiny nascent project doesn’t have remotely the scope required to hit these levels. It is absolutely not the case that “start greenfield projects” and “grow to 7/8” are connected.
And yeah, growth to these levels is hard. The expectation at Google is that the majority of engineers never hit 6 in their entire career.
Here it’s focused on k8s, but it’s not out of line if we generically apply it to AMZN. How many services look cool, have a great re:Invent kick-off, and then go KTLO a year later? I’m going to lay that at the feet of the AMZN promo process, which prioritizes creating something new over real and needed maintenance.
I’m going to further reduce this to lazy management. It’s easier to point to splashy announcements to show impact than it is to put in the work and help your direct reports quantity unglamorous maintenance work.
Not sure how long you've been at or spent at Amazon, but the promotion process doesn't prioritize building new things at all. It's true that people think complexity = build new things, but promotion criteria specifically encourage people to stick to simple things and build on top of what exists rather than invent something new for promotion's sake.
When selling your work to a promo committee, building a new shiny thing lowers the communication hurdle.
Personal anecdote, at Amazon, my wife led a very challenging refactor of legacy code and didn’t get promoted, I built a shiny thing barely anyone is using yet, and got promoted.
> Personal anecdote, at Amazon, my wife led a very challenging refactor of legacy code and didn’t get promoted, I built a shiny thing barely anyone is using yet, and got promoted.
How much of that is the effect of shiny vs. dull, and how much is male vs. female (apologies if my assuming you are male was mistaken, but it seemed the way to bet)?
That's an odd bet to make. At Google promo committees were desperate to promote women. There were entire schemes set up to encourage them to apply more, to give them special help men didn't get etc and that was years ago. Probably it's worse (less fair) now.
It wasn't just process, either. I saw severely unfair decisions go the other way. A woman in my team who everyone hated because she was dishonest, staggeringly unproductive and had a habit of upsetting other teams, sailed through promotion several times into management. On the other hand team members who created entirely new products from scratch or were the backbone of their team got bounced. The patronage of another (female) boss higher up in the management chain seemed the most likely culprit, along with a pervasive "everyone's gotta help women get ahead" attitude.
Even if you can prove a revenue number, it doesn't matter if they don't want to promote you. I've seen it explicitly stated at more than one company, that revenue impact is irrelevant and that you need to show a "large enough design and execution" to be promoted as a reason to just not give it out.
Not only that, but the "promo packet" and "promo committee" is fairly unique to Google. Other companies do things very differently, e.g. FB's twice-yearly (now yearly) manager-driven PSC+"calibration" cycle, which has fairly different incentives than Google-style committees of unknown engineering peers with arbitrary periodization.
Google's system is well known (at least, in my circle of colleagues!) to not promote healthy long-term maintainership, although it does incentivize point-in-time brilliance and solving deep complexity.
Google promos everywhere up to a certain level are in-org (IE not unknown engineering peers). They are done alongside rating calibrations.
Google promos everywhere up to an even higher level are basically in-PA.
For cloud, they are in-org to even a higher level. So the stuff talked about in this tweet would have been evaluated not just by cloud engineers, but almost certainly fairly local cloud engineers.
My experiencing having been on promo committees for google for 15 years is that for every case in my org i see of google incentivizing the wrong thing (or disincentivizing the right thing), i see probably 5-6 cases of it doing the right thing.
I see a lot more cases that are like:
1. Person builds new shiny thing for ... questionable reasons.
2. They make a mess of the world by not making it easy to migrate people to it and get rid of the old thing
3. Things are in a bad half-state, where both things have to be supported and the new thing does not provide obvious higher value.
4. Google doesn't promote them, they get pissed off about it and post on twitter about how Google didn't care about product excellence or something.
Than cases like:
1. Person toils away on the right thing forever
2. Person tries to get promoted for doing the right thing and having impact
3. Google turns them down
4. They get pissed off and post on twitter about how Google didn't care about about product excellence or something.
Externally, of course, people can't distinguish these two cases because they look the same (angry people on twitter).
I have been promoted from SWE III to Senior Director at Google, working only on open source until i was an L7.
Agreed. I work in the PP's PA and was promoted on my first try for doing unsexy but worthwhile maintenance work on an important tool. It can happen. In my estimation the best thing to do if you want to get promoted is 0) have a positive relationship with your manager 1) read the SWE ladder for the next level 2) give your manager a document that describes your work using the language of the ladder as closely as possible. Said ladder does not say you need to deliver a shiny new thing, despite what is commonly assumed in this forum.
It's good to know the system I'm talking about is mostly dead; the committee of unknown engineers always felt bonkers to me. That being said, a ratio of 1:5 or 1:6 of bad incentives at promo committees still sounds pretty suboptimal to me — when I was on calibration committees at FB, it was super rare for me to come away with the feeling that the wrong thing was being incentivized or rewarded at an engineering level (oh boy was product management a different story, but that's a different post). There were a couple times where I disagreed with an outcome, but that was like 1:150 or better.
> While working on open-source projects might be tolerated, even on company time, it makes sense that it won't put you on a path to promotion unless there is direct benefit to the company from it.
The thing is - this is all theater. It's mostly made-up storytelling. An individual engineer's self-review doesn't actually speak to their direct impact on the company's success.
It's fallacy to act like you can map commits to dollars. But there's a lot of money to be made by HNers for trying, so I'm sure they'll act like they can do the impossible.
Define "killing kubernetes"? It's still pretty successful and the adoption hasn't slowed in any way I can measure. I promise you that some site you used TODAY is running, at least part of it, on Kubernetes.
Google employees regularly get promoted, at all levels, based on their OSS work - Kubernetes and other projects. We have dozens of people who work on Kubernetes, in one area or another, at varying degrees of depth. Is that "killing" the project?
Of course it is never ENOUGH. I'd happily consume hundreds more people. :)
> While correct, the same is true for most projects at these companies. Very little work that a single engineer does can be assigned a provable revenue number. How does someone working on an internal build tool get promoted? Or a new training model? A large-scale refactor of a legacy codebase? (All of these are examples of very senior promos at my own FAANG company).
Which FAANG though? Google is usually singled out for this criticism more than the rest of FAANG combined. It’s possible that your experience at a non-Google FAANG might not be relevant for Google specifically.
(Note: I wouldn’t know first hand, I’ve never worked at a FAANG).
This thread is ridiculous. At FAANG, and everywhere really, promos are decided based on impact and influence. FAANG are just more likely to be working on open source projects than employees at other companies.
All impact is difficult to quantify, not just contributions to OSS. The only easily quantifiable achievements for an engineer are delivered projects that directly generate revenue. In the teams I’ve worked in, I’d say that covers perhaps a third of projects.
Let's pretend Linus Torvalds worked for Google, would he be able to get promoted for creating and maintaining Linux? As far as the bottom line goes it brings in zero dollars and costs the company his total compensation or more to maintain.
How would Linus quantify his revenue impact in his promotion packet and compete against another hypothetical engineer that improved ad targeting by 0.0X% and can directly measure an increase of revenue for the comapny in the millions of dollars?
I don't know why people here think you have to quantify your revenue impact in your promo packet. This is literally not a thing, and any number you put in there will be ignored. No one up the engineering chain cares about dollar figures when it comes to evaluating ICs.
As for the Linus question, plenty of engineers with similar profiles have worked for Google, and they don't do so as junior developers their entire career. Junio Hamano, who has been maintaining Git while at Google for the last 15 years, is the perfect example. Sundar Pichai started his career working on products that had zero or negligible revenue (Chrome, ChromeOS, Google Drive) and is now the CEO.
The most Sr and influential ppl tended to come from web search and core infra, with no link to revenue in sight.
If people on engineering teams that don't have an attributable impact on revenue believe they'd be rejected because they can't show an individual impact on the bottom line then they won't submit a packet. That could be why you didn't see it - people who think this is necessary don't go for promotion. As the author of the Twitter thread states, they change team, or leave their role, instead. You'd need to look at why people move to revenue impacting teams, or why people leave the company, to understand this.
Looking at it from the perspective of the promo board obviously isn't going to work. "absence of evidence is not evidence of absence"...
People within the company are going to have a better idea than some random outsider. Anecdotally I've literally never heard of revenue impact being a requirement, and I'd expect few packets through L6 contain mention of revenue, and still a minority beyond that (but idk for sure).
Maybe this particular org is weird, but across Google this perception doesn't exist.
He wouldn't quantify his direct revenue impact, much as engineers who work on Borg and other infra don't. Or, he might point to a systems performance working group he helped chair and show that that working group improved performance of some syscall by n% (or added a new one that was n% faster) which when applied across all compute at Google has a wide revenue impact. (Or really he'd be at a leadership level where he'd need to show that the working group he lead did that a dozen times, but you get the idea).
When you have difficulty quantifying, you start qualifying, and that leads to some options for quantification as a second order effect.
"Lead the development and maintenance of custom OS underpinning our entire enterprise's technical stack. Worked with stakeholders across every domain within engineering to improve stability, performance, and feature set that enabled every part of the engineering org to better deliver" - "Examples of revenue visible effect are the debugging and patching of (performance improvement to the underlying OS) that reduced CPU utilization across the org, enabling a 3% reduction in fleet costs, allowing the company to save an estimated (estimate 3% reduction in fleet cost from before it was introduced)"
I can't speak to Google, but a good manager can work with their team to really highlight their achievements in a way the promo committee can understand. Admittedly, it still means learning and playing the system, but that's true regardless of the system you use.
Remember, too, that any impact on engineer time -also- has a direct provable reduction in cost. "Took 10 seconds off (process people wait on, build, test, whatever), saving an estimated (###) of engineering hours per year based on the number of teams using (part of process you affected)" should still be evaluated equivalent to revenue. If it isn't, you call it out, "leading to a reduction of engineering operating costs and saving the company roughly (estimate number of man hours * average cost of man hours)". So something that starts "saves effort" can be translated to "saves time" which can be translated to "saves money".
Also it's Google, so we should be looking to promote operating system authors who aren't white or male anyway.
Expecting hate for this, but I've been on the receiving end of company notifications (not at Google) where it was stated explicitly that the newly-opened distinguished engineer role was exclusively available to not-men.
(I'm generally-supportive of DEI efforts, but at sane companies those efforts wouldn't block qualified people from being promoted to their proper roles. Among the FAANGs, I have a lot more confidence in Facebook and Amazon's ability to operate sanely than I do Google. Prior evidence and all.)
Brendan Burns, the Kubernetes co-founder who currently works at Microsoft Azure as a Corporate Vice President disagrees the "killing Kubernetes" part.[1][2]
But yes, Big Tech's promo committee's short-sighted interests do NOT align well with multi-vendor sponsored open source projects' the long term prosperity. And it's no secrets that open source contributors are overworked. And it's not only affect K8s, but it affects across the board.[3]
> going L6->7 at Google is worth ~200k/year, 7->8 is ~400. Similar patterns at other places. People have kids, mortgages, student loans
Honestly this is a bit ridiculous. If making 500k as an L6 isn't good enough for you maybe you should be working on projects that provably bring your company revenue?
Exactly. If you are an L7 and making an average L7 salary, nobody should have to squint and tie themselves in knots to figure out the connection between your contribution and your employer’s revenue. You are a Ferrari that is purchased new each year.
There’s a place where you can work on things that don’t generate revenue but are morally/technically interesting: it’s called “academia”.
What is a professor's job other than revenue? Instead of selling a product, they have to beg for money instead. Alternatively, they sell their research. Publish or perish. Raise enough money for the school or you're out.
The Bay Area is unaffordable because most cities haven't allowed anywhere near enough housing to be built by land owners over the last five decades: https://techcrunch.com/2014/04/14/sf-housing/.
Return building rights to land owners and the problem will be solved. The insane thing is the way coalitions of existing owners conspire to reduce supply and thus increase the values of their assets, and the rest of us just go, "Oh, okay, that's cool, that you're conspiring to embeggar the rest of us."
Indeed. If lots of people with money want to live in a place and building more housing is illegal the price of existing housing will go up a lot. If you want the problem to go away you can either build more housing or make it so the people with money don't want to live there.
The problem is not the tech people, it is the boomers who lived here before us. They have manipulated the system, voted down every attempt to build more housing, and leveraged their good luck to engage in rent seeking and siphon off half of the "ridiculous tech salaries."
My old landlords had average non-tech jobs in the 80s, bought a house for under 100k. At some point they used it to buy a second house, and another, and another. Now they own 30+ houses and rent each for 4-7k/mo.
Old next door neighbor's house was owned by her dad, he was a public works employee and owns 6 houses in the bay area now.
Neighbors on the other side were an older couple with adult kids and multiple rental properties in Fremont. They bought a house for their adult son outright before we moved.
Market forces are moving in your favor. Most European companies have completely saturated their target markets for outsourcing (Barcelona, Estonia, Czech Republic) and Russia & Ukraine are suddenly unavailable options due to a war. Also the local markets have much more demand for engineers than are available to fill it.
There has been a large push in the last two years to recruit Americans and Canadians to fill their unmet needs and salaries have been climbing rapidly. Salaries for new job postings in Berlin are about where they were at in New York 3-4 years ago which is EXCELLENT for that market (seriously, two years ago salaries were maybe half of what they are now).
If you're in Europe and you haven't been interviewing you should look around and maybe even use a recruiter. My own recruiter has been feeding me opportunities across Europe with pretty much equal money to what I'm getting here in the States.
It's likely that if you haven't changed jobs in the last two years that you're leaving significant amounts of money on the table. That said, I know that chasing after money isn't fashionable or socially acceptable to most Europeans...but the new hires are going to get it -- why shouldn't you?
There's over 4 million (and counting) Ukrainian refugees in Europe right now, and quite a few of those people are software engineers. Then there's over 100k (and counting) Russians that rushed to leave the country before the new iron curtain goes down - and I wouldn't be surprised if most of those are software engineers.
Ukraine and Russia certainly do vastly better in gender balance in programming than western countries, but that's counterbalanced by the fact that Ukrainian refugees are almost all women, children and the elderly. Men of fighting age aren't allowed to leave. And even in Ukraine, most programmers are men, so it may not lead to an influx of skilled people as you'd imagine.
You do know that refugees legally aren't allowed to work, right? You must fully complete your asylum-seeking process before you are allowed to seek employment. That often takes years if not decades.
And because of various financial embargos it's historically very hard to hire Russians in European countries unless they have bank accounts in those countries, which without EU citizenship or long-term residency are very, very hard to get.
Neither of these countries' citizens are allowed to freely travel and work within the EU the way that EU members can.
The EUs Temporary Protection Directive grants a residency permit (including right to work) to Ukrainian refugees. (for at least a year, to be extended as needed as the situation develops)
> Neither of these countries' citizens are allowed to freely travel and work within the EU the way that EU members can.
True, but they still have temporary residency permit. And I suspect that their refugee status won’t take that long to be worked out. And then, from a realpolitik perspective, most EU countries cannot afford to have hundreds of thousands of additional people using infrastructure and resources whilst not working (I mean they could, but the whole continent is shifting to the right, so that won’t be a stable political situation). All of this to say, I am pretty sure most of them will be able to get residency permits, which includes right to work in a country and freedom of movement to the others, for at least a couple of years.
Google created Kubernetes so potential competitors are dragged down with immense technical debt, of course they’re not going to award internal advancement of it.
Having participated in full on rearchitectures to Kubernetes several times at this point I can say that it wasn't justified a single time, even at the 10^4 microservice scale. Now having participated in several fairly-effortless rollouts of Nomad and arrived at a better place, it's funny to watch the rest of the industry cargo cult.
I actually don't believe that either of these solutions are going to be a long-term settled end state (though Nomad with Firecracker, or really any other Firecracker-centric solution that isn't k8s will have some legs) the industry falls on, but I agree wholeheartedly that Kubernetes is purely a distraction having seen the pain and the gnashing of teeth.
Every painless Kubernetes story that I've seen is at a scale where Kubernetes wasn't even necessary/justifiable and another solution would have been even simpler. But at least it's good for the resume.
And Helm charts are are akin K8s' own borgcfg. We're doomed to repeat our mistakes it seems.
To counter your argument. I did one full migration from AWS to GCP Kubernetes (GKE). The project was a huge success, simplifying our stack, deployments, logging, etc, etc.
We reduced our costs by 2/3rds, saving millions of years. Teams have been able to move onto feature work instead of maintaining custom deployment tooling. The Ops team 1/2 the size that it was before but is able to handle twice as many customers.
I've seen small startups waste their time with things like self-hosted Kubernetes in a quarter-rack of colo space for workloads they could have instead hosted with KVM or on cloud instances with 1/100ths of the management overhead. Scale that up to a half-dozen racks of servers and you're really still telling the same story. Even OpenStack of all things is easier to manage.
This covers the scale of the actual operations of 99% of companies. Probably some more 9's there afterwards. Go and read about how StackOverflow's infrastructure has developed over the years and how damned simple and effective it is.
If you aren't Fortune 100 or you don't have extremely specific performance needs for your (re)deployments, then it's highly likely that rolling out Kubernetes infrastructure is akin to driving screws with a sledgehammer.
I think part of this is the downside of capital being far too cheap for too long. Companies have way overbuilt their infrastructure for reasons that aren't really moving the needle forward. Many barely make an effort to control their costs. At most companies my expectation going in is to look at their infrastructure and see somewhere between 0.02 and 10% hardware utilization across everything they have. Even the companies running Kubernetes hardly seem to be doing a better job because very rarely is Kubernetes running 100% of their infrastructure, if even 10%.
Right, but they can achieve the same with managed kubernetes.
I am not sure what management overhead you are referring to? I.e. what do you "manage" (as human) if you choose a managed kubernetes offering [e.g. digital ocean or linode] (Vs open stack) ?
Also, I am not sure that hardware utilization issues has any relation to kubernetes? I.e. you would have this problem regardless?
In general , my own moto is that if you do not use kubernetes, you would end up writing it.
Managed Kubernetes isn't without its own overhead.
Let's take the most popular option, GKE. You're going to be on a release channel and you need to understand that Google is going to upgrade your clusters/node pools frequently and it's your responsibility to keep your workloads ahead of deprecated features. If you pay attention to your emails, okay, but if there's ever a critical security issue in K8s, Google will force migrate you to the latest patched version immediately. Even in Stable. Even if some Beta features you depend on aren't actually working in that new version (this actually happened when Workload Identity was still in Beta... because "you shouldn't be using Beta features in Production" even though that's what Google's documentation and your TAM said explicitly to do). Good luck!
Then there's the fact that Google doesn't give you any overt knobs to manage the memory of their workloads in the kube-system namespace. They spec these with a miserly amount of memory that will cause their metadata server and monitoring stack to crashloopbackoff if you have even a moderate amount of jobs that log to stdout/stderr. Some of these you can add a nanny resource to expand the memory, but you will have to reapply this every time the node pool gets upgraded/replaced, which can and might happen during any maintenance window (weekly, or at any time). Some others you have to edit Google's deployment and reapply your changes every time Google deploys a new version. This means that you need to monitor and manage Google's own workloads in your "managed" clusters.
Setting up your own custom ingress to work within the parameters of their loadbalancers is non-trivial. If you use rpc/gRPC, you _will have to do this_. GKE's ingress controller will not serve your needs.
Setting up the VPC and required subnets properly is equally non-trivial, but luckily you just have to figure out the proper operating parameters once. Oh and remember Google Cloud is deny by default. Your developers will need to know some things about how to expose their workloads with proper firewall/Cloud Armor rules. Unless they have prior cloud experience they likely won't have a clue as to what's required. Congrats, now ongoing training of teams of engineers is now part of your role.
Enjoy the misery of operating regional clusters and all of your node pools suddenly becoming non-functional the minute one of your region's AZs go down, which on average happens to at least one of any given cloud's regions twice per year. Hopefully not your region. And you thought that operating regional clusters would give you the kind of HA to avoid these situations but when running K8s on top it's the control plane itself that can't handle that kind of outage...
Oh and if as a requirement to run you have jobs that require changing a kernel parameter that GKE doesn't expose to you to set on the node pools (because they're not yet "safe" sysctls in k8s), such as fs.file-max, you don't have any recourse.
There are numerous tickets open for these issues and they have been open for years. Google isn't forthcoming with solutions, other than they prefer you scale up rather than out when K8s is advertised to be a solution that favors scale out (but for current deficiencies in K8s' software stack the reality is that scale up is truly the preferred option anyway).
Managed Kubernetes isn't a panacea if you have a reasonable scale of work that you're throwing at the clusters. You _will_ have to work with your cloud providers' support team to tune your clusters beyond the defaults and knobs that they expose to regular everyday customers. You'd better have an Enterprise support account that you're paying $35,000/50,000 + %ofSpend a month for.
> Managed Kubernetes isn't a panacea if you have a reasonable scale of work that you're throwing at the clusters.
And Nomad is?
I’ve seen nomad, particularly nomad and consul setups fail spectacularly. One only has to search for the recent Roblox outage for a high profile example.
Nomad requires just as much engineering talent to run as k8s, it is just packaged differently - and has less community support, buy in. Nomad plus Consul probably requires even MORE engineering talent to run. How many times have you had to troubleshoot etcd within a k8s cluster?
Come to think of it as a practitioner of both I can’t say k8s has ever failed me as spectacularly as nomad has.
I say all of this as a fan of Nomad.
Moreover: managed nomad is basically not even a thing and if HashiCorp or some partnership were to offer it- it would probably be ludicrously expensive(looking at vault and consul here).
Roblox's outage was uniquely due to the fact that they tried to do two things you should explicitly not do when operating production consul clusters. They even admitted as much.
1) They were using Consul for multiple workloads -- as both their service discovery layer and as a high-performance kv store. These should have been isolated
2) They introduced high read _and_ write load without monitoring the underlying raft performance. Essentially they exceeded the operating parameters of the protocol/boltdb.
well, and they turned on a new feature without proper testing. There's a difference between doing something stupid and the services being inherently flawed. Kubernetes has equally as many if not more footguns to shoot yourself with.
As far as your etcd comment, I know several people at different companies whose literal fulltime job has devolved to tuning etcd for their k8s clusters.
Ask anyone who works/worked at Squarespace 2-3 years ago. They had a full team of 30+ devops folks fulltime dedicated to Kubernetes and their production clusters were failing almost _daily_ for a full year.
k8s and Nomad are both complex systems. My point is that Nomad isn’t some sort of magic bullet that cures the k8s illness. I can’t really speak to k8s 2-3 years ago because I wasn’t working with it. It has been a dream for me and my teams these days. But it still has its issues. Like anything else.
Transitioning from k8s to nomad for simplicity reasons doesn’t make a lick of sense. Nomad is going to fail in _at least_ the same ways and there is going to be exponentially less information as to how to fix it out there.
Wow. Thanks for writing that. I've had friends preaching the miracles of k8s or managed k8s to me for years but reading your post I'm glad I went with a rinky-dinky systemd+debs setup for my current servers. You can't even adjust the FD limit?! Force upgrades without notice that remove features in production with a blame-the-user mentality?
Google's approach to all of their services is an upgrade treadmill. They take the opposite approach of AWS who operate things like an actual service provider.
Google will change APIs and deprecate things with little to no advanced notice. Anywhere except Ads, really. The progress of code at Google is a sacred cow.
You may have found random APIs of theirs barely in KTLO status -- like the last time I used their Geocoding API 5 or 6 years ago, not a single one of their own client libraries worked according to how they were documented. Specifically the way you do auth was wildly different from the documentation and seemed to be wildly different between client libraries as well. It seemed to me like they went through three or four rounds of iteration w/ that API and abandoned different client libraries at different stages. Extremely bizarre.
If you can operate like that, then using Google's stuff is fine. You just have to be ready and willing to drop everything you're doing to fix things at any time because Google decided to break something that you rely on.
This is not really just the problem in Google. We might as well claim that promotion processes in big companies promote promotion-oriented work (yes, I intentionally use the word promotion this many times). The result is title inflation, lots of good employees leaving, and superfluous yet mediocre projects. I wouldn't say only unqualified people get promoted, though. Big companies have very different dynamics than small ones. Some people are just good at navigating big companies and are capable of aligning multiple organizations, for good or for bad. Big companies need such talent as well. As for individuals, the reward of getting promoted is not worth the effort, at least statistically. A much better alternative, is focusing on solving truly interesting and meaning problems in a blow-out company. The financial return and title bumps will follow naturally.
That's the most ridiculous bubble valley thing I read in a while, coming from a non-k8s using non-FAANG-employed European.
Which isn't meant dismissive of the problem at hand, I can totally see how it's happening and how it might be bad for the people involved in the project.
To phrase it differently - if the project is dying because it hinges on people working on a promotion that gives them a /bigger paycheck increase than my yearly income/ - well, good riddance. I'm staying on my side of the fence and continue to not buy a new Tesla every year.
> if the project is dying because it hinges on people working on a promotion that gives them a /bigger paycheck increase than my yearly income/ - well, good riddance. I'm staying on my side of the fence and continue to not buy a new Tesla every year.
An understandable position but this is simply how competitive environments work.
It’s a lot like the college application mania. There’s tons of gifted high school students who clearly can succeed without needing to volunteer in a foreign country every summer while playing oboe and competing in lacrosse at the state level.
But seats at Harvard are competitive, and those same talented and well-to-do high school seniors often only have college admissions as their ruler to measure their growth and self-worth. So they compete on every angle to try and get into Harvard rather than merely Dartmouth or heaven forbid BU.
It’s reasonable to think working adults should have measures of personal growth that don’t rely on climbing the corporate ladder, but conversely all those who *do* obsess with climbing the ladder will naturally be clustered around places of wealth/prestige, because that’s what they are optimizing for. Much like the Harvard application pool, you’ll find these types doing what it takes to stand out and land that 700k/yr L7 position at Google.
So I guess the answer is that the most stable and secure position for an open source project to be in, is one that does not rely on development being coupled to big tech. I imagine most open source folks could have told you that anyways :)
Kubernetes' biggest problem by far is its mess of accumulated technical debt. FAANG promo committees very likely undervalue FLOSS contributions, but this hits all of FLOSS pretty much equally; it hardly explains why k8s specifically would be in so much trouble.
I would go as far as to say it's not just limited to FLOSS contributors. In my experience at Google, not every service is profitable or brings in revenue, so it can be hard to justify "impact" on those teams. I certainly remember beating my head on how to spin my contributions as having "impact."
Kubernetes doesn't get patches from hyperscale companies because kubernetes is not really hyperscale software. If I recall correctly (probably don't), clusters can handle tens of thousands of container hosts. That's not manageable at large scale unless you're selling people individual kubernetes instances directly (which, to be fair, google and amazon are).
In my experience the thing that kills open source at big companies is the divergence of needs: The company needs massive scale software, but the community wants ease-of-use and features. Most open source models are a big drag on productivity (you can't just refactor all the callsites to do a migration!), and it's hard to change directions when the company's need does (since the other contributors will likely want the original software).
The tailscale people have been arguing this as well (with more data and better reasoning), but fundamentally you need different software when you're a scrappy startup or a medium size company or a giant behemoth.
What's preventing k8s from being useful in these larger scale deployments? GP was not very clear about that, they just said that it's not scaling enough.
K8s doesn't do anything to help you with managing the fleet of k8s hosts. At 10^5 host machines you have a much bigger problem on your hands than the one that k8s is solving for.
If you have and are solving that hyperscale infrastructure problem then whether or not you're running kubernetes doesn't actually make a damn bit of difference to how you allocate your human/financial resources.
Or summarized, at the scale where Kubernetes is advertised to make the biggest difference, it's actually irrelevant/an implementation detail.
10^5 is also on the low end of hyperscale. I'd bet that google/amazon have high single-digit millions of machines.
So yes, big companies can afford to pay people to customize their systems, and do so. The infra teams at these places are enormous, even just counting job orchestration.
The argument upthread (mine) is that the companies do so, and that kubernetes is the wrong choice for them. As a result, they don't really contribute to it.
No but that's not the point. The scale of the underlying operational problem is so much bigger that whether you're running kubernetes or some other solution on top of it doesn't actually make any material difference (kubernetes or any other option are effectively equivalent).
Thus hyperscale companies have no reason to contribute to Kubernetes. It does not affect their bottom line. Hence all of the memes about companies chasing Kubernetes are being distracted from their real goals.
A lot of k8s is just not optimized for beyond a relatively small scale (10k services etc.). In particular many of the components rely on a primary node to handle all the traffic rather than sharding work across a group of nodes.
Completely true. Woe unto those who learned this the hard way and have suddenly found themselves in the fulltime position of "tuning etcd for kubernetes".
Maybe I’m misunderstanding but why would you want to run one giant kubernetes cluster at hyper scale? I would think at hyper scale you would be running micro services with one kubernetes cluster per service per datacenter/region.
At hyper scale, efficiency matters a little. Small gains multiplied by big numbers become big gains. The overhead of running a cluster is not small, but more importantly - the overhead of managing nodes between services is real.
No, you would not want to run a cluster per service; Kubernetes is somewhat multitenant. You'd get better efficiency from running one cluster per datacenter, with all those teams in their own namespaces. Kubernetes can stack many containers per physical host, with widely varied workloads cooperatively sharing.
There’s overhead in both of these models when you try to scale centralized infrastructure at massive scale across a large number of services and teams you run into many scaling problems that your central infra teams will need to solve. In addition this model gives less flexibility to your service teams and adds certain centralized failure scenarios that might be undesirable at the scale of such a company.
Instead of this you can have central infrastructure and platform teams have partial ownership of infrastructure while service teams or departments have partial ownership. In that model a service team or department owns their own cluster with infrastructure code partially provided by the platform team running on compute infra maintained by the infrastructure team.
This model has seen much success at the scale of Amazon while maintaining SLAs and controlling costs. There are of course a number of drawbacks to this model at scale that you can ask any former Amazon engineer about.
I would think a similar model can be mapped to hyperscaling kubernetes where operators, cross cluster infra, and base kubernetes configs are maintained by the platform team while departments or service teams (depending on scale of team size) maintain their own clusters at whatever granularity fits the company’s scale (e.g region, datacenter).
This is also where cloud can help alleviate some pain for your platform and infra teams by using managed solutions to solve some of these problems.
In my opinion both are viable models at scale it just depends on the needs specific to your company.
> Kubernetes can stack many containers per physical host, with widely varied workloads cooperatively sharing.
Until you learn the downsides of this approach at hyper scale, namely that containers and a shared kernel mean that all of your workloads are sharing the same kernel parameters, including things like network limits and timeouts and file handle limits. Multitenancy and containers actually ends up working against you and creates new problems and knobs in your individual jobs that you have to configure -- to the point that it's almost worth just having different types of jobs run on different isolated node pools and eliminating your multitenancy issue anyway.
Companies that scaled on KVM never had to learn about these limitations and just focused on what their hardware was capable of in aggregate.
At hyper scale and with multitenancy, microVMs are always going to be the end state -- and while there's k8s support for this, it's far from the default or even most convenient option.
Network limits and timeouts aren't different between kubernetes hosts and non-Kubernetes hosts. Network resources are a real resource, and you may need to implement quality of service or custom resources (a new feature [1], and one that is late to the party).
File handle limits are something no sane workload ever encounters. They are technically a shared resource, but in a sensible kubernetes configuration, it is impossible to hit because the ulimits on each process are low enough. A very small number of teams may need an exception, with good reason, and will typically be cordoned on to their own node classes that are specially tainted.
Yes, fleet Management via taints offers nothing over the fleet Management that you've already got. This is a good thing. Fleet Management tools are a damage to your reliability. They mean that your machines are non-fungible. Kubernetes great innovation is making machines, units of compute, fungible.
There are workloads and architectures that will never be suitable for kubernetes. There are HPC Clusters that heavily rely on things like rack-locality that Kubernetes views as damage. Getting rid of them is a net win for humanity.
If your web crawler is using a hundred thousand filehandles, you've got a problem. You shouldn't need that many; You can support ten thousand open web requests, for sure, but you don't need ten filehandles for each; A few hundred connections to intermediate processors and databases where you store the scraped data.
High performance template rendering has as many filehandles as open requests - Maybe 10,000. If it's actually high performance, the templates underneath aren't files anymore by the time you're processing, they're stored in memory.
Databases are almost an exception, but you shouldn't be running "Large DB" on a shared host on K8s. You should taint and dedicate those machines. K8s is still useful as a common management plane, but I'm roughly on the fence of "Just run those machines as a special tier" and "Run them on k8s with dedicated taints", because both have advantages. Smaller databases run just fine. Postgres is using ~10k filehandles
There was a time for a scheduling specifically for "filehandleful" jobs. It's long gone. Modern linux systems set the filehandle limit to something obscene, because it's no longer a limiting factor, and it hasn't been on these workloads for 5 years.
At hyper scale you don't need to worry about sharing as much because the important services are far bigger than one machine. That sidesteps the problem: you can apply whatever sysctls or configs you need to do before starting the container.
"Multitenancy" here means "I have a giant pool of machines and I run a bunch of jobs across them" not "I have a pool of giant machines and I stack jobs on them".
You wouldn't want to run one giant cluster, but at hyper scale you're talking about running thousands or tens of thousands of kubernetes clusters. That's the part that doesn't scale well, for a couple of reasons.
The biggest one is just mechanical: with that many clusters it will be hard to move capacity between clusters, and locality gets baked into everything you do (people do try to build around this, but it's awkward).
If each service or team runs their own kubernetes that's a lot of overhead: kubernetes will need something like 6-7 machines for the cluster (I don't have production experience with kube, spitballing here), so small teams or jobs will have terrible efficiency. Big teams will have to spend a lot of operational effort to manage their fleets.
It's worth noting that at hyperscale there will be individual jobs in a datacenter that are bigger than kubernetes handles comfortably. Handling this efficiently becomes very important, it's literally billions of dollars worth of hardware.
This is just the law of inertia at play. Big projects slow down. K8s isn’t just the K8s core, it’s a massive ecosystem, and when taken as a whole it’s unbelievably vast and growing at an amazing pace. As it does, the core of it moves slower, there are more stakeholders, more arguments. Small changes make for big problems and not everyone can be happy. Patches will keep coming, major changes will become eventually impossible. The death of Kubernetes makes for great clickbait, but most of us won’t see it in our lifetimes. Not when our banks run on 60 year old code. Kubernetes is too big to fail at this point.
As far as FAANG promo committees go, let them value what they value. Kubernetes is a direct revenue driver for many companies; it’s health is tied to billions of dollars in investments. Just because one cohort of contributors age out, cash out, or fall in love with someone doesn’t mean there is no one to take their place. The new people won’t do it the same way and that is okay, even great. I’m grateful for the vision and effort that have make K8s the platform it is, and if I can translate that into a contribution in the future I will.
Finally, I’ll say that people who really love working on the project may not get support to work on it full time, but then may find themselves able to retire sooner than many and have ample opportunity to contribute.
I basically can't stand the comment section of Kubernetes threads. Lots of big teams are getting big things done with k8s, and you would not know it from here.
Not sure why this says FAANG because it's talking about Google.
Calibration, promo committees, feedback and the like are intended to create equivalent expectations on impact across different orgs. The dirty little secret however is that at best it has limited success.
The best advice I can give anyone in such an organization is to be liked by your manager and their manager. If that's true, good things will tend to happen. If it's not, good things will be a lot less frequent.
Put another way: you can take the exact same set of objective facts and use them to say a person did a good job or a bad job. There's a popular meme about feedback at Google that goes something like:
> This project would've failed without this person. It failed anyway but it definitely would've failed without them.
The difference ultimately boils down to whether or not they like you.
Here are a few Google-specific tidbits worth knowing:
1. Ratings are fit to a curve across a sufficiently large pool, typically at the director level and usually over 100-150+ people. This means there will be a percentage range of people who get Meets All, Exceeds Expectations, Greatly Exceeds, etc. This is intended to stop ratings inflation;
2. A consequenc eof (1) is that ultimately you are competing with people in your org for those better ratings. This can create some perverse incentives and a toxic environment;
3. It is almost always better to let something blow up and come and fix it rather than preventing that from ever happening. The first will get you a lot of recognition. The latter will get you almost none;
4. Promos at Google are stack-ranked. Each committee gets 10-15 packets that will be for a particular level. The committee will rank those packets. After that the promotion target will come into play. This is set by management and was allegedly cut as a cost-saving measure when Ruth Porat came on board. If it's 20% then the top 20% from that ranking process be promoted.
You will find people who serve on those committees who say this isn't how it works and they'll argue they're evaluating if someone is operating at the next level or not. This is partially true. Thes packets will be divided between promote, don't promote and on-the-bubble. The on-the-bubble group will be sufficiently large to allow for the promotion target to be met;
5. For SWEs. L5 is the "terminal" level, meaning there is an expectation of growth to that level. L3->L4 and L4->L5 once went through promo committee but now don't. Management within orgs decide this. These too have target percentages and there have been cases where the promot rate has been too "high" and orgs have been told to cut back on promotions to meet the targets;
6. There is a massive backup at L5->L6. Because of the low target percentages the impact required keeps going up and really you need your management to really push for this to happen. There are limited slots so you may be waiting eyars and again this is why them liking you matters so much. Google is full of L6s who got promoted 5-10+ years ago that would never make the grade by today's standard. For really old cases you can find archives of why there were promoted and you'll find cases like "promoted unit testing".
I say all this because the author of this thread seems to fundamentally misunderstand how this process works.
The promo stack rank goes the other way. You decide on the promote vs no promote part first and then rank based on confidence. I’ve absolutely been in sessions where 100% of people were promoted. You don’t rank and then apply a bar based on a target promo rate.
There are certain classes of work that are completely ignored by most companies such as build, automation, developer tooling and infrastructure cost cutting. These are not flashy features that you can show off to customers but they can pay in dividends year over year.
(Personally, I have saved millions a year in infrastructure costs and my promotions/compensation were not close to what a sales person would get if they brought in millions of recurring revenue)
K8s is a great example of what murders open source: monolithic platforms that exist unto themselves, but have some seemingly benign model allowing arbitrary "plugins" to pretend it's not really a monolith in sheep's clothing. It looks like open source and has the right license, but it ends up just mirroring corporate interests rather than those of the Commons that the open source movement was created to support.
If you're going to have monolithic platforms anyway (and you will, because designing for modularity is hard, whereas hacking together some random monolithic code is easier) they should at least be open source rather than proprietary.
Actually I disagree. Any time there's an Open Source tool that does everything for free, it's extremely hard to justify paying for a proprietary product that does it much better, because companies are cheap. There's many open source projects like this that are just terrible but a company will always use them first because they're free.
The answer isn't modularity, it's composeability, which is a significant difference. A modular program requires "integration" (tight coupling of APIs/ABIs) whereas a composeable program has loose interfaces which require virtually no "integration".
Drone.io is one example; you can implement any "plugin" purely by creating a container with an initial CMD entry that reads environment variables, and that program can interface back with Drone a number of ways (STDIN/STDOUT, REST API, database). Another is any application that just reads in or spits out simple line-by-line instructions or a JSON blob. Unix pipe based programs are the penultimate example. The dumber the interface, the easier it is to compose.
This is the case with most major products from most major tech companies. There is a reason you don't (often) see steady incremental progress on products, but rather big flashy releases that never quite get their bugs cleaned up unless they are wildly* successful.
* Where wildly is defined as a success level which would leave any rational founder very wealthy, and very content.
Hopefully Kubernetes does die. Most companies write shit slow software and then think they're big smart boys because they have to buy $20,000/month of servers to make it not be unusable. In many shops it could and should be replaced by a handful number of efficient servers that do not suck shit.
Oh man, another rant that boils down to "big evil corporations are KILLING open source software by not supporting it in the specific ways and in the specific amounts that I want."
This seems like a strictly Google only problem. And by extension this is the problem with promotion committees that are removed from day to day of the candidate getting promoted. People are ridiculously bad at measuring growth based on some obscure company wide criteria - which is abstract at best and useless at worst at a company large enough as Google.
I'm at AWS and the promotion process here is largely contained within the candidate's org which is at best two levels higher but in the same business. So if your team's business relies on contributions to FOSS, its impact is measurable and leaders and the 'committee' can easily tell you if your contributions are at the next level.
Go figure, contributions to a multi-vendor consortium that just barely makes a marginal profit for an already marginal business unit (GCP) are not as highly valued as things that can be directly connected to revenues. This is not surprising in the least.
Perhaps obviously, Mr. Kantrowitz hasn't quit, or been poached by a competitor, so I would say Google's (cynical) comp and promo policy is working out just great for them. Whatever personal interest keeps Mr. Kantrowitz going is working out for them!
The core point:
> It's too indirect, fixing a bug in kube-apiserver might retain a GCP customer or avoid a costly Apple services outage, but can you put a dollar value on that? How much is CI stability worth? Or community happiness?
While correct, the same is true for most projects at these companies. Very little work that a single engineer does can be assigned a provable revenue number. How does someone working on an internal build tool get promoted? Or a new training model? A large-scale refactor of a legacy codebase? (All of these are examples of very senior promos at my own FAANG company).
Kubernetes is no different. An engineer has to show how the work they did first and foremost aligns with their team's goals. If it doesn't, then well they shouldn't have been working on it in the first place. While working on open-source projects might be tolerated, even on company time, it makes sense that it won't put you on a path to promotion unless there is direct benefit to the company from it.