Hey, author here, I think you ask a good question and I think you frame it well....

felixgallo · 2024-08-08T20:18:05 1723148285

I have a constructive recommendation for you and your engineering management for future cases such as this.

First, when some team says "we want to use helm and etcd for some reason and we haven't been able to figure out how to get that working on our existing platform," start by asking them what their actual goal is. It is obscenely unlikely that helm (of all things) is a fundamental requirement to their work. Installing temporal, for example, doesn't require helm and is actually simple, if it turns out that temporal is the best workflow orchestrator for the job and that none of the probably 590 other options will do.

Second, once you have figured out what the actual goal is, and have a buffet of options available, price them out. Doing some napkin math on how many people were involved and how much work had to go into it, it looks to me that what you have spent to completely rearchitect your stack and operations and retrain everyone -- completely discounting opportunity cost -- is likely not to break even in even my most generous estimate of increased productivity for about five years. More likely, the increased cost of the platform switch, the lack of likely actual velocity accrual, and the opportunity cost make this a net-net bad move except for the resumes of all of those involved.

Spivak · 2024-08-08T20:25:15 1723148715

> we can't support etcd and helm and you will need to find other ways to work around this limitation

So am I reading this right that either downstream platform teams or devs wanted to leverage existing helm templates to provision infrastructure and being on ECS locked you out of those and the water eventually boiled over. If so that's a pretty strong statement about the platform effect of k8s.

vouwfietsman · 2024-08-09T07:24:12 1723188252

Hi! Thanks for the thoughtful reply.

I understand what you're saying, the thing that worries me though is that the input you get from other technical teams is very hard to verify. Do you intend to measure the development velocity of the teams before and after the platform change takes effect?

In my experience it is extremely hard to measure the real development velocity (in terms of value-add, not arbitrary story points) of a single team, not to mention a group of teams over time, not to mention as a result of a change.

This is not necessarily criticism of Figma, as much as it is criticism of the entire industry maybe.

Do you have an approach for measuring these things?

felixgallo · 2024-08-09T15:14:22 1723216462

You're right that the input from other technical teams is hard to verify. On the other hand, that's fundamental table stakes, especially for a platform team that has a broad impact on an organization. The purpose of the platform is to delight the paying customer, and every change should have a clear and well documented and narrated line of sight to either increasing that delight or decreasing the frustration.

The canonical way to do that is to ensure that the incoming demand comes with both the ask and also the solid justification. Even at top tier organizations, frequently asks are good ideas, sensible ideas, nice ideas, probably correct ideas -- but none of that is good enough/acceptable enough. The proportion of good/sensible/nice/probably correct ideas that are justifiable is about 5% in my lived experience of 38 years in the industry. The onus is on the asking team to provide that full true and complete justification with sufficiently detailed data and in the manner and form that convinces the platform team's leadership. The bar needs to be high and again, has to provide a clear line of sight to improving the life of the paying customer. The platform team has the authority and agency necessary to defend the customer, operations and their time, and can (and often should) say no. It is not the responsibility of the platform team to try to prove or disprove something that someone wants, and it's not 'pushing back' or 'bureaucracy', it's basic sober purpose-of-the-company fundamentals. Time and money are not unlimited. Nothing is free.

Frequently the process of trying to put together the justification reveals to the asking team that they do not in fact have the justification, and they stop there and a disaster is correctly averted.

Sometimes, the asking team is probably right but doesn't have the data to justify the ask. Things like 'Let's move to K8s because it'll be better' are possibly true but also possibly not. Vibes/hacker news/reddit/etc are beguiling to juniors but do not necessarily delight paying customers. The platform team has a bunch of options if they receive something of that form. "No" is valid, but also so is "Maybe" along with a pilot test to perform A/B testing measurements and to try to get the missing data; or even "Yes, but" with a plan to revert the situation if it turns out to be too expensive or ineffective after an incrementally structured phase 1. A lot depends on the judgement of the management and the available bandwidth, opportunity cost, how one-way-door the decision is, etc.

At the end of the day, though, if you are not making a data-driven decision (or the very closest you can get to one) and doing it off naked/unsupported asks/vibes/resume enhancement/reddit/hn/etc, you're putting your paying customer at risk. At best you'll be accidentally correct. Being accidentally correct is the absolute worst kind of correct, because inevitably there will come a time when your luck runs out and you just killed your team/organization/company because you made a wrong choice, your paying customers got a worse/slower-to-improve/etc experience, and they deserted you for a more soberly run competitor.