Hacker News new | past | comments | ask | show | jobs | submit login
When feature flags do and don’t make sense (2019) (rajivprab.com)
98 points by nomdep on Sept 22, 2020 | hide | past | favorite | 40 comments

I found this post to be a great complement to my hobby of collecting/documenting all the feature flags used in the Windows operating system for the past three years. [1] It's proven to be a reliable source of what's to come in future builds of Windows, sometimes to the chagrin of teams unfamiliar with the public facing artifacts generated by their experimentation <grin>.

Windows (what ships inbox, that is), as of last week (Sep 14) [2], has roughly 2500 feature flags. Some are permanently jammed into the on position, some off position, and the rest are configurable by its experimentation frameworks and hackers. (Apps are a separate beast, have their own experimentation tooling, their own flags, etc.)

I don't understand why the jammed-on features still exist in the OS. I'd imagine there will (eventually?) be a measurable impact to leaving all this trash behind. I suspect, like the author noted, it's non-zero risk work that no one wants to complete.

[1] https://github.com/riverar/mach2/tree/master/features

[2] https://github.com/riverar/mach2/blob/master/features/20215....

I loved the testimonials section. What do you do with that? Do you keep turning them on and off to see how they work? How did you originally find out about how you can extract and set/under feature flags in Windows?

Most of the time, I just document what's new, maybe flip it on for a screenshot, and move on. Sometimes it ends up in the news. [1] Microsoft has a rich history of hiding stuff [2] in the OS though.

[1] https://www.bleepingcomputer.com/news/microsoft/windows-10-i...

[2] https://www.zdnet.com/pictures/windows-7-the-blue-badge-expe...

Solid article. I've worked with complex feature flags / gating systems for years at multiple large internet companies and have largely come to the same conclusions.

That feature flags let one version of the software run in a combinatorial number of modes is both their superpower and kryptonite. Use them wisely and clean them up as soon as possible.

One problem I've seen happen over and over again is when people pile feature flags on top of feature flags. With the systems I've used, all the flags can be independently enabled by the gating service, but that doesn't mean that every combination of flags is a valid state for the program to run in. If your system will break if flag A is active while flag B is not, then it's worth the effort to write an abstraction that checks both flags and fails to a valid state.

> combinatorial number

My experience is that the marketing department think of the complexity as 1 + 1 + 1 + 1 + 1 + 1 + 1 = 7 versions, when it really is 2^7 = 128.

It might be better estimated by the binomial coefficient 7,2=21, since each patch has to be backported to every other version. Still kind of exponential.

More quadratic than exponential

In many cases you are better off using a kill switch than a feature flag. This may seem pedantic, but the way your system fails (on vs off) can protect you from disaster when your flag setting framework has a bug.

On a large codebase it is easy to forget to clean these things up, and a flag that hasn’t been set to off in a year can be masking a major regression. At my last job we had two major outages in as many years from defunct flags defaulting to “off” when the feature flag system failed to return flag states.

Failing to “on” is a simple design choice to protect you from your tech debt. There are more expensive better fixes (e.g. automated enforcement of removing flags from codebase), but none as easy to implement.

Another thing about tech debt to watch out for is reusing previous flags left before your time [0].

[0] - https://en.wikipedia.org/wiki/Knight_Capital_Group#2012_stoc...

I always push back or ignore people who complain about how you shouldn’t have to “burn in” new code. That if it’s not ready to go you shouldn’t commit the change at all.

Knight Capital would likely not have lost hundreds of millions of dollars if someone had deleted that flag months before they needed the slot again. Someone in that time would have likely noticed the extra server that was running really old code.

Flags defaulted to off on first deploy are, IMO, just another form of refactoring. Prove that you can do nothing before you prove you can do something.

To be clear, I agree that it often makes sense for the code to be “off” on initial deploy. By “default”, I mean what does the code do in absence of feature flag data?

Put another way, my specific concern is “what happens when the flag system fails?” If you accidentally drop your feature flag database, do all your features turn off, or do they all turn on?

Superficially it might seem safer for them all to turn off in this failure mode. After all, flags are for experiments, and what’s wrong with disabling experiments? The problem is that companies I have worked for (and those of friends I talk to) do not purge 100% of flags corresponding to launched features. Once code has been built on top of these unpurged flags, bugs are almost guaranteed if they get turned off via system error.

One the system I’m most familiar with, the system won’t start. But as it’s sourcing it’s data from a Raft protocol data store, that server wasn’t going to start anyway. The source of that data could fall behind though.

So yes, you do want to generally be careful and occasionally use an antonym of the most obvious flag to guarantee that true is on and all other states mean off.

I'm not sure I agree completely with this. Defaulting to "on" and only rarely going to "off" would, at least in the teams I've worked, result in the "off" path being much less tested and more likely to contain regressions.

If, then, the "off" state is meant as an emergency oops-this-new-code-didnt-work-well-at-all-lets-revert-to-a-safe-state then having it be the less tested state seems like a problem waiting to happen.

In other words, by defaulting to "on", you get two states that are likely to contain bugs: the on state, because it contains new code, and the off state, because it's the less rigorously exercised configuration.

The nice thing about defaulting to off is that in that case, the off state will contain both old, known code, and be rigorously exercised. The on state will be the brittle one, but it would be anyway on account of running the new code.

I guess we can both agree defaults may seem pedantic but matter a lot.

In my experience the “off” path is only well tested the day the feature is released. Once the feature is out in the world the old tests go stale, and manual testing in effect never happens. It gets worse when a new ungated feature is built on top of something that was behind a flag (but always set to on). Suddenly code that was expected to always run doesn’t, and all kind of expectations get broken.

I’m most concerned about “What happens when your flag manager fails to return data.” This should be a rare event. In my previous organization (~500 devs) it happened twice, and was catastrophic for the ~4 hours to get everything fixed. If instead some unpolished features had leaked for four hours, no one would have cared.

If your organization has excellent hygiene at cleaning up flags, then off may be the sensible default, since you won’t have 10 years of cruft blowing up at once. But building a culture with that kind of discipline is hard, and automating it is non-trivial/often difficult to get developer resources for. Defaulting to “on” can be a pragmatic choice when you can’t invest in paying off all the tech debt.

That might very well be true. I'm fortunate enough to work in an organisation that takes tech debt somewhat seriously and we never jam a flag "on". If we're at the point where we're ready to turn it on for everyone, we just remove it from the code.

It's either on for a subset of users, or it's not in there at all.

Usually you remove flags after they heva ebeen on for some time. I find in a "every few weeks" release cycle it works well to remove the flags of the last release (which by then have been live for a few weeks).

Feature flags are awesome and beyond the obvious AB testing, kill-switch, they allow us to merge code constantly and toggling them on when it is ready (as the author suggests).

However, don't forget to do some house-cleaning from time to time. Experiments end (successfully or not), features get permanently rolled-out or killed, and that code will become dept very fast unless you clean your flags and all the code related to them regularly.

I've seen a lot of people mention the advantage of feature flags over feature branches. I've never worked with feature flags on any reasonable scale[0], and I can't help but wonder how they can possibly work while keeping your code well-organised.

A lot of features touch a lot of different parts of the code. Some features require parts of the code to be refactored, or it will turn into a mess of warty exceptions and if-statements all over the place, which to me sound like it would make your code hard to read, hard to maintain, and hard to test.

So how does this work? Are there good frameworks that abstract this mess a way in some easily readable and maintainable manner? Do you postpone refactoring the code until after a successful launch? But then how do you launch the refactored code? No matter how I think about feature flags, I can't help but conclude that it would turn your code into a hard-to-maintain mess.

So how would this work? How do you keep your code clean? I'm sure someone here can explain this or point me towards a good explanation of how to actually implement feature flags the correct way.

[0] In my current project we've got a login system that needs to behave differently in different environments, controlled by environment variables, and it's by far the ugliest part of our code. I have no idea what's going on there, which makes it very hard to address bugs.

See: BranchByAbstraction.com

And: TrunkBasedDevelopment.com

Essentially you go up in the code path until you find a single place where you can introduce an abstraction, and make that the single toggle.

It works brilliantly. It is more overall effort. But it keeps the velocity higher as well because you no longer have head of line blocking.

You don't need to apply the flags to every part of the application where the feature is implemented, mainly in the user-facing code (so the UI and (exposed, documented) API). The rest can just be left alone and developed (in trunk/master). The main thing is to not expose a feature until it's done, which will actually allow you to do all the implementation, refactorings, testing, etc.

There are multiple ways you can do this. Yes, a lot of if statements are messy.

What we did and worked very well for use is to develop whole modules, until all the features was ready, and only at the and we "wired in" the module, which meant a "application.register(new_module)" call or something like that. This need a very well organized and modular code base.

I'm going to respond point-by-point, and you'll tell me if that helps at all.

> I can't help but wonder how they can possibly work while keeping your code well-organised.

They don't keep your code well-organised. They are an intentional complexity debt you pay for their other benefits.

> A lot of features touch a lot of different parts of the code.

This is a sign of high coupling and/or low cohesion. It's a code smell. The ideal is to first refactor these features and make them loosely coupled and highly cohesive. Then they will have only one or very few connection points with the rest of the code.

Feature flags are really hard to use in bad code (and in my experience, most code is bad until the first feature flag that touches that part of the code.) So step one when feature flagging is to refactor and improve the code to the point where there's a logical place to switch between the behaviours.

> Some features require parts of the code to be refactored

Yes. You don't need feature flags to refactor. Refactoring does not change any external behaviour, so you can safely do that guided by your compiler and test suite.

> Are there good frameworks that abstract this mess a way in some easily readable and maintainable manner?

There are frameworks that help with this, but they aren't really necessary. In essence, a feature flag is a single if/else branch somewhere, that plugs in either this feature or that feature (or no feature at all.)

> Do you postpone refactoring the code until after a successful launch?

Opposite! You frequently have to do the refactoring first, because the code was not written to be extendable/subsettable – which is a requirement for feature flagging to be really useful.

(In a way, feature flags are like tests in that both force you to write better code that is more loosely coupled and more cohesive.)

> But then how do you launch the refactored code?

I'm probably sounding like a broken record now, but if it's pure refactoring, just ship it. (Assuming it passes the build and tests.)

> No matter how I think about feature flags, I can't help but conclude that it would turn your code into a hard-to-maintain mess.

Feature flags are meant to be temporary, so we excuse whatever additional maintenance cost they come with. I'm more worried about maintenance a year down the line than I am for the next few weeks or months.

If you're using feature flags, you sort of have to first refactor your codebase to a better state where the feature you want to toggle is extendable/subsettable, and a year down the line when the feature flag is removed – you're still getting the maintenance improvement from the refactoring you had to do.

So in that sense, feature flags lead to even better code in the long run, because you can't "cheat" and just swap some code out for different code. You have to actually make the code properly designed and architected first.

> Death by Flags

This is something I have spoken[1] (slides[2]) and written about quite a lot - and my solution has usually been to add a monitoring system to any feature flag system used.

If a flag hasn't changed state in a period of time, or hasn't been queried in a period of time, then an issue is filed against the relevant repositories. The time period is different based on teams and services themselves, and there are also exclusions for flags which should be kept around.

I'd like to improve the system to do something like an automatic PR to remove a flag, but at this point, it seems more effort than it's worth.

[1]: https://www.youtube.com/watch?v=LZgQBSr36p8

[2]: https://andydote.co.uk/presentations/index.html?feature-togg...

A built in feature on LaunchDarkly.com

They are extremely awesome for the price.

> Sure, and we should also not allow our tech debt to accumulate and we should follow every single best-practice religiously. Unfortunately, this never happens in any corporate environment. Even in great teams, tech debt often gets de-prioritized in the face of new requests. Newcomers to the team or those on their way out, aren’t always disciplined enough to clean up their flags after a successful rollout. And sometimes, these tasks simply slip through the cracks and get forgotten.

Yeah, you can make the same argument about any practice. If the team doesn't care, you cannot fix anything by doing or not doing any practice. A good team can suffer from a bad practice, but no practice can overcome a bad team.

Clean up your feature flags and definitely don't use a framework for it unless you are doing continuous A/B testing.

> the recommendation at places like Google is to rollback first and investigate the problem later

I always have the same question regarding this advice: how are you supposed to handle data corruption?

When a buggy new version creates lasting problems in the persistent data, reverting to the last known good version may not fix the problem as the correct code now has to deal with incorrect data, which lead to a different, possibly incorrect behavior compared to before the buggy deployment.

Reverting the data itself is often not possible as you usually can’t say to your customers « woops, we deleted all of your bank transactions / incoming emails for the last two weeks because of a rollback on our side ».

So in my experience you still have to roll forward in a lot of cases, to fix both the code and data together.

How do you guys handle that in practice?

A few things: you usually rollback quickly. After minutes, not weeks.

Defense in depth. Backups, architecture that's supports replay etc.

Dogfooding. No one will complain if you're the person who wrote the code that lost your emails.

Make changes backwards compatible when possible. Add a new column. Wait a while. Write data to the column. You can rollback cleanly without issue at each step. Naturally backwards compatible schemas (protos) help with this.

Dual writing. Similar to the above, if you're replacing a with b, dual write a and b for a while. Validate everything. Start relying only on b, but keep writing a. Eventually stop writing a after a while. At every point, you can roll back to the previous state without issue, so if you can validate everything functions at each interim state, you're always able to roll back.

I'm sure there's more I'm forgetting

Thank you for the very relevant advice.

All of these suggestions are certainly feasible but taken together they do amount to a non-negligible cost in infrastructure and development velocity though.

Where I live I've yet to find an employer that can be convinced these costs are worth it.

This is one of those "slow is smooth; smooth is fast" kinds of situations.

Frequent shipments is correlated with high developer performance (for reasons I haven't time to get into, but it boils down to feedback and motivation). Frequent shipments is only possible when taking tiny steps at a time.

In the end, you save time and money by doing it that way. If your employer does not understand that, it is part of your professional responsibility to try to convince them of it.

This is actually all very good advice and succinctly put at that. If you think of more, please do add it to your comment!

Rollbacks are easy and simple to do and "solve" a large number of problems and go a long way bisecting the problem. They are not expected to solve every problem.

What is different between rollback and manual data fixing or roll forward and manual data fixing? Well, rollback is usually faster and safer.

In some cases you do not have data corruption and can roll back safely. If not, you can probably be quicker to fix the data than find the solution for the underlining problem.

Sort of off topic:

I had fun writing feature flag system for java/spring for work. We rolled a lot of our own stuff for tighter integration and this was not exception. You could define an interface, slap on a special annotation and spring would provide an instance backed by a proxy. And this was for a web app, and so we made the toggles would configurable in the main app by admins. The UI for targeted roll outs was pretty simple but powerful. I never got around to adding mandatory cleanup dates, but it wouldn't be very hard. Adding automatic a/b tests would certainly have been a lot harder.

We did find them very nice for fresh implementations of common workflow screens. It was a big app and invariably some functionality would be missed and so we could just turn of the migrated section until we fixed it. Definitely had to do more QA work, but it shortened the turn around time and reduced the stress quite a bit because patches weren't as urgent.

Article highlight some great points. But I think Feature flags bring value where we dont need to rollback code if its failing. Also failure so not so straight forward in most of cases where we enable a feature and it fails. Sometimes it degrades performance over time, sometime you want to enable it slow to understand its semantic under production traffic, sometimes we want to enable something around some launch date etc.

For sure, it brings extra complexity in code because of left over flags which are not cleaned up. Not only complexity which is added while writing flag but later new code also added for all these condition to unsure compatibility. Cleanup should be managed like tech debt but its not unfortunately.

I like to call this light switch per light bulb to drive home the point with customers about toggles for everything. If there is not a clear point behind it, it can become a bit obnoxious.

When working on a system that is or might become multi tenancy we tend to add light switches per light bulb

I think that the article omits an important distinction: feature flags that disable some code before the compilation, or flags that are read by program in runtime to decide what behaviour to choose.

I strongly prefer the feature flags of the second type, because that way, you have at least the guarantee that program compiles with any combination of feature flags; and therefore, you can be somewhat sure that any such combination produces correct behaviour (as much as you can be sure of that in general with your codebase: obviously, results may vary from C to Haskell).

"Software engineering is primarily an exercise in managing complexity."

That is so true. Unfortunately many engineers do not manage complexity but simply increase it until it becomes unmanageable. Then they do a big rewrite and (if the project survived this, often it fails at that point) the cycle starts again.

I am always baffled by how much disorienting crap there is on the Amazon desktop UI. What their A/B tests do not show them is me doing the checkout for family members who just give up.

Funny, I was just doing some feature flag cleanup. The article makes good points, but I think I'll just clean up once a year or so and not worry too much about adding the flags :)

> For example, consider the Facebook Android app, which contains code contributed by hundreds of different teams

You don't say

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact