Windows (what ships inbox, that is), as of last week (Sep 14) , has roughly 2500 feature flags. Some are permanently jammed into the on position, some off position, and the rest are configurable by its experimentation frameworks and hackers. (Apps are a separate beast, have their own experimentation tooling, their own flags, etc.)
I don't understand why the jammed-on features still exist in the OS. I'd imagine there will (eventually?) be a measurable impact to leaving all this trash behind. I suspect, like the author noted, it's non-zero risk work that no one wants to complete.
That feature flags let one version of the software run in a combinatorial number of modes is both their superpower and kryptonite. Use them wisely and clean them up as soon as possible.
One problem I've seen happen over and over again is when people pile feature flags on top of feature flags. With the systems I've used, all the flags can be independently enabled by the gating service, but that doesn't mean that every combination of flags is a valid state for the program to run in. If your system will break if flag A is active while flag B is not, then it's worth the effort to write an abstraction that checks both flags and fails to a valid state.
My experience is that the marketing department think of the complexity as 1 + 1 + 1 + 1 + 1 + 1 + 1 = 7 versions, when it really is 2^7 = 128.
On a large codebase it is easy to forget to clean these things up, and a flag that hasn’t been set to off in a year can be masking a major regression. At my last job we had two major outages in as many years from defunct flags defaulting to “off” when the feature flag system failed to return flag states.
Failing to “on” is a simple design choice to protect you from your tech debt. There are more expensive better fixes (e.g. automated enforcement of removing flags from codebase), but none as easy to implement.
 - https://en.wikipedia.org/wiki/Knight_Capital_Group#2012_stoc...
Knight Capital would likely not have lost hundreds of millions of dollars if someone had deleted that flag months before they needed the slot again. Someone in that time would have likely noticed the extra server that was running really old code.
Flags defaulted to off on first deploy are, IMO, just another form of refactoring. Prove that you can do nothing before you prove you can do something.
Put another way, my specific concern is “what happens when the flag system fails?” If you accidentally drop your feature flag database, do all your features turn off, or do they all turn on?
Superficially it might seem safer for them all to turn off in this failure mode. After all, flags are for experiments, and what’s wrong with disabling experiments? The problem is that companies I have worked for (and those of friends I talk to) do not purge 100% of flags corresponding to launched features. Once code has been built on top of these unpurged flags, bugs are almost guaranteed if they get turned off via system error.
So yes, you do want to generally be careful and occasionally use an antonym of the most obvious flag to guarantee that true is on and all other states mean off.
If, then, the "off" state is meant as an emergency oops-this-new-code-didnt-work-well-at-all-lets-revert-to-a-safe-state then having it be the less tested state seems like a problem waiting to happen.
In other words, by defaulting to "on", you get two states that are likely to contain bugs: the on state, because it contains new code, and the off state, because it's the less rigorously exercised configuration.
The nice thing about defaulting to off is that in that case, the off state will contain both old, known code, and be rigorously exercised. The on state will be the brittle one, but it would be anyway on account of running the new code.
I guess we can both agree defaults may seem pedantic but matter a lot.
I’m most concerned about “What happens when your flag manager fails to return data.” This should be a rare event. In my previous organization (~500 devs) it happened twice, and was catastrophic for the ~4 hours to get everything fixed. If instead some unpolished features had leaked for four hours, no one would have cared.
If your organization has excellent hygiene at cleaning up flags, then off may be the sensible default, since you won’t have 10 years of cruft blowing up at once. But building a culture with that kind of discipline is hard, and automating it is non-trivial/often difficult to get developer resources for. Defaulting to “on” can be a pragmatic choice when you can’t invest in paying off all the tech debt.
It's either on for a subset of users, or it's not in there at all.
However, don't forget to do some house-cleaning from time to time. Experiments end (successfully or not), features get permanently rolled-out or killed, and that code will become dept very fast unless you clean your flags and all the code related to them regularly.
A lot of features touch a lot of different parts of the code. Some features require parts of the code to be refactored, or it will turn into a mess of warty exceptions and if-statements all over the place, which to me sound like it would make your code hard to read, hard to maintain, and hard to test.
So how does this work? Are there good frameworks that abstract this mess a way in some easily readable and maintainable manner? Do you postpone refactoring the code until after a successful launch? But then how do you launch the refactored code? No matter how I think about feature flags, I can't help but conclude that it would turn your code into a hard-to-maintain mess.
So how would this work? How do you keep your code clean? I'm sure someone here can explain this or point me towards a good explanation of how to actually implement feature flags the correct way.
 In my current project we've got a login system that needs to behave differently in different environments, controlled by environment variables, and it's by far the ugliest part of our code. I have no idea what's going on there, which makes it very hard to address bugs.
Essentially you go up in the code path until you find a single place where you can introduce an abstraction, and make that the single toggle.
It works brilliantly. It is more overall effort. But it keeps the velocity higher as well because you no longer have head of line blocking.
What we did and worked very well for use is to develop whole modules, until all the features was ready, and only at the and we "wired in" the module, which meant a "application.register(new_module)" call or something like that. This need a very well organized and modular code base.
> I can't help but wonder how they can possibly work while keeping your code well-organised.
They don't keep your code well-organised. They are an intentional complexity debt you pay for their other benefits.
> A lot of features touch a lot of different parts of the code.
This is a sign of high coupling and/or low cohesion. It's a code smell. The ideal is to first refactor these features and make them loosely coupled and highly cohesive. Then they will have only one or very few connection points with the rest of the code.
Feature flags are really hard to use in bad code (and in my experience, most code is bad until the first feature flag that touches that part of the code.) So step one when feature flagging is to refactor and improve the code to the point where there's a logical place to switch between the behaviours.
> Some features require parts of the code to be refactored
Yes. You don't need feature flags to refactor. Refactoring does not change any external behaviour, so you can safely do that guided by your compiler and test suite.
> Are there good frameworks that abstract this mess a way in some easily readable and maintainable manner?
There are frameworks that help with this, but they aren't really necessary. In essence, a feature flag is a single if/else branch somewhere, that plugs in either this feature or that feature (or no feature at all.)
> Do you postpone refactoring the code until after a successful launch?
Opposite! You frequently have to do the refactoring first, because the code was not written to be extendable/subsettable – which is a requirement for feature flagging to be really useful.
(In a way, feature flags are like tests in that both force you to write better code that is more loosely coupled and more cohesive.)
> But then how do you launch the refactored code?
I'm probably sounding like a broken record now, but if it's pure refactoring, just ship it. (Assuming it passes the build and tests.)
> No matter how I think about feature flags, I can't help but conclude that it would turn your code into a hard-to-maintain mess.
Feature flags are meant to be temporary, so we excuse whatever additional maintenance cost they come with. I'm more worried about maintenance a year down the line than I am for the next few weeks or months.
If you're using feature flags, you sort of have to first refactor your codebase to a better state where the feature you want to toggle is extendable/subsettable, and a year down the line when the feature flag is removed – you're still getting the maintenance improvement from the refactoring you had to do.
So in that sense, feature flags lead to even better code in the long run, because you can't "cheat" and just swap some code out for different code. You have to actually make the code properly designed and architected first.
This is something I have spoken (slides) and written about quite a lot - and my solution has usually been to add a monitoring system to any feature flag system used.
If a flag hasn't changed state in a period of time, or hasn't been queried in a period of time, then an issue is filed against the relevant repositories. The time period is different based on teams and services themselves, and there are also exclusions for flags which should be kept around.
I'd like to improve the system to do something like an automatic PR to remove a flag, but at this point, it seems more effort than it's worth.
They are extremely awesome for the price.
Yeah, you can make the same argument about any practice. If the team doesn't care, you cannot fix anything by doing or not doing any practice. A good team can suffer from a bad practice, but no practice can overcome a bad team.
Clean up your feature flags and definitely don't use a framework for it unless you are doing continuous A/B testing.
I always have the same question regarding this advice: how are you supposed to handle data corruption?
When a buggy new version creates lasting problems in the persistent data, reverting to the last known good version may not fix the problem as the correct code now has to deal with incorrect data, which lead to a different, possibly incorrect behavior compared to before the buggy deployment.
Reverting the data itself is often not possible as you usually can’t say to your customers « woops, we deleted all of your bank transactions / incoming emails for the last two weeks because of a rollback on our side ».
So in my experience you still have to roll forward in a lot of cases, to fix both the code and data together.
How do you guys handle that in practice?
Defense in depth. Backups, architecture that's supports replay etc.
Dogfooding. No one will complain if you're the person who wrote the code that lost your emails.
Make changes backwards compatible when possible. Add a new column. Wait a while. Write data to the column. You can rollback cleanly without issue at each step. Naturally backwards compatible schemas (protos) help with this.
Dual writing. Similar to the above, if you're replacing a with b, dual write a and b for a while. Validate everything. Start relying only on b, but keep writing a. Eventually stop writing a after a while. At every point, you can roll back to the previous state without issue, so if you can validate everything functions at each interim state, you're always able to roll back.
I'm sure there's more I'm forgetting
All of these suggestions are certainly feasible but taken together they do amount to a non-negligible cost in infrastructure and development velocity though.
Where I live I've yet to find an employer that can be convinced these costs are worth it.
Frequent shipments is correlated with high developer performance (for reasons I haven't time to get into, but it boils down to feedback and motivation). Frequent shipments is only possible when taking tiny steps at a time.
In the end, you save time and money by doing it that way. If your employer does not understand that, it is part of your professional responsibility to try to convince them of it.
In some cases you do not have data corruption and can roll back safely. If not, you can probably be quicker to fix the data than find the solution for the underlining problem.
I had fun writing feature flag system for java/spring for work. We rolled a lot of our own stuff for tighter integration and this was not exception. You could define an interface, slap on a special annotation and spring would provide an instance backed by a proxy. And this was for a web app, and so we made the toggles would configurable in the main app by admins. The UI for targeted roll outs was pretty simple but powerful. I never got around to adding mandatory cleanup dates, but it wouldn't be very hard. Adding automatic a/b tests would certainly have been a lot harder.
We did find them very nice for fresh implementations of common workflow screens. It was a big app and invariably some functionality would be missed and so we could just turn of the migrated section until we fixed it. Definitely had to do more QA work, but it shortened the turn around time and reduced the stress quite a bit because patches weren't as urgent.
For sure, it brings extra complexity in code because of left over flags which are not cleaned up. Not only complexity which is added while writing flag but later new code also added for all these condition to unsure compatibility. Cleanup should be managed like tech debt but its not unfortunately.
When working on a system that is or might become multi tenancy we tend to add light switches per light bulb
I strongly prefer the feature flags of the second type, because that way, you have at least the guarantee that program compiles with any combination of feature flags; and therefore, you can be somewhat sure that any such combination produces correct behaviour (as much as you can be sure of that in general with your codebase: obviously, results may vary from C to Haskell).
That is so true. Unfortunately many engineers do not manage complexity but simply increase it until it becomes unmanageable. Then they do a big rewrite and (if the project survived this, often it fails at that point) the cycle starts again.
You don't say