I'll highlight something I've learned in both succeeding and failing at this metric: When rewriting something, you should generally strive for a drop-in replacement that does the same thing, in some cases, even matching bug-for-bug, or, as in the article, taking a very close look at the new vs. the old bugs.
It's tempting to throw away the old thing and write a brand new bright shiny thing with a new API and a new data models and generally NEW ALL THE THINGS!, but that is a high-risk approach that is usually without correspondingly high payoffs. The closer you can get to drop-in replacement, the happier you will be. You can then separate the risks of deployment vs. the new shiny features/bug fixes you want to deploy, and since risks tend to multiply rather than add, anything you can do to cut risks into two halves is still almost always a big win even if the "total risk" is still in some sense the same.
Took me a lot of years to learn this. (Currently paying for the fact that I just sorta failed to do a correct drop-in replacement because I was drop-in replacing a system with no test coverage, official semantics, or even necessarily agreement by all consumers what it was and how it works, let alone how it should work.)
This is probably very context dependent, because I've learned the opposite.
For example, I was rewriting/consolidating a corner of the local search logic for Google that was spread throughout multiple servers in the stack. Some of the implementation decisions were clearly made because of the convenience of doing so in a particular server. But when consolidating the code into a single server, the data structures and partial results available were not the same, so re-producing the exact same logic and behavior would have been hard. Realizing which parts of the initial implementation were there because it was convenient, and which were there for product concerns let me implement something much simpler that still satisfied the product demands, even if the output was not bitwise identical.
I didn't read the parent comment as reproducing the exact same logic perfectly. More as a definition of the interface between the external code and the part to replace and matching that interface closely with the replacement.
This isn't always possible but seems like a reasonable objective given my experience.
There's a simpler method than this that provides even more surety, used by e.g. LibreSSL:
1. Start writing your new implementation (or heavily refactoring your old implementation, whichever), but in parallel, for each legacy function you remove, write an equivalent "legacy wrapper" function that implements the old API (and ABI; you have to return the same structs and all) in terms of the new API.
2. As you develop the new code, continue to run the old code's tests. (This shouldn't require any work; as far as the tests can tell, the codebase containing all of {the new code, what's left of the old code, and the legacy wrapper} presents exactly the same ABI as the old codebase.) The old tests should still all pass, at every step.
3. Once you're finished developing the new code, and all the old code's tests are passing, rewrite the tests in terms of the new API.
4. Split off all the legacy-wrapper code into a new, second library project; give it the new "core" library as a dependency. Copy all the old tests—from a commit before you rewrote them—into this project, too. This wrapper library can now be consumed in place of the original legacy library. Keeping this wrapper library up-to-date acts to ensure that your new code remains ABI-compatible; the old tests are now regression-tests on whether a change to the new "core" library breaks the legacy-ABI-wrapper library.
5. Document and release your new core as a separate, new library, and encourage devs to adopt it in place of the legacy library; release the legacy-wrapper (with its new-core dependency) as the next major version of the old library.
When all-or-nearly-all downstream devs have transitioned from the legacy wrapper to the new core, you can stop supporting/updating the legacy wrapper and stop worrying about your updates breaking it. You're free!
In LibreSSL, if you're wondering, the "new core" from above is called libtls, and the "legacy wrapper" from above is called libssl—which is, of course, the same linker name as OpenSSL's library, with a new major version.
This is a textbook example of 'How to Kill an OSS project'
#1, #4, and #5 are completely unnecessary.
If it's desirable to 'kill with fire' the old codebase, you could always create a fork and merge the breaking changes on the next major release.
Creating a second library is a terrible idea for 2 reasons...
You lose the legacy source control history which is (arguably) more valuable than the current source because it can be used to research solutions to old problems.
You split the community which is devastating to the culture and survivability of an OSS project. Even something as a simple name change will have massive negative impacts on the level of contributions. Only the most popular and actively developed projects can get away with forking the community.
LibreSSL will likely survive because everything in the world that requires crypto uses OpenSSL. Even then, that was the absolutely wrong way to go about things.
The only solid justification for a rename and complete rewrite, is if there are license/copyright issues.
> You lose the legacy source control history which is (arguably) more valuable than the current source because it can be used to research solutions to old problems.
No reason for that. Both projects—the wrapper and the new core—can branch off the original. Create a commit that removes one half of the files on one side, and the other half of the files on the other, and make each new commit "master" of its repo, and now you've got two projects with one ancestor. Project mitosis.
> You split the community
How so? I'm presuming a scenario here where either 1. you were the sole maintainer for the old code, and it's become such a Big Ball of Mud that nothing's getting done; or 2. the maintainer of the old code is someone else who is really, really bad at their job, and you're "forking to replace" with community buy-in (OpenSSL, Node.js leading to the io.js fork, gcc circa 1997 with the egcs fork, MySQL leading to MariaDB, etc.).
In both scenarios, development of the old code has already basically slowed to a halt. There is no active community contributing to it; or if there is, it is with great disgust and trepidation, mostly just engineers at large companies that have to fix upstream bugs to get their own code working (i.e. "I'm doing it because they pay me.") There are a lot of privately-maintained forks internal to companies, too, sharing around patches the upstream just won't accept for some reason. The ecosystem around the project is unhealthy†.
When you release the new legacy wrapper, it replaces the old library—the legacy wrapper is now the only supported "release" of the old library. It's there as a stopgap for Enterprise Clients with effectively-dead projects which have ossified around the old library's ABI, so these projects can continue to be kept current with security updates et al. It's not there for anyone to choose as a target for their new project! No new features will ever be added to the wrapper. It's a permanent Long-Term Support release, with (elegant, automatic) backporting of security updates, and that's it. Nobody starting a project would decide to build on it any more than they'd build on e.g. Apache 1.3, or Ubuntu 12.04.
> Even something as a simple name change will have massive negative impacts on the level of contributions.
Names are IP, obviously (so if you're a third party, you have to rename the project), but they're more than that—names are associated in our brains with reflexes and conventions for how we build things.
The reason Perl 6, Python 3, etc. have so much trouble with adoption is that people come into them expecting to be able to reuse the muscle-memory of the APIs of Perl 5/Python 2. They'd have been much better off marketed as completely new languages, that happened to be package-ecosystem-compatible with the previous language, like Elixir is to Erlang or Clojure is to Java.
If these releases were accompanied by their creators saying "Python/Perl is dead, long live _____!" then there'd have been a much more dramatic switchover to the new APIs. Managers understand "the upstream is dead and we have to switch" much more easily than they understand "the upstream has a new somewhat-incompatible major version with some great benefits."
One good example: there's a reason Swift wasn't released as "Objective-C 3.0". As it is, ObjC is "obviously dead" (even though Apple hasn't said anything to that effect!) and Swift is "the thing everyone will be using from here on, so we'd better move over to it." In a parallel reality, we'd have this very slow shift from ObjC2 to ObjC3 that would never fully complete.
---
† If the ecosystem were healthy, obviously you don't need the legacy wrapper. As you say, just release the new library as the new major version of the old library—or call the new library "foo2", as many projects have done—and tell people to switch, and they will.
It's easy to find healthy projects like this when you live close enough to the cutting-edge that all your downstream consumers are still in active development, possibly pre-1.0 development. The Node, Elixir, Go and Rust communities look a lot like this right now; any project can just "restart" and that doesn't trouble anybody. Everyone rewrites bits and pieces of their code all the time to track their upstreams' fresh-new-hotness APIs. That's a lot of what people mean when they talk about using a "hip new language": the fact that they won't have to deal with stupid APIs for very long, because stupid APIs get replaced.
But imagine trying to do the same thing to, say, C#, or Java, or any other language with Enterprise barnacles. Imagine trying to tell people consuming Java's DateTime library that "the version of DateTime in Java 9 is now JodaTime, and everyone has to rewrite their date-handling code to use the JodaTime API." While the end results would probably have 10x fewer bugs, because JodaTime is an excellent API whose UX makes the pertinent questions obvious and gives devs the right intuitions... a rewrite like that just ain't gonna happen. Java 9 needs a DateTime that looks and acts like DateTime.
> ...that is a high-risk approach that is usually without correspondingly high payoffs
That's probably true, but it's also true that over a long enough timescale (100 years, to trigger a reductio ad absurdum) there is a very high risk that not replacing or rewriting that code will sink your technology and possibly your organization.
Just because the risk will be realized in the long run doesn't mean it's not a risk. And if the worst-case scenario is death of the entire organization, then the math could very well add up to a full rewrite. Most business managers are not prepared to think strategically about long-term technical debt. It's the duty of engineers to let them know the difference between "not now" and "never". And the difference between "urgent" and "low priority".
That means even if your new pretty-shiny is going to eventually have a new API, it is usually worth it to still offer an old shim in with the old interface. It may seem like extra useless work, but it reduces risk.
Far be it from me to suggest rewrites are never useful. Half my career could be characterized as "rewriting the things nobody thought could be rewritten because they were too messy and entrenched". It's personally a bit risky (you end up owning not only the bugs you created, but the bugs you failed to reproduce... doubly-whammy! better be good at unit testing so at least the new code is defensible), but the payoff is pretty significant, too, both for your career and for your organization.
Right. I'm saying (poorly, I guess) that that doesn't always work. If the technical debt is sprinkled throughout your system or your organization is sunk deeply into a broken paradigm, you're in trouble. Sometimes you need to provide a completely new technology that serves many of the same use cases in a different way.
I'm not saying we should reach for rewriting things as our first instinct.
I am saying that your COM interface will need to be replaced some day. Or your COBOL business logic will become a liability. Or your frames-based web API will cost you business.
Incremental change leads you to a new local maximum, but sometimes that local maximum stunts your growth or even proves deadly.
One strategy that I've seen work for this kind of deep architectural change is to write the new system, then write a shim that provides a compatibility layer (however hacky and ugly) to emulate the old system. This lets you test the new system without then having to also test everything that the system interacts with. And then, start replacing usages of the shim with direct interaction with the newer prettier system.
>but it's also true that over a long enough timescale (100 years, to trigger a reductio ad absurdum) there is a very high risk that not replacing or rewriting that code will sink your technology and possibly your organization.
Could you elaborate on what basis you claim this as a truth?
It's largely conjecture, admittedly. But there are very few pieces of software that last 40 years. Most companies don't plan to be out of business in the next 40 years. And many tech companies have gone out of business in even the last ten years because they weren't agile enough to adapt quickly to new technological changes.
I don't think that argument holds. Usually it's true, that if something hasn't lasted a long time, it likely won't last a long time in the future. But we live at a quasi-special time, close to a boundary condition.
There exists no software that has existed for 100 years, because software was invented sometime in the past hundred years. You can reductio ad absurdum that argument by saying, the week after the first software was written, that no pieces of software have lasted more than a week, and therefore very few pieces of software will last more than a week.
So very few pieces of software have lasted 40 years, because very few pieces of software were written at least 40 years ago.
While I agree that in the abstract, technical debt can catch up to you, I'm just not sure that its impact is necessarily such that it cannot be contained or mitigated.
I mean UNIX is still around for 30 odd years. It hasn't sunk. It carries tremendous technical debt, in terms of bad design, in terms of implementation bugs that need to be carried forward, etc.
My sense is that the more organizations that depend on a piece of tech, the more chance there is that it is going to age well, warts and all.
I've learned a variation on this theme, which is more specific:
You have some code that, for whatever reason, you think is not very good. There is a reason why you have ended up with code that is not very good.
If your action is to sit down and write it all again, you should anticipate getting a very similar result (and this has been the outcome of every "big rewrite" effort I've ever seen: they successfully reproduced all the major problems of the old system).
The reasons why this happen are probably something to do with the way you're doing development work, and not the codebase that you're stuck with. Until you learn how to address those problems, you should not anticipate a better outcome. Once you have learned how to address those problems, you are likely to be able to correct the problem without doing a "big rewrite" (most commonly by fixing them one piece at a time).
Sometimes I see people attempt a "big rewrite" after replacing all of the people, thinking that they can do a better job. The outcome of this appears to me to invariably be that the second team who tried to build the system with no real experience end up following a very similar path to the first team that did the same thing (guided by the map that the first team left them, and again reproducing all the same problems).
From these observations I draw one key conclusion: the important thing that you get from taking smaller steps is that you amplify your ability to learn from things that have already been done, and avoid repeating the mistakes that were made the last time. The smaller the step, the easier it becomes to really understand how this went wrong last time and what to do instead. Yes, the old codebase is terrible, but it still contains vitally important knowledge: how to notdothatagain. Neither writing new code from scratch without touching the old, nor petting the old code for years without attempting to fix it, are effective ways to extract that knowledge. The only approach I've ever really seen work is some form of "take it apart, one piece at at time, understand it, and then change it".
I think what works even better is to have "permission" to gradually change both the old and the new. It can drastically simplify the process of creating a replacement, if you're only replacing a slightly more sane version of the original instead of the actual original.
It's tempting to throw away the old thing and write a brand new bright shiny thing with a new API and a new data models and generally NEW ALL THE THINGS!, but that is a high-risk approach that is usually without correspondingly high payoffs. The closer you can get to drop-in replacement, the happier you will be. You can then separate the risks of deployment vs. the new shiny features/bug fixes you want to deploy, and since risks tend to multiply rather than add, anything you can do to cut risks into two halves is still almost always a big win even if the "total risk" is still in some sense the same.
Took me a lot of years to learn this. (Currently paying for the fact that I just sorta failed to do a correct drop-in replacement because I was drop-in replacing a system with no test coverage, official semantics, or even necessarily agreement by all consumers what it was and how it works, let alone how it should work.)