Hacker News new | past | comments | ask | show | jobs | submit login
When to Rewrite from Scratch – Autopsy of Failed Software (codeahoy.com)
240 points by perseus323 on Apr 23, 2016 | hide | past | web | favorite | 106 comments

Unfortunately this is a lesson that often has to be learned the hard way when you have a lead or team that lacks experience/wisdom. If you can survive that educational expense, you end up much better for it but it's painful nonetheless.

Aesthetics are never a legitimate reason to replace a functioning system with something else. I do not care how it offends the eye, the mind, the ear, or any other sensation you have... if it does not have a fundamental functional problem your best option is to maintain or plan an incremental rolling replacement that is tightly scoped at each step.

For large systems that do present functional problems, identify the egregious offenders in that system, abstract the entities/behaviors most responsible, and replace-in-place in a Ship of Theseus fashion. Despite how familiar you believe you are with a monolithic or complex inter-dependent system, you will discover how much you don't actually understand about how it works as a whole if you try and re-implement in its entirety.

I perpetually love the references to Ship of Theseus, and find it particularly applicable to this problem.

You mention fundamental functional problems, and I'd like to add something: sometimes it's not a functional problem, but a changeability problem. The code functions fine, but the process of adding a new feature is incredibly painful.

I've done big-bang partial rewrites of systems before, quite successfully, but I've got a Ship of Theseus rule of my own that I follow: no new features during the rewrite, and no missing features. The first example that comes to mind was a rather complicated front-end application that had become a total spaghetti disaster. It had been written using Backbone, and from my research it fit the Angular model of things quite well.

I took a pound of coffee with me out to the cabin for a weekend, and rewrote the whole frontend. Side-by-side. I started with the first screen, and walked through every code path to populate a queue of screens that were reachable from there. Implement a screen, capture its outbound links, repeat.

Nothing changed, but everything changed. The stylesheet was gnarly too, but I left it 100% untouched. That comes later. By keeping the stylesheet intact, I (somewhat) ensured that the new implementation used identical markup to the old system. The markup was gnarly too, but keeping it identical helped to ensure that nothing got missed.

48-72ish hours later, I emerged a bit scraggly and the rest of the team started going over it. They found 1 or 2 minor things, but within a week or so we did a full cut-over. The best part? Unlike the article, clients had no outward indication that anything had changed. There was no outcry, even though about a quarter of the code in the system had been thrown out.

A full rewrite absolutely should not be taken lightly (if ever). It's very much a last resort and something that requires deliberation and a clear path to success. You're spot on with your rules - nothing new; nothing lost.

I had a similar experience, sans the cabin. I was hired onto a startup that already had a functioning app in the wild. The API was written by one of the founders, who is not a developer in the professional sense, and holds my undying respect. As an early-stage startup with lots of ideas, we needed to move fast, and I wasn't going to be able to do so with the existing code base.

I stocked the fridge, locked myself into my tiny Brooklyn apartment, and got to work. I started by logging all requests to the API in order to ensure I had all the necessary endpoints covered. Then I wrote integration tests - acting as an HTTP client - for the entire API.

About a week or so later, once the rewrite was finished, I added automated tests that compared the output between the two APIs, and once those matched perfectly, ran it live beside the original API (sending requests to both) and compared results from real requests to ensure there were no discrepancies.

Besides a couple very small bugs after the switch, it went very well. The user base was none-the-wiser, besides the sudden uptick in features after the rewrite. The startup was relatively successful (acquired), and I still work with those guys from time to time.

The thing that struck me is that they decided to rewrite the system without talking to the customer. I believe that if they kind of sold this first, they might have gone a different path.

I love both your story and the patent's :-) great work!

I have similar experience: the "incremental rewrite" is usually the most effective tactic to apply. It reduces risk, cuts "time to market" and lets you apply Pareto's rule - making the process efficient.

Very rarely a rewrite is the answer: for example if the product is functioning poorly, where even low risk changes cause random regressions. Where large system-level changes are needed to make the system work - say because the original developers didn't implement authorization checks after the login.html page. Where none of the original developers are available and no-one knows the requirements. Where the system is a hodgepodge of 4 different frameworks including one custom one (whose developer is long gone, leaving 0 documentation or tests behind).

In cases like that the software artifacts are a liability; it's better to use the assets you've built up (domain knowledge) and develop a new product in parallel. Put the old one in zombie mode and spend two years building the new product with feature parity.

There is one other situation where a full rewrite is good: if v1 is a throwaway prototype. However that would never have been put into production for any significant number of users.

>In cases like that the software artifacts are a liability; it's better to use the assets you've built up (domain knowledge) and develop a new product in parallel. Put the old one in zombie mode and spend two years building the new product with feature parity.

In two years your devs should be able to understand the existing code, fix the auth problems, and excise the worst of the code and hodgepodge framework mess. If they can't, then they certainly can't maintain the old system while building a replacement.

If you can't understand the old code, you can't replace it. If you don't understand the requirements, I'm not sure you can even maintain the existing system. It shouldn't take two years to add auth, or remove an unsupported framework.

This is the textbook case of when a rewrite will fail, because the scope is not just too big, but unknown (because you don't even understand the requirements). The choice to rewrite in this situation is not logical. It's emotional. The existing system is a mess, and the fix seems so difficult. But the rewrite estimate is likely poor because you don't actually understand the system you're rebuilding, and even if it's an accurate estimate you lose two years guaranteed just to ship feature parity. You can make a whole lot of improvements to a codebase in two years while also shipping features.

> There is one other situation where a full rewrite is good: if v1 is a throwaway prototype. However that would never have been put into production for any significant number of users.

Tell that to my manager. Since I joined this company a little over two years ago, all of our tools have been put into protoduction, despite my warnings and protests.

>cabin for a weekend, and rewrote the whole frontend.

The problems with big bang rewrites don't manifest in rewrites that take 2-3 days. Even if your rewrite burns a full week, if it fails you've only lost a week. Rewrites are problematic when they are expected to take months or longer. That's when the amount of code is high, the complexity is high, and the estimates tend to be bad. Losing many months of forward progress to chase a rewrite can kill a company. If a single lost week can kill your company, you're probably doomed anyway.

I've successfully done the "big bang" rewrite myself for things that needed a week or so of rewriting. I don't believe for a moment that this experience is relevant for large scale rewrites. I've only ever seen those fail spectacularly.

Anyone who tells you that a 1-week rewrite is never appropriate is just cargo culting and probably not worth listening to in general. A one week rewrite to avoid weeks of refactoring can be a very good tradeoff. A one year rewrite on the other hand is likely to end in disaster, not least because it guarantees a year of lost forward progress.

That's why I targetted a specific subsystem. Redoing the whole system, frontend and backend, would have definitely taken much longer than the 48 of 72hr I put in over 3 days. I took the piece of the system that was the gnarliest, rewrote it, and bought us a quick win. Down the road, other vertical chunks of the backend were ripped out and replaced in a similar fashion, once they became the now-worst piece of the system.

I guess I'm a little unclear about your argument here. You used the term "big bang" rewrite, but you're describing incremental rewrite.

Only a pound of coffee? I'd be terrified of running out just before I finished. Stay safe, man. Always bring lots of extra coffee to your cabin in the woods.

Hah, funny enough... Not on that project, but on a different messy one, I packed up and went out there, only to discover at 9am on the first day of the project that I had accidentally brought a pound of decaf. The nearest town is about a half hour drive one-way, and the best I could find at tiny grocery store there was a large can of Folgers with questionable vintage, and a box of Red Rose tea.

Surprisingly, if you use pre-ground cheap coffee in an Aeropress, it still turns out... sort of OK. Better than I expected, worse than I'd hoped!

Friends don't let friends drink decaf.

That is precisely what I did with a set of Qlikview dashboards at my previous firm. They were always very dodgy, but kind if worked. My new boss walked in and started fiddling with the underlying scripts and buffered up everything.

Eventually I and another colleague was tasked with fixing the mess. What I did was to look at the sources, then work out what each graph was trying to do. I then setup a new script that extracted the data into a cleaner data model, and then I (rather painfully) copied and pasted entire sheets into this new Qlikview dashboard. From here all I needed to do was to hook up the graphs to the new model by changing the source fields, expressions and calculations.

After a bit of UAT from internal departments I had fixed the reports and no client was really any the wiser. If I hadn't done it this way, there would have been awkward questions and I just know I would have been chasing my own tail modifying the rewrite to get it back to the way the old system looked.

A about 500 of Fortune 500 would like you to show them how to do that with each of their 500 worst hairballs.

While I generally agree with this, there is one situation when scorched earth re-write should at least be kicked around as an option.

If you are dealing with a very bad system that has not been maintained and has tons of candidates for those tightly-scoped incremental fixes, but at the same time, you are embedded in a giant, faceless conglomerate sort of company where there is effectively 0% chance that any of those tightly-scoped incremental fixes will ever be greenlighted and everyone in the team knows it.

Then you should consider the scorched earth approach, because it is the only way that the incremental fixes will ever happen. Bureaucrats will always find an excuse why this particular short term time frame is not the right one for the incremental fix that slightly slows productivity and gets them dinged on their bonus. And they will always find a way to pin the blame regarding underinvestment in critical incremental fixes on the development staff.

So sometimes all you can do is deny them that option, even if you know how painful it will be. The aggregated long-run pain will be less, though it won't feel that way for a long time.

I just want to reiterate that I mostly agree with you. It's especially bad to turn your nose up at legacy code that actually has valuable tests, because the tests make the maintenance and incremental fixes so much better. When there are solid tests, you should almost never throw it away.

Nonetheless, sometimes you have to torch it all to deny the bureaucrats the chance to slowly suck the life out of it (and you).

I dunno, after you do the rewrite, then what? If the bureaucrats won't ever let you make incremental fixes/cleanup, then even your rewritten version will eventually rot.

I think the idea of building software "the right way, once and for all" is mostly just wishful thinking on our part. Even the ideal rewrite will have ongoing maintenance that needs to be done.

Therefore, my only two options in that situation would be to a) make those incremental changes as part of features without telling them (estimated as one chunk), or b) look for a new job.

At my old job, we built, maintained, and operated a huge customer-facing system. We had 2-3 big releases per year, and in every release cycle we devoted around 20% of our time to maintenance, whatever time was necessary for critical bugs (generally from the new code released in the last cycle) and the rest to new development. We (the dev team) put this split into our estimates and development plans, and we had official approval from the project management team and c-level execs. That 20% time was ours to use as we saw fit, and we generally used it for refactoring code we felt was hampering productivity on new development. But we also used it for side projects; I built a performance/metrics gathering and reporting system that was incredibly useful to me, and ultimately became incorporated in marketing materials and the ceo's presentations.

We were lucky to be able to use this 20% time openly, but if we weren't we would have increased our new dev estimates to get the time.

Michael Feathers considers legacy code to be any code without tests. [1] I tend to agree - only I think even code with tests can be candidates for the "legacy" moniker.

Bottom line is: you may do a rewrite, but that codebase may be considered to be a hairball by the next guy.

1. http://www.netobjectives.com/system/files/WorkingEffectively...

All b) really does is start the loop over again at the initialization step.

One of the cool things about doing replace-in-place is that you can A/B test the old and new versions. We are doing that on our current Form Builder rewrite at JotForm. We rewrite one piece and make it live to 50% of users. Then we receive daily morning emails about each test. If some metric is in red, we discuss (or watch fullstory, or talk to users) what might be causing it and improve the new version.

Here is a fresh example, we released the new version of PayPal payments pro/express integration, and the success rates stayed in red. The old version was beating the new version. It was 3x better even though almost everything was same. After some head scratching, we found that the old version had a link to a direct PayPal page where the can get their api credentials, and the new version was missing it. From there, the fix was easy and things turned green.

This is a story that has happened over and over again. When you rewrite software, you lose all those hundreds of tiny things which were added for really good reasons. Don't do it blindly.

Usually these problems come down to the information architecture, and I wonder sometimes if the skills necessary to do a replace-in-place to fix those sorts of problem might be better off, at least in general, being applied to some other problem.

If people want broken stuff, or things that violate the laws of physics, information theory, and common decency, no amount of heroics is gonna keep the wheels on forever. Maybe writing software to cure cancer or improve food distribution (ie, cure hunger) would be a better use of your energy.

Funny because that's exactly the choice I just made.

Why should I - after spending 20 years turning myself into a top tier developer - spend all of that talent and investment fixing the horrible mess that two generations of management have created for me?

I can earn twice as much freelancing or moving into management.

And, honestly, I'm beginning to think that we should just let market forces do their thing. These are the software equivalent of the subprime mortgages - if I spend an enormous amount of capital I can stop it failing but I simultaneously remove any downside. So these "management" types can make the same errors all over again.

Nah, starting for myself is much more interesting. That seems to be the Valley way (although I live on the other side of the ocean).

Our product is used by 2 million people and some of them are literally curing cancer and hunger. That's actually a great motivational reason for us to continuously improve it.

Yes, it's a great feeling and strong motivation to work on something like that, and it gives you the will to stick to it for the long run.

I've been working on a system for improving health and preventing diabetes since 2005 [1], which is complicated by the fact that it's designed from the start to support controlled clinical trials, followup questionnaires, and analytics to measure how well it works. That's enabled us measure how well it works (which literally involved drawing blood), publish papers proving its effectiveness, and feed that learning back into many changes to improve the system over the years.

Parts of the system really needed to be rewritten, but were originally intertwined all throughout, and there was no time for a rewrite. So over the years I've been incrementally refactoring and isolating those parts to make it easier to remove them in the future, even when we were under crunch and there was no time to rewrite.

I finally found the time to completely replace the bad parts, and it turned out to be much easier (and more satisfying) that I expected, significantly reduced the complexity and lines of code, and made it possible for us to hire contractors to work on the code without their heads exploding. But I think it went so well because I'd been thinking about it and chipping away at it for years.

[1] http://turnaroundhealth.com

For anyone who has to look up the "Ship of Thesus", here it is: https://en.m.wikipedia.org/wiki/Ship_of_Theseus

Or Fowler's "Strangler Pattern," a similar approach.

...this is a good reminder that sometimes its a good idea to hire for certain positions, based on prior experience.

Indeed, but then the organization has to know what it does not know... which presumes a certain wisdom in the first place. ;)

I no longer believe in rewrites.

I used to believe that every company gets one rewrite, but only because I have seen that most places have the patience and the stamina for a little bit less than a single rewrite, but I was on the fence about whether they were a good idea anyway.

Trouble is, I could never put my finger on why, other than that it never seemed to fix the problem. It was a bit like moving to a new city to start over and finding out you brought all your problems with you.

In the last couple of years I have begin to figure out what I know in a way I can articulate. The people who have the skills and discipline to take advantage of a rewrite don't need a rewrite. It's a short distance from that skill set to being able to break the problem down and fix it piece by piece. They just need permission to fix key bits, one bite at a time, and they've probably already insisted on the space to do it, although they may be clever enough never to have said it out loud. They just do it.

Then you have the people who don't have the skills and discipline to continuously improve their code. What the rewrite buys them is a year of nobody yelling at them about how crappy the code is. They have all the goodwill and the hopes of the organization riding on their magical rewrite. They've reset the Animosity Clock and get a do-over. However, as the time starts to run down, they will make all of the same mistakes because they couldn't ever decompose big problems in the first place, and they lack the patience to stick with their course and not chicken out. They go back to creating messes as they go, if they ever actually stopped to begin with. Some of them wouldn't recognize the mess until after its done anyway.

In short, people who ask for a rewrite don't deserve a rewrite, and would squander it if they did. It's a stall tactic and an expensive dream. The ones who can handle a rewrite already are. They never stopped rewriting.

Now, I've left out all of the problems that can come from the business side, and in many cases the blame lies squarely with them (and the pattern isn't all that different). In that case it might take the rewrite for everyone to see how inflexible, unreasonable and demanding the business side is, and so Garbage In, Garbage Out rules the day. But the people with the self respect to not put up with it have moved on, or gotten jaded and bitter in the process.

This comment inspired me to recount a story from my time at Amazon: https://storify.com/jrauser/on-the-big-rewrite-and-bezos-as-...

I'm a highly biased former Amazonian.

I was expecting some real wisdom in this story, but I don't think this qualifies him as a technical leader in any way. I don't even think this needed the cargo-cult-y "leaving the room and coming back with a paper that changes everyone's mind." Couldn't he have just asked for a plan? And the ROI? That seems like management 101.

I have no love for him, but he is a genius. This story doesn't do him any justice (nor is the paper particularly compelling).

My point was not to qualify Jeff as a technical leader (though he is an excellent technical leader, IMO). Rather, my goal was to give people a sense for his leadership style, and what it's like to be in a high level meeting at Amazon.

I'm sorry you find the Rickover memo uncompelling. Speaking as an experienced engineer, I happen think it's among the best pieces of engineering writing ever produced.

I have successfully done a full rewrite. 60kLOC of VBScript/ASP to 20kLOC python/django. Added a ton of missing features that wouldn't have been possible in the old, clipboard inheritance style. There was one page on the old site that I had tried to fix up a couple of times and failed completely at. It was ~2500 lines. New system it was nothing special, 25 lines in the view and a big but manageable template.

I did this whole rewrite in about 9 months including all the work to do the database migration which took a bunch of dry runs and then the actual switchover which took nearly 12 hours to run.

What's the catch? I had been working at the company for about 7 years at that point as the sole caretaker of the application so I knew the business needs inside and out, I knew where all the problem areas were and I had plenty of ideas about what features we desperately wanted to add but couldn't. I didn't necessarily add all of them up-front but I was able to make good schema decisions to support those later features.

Rewrites aren't impossible by any means, but I suspect my story is more of the exception than the rule.

I've done two rewrites at the startup I joined 2.5 years ago of codebases that were our core competitive advantage and that were no longer maintained when I joined. Both of the rewrites were successful, but they were small codebases (5k & 10k LoC) and I was the expert on what they were meant to be doing. The first one was our client-side analytics product that I ported from a single 5k LoC CoffeeScript file to TypeScript, the second was a Node.JS/CoffeeScript application that I ported to Java/Groovy.

I think the move to TypeScript was probably the one with the least clear benefit; Our lack of experience with CoffeeScript and the compiler's lax semantics were causing us problems, but we could have moved to pure JavaScript. I like TypeScript and I feel more confident refactoring and merging with a statically typed language, but it's hard to quantify. This took about 6 weeks, though most of my attention going to fighting customer-facing fires during this period. Launching this was excruciatingly stressful though because there was a bug we could only tell existed through aggregate A/B test statistics.

The second one from Node.JS/CoffeeScript to Java/Groovy has seen more clear benefits from being on the JVM since our team has more experience with it overall, both in Dev & Ops, and a direct port boosted perf 3x and we've benefited from the more extensive library support and interoperability with other JVM languages we use elsewhere in our system. This took about 2 months to go from nothing to prod. It was also useful organizationally since we'd talked about rewriting it for so long that we'd been treating the existing system as legacy and not wanting to invest any extra time in that codebase for that reason. Releasing this was comparatively easy since it essentially acted as a streaming transformer, taking JSON in from one queue and emitting JSON out another queue without any state and I could confirm that the new system had only irrelevant changes. Still, half of this time was spent on testing.

I definitely agree

A rewrite is more useful when the original project is irrevocably doomed (bad/obsolete framework, needs a new language, etc)

Most of the time you can refactor the original code

I'm currently at my first job, working at a budding unicorn company in silicon valley. I've been here 1.75 years and am already in the midst of a 3rd rewrite - although not my decision (from higher ups)

The last paragraph really hits home for me and i've 100% gotten very jaded and very very bitter.

"While parts of code were bad, we could have easily fixed them with refactoring if we had taken time to read and understand the source code that was written by other people."

This is the thing. Many people just HATE and REFUSE to read other people's code.

Maybe they just don't teach code reading and reviewing in computer science classes, but they should.

Printing it out on paper and reading it away from the computer is helpful, but a lot of people write code for wide screens that doesn't print out well, or they just don't give a shit and it looks terrible even on the screen. And some people just don't have the stomach for reading through a huge pile of pages, or think that's below their station in life.

But reading code is one of the best most essential ways to learn how to program, how to use libraries, how to write code that fits into existing practices, how to learn cool tricks and techniques you would have never thought of on your own, and how to gain perspective on code you wrote yourself, just as musicians should constantly listen to other people's music as well as their own.

I just don't feel safe using a library I haven't at least skimmed over to get an idea of how it's doing what it promises to do.

There are ways to make reading code less painful and time consuming, like running it under a debugger and reading it while it's running in the context of a stack and environment, looking at live objects, setting breakpoints and tracing through the execution.

Code reading and reviewing is so important, and it should be one of the first ways people learn to program, and something experienced programmers do all the time.

And of course as the old saying goes, code should be written first and foremost for other people (preferably psychopaths who have your home address, photos of your children, and your social security number) to read, and only incidentally for the computer to execute.

>Maybe they just don't teach that in computer science classes, but they should.

Should they? This seems like a classic conflation of computer science with programming. Perhaps the reason this important craft skill is not taught in CS is because it has nothing to do with complexity classes, automata theory, etc.

Programming != CS

It may be the case that programming is applied CS, and that a CS grad might therefore not actually be a decent programmer. However, employers are considering a CS degree as proof of competency in programming. So, if CS classes don't teach programming, then they're failing to provide what the expectation is that people have of that degree.

And they're also failing to educate their CS grads as well as they should. I think every CS grad should have a firm grasp of programming, and I'd consider a CS grad who doesn't know how to program to have a gaping hole in their education.

This notion of considering programming a "craft" that's below you gets at the attitude problem I think a lot of people (not just pure theoretical computer science graduate students) have towards reading other people's code.

Just to be clear. I don't consider it a craft that's below me, I'm a working progammer. I don't consider the use of the word craft to be in any way derogatory either. Quite the opposite. I have a knowledge base from CS and a knowledge base of craft skills that is derived from the practical professional application of the former. I consider these to be separate domains. The latter is not in any way a prerequisite for the former. The obverse may not be true. YMMV.

Given that all the things you list that are important parts of a Computer Science course are tools to create better software, why do you consider the ability to analyse and understand code as seperate from CompSci?

When you are working out the complexity of a program, you need to be able to read and understand it. It seems to me that if tiu are taught the fundamentals of CompSci, reading code and understanding good coding practices naturally are part of the subject matter you should be taught!

The academy should not exist to serve short term corporate interests (I.e. outsourcing employee training). Also university students who want to work in corporations should intern, or if for some reason you can't get an internship, volunteer to build real world apps. I've trained several new grads, who have trouble with implementing even the most basic tasks. This is a failing of the students and the places that hire them, not the schools. It would be like blaming schools for under-employeed English grads.

How do the schools advertise the CS degree? Do they advertise it as a purely academic pursuit with little practical benefit, or do they promote it as the start of a career? If the latter, it is very much the schools who are to blame if they fail to teach the things needed to have that career.

They aren't the same, but they aren't mutually exclusive either.

I think all computer scientists should learn to program, and learning to program requires getting over your distaste of reading other people's code.

I doubt a pure academic "computer scientist" who doesn't like to get their hands dirty reading or writing code because they look down on it as a "craft" is going to be of much use on the types of projects we're discussing that require rewriting or refactoring existing code.

TIL some people have seriously negative connotations of the word 'craft', which appears to have made it a poor word choice, please see my explanatory comment above. I'd be interested to hear if there's some context I'm missing. Both responses that express negative interpretations of the word have mentioned computer scientists' attitudes, but I don't hang out with any IRL. Have I missed out on some historic flame war?

No, you just missed out on some bad professors. ;)

I spend a lot of time looking at a profiler view of my application in work, it's a great way to see what the most "important" areas of the code are.

You get call stacks for the hottest code paths, can read the code with annotations about how expensive a certain line is, etc. It's a good way to look at code I've found!

Understanding others code can be difficult, frustrating, or just plain tedious. But who wouldn't have to do it? When software isn't working, seems to be a normal part of figuring out where things are going wrong, which is a normal part of fixing bugs, etc.

Of course, it's agonizingly ironic to discover that reviewing one's own code written long ago may be no easier to decipher. It's a humbling experience when it happens, but what would ever teach the lesson more effectively about writing code that can be read by mere humans.

I know it makes me think about the variable names I use, the merit of "one-liners", how much white space to leave, and the clarity of explanatory comments, among other things. It's a form of art to know when to be verbose and when not to be.

Some programmers may be superior "code artists", but every programmer should be able to write well enough.

Software is re-written most often for political reasons, namely for new developers to leave a mark, and put a good line in their resume. Everyone likes to read "designed and implemented" on a resume than "maintained and fixed annoying bugs here and there". We know what looks better, managers know who they'd move to the next round, and so on.

The only thing is, if there is no software to be written software developers will always find a reason to write software. Because they get paid full time, usually above what other professions get paid, to well, ... write software.

I think other reasons such as "it is slow, outdated, not written in <latest-language-fad>, bad UI", are often brought up and used as excuses. But underlying reasons are a lot more personal and subjective.

Ill-fated code rewrite crusades often seem to be linked with the Second System Effect (https://umich.instructure.com/files/884126/download?download... ... page 55).


An architect's first work is apt to be spare and clean. He knows he doesn't know what he's doing, so he does it carefully and with great restraint. As he designs the first work, frill after frill and embellishment after embellishment occur to him. These get stored away to be used "next time." Sooner or later the first system is finished, and the architect, with firm confidence and a demonstrated mastery of that class of systems, is ready to build a second system. This second is the most dangerous system a man ever designs. When he does his third and later ones, his prior experiences will confirm each other as to the general characteristics of such systems, and their differences will identify those parts of his experience that are particular and not generalizable.

And from the OP, frill by frill :-)

I wanted to try out new, shinny technologies like Apache Cassandra, Virtualization, Binary Protocols, Service Oriented Architecture, etc.

To OP's credit, he took a step back and learned from his mistakes.

Also related... Second System Effect may not only apply for your second system ever. It may also bite you designing the second system in a new problem domain you're unfamiliar with. I can admit I've built a few second systems in my time ;-)

I hope/like-to-think that the second-system that focuses on architectural-features is better than the second-system that focuses on new user-features.

Stuff like "no global variables" or "actually use database transactions" or "support UTF-8"...

"Remember that time zones are a thing and not everyone is in America/Los_Angeles..."

That’s why I personally love most european companies, as most just assume UTC.

The gold standard is software designed to be run near the North and South Poles, like snowmobile and ICBM navigation systems, which need to be able to switch between all time zones quickly.

Why would an ICBM need to use local timezones at all?

I can't imagine any teams up their using more than one timezone anyways. Perhaps UTC + their home zone?

After a 15-year career programming, I'm pretty sure the only sane way to handle timestamps is everything UTC server side, and let the UI convert to localtime on display.

Yup. That’s also my solution. Sadly, we’ve had to deal with doing other things recently, too, as our software was run on servers with the timezone set to UTC, but the hardware clock being 3 hours ahead of UTC (this is common especially on old Windows installs).

The result was that the displayed time on the server was local time, but if we sent it to the client, we got 3h offset there as well.

>The development officially began in Summer of 2012 and we set end of January, 2013 as the release date. Because the vision was so grand, we needed even more people. I hired consultants and couple of remote developers in India.

Well, that's already a warning sign. If you don't have the talent to do the rewrite, consultants and remote developers in India are only going to make things worse.

> If you don't have the talent to do the rewrite, consultants and remote developers in India are only going to make things worse.

I don't think this is necessarily a given, but I see where you're coming from with the overall perspective.

To get closer to the point of the post: unless there are truly solid and unavoidable blockers, e.g. substantial architectural defects that just can't be resolved without a major rework, a rewrite of any capacity should probably be off the table.

Although I'm sure there are others that I have no insight into, the only "successful" commercial rewrite I can think of would be Windows Vista, and that only made that version of Windows somewhat releasable. Even then, that wasn't a total reboot even of the project; it was a partial rewrite off of a different codebase with as much code salvaged from the original project as possible (the initial Longhorn builds on top of the XP codebase were a total trainwreck, driving the partial reboot on top of Server 2003's codebase instead). They still needed two service packs to make Vista stable, and it took an entire subsequent major release (7) to right the ship, so to speak.

There would be nothing preventing a rewrite of core subsystems. Vista, despite it being a disaster for Microsoft, was only problematic because of changes made to the UX.

Actually Vista is interesting. For all its problems, it's actually an impressive upgrade - whilst there was development chaos due to a lack of focus, a new management team was able to significantly focus in on what needed to be done and hundred of improvements were made under the hood. Microsoft rewrote their TCP/IP stack from scratch and included important system administration tools like an improved logging system, performance and reliability tools, a vastly improved desktop window manager, a new graphics driver model called WDDM and a raft of new kernel security measures.

These rewrites weren't rewrites of the entirety of Windows, but instead Windows was increasingly modularised and entire subsystems were rewritten. Whilst Vista was a disaster and had some real problems I don't think it was because of rewriting subsystems, I think it was due to UX changes.

Interestingly, Windows 7 was an amazing success and in fact built on Windows Vista, but again focused on modularisation and leveraging the work done in Vista.

My hypothesis that major UX changes caused Microsoft to stumble in Windows is born out by Windows 8 - which was even more stable and performed better than Windows 7, but was rejected in the market because of insane UX decisions.

So a major rework can be done, but if you are going to do this then I suggest that major changes to UX should be done after the rework.

I don't think you can avoid major rework though. Without some of the things I highlight above, there is no way Microsoft would be able to improve their code or implement features that actual make sense!

This is basically the story of each of the last few generations of Windows - the core of the operating system has gotten considerably more stable and better at each iteration, only to be stymied by the UX changes which are undesired and generate considerable hate.

Worst of all is when those UX changes are spotty and inconsistent. Windows 10 is a fantastic upgrade, so far as the core operating system goes. But the UI is a complete trainwreck lava-layer, particularly in the control panel and settings dialogs. Bits of it are essentially unchanged from Window 2000, other parts are Vista-era, others from the Windows 7 generation, yet other in the Metro Windows 8 or whatever they are calling the Windows 10 style. Some things can only be managed through the older UI, some only through the new UI, some in both, and some in both, but with subtle and non-obvious differences.

If you start from the desktop and go trough the graphic properties layer you basically traverse all generations of panels with more details each time, down to the windows 95 advanced appareance tool https://www.google.it/search?q=windows+advanced+appearance&r...

The biggest problems with Vista were

a) Lockdown of "Program Files" directory and UAC which meant existing applications had tricky to fix issues (sortable by IT folks not by end users). Microsoft had been saying for years that many of these practices weren't allowed as part of the windows certification program (I'm sure I did most of this work for Win2K certification)

b) Requiring new drivers for things like printers / scanners. Hardware vendors didn't want to do rewrites for old hardware (there's no money in it for them) so peripherals that used to work got "broken" by Vista.

Windows 7 was better because UAC was toned down and both software and hardware vendors had done the security work a few years earlier for Vista. Folks who did the XP->Win7 upgrade had a much easier time.

From a security PoV Vista was needed.

> Vista, despite it being a disaster for Microsoft, was only problematic because of changes made to the UX.

The only major regression that comes to mind here is UAC. I don't think Vista is considered a failure because of its UX decisions.

In my experience, Vista was incredibly slow (compared to XP) and unstable, and got even worse the longer you used it. Anecdotally, even PCs that shipped with Vista were often unable to run it smoothly.

I don't think Vista would have been a hit with an unchanged Windows XP user interface.

We began early April and we had more FT in-house developers than remote resources. The only reason we got remote workers because it was next to impossible to find good developers.

We initially had a terrible experience with remote workers and it didn't work out at all. But we learned from our mistakes and made it work later on. When I left the company, 100% of its development (i.e. maintenance) is done offshore.

Wasn't the rewrite was a technical success despite warning signs?

The customer simply didn't want to adopt the shiny new software.

The rewrite was a technical success. But we missed real business opportunities and completely underestimated the effort of building a new system while maintaining the legacy one in parallel. In short, we didn't prioritize and miscalculated big time :-)

The customer confidence in us had eroded because we weren't responding to their new feature requests in the legacy system. Plus their management didn't want to take any risks since the new system offered to new features to them and only benefited us.

Could you have made major changes to subsystems in the original system and then gradually worked towards a total rewrite incrementally?

You would have had to work very carefully and have an extremely solid test framework and methodology but I'm wondering if it might not have allowed for adding the features the client wanted and prevented a situation where you had to do a major cutover on code that wasn't tested in the field.

What about responding to their feature requests... in the new system? !

Then they'd have the motivation to make the switch.

We tried that and that's how we eventually got them to switch - 2 years later. The client required a feature to start 'live sessions' with mobile subscribers whenever they were active on the network (made or received a call) and support for multi-level menus. The original architecture was transactional and they understood its limitations. The rewrite used SEDA and we were having latency issues, gc and memory issues handling live sessions with SEDA. So we had to do another upgrade to switch from SEDA to Actor based model that worked. The next upgrade was small and incremental compared to the first one very and we had the system ready by the time contract was finalized.

That's the problem. From my reading of the article, their client was, probably legitimately, concerned with the risk of a cutover to a new and unproven system.

To implement a new system the client always needs to do UAT. That's actually a drain on their resources, and if they have a working system it is far better to have incremental changes made and a regular set if bug fixes than to launch a new system, then find a raft of new bugs or unexpected changes that need to be fixed anyway.

Existing systems allow enterprises to operate effectively. New systems almost always cause unexpected disruption to a business regardless of efficiencies gained from the new system. And most IT systems are to support the objectives if the business, they aren't the actual objective if the business. A software development company can lose sight of this because it IS the objective of their business :-)

Good point. The UAT was definitely one of the reasons. It was several Excel spreadsheets long and scheduling it was a pain :-)

Also, it was a hosted service and client saw no value to them- just the pain. There was risk of service outage during the upgrade that would have had resulted in the loss of revenue. Also, the client had to provision extra resources to run Cassandra/SOA servers. In short, it was a lot of work for them with no benefit to them.

There is no way to know if it was a technical success since it was never used by real customers. It might be full of bugs and bad decisions. In fact, by default unused code should be assumed to be full of bugs and problems.

That's the reason behind the success of continuous integration and deployment. At least bad code stay hidden for a very short time and needs to be fixed before other things can be built on top of it.

"The operation was a success, but the patient died"

This is one of the reasons I love working on LibreOffice. I think most people acknowledge that LibreOffice is a pretty decent office suite. But it is just over 30 years old now and it has acreted massive amounts of legacy code, some questionable design decisions and in terms of often trying to track it in version control it can be a nightmare if you go back far enough.

But it would definitely be a mistake to rewrite it from scratch. There are millions of users who rely on it, and it's a massive code base that runs on three major operating systems - four if you include Android.

Instead, the LibreOffice developers are slowly working through refactoring the code base. They are doing this quietly and carefully, bit by bit. Massive changes are being done in a strategic fashion. Here are some things:

* changed the build system from a custom dmake based system to GNU guild - leading to reduced build times and improving development efficiency

* the excising of custom C++ containers and iterates with ones from the C++ Standard Template Library

* changing UI descriptor files from the proprietary .src/.hrc format to Glade .ui files

* proper reference counting for widgets in the VCL module

* an OpenGL backend in the VCL module

* removal of mountains of unneeded code

* cleanup of comments, in particular a real push to translate the huge number of German comments to English

* use of Coverity's static analysis tool to locate thousands and thousands of potential bugs most of which have been resolved

* a huge amount of unit tests to catch regressions quickly

This has allowed LibreOffice to make massive improvements in just about every area. It has also allowed new features like tiled rendering - used for Android and LibreOffice Online, the VCL demo which helps improve VCL performance, filter improvements which has increased Office compatibility and allowed importing older formats, a redesign of dialogue boxes, menus and toolbars which has streamlined the workflow in ways that make sense and are non-disruptive to existing users, and many more improvements which I probably don't know about it has slipped my mind.

But my point is - if LibreOffice had decided to do a rewrite it would have been a failure. It is far better to refactor a code base that is widely used than to rewrite it, for all the reasons the article says.

True story. Similar start. A couple of enterprise customers, kludgy old code. Changeability was the problem. Decision made for total rewrite. Budget no probs, team of 10 allocated. I was the guy who sold the deals. Wanted to assess if we had a disaster on our hands. Asked CTO to explain how team was constructed. Here is how the conversation went:

Me: tell me about the team

CTO: well it is actually all being built by two guys.

Me: That's crazy. We have a 10 person team.

CTO: Actually I lied, it is actually all being built by one guy. Before you get me fired, let me explain. One guy just builds prototypes to keep all those idiotic requests from management.

Me: I knew the guy. Fastest damn developer on the planet

CTO: we just throw his code away. The other guy who we don't let you meet actually writes the code that works. You see nothing for weeks, then it emerges and works beautifully.

Me: what do you do with the other 8 guys?

CTO: give them make work to keep them out of the way.

The system was a market success for a long time.

That's fascinating, was everyone involved in on the deal, or how did they manage to keep that under wraps?

I rewrote from scratch once, after much deliberation and was successful. In my case, I had some extremely good reasons:

The original codebase was build using good old waterfall, where neither programmers nor testers had ever met a user in their life. So all the things that are good about working software kind of go out the window: There are a lot of abstractions in the code, but they didn't match the user's mental model, or helped in any way.

My users didn't like the software at all. It might have me their requirements in theory, but not in practice.

The team that had built the application was pretty much gone, nobody that had any responsibility in any design decision was still in the company.

I actually had managed to get access to people that actually used the software, and people that supported those users, so relying on them, I was able to get a model of the world far better than the original developers ever did.

300K lines of Java, written over three years by 10+ developers were rewritten by a team of two in three months. All the functionality that anyone used was still there, but actually implemented in ways that made sense for user workflows.

I'd not have dared a rewrite had the situation been better though: Working code people want to use trumps everything.

I have always felt that if you need more than 3 developers and 12 months then the software is too big and the scope needs to be cut down. Better a small program that does one thing well than a monolith that does everything badly - why do we never learn from the UNIX philosophy.

I think what you're referring to is what the kids call "microservices".

Three months = 60 working days.

Two devs, so 120 working days.

300k lines of code. So, 2500 lines of production code rewritten per day. Sustained for three months. No holiday. No days off.

And this code was tested to be equivalent of the old code. So, let's say, being conservative, another 2500 lines of test code per day. (It's a lot more that, of course, in most cases.)

So 5000 lines of working, tested code per day.

Plus, the team of two were checking in with the users to make sure everything was hunky dory. And the new team was designing a new "model of the world far better than the original developers ever did".

I'd have gone along with that bullshit until the final sentence: Working code people want to use trumps everything. No, no, young ninja, rock star, Jedi: Working maintainable code trumps everything.

Survived a complete rewrite for a client just now.

The previous team blindly relied on a single method from Spring framework's RestTemplate for every rest service calls with SAP backend.

    ResponseEntity<T> exchange(String url,
        HttpMethod method,
        HttpEntity<?> requestEntity,
        Class<T> responseType,
        Object... uriVariables)
        throws RestClientException
(ref: https://docs.spring.io/spring/docs/current/javadoc-api/org/s...)

This forsed them to create object models structured to match the complicated and inconsitent request and response JSON strings that SAP used, instead of creating the objects modeled after the business domains. The result was no surprise was a disaster and the client wasted five months with them and I had to redo and complete the project still on time.

I still see many teams make their function signatures in Java so convoluted like below and don't know what hit them.

    interface ToDoService {
        ListToDoResponse listToDo(ListToDoRequest request) throws ListToDoException;
        AddToDoResponse addToDo(AddToDoRequest request) throws AddToDoException;
        .. and so on ..

Can you shed some light on how to best solve this problem? I'm pretty much in the same boat with a restful service I depend on.

That guy really has a done a rewrite and then got it back to the point where it is used. That is what I felt when reading the article. Many people just rewrite and then complain that their rewrite never gets used. So they don't draw the important conclusions and the next time they will try it again. If you really push yourself, your team and the users to using the rewrite and getting it back on track with the original solution, then you have to go through all the disappointments of your rewrite not being better than the original, the real pains you induce in your users, and solve so many problems you've never signed up for solving because you didn't know that the old, ugly solution actually solved these as well.

The pain of reading, understanding and fixing/improving the old, ugly solftware is never that painful. And the value is much more immediate since your users already use and know the old software.

I have incrementally refactored large codebases by committing to a guide of which constructs to replace with what, what is the final goal and writing test cases. Once the guide was seen beneficial by other developers, rather they were forced to use it, they were surprised how well that worked. All this was behind the scene from the customer and updates bugfixes were going out. Re-write may be the answer sometimes, many times, a well thought incremental re-factoring guide is an easier path.

I'm facing this at work and wanting to rewrite, updates are to slow when one small change needs 1000 other changes in spaghetti code in piles of scripts...

Difference is we can roll our users over as they renew and in a year after release all new customers and old will have the same software.

Only about 2000 users not as big as a scale as the story here.

Jeff Meyerson's interview with David Heinemeir Hanson [DHH] includes rewriting software. It discusses the reasons rewrites have a bad reputation and the difference between rewriting because "it's not my code/language/idiom" and declaring "technical bankruptcy" as an extension of the technical debt metaphor.

Agree with DHH or not, he's thought deeply about a lot of software development tropes.


I listen to every DHH interview I can but his view on a rewrite is different than others that do rewrites. There is no "second system" per say in his world. He leaves the old system up for people that still think it works just fine. Instead of messing up the old versions with upgrades, they leave them up and if a customer wants to go to new version that is "rewritten" they can.

What are everyone's thoughts on rewriting in order to simplify your developer overhead with the likes of React?

We are a small shop and have a non-game app for html, ios, and android. The non-public (used by less than 30 people) backend has two management interfaces in, you might want to sit down for this, Adobe Flex. There are 2 full time developers (1 backend/flex/html and 1 ios/html) and 1-3 part time (android/ios) contractors depending on workload/deadlines. Everyone is pretty siloed with mostly 1 additional cross over app.

I have read articles similar to this as well as worked with some greybeards that carry this philosophy of staying the course and righting the ship rather than jumpin. However, this siren's song of React is luring me in with their philosophy of "learn once, write anywhere"[0]. Instead of siloed developers we'd simply have developers that would feel comfortable in [management interface/html/ios/android]

[0]: https://facebook.github.io/react/blog/2015/03/26/introducing...

Sometimes, you've got to dump the old stuff because it is just too painful to make changes on. I've rewritten more than a couple of projects that I inherited from less than competent outsourced contractors... Number one, these projects mostly didn't work, and had so many bugs that I'm not sure how it was ever shipped software that somebody paid for. Number two, it was all the worst kind of WebForms and Visual Basic monstrosity. I'm sure its possible to write non-terrible code with WebForms that looks good, but I've never seen it, and it was a matter of a couple weeks to burn it down and build a proper MVC application instead. My only regret there is that I waited too long to rewrite, and wasted a few months battling the manure pile of the old version and trying to improve it incrementally.

I want to try a re-write project at least once in my career. I've been thinking about how I might tackle it:

The legacy system components are unlikely to have ideal boundaries (unless you're insanely lucky) due to organic growth. So step 1, identify key boundaries and implement them as libraries / micro services / whatever makes sense.

Step 2, identify the highest value component to refactor - that might not be the part that makes the system go better for the users, it might be tackling the bit generating most support noise so you can free up more dev resource.

Begin the refactoring with a release, the new component begins life as a shim between the component interface and the legacy component, calling out to the legacy component for all features initially but as functionality is migrated to the new component it gradually uses the legacy component less and less.

You can always release at any time and you'll have a system using new code where it's written and falling back to old code where not.

Lather, rinse, repeat.

Key challenges I see during the refactoring are managing state during the refactoring of complex behaviors of the original component. It may lead to a private interface which is more granular but only exposed to the new replacement component.

Interesting stuff for sure.

The only cases where I've seen rewrites work are when the software fails to consistently complete the task it was written to do. A lot of software is garbage but gets the job done. When you encounter hot spots its best to slowly refactor until you get the system where it needs to be. Lots of little changes mitigate risk, and it is easier to track down bugs if you make a mistake.

Part of me was really hoping that some startup had written an MVP in Scratch and was seeing how far it could scale.

I don't believe in rewrites. But I believe in the evolution of the understanding, what a system is supposed to do and how to do it.

A customer might have a lofty web-scale vision, the first version is built with all scaling and no maintenance in mind. You are then faced with a decision: carry along the visionary nightmare and get almost no real work done, because the system itself is in the way, or rewrite the system with a fresh, minimal approach, focusing on a valuable target.

The point being: Software requirements evolve and sometimes, writing a first version of the software is the spec itself. Just as with specs, you can try to scavenge all good things from the previous version for the next one - and actually getting better over time, despite being in a rewrite mode.

The new Software Development Manager always wants to rewrite the code from scratch. That's the way it's always been. I like that the guy who wrote this post had the courage to admit that he was motivated partly by the desire to play with new tech. reply

I recently faced the decision whether to rewrite

I took over a rails + angular 1 codebase from another developer who left the company. I would be working on the project for few months.

I rewrote the entire app with Meteor + React in few weeks. Best decision I ever made. Even though the app has same features, customers say the app is faster, has less bugs, and overall feels more solid. Security is also better because of the pub/sub system. It also gave me a chance to fully understand the ins and outs of the database and how the app works.

I can also develop features much faster with Meteor + React than with Rails + Angular 1. So time spend re-writing is saved by enabling faster feature development.

> The development officially began in Summer of 2012 and we set end of January, 2013 as the release date.

This is the problem right there. In my experience, successful rewrites only happen when they get done quickly. If you know exactly what you're re-writing, you can get it done within a few days. If you have no idea what you're doing and think it'll take more than a week, it will probably end in failure. The truth is, I re-write code almost every day in my programming life. I usually don't realize I'm rewriting something until after the fact.

"Incremental change is always better for testability and validating the existing features"

- from 97 things every programmer should know

Interestingly I've hardly seen this 'rewrite' vs 'no rewrite' debate regarding to microservices, with all the talk about them.

In theory it shouldn't be more work (as they are supposed to be so small) than some refactoring in your old (big) piece of software, and you can even rewrite it in a different language.

I think there are only two reasons to rewrite.

1) When you need to support a different platform, ie we have a Windows Desktop app and want a Web or Mobile or Mac.

2) When the code base is so bad adding any new features creates many time more bugs.

A lack of design skills and code-cleaning skills can mislead you into thinking "the code needs to be rewritten from scratch".

However, as the new code grows, you're going to need badly these very skills anyways.

This is why all developers should build their own framework from scratch in their spare time... It scratches the itch and helps one appreciate existing software.

"and changed our protocols to binary, all at the same time. "

What does the author mean by that statement?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact