Hacker News new | past | comments | ask | show | jobs | submit login
The Birth of Legacy Software – How Change Aversion Feeds on Itself (rajivprab.com)
113 points by whack on Nov 26, 2019 | hide | past | favorite | 66 comments

I quit my last job because my team was in this room spiral. And management's solution to the problem? Work harder.

As far as I know, everyone else still works there. I bet some of them sleep at their desks trying to keep adding more knots to that awful legacy system.

And it didn't get that way WHILE I was there. That project had been around since 1998. They were so afraid of change that they were still using the C++98 compiler. So I didn't even leave that job with relevant (read modern) C++ knowledge.

Funny thing: The server architect at my current job was one of the founding members on that project. He had the same kind of view towards it. I feel like that project has been in a doom spiral from the beginning. So 21 years at this point.

It blows my mind that there are these legacy code sweat shops out there barely holding software together like this. What a miserable existence.

>And it didn't get that way WHILE I was there. That project had been around since 1998. They were so afraid of change that they were still using the C++98 compiler. So I didn't even leave that job with relevant (read modern) C++ knowledge.

This happened to me too!

I understand the reasons for that: it was highly specialized (scientific) software used in computer chip manufacturing. Most of the people on the team had a PhD in Engineering or Science, but few had extensive Computer Science training before joining the company.

The hardware industry is moving slowly, because you don't just upgrade a fab. Some of the clients were on RH5, so that was our build target.

Part of the codebase was in Fortran (and for good reasons too; this language is common in HPC/scientific scene).

All this resulted in a codebase that was part brilliant, part byzantine, poorly documented, and not even compiling with a modern C++ compiler (hence reluctance to start using C++11 or newer).

Thankfully, the automated test suite was holding it all together; but that was about it. As people left, they took systems knowledge with them that was written down nowhere, and nobody wanted to do a deep dive and document the still known parts.

Predictably, the priority has shifted to "quality" - that is, fixing bugs instead of innovating on the core functionality.

That was not the only reasons why I left for better opportunities, but it was a big factor. It wasn't the legacy system that was scary - no more scary than a hairy prototype that gets the job done, really - it's that nobody was going to put in the effort and take the risk to start moving from a decades-old "prototype" stage.

I believe that things finally started moving when the people there realized that there is no way but forward; I hope they are using modern C++ now. But that train was set in motion after I left.

Isn't "modern C++" pretty much an oxymoron, and pretty much follows the above definition of "legacy software" ?

It's.. not?

The definition of "legacy software" in this discussion is: software that grew so many "temporary" fixes and workarounds instead of necessary architectural changes that it's in permanent maintenance mode, and entire parts of it are untouchable because nobody understands what they do due to the exponential increase in complexity stemming from the abundance of these hacks, exacerbated by system knowledge evaporating with engineer turnover.

That doesn't apply to either C++ language and compiler (rapidly evolving in the past decade), nor projects written in it, generally (that's not a property of the language).

On that note, we've integrated some Fortran code that was written in the 80s that I wouldn't call "legacy" under this definition: the algorithm was clear, and the implementation documented well enough that modifying it, if necessary, would not have been hard, and using it with our floor was very simple (it was one of the flavors of gradient descent algorithms that converged much better than several others).

> software that grew so many "temporary" fixes and workarounds instead of necessary architectural changes that it's in permanent maintenance mode, and entire parts of it are untouchable because nobody understands what they do due to the exponential increase in complexity stemming from the abundance of these hacks, exacerbated by system knowledge evaporating with engineer turnover

Pretty much this, what else am I supposed to think when I read this : ? https://stackoverflow.com/questions/17103925/how-well-is-uni...

We aren't on the ARPANet anymore - I'm expecting even low-level programming languages to use native Unicode. The way how C equates "char" to both character and byte is fundamentally broken.

(This is also/more(?) an issue with Unicode - IMHO we should have increased byte size to something like 32 bits (during the transition to 64-bit words ?) to be able to fit one character per byte - the increased hardware cost for text storage would have been quickly compensated by the decrease in developer costs. But here we are.)

This sounds a lot like the stories I have heard coming from ASML. Especially the part of people with a PhD in engineering writing code.

On the good side of things, their business must be doing really great because they have money to support that monstrosity, and have motivation to, meaning: it gives them a lot of cash.

I've seen plenty of projects blown up with massive, fearless refactors to do thing "the right way".

Careful with that axe, Eugene.

In my open office next door, people have been refactoring a cpp98 monolith into more interdependent components to be able to have a better test suite, better CI integration, better deployment story. That sounds about right.

Well, the issue is that they have done it a bit sneakily, they removed all the legacy code they haven't understood. So the code is much more elegant, it has been moved to cpp11 or 14, it ticks every good practice. There is only one slight issue : it does not work. It somewhat work, but is not reliable and fails regularly in unexpected ways. And they've started 5 years ago, and haven't been delivering any business value since then.

At the beginning, it was okay because they had some leeway but now they are blocking the release of new products, and our market share is in free fall.

Heads have started to roll.

To be fair, a few years down the line, their team will likely be more productive and efficient, but I am still not sure that the cost of the rewrite was justified. Still the article is very on point on the risk of not paying your technical debt.

> they removed all the legacy code they haven't understood.

This is exactly why rewrites or huge refactors typically fail. The new programmer sees code and doesn't understand why the code is there and thinks the last programmer was an idiot and deletes said code. Unknown to the new programmer is that code handles some weird edge case.

A rewrite should spend 80% of the time understanding the old code and 20% writing the new. But, that's no fun for most programmers who just want to code in the latest shiny so the new code ends up broken.


"Chesterton's fence is the principle that reforms should not be made until the reasoning behind the existing state of affairs is understood. The quotation is from G. K. Chesterton's 1929 book The Thing, in the chapter entitled "The Drift from Domesticity":

In the matter of reforming things, as distinct from deforming them, there is one plain and simple principle; a principle which will probably be called a paradox. There exists in such a case a certain institution or law; let us say, for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, "I don't see the use of this; let us clear it away." To which the more intelligent type of reformer will do well to answer: "If you don't see the use of it, I certainly won't let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it."

I've taken part in a "rewrite" like what you describe. It was not the classic "Write a whole new system in parallel, then migrate everyone over." It was the Ship of Theseus executed in software, slowly, over the course of a couple years.

The lights stayed on the whole time. We retained the ability to switch back to the old system by setting a feature flag until everyone (including the business users) was confident that the new system was working as well as - preferably better than - the old one.

Throwing out code we didn't understand was simply unthinkable, because we weren't allowed to have that 6 months of time fantasizing that everything would be great before launch day - whatever you were working on now would be expected to go into production within a week or two, without upsetting the rest of the business in the process.

No, it wasn't very fun. But it was satisfying work. I think it was my boss's boss who observed that making all the team members happy won't make a project successful, but everyone's ultimately happy to see a project successfully executed.

Which is why it is sometimes crucial to add comments to your code. I know certain code basically documents itself, but that depends on the person who reads it. In my eyes it is precisely edge cases that might or might not be known that profit from decent explainatory comments.

As a avid writer of comments I am convinced they also help myself, to form thoughts and remember them later, so at times I will write the comments before writing an actual function.

If that is to bold, commenting while the thing is still in your head makes sense anyways — saves you time later and helps everybody else who will look at your code.

And please write proper commit messages!

A short summary, preferably including a hint at which subsystem is impacted by the change. Then explain in detail the context of the change: root cause of a bug and the gist of the solution, use case(s) behind a new feature and how it can be used, ...

Yes, often there's a bug/project tracking tool being used and the commit message contains a reference to the relevant entry there. But from experience I know these tools tend to change: old one gets decommissioned, data gets migrated, what was once the primary identifier is now a mere field or comment in the new system, access rights get messed up, ... Trying to understand the history then turns into an archeological expedition through various eras long gone... unless the commit messages are sufficiently self-containing.

The issue seems to be that that tracking tool does not have its data backed up in the version control system?

This seems like it could be mitigated with an integration that watches the issue tracker for updates and then commits those updates to a docs folder on the appropriate branch of the appropriate repo.

This would have saved quite a bit of headache at my last job actually.

Or just use a version control system with issue tracker support, like Fossil ?

Switching an existing project to a new VCS loses history or at minimum causes a new layer of "identifiers in the old system don't match the new one", no?

In general, I've found value in figuring out how to improve existing systems where feasible rather than trying to migrate to a new system, since the existing system probably has advantages the new one won't. At minimum, people are already familiar with the existing system.

Of course - but I'm not sure why you assumed that I would suggest that, instead of using a different VCS for a new project.

Ah. I've spent a lot more time working on projects I inherited rather than ones where I made the initial technical decisions - if you're starting a new project, you should use the most capable tools available.

I agree. I also use git blame and track down when and why code was added. But this is tedious work that I find many people don't like to do for some reason. Me, I like the investigative work of tracking something down.

I also personally believe in, and this isn't just for code, but for any project you'll pass to another, a connected read.me file explaining the high-level reasonings and other thoughts, including a personal changelog and roadmap, and some instructions for use and editing.

Deleting all code you don't understand is also stretching the term "refactoring" a bit beyond its usual meaning :-)

Deleting legacy code you don't understand is the best way to remove fixes for bugs you didn't know you had.

I learned from Microsoft that rewrites are hard! (they tried it with word) and I think somewhere is the story how it went.. They rolled back later on and kept modernizing piece by piece now.

That, I think is the right way.

To play the devil's advocate: it might not be a one size fits all solution.

When you're at Microsoft and can just walk up to the bar researchers and programmers in the world, maybe. When you're at some corporation where you have to spend half a day on the phone to get your computer unlocked by the desktop support and request to change a config on a web server becomes a ten foot long email chain about whose fault it is that we need this change, I don't think people have any motivation to modernize piece by piece.

Then there is the issue that you'll have to explain why part of the application is in .NET core and part is in dot net framework 3.5...

> ...and I think somewhere is the story how it went...

Maybe this? https://blogs.msdn.microsoft.com/rick_schaut/2004/02/26/mac-...

Couldn't find any other reference, nice reading!

The legacy code should have been running in parrallel with the new code , atleast until it reaches feature parity ( or reduced features - with other features removed as they have not been used)

Definitely it is a text book example of things to not do, but what is painful is that they started with good intentions. (People with good intentions are the most dangerous one though)

One thing to watch out for, I think, is people who want to do this massive refactor (an oxymoron as another comment notes) and also have a work history consisting entirely of short employments. These people never get to experience the actual result of their refactoring.

Anyone who tells you they’re doing a massive refactor is full of shit. The point at which you should stop trusting them is the moment that phrase escapes their lips or is used to describe them.

And I say that as someone who has committed a few, in both senses of the word.

The point is to be deterministic, and it can only be deterministic if it is small.

For real. Software development can be like building and maintaining a house.

Except none of the contractors agree on what materials to use. So one section is steel, another is wood, and another yet is brick. Meanwhile there is a 3rd party outside attempting to load the whole place onto a truck and ship it somewhere else.

I think that the problem is most often something along the lines of: we have a house, it took 5 years to build and the family is very happy with it. But now, they also want a chip fab, and there are many common parts useful to both, so we don't think it's worth it to start from scratch - we need to modify the home to also house their chip fab, but make sure it's also still a good house to live in.

"home to also house their chip fab, but make sure it's also still a good house to live in."

And then someone else comes along and asks "This does fly doesn't it?"

"No, but we could add flying behavior to it, and it would definitely be easier than throwing away the home and the chip fab that we spent so long on"

Travelling like a submarine underwater is the same as flying isn't it?

And if it’s a web app, they want the fab, a gym, and a jacuzzi, and oh can you do that without adding on any rooms? We have enough already and it is confusing people. Just put the chip fab in the living room.

Is a massive refactoring an oxymoron?

'a series of small behavior-preserving transformations, each of which "too small to be worth doing"' https://martinfowler.com/books/refactoring.html

The solution from the book mentioned in the article (one of the must-read books for devs IMO) is to have tests for your functionalities. Those kind of tests will help a lot when refactoring.

Several years of wasted development effort have I seen.

It doesn’t have to be that way. I’ve worked on plenty of code where the code at the end is barely the code we started with. Ship of Theseus style.

I have noticed that database changes on a live system are the hardest / most risky to implement. Also they are most likely to get to the heart of the problem. I learned what the article is talking about through experience - avoiding the "harder" changes and piling fix on top of other fixes until things become a mess.

As always, there is more than one valid perspective on this problem.

Given the very first example from TFA, the problem is solved (i.e. the new requirement is satisfied) by only adding a few lines of code. In fact, a massive refactoring was not required. I would suggest that this is actually A Good Thing(tm) and may even be indicative a Good Design(tm). If every new requirement requires the system to be overhauled, you're definitely in a worse situation.

The lack of tests/testing is a wholly separate issue.

If the new requirement comes with new tests, and even better, tests for the old behavior and the new behavior, all the better: you can refactor the system at a later time to make it cleaner and simpler and still meet the requirements, which is even better, since you have a fuller picture of what the actual requirements are.

Refactoring in the face of every new requirement smacks of poor initial design.

“Make the change easy, then make the easy change” is one of the tenets of refactoring. You do, in fact, refactor the code with every requirement.

I really, really hated YAGNI about ten years ago but have come around a bit. Our users are empirically insane. You cannot guess what an insane person will do next, and trying will only make you insane, too.

I’ve gotten a lot of good mileage and a lot less stress by following the Rule of Three (architecting on the third example). I’ve learned to spot bullshitters versus sincere YAGNI folks (but many more if the former). The critical factor is identifying which decisions are reversible, not investing much energy in them, do everything you can to delay irreversible ones, and failing that make people pay attention to what they are choosing.

> Refactoring in the face of every new requirement smacks of poor initial design.

This is true.

The idea that a system is well-designed if it allows changes/unexpected new requirements via small, mostly-additive, easily testable changes is also true.

But it is also true that systems remain well-designed until they aren't any more, because too many things have been changed/added. And identifying when that watershed has occurred (or, for extra seniority points, identifying in advance when it is likely to occur) is critical to good engineering over the long term. The point at which "you can refactor the system at a later time" becomes "that used to be the case, but now we need to actually pay down the debt, the tradeoffs have gotten too bad" is the most valuable to identify.

> Because no one in the team has a great understanding of the system

Isn't that the main issue? I understand to not want to do invasive changes when you're not familiar enough with a system, but the solution seems to be:

1. get familiar with the system (doing small non-intrusive changes is a good way to start to become familiar with it)

2. do more invasive refactoring

For sure don't start directly with the invasive changes, but at some point you need to get a good understanding of the system you are working on, or are responsible to maintain.

The problem with "having a great understanding of a system" is that it's hard to pin down, to define what that actually means. You can always use that as an excuse, if you are gatekeeping a poorly constructed system and resisting change.

If you require "full understanding of our system" for me to add some new functionality in a module, then chances are pretty good that your system has a bunch of problematic dependencies, no?

The article seems to be about "invasive and risky changes" (that's from the article). In this case I expect some level of gate keeping by people familiar with the system. If people doing the gate keeping aren't familiar with whatever they keep, then that's the first thing to work on.

> If you require "full understanding of our system"

I wrote "familiar enough", and "good understanding", not "full understanding". "enough" and "good" will of course depends on the context.

Considering this:

> then chances are pretty good that your system has a bunch of problematic dependencies, no?

In practice you need to understand some level of the context in which your module exist, hopefully not all of it, though of course it would be better to be able to just focus on the module itself. By "system" I don't necessarily mean the whole, complete infrastructure, it can be the module.

My point was that if people are blocking changes because of lack of understanding of something, the solution would be to actually get some level of understanding.

Edit: also, I assume good faith from gate keepers.

One of my biggest failures was while managing a team like this. In this organization I was a fairly well-liked leader and had both successful and failed projects and I was brought in to stabilize a core team that was having rampant attrition, had a two-year project that had practically stalled, and was feeling somewhat in a siege relationship with the rest of the company.

Interestingly, the whole situation was almost exactly what he describes here and a full year later I had failed to correct it (and probably made it worse). Here's what I did so that others may learn from:

* Since the team had stalled on that project for two years, I pretty much sidelined it and attempted to get us in a habit of getting wins so that we'd feel comfortable with changes. Result: We just accelerated the muddying of the codebase. The team wasn't constantly upset anymore but the size of the improvements you could make was decreasing - a sign that you're in the doom spiral.

* Set aside time for us to perform the refactors, but allowed the team to identify the primary pain points and focus on them. Result: The muddied code-base was like an underwater cave, the silt was in established places. The attempted refactors were just more muddying. With two years of living with an on-call rotation, the team's primary focus was in trying to remove the things that caused them direct pain, not the underlying causes of them, because they felt that the time we had was insufficient to do anything meaningful.

* I championed automated testing, replicable builds, and atomic changes but failed to really make the argument since the net takeaway was "doing it that way would be great but it would be too slow". It didn't really take, but perhaps I should have had an enforced approach to that rather than an approach to make an argument that convinces. We got somewhere with the testing but it was a big effort. That cost me a little bit on the culture since it had to come top-down from me.

* I tried 'leading from the trenches' so to speak, but that was unsustainable. Things worked while I was there, working as an engineer along with everyone else, but then I had to sacrifice other crucial things that were necessary to retain the state.

In the end, things got somewhat better there in terms of attrition and big system failures, but the doom spiral was still there. So it was like adding x years to failure rather than putting us on the path to success. I'd like to think that now, with some distance, I have a much better picture of what to do.

To be honest, I really don't think that adding a slew of manual testers is really the solution. The iteration speed will plummet because now there is one more hinge to the arm controlling output.

The annoying bit about the whole thing? None of that is useful to me now. When you're trying to build a startup you wish you had problems like this because it means you are already successful. Ugh. Maybe some day it will be of use.

> I'd like to think that now, with some distance, I have a much better picture of what to do.

OK, what, please share!! You told us what didn't work, in retrospect with distance, what would you have tried differently?

I think I would have trouble expressing the nuance in a HN comment given how much time I'm willing to spend on it.

I think the best thing to do is use the existing system as a really, extremely detailed interactive requirements doc for a complete rewrite with modern practices and good documentation, then run both in production until you can switch the old one off

Complete rewrites seldom tend to work out well. Using a current version's behaviour as a "requirements doc" rather than the code or an actual up-to-date document probably misses out on a good deal of functionality that is rarely used but critical in some conditions. The thing about legacy software is that the software has encoded in it years or even decades of changes to functional and technical requirements, workarounds for edge cases that weren't foreseen, bug fixes, optimizations. Each of those can appear like a mess to the observer but it is foolish to think a new system will not run into similar issues. The idea of starting over to "do it right" is a fallacy because perfection does not exist. This is why the software industry is increasingly focused on processes that reap the benefits of quick turnaround and making change easier to deal with.

Believe me when I say the day you switch the old system off will never come in the vast majority of scenarios like this. The result is either wasted development time or, probably worse, you now have two systems to maintain and keep running.

That is basically an example of sunk cost fallacy. Yes, those bug fixes are important, but its easier to find them and fix them again in a tool with proper architechture. The idea is to run the original and the redesigned code parrallely and find the differences in behaviour and fix them.

We recently moved from a heap of matlab code which was started with a student project in 2001 to a huge tool used in the industry today to a new implementation in C++ and python. It has been a huge sucess with our customers.

It's not a sunk cost fallacy at all, sunk cost is throwing further money at something that already failed. If the code is being used, it obviously succeeded. The product works. Otherwise it wouldn't even be worth rewriting.

The reason why people warn against rewriting is that it's a risk, a gamble, and often a conceit by the programmers. Programmers will also often spectacularly underestimate how hard a full rewrite will actually be.

You're taking something that works, and attempting to recreate it. You can find lots of examples where rewrite projects went spectacularly wrong. A commonly cited example was the Netscape rewrite (which killed a hugely successful company).

Your gamble paid off, but it's almost always the worst decision you can make. There's even examples in this thread of when rewriting goes wrong.

Of course it depends on several factors. In our case, we were having a mess of spaghetti code, which crashed at important moments and there was no easy way to locate the issues. Sometimes, when your code required prayers and sacrifices to appease the gods and devils, then maybe its time for a rewrite. However, there are programmers who just want to rewrite because that think the old architechture is bad , when it is mostly fine.

But the forces which caused the current code to end up in such a sorry state will still be in force and cause the rewrite to end up in the same place.

If you have reason to think you can do better this time - e.g. the team have learned how to avoid unexplainable crashes - then you could apply this knowledge to fix the issues in the current code, which would a much less risk and less time.

In our scenario, The tool had no requirements. A student project did something which a team found useful and saved money. More students were given similar projects. Then an engineer comfortable with coding glued the code from all these students. This happened for 17+ years (converting several person months of work to few hours). The team lead for this team got promoted to a very high position in the firm and he hired a software team to redo everything with coherent code.

For us the first project acted as requirements analysis. In most cases bad software is mainly because of lack of proper requirements. In hindsight, its easier to make complex tool coherent.

It is not "sunk cost fallacy" to not wanting to fix the same bugs twice!

I think the word "bug" is not correct in this context. If a rewrite of the software contains the exact same bug, then it means that the requirements were not well defined.

There are certainly situations where that can work well, but there are also many instances where running two solutions in parallel is anything but trivial and adds another layer of complexity. Possibly to the point or multiplying the costs of the rewrite. I've been part of projects where that exact scenario occurred.

A complete rewrite is rarely a good idea. Joel Spolsky summed it up in on of his great blog posts: https://www.joelonsoftware.com/2000/04/06/things-you-should-...

But why not just write good documentation for the existing system? Then you have the best of both worlds: Good documentation and a system which actually works.

I believe this type of rewrite is a lot less risky that what you're suggesting: https://martinfowler.com/bliki/StranglerFigApplication.html

Don't do a rewrite until you have an effective grasp on what the requirements on both the old and the new system are/were. A running system is better than nothing, but it is no substitute for requirements engineering!

So go from a well tested production system to a completely untested (in the real world) system?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact