Hacker News new | comments | show | ask | jobs | submit login
How to Improve a Legacy Codebase (jacquesmattheij.com)
653 points by darwhy 236 days ago | hide | past | web | favorite | 293 comments

> Do not fall into the trap of improving both the maintainability of the code or the platform it runs on at the same time as adding new features or fixing bugs.

I don't disagree at all, but I think the more valuable advice would be to explain how this can be done at a typical company.

In my experience, "feature freeze" is unacceptable to the business stakeholders, even if it only has to last for a few weeks. And for larger-sized codebases, it will usually be months. So the problem becomes explaining why you have to do the freeze, and you usually end up "compromising" and allowing only really important, high-priority changes to be made (i.e. all of them).

I have found that focusing on bugs and performance is a good way to sell a "freeze". So you want feature X added to system Y? Well, system Y has had 20 bugs in the past 6 months, and logging in to that system takes 10+ seconds. So if we implement feature X we can predict it will be slow and full of bugs. What we should do is spend one month refactoring the parts of the system which will surround feature X, and then we can build the feature.

In this way you avoid ever "freezing" anything. Instead you are explicitly elongating project estimates in order to account for refactoring. Refactor the parts around X, implement X. Refactor the parts around Z, implement Z. The only thing the stakeholders notice is that development pace slows down, which you told them would happen and explained the reason for.

And frankly, if you can't point to bugs or performance issues, it's likely you don't need to be refactoring in the first place!

From personal experience, a good way of approaching the sell to business stakeholders is getting them involved in the bug triage and tracking process.

You need to make the invisible (refactoring and code quality) visible (tracking) so they can see what the current state is and map the future.

The biggest reason business stakeholders push back against this is that developers tend to communicate this in terms of "You don't need to know anything about this. But we've decided it needs to be done." Which annoys someone when they're paying the hours.

I've had decent success with bringing up underlying issues on roadmaps, even to the generalness of "this feature / component has issues." It's a lot easier conversation if it's adding "That thing that we've had on our to-do list for a couple months" vs "This new thing that I never told you about."

And as far as pitching, if the code is at all modular, you can usually get away with "new feature in code section A" + "fixes and performance improvements in unrelated section B" in the same release.

PS: I love the simple counter-based bookkeeping perspective from the linked post. (And think someone else suggested something similar in a previous performance / debugging front page article)

I've tried this "getting them involved" approach and it failed miserably for me. I've tried explaining why module A had to be decoupled from module B to stakeholders. I've tried explaining why we need to set up a CI server. I've tried explaining why technology B needs to isolated and eliminated.

In almost all cases they nod and feign interest and understanding and their eyes glaze over. And why should they be interested? The stories are almost always abstract and the ROI is even more abstract. It's all implementation details to them. These stories usually languish near the bottom of the task list and you often need to sneak it in somehow to get it done at all.

I think the only real way of dealing with this problem is to allocate time for developers to retrospect on what needs dealing with the code (what problems caused everybody the most pain in the last week?), then time to plan refactoring and tooling stories and time to do those stories alongside features and bugs.

Stakeholders do need to assess what level of quality they are happy with (and if it's low, developers should accept that or leave), but that should be limited to telling you how much time to devote to these kinds of stories, not what stories to work on and not what order to do them in.

I don't see why they shouldn't have visibility into this process but there's no way they should be allowed to micromanage it any more than they should be dictating your code style guidelines.

This is, IMO, the single worst feature of SCRUM - one backlog, 100% determined by the product owner whom you have to plead or lobby if you want to set up a CI server.

> In almost all cases they nod and feign interest and understanding and their eyes glaze over. And why should they be interested? The stories are almost always abstract and the ROI is even more abstract. It's all implementation details to them.

If you're explaining it in terms of internals and implementation details, then you're always going to get this response.

Your job as a business-facing developer is to translate the technical details (in as honest a way as is possible) into a business outcome.

I'm not naive. We've all worked with stakeholders that make stupid choices and can't seem to grasp a point dangled right in front of them.

But. Even more often than that I've seen (especially in-house) IT talk down to the business, push an agenda through the way they summarize an issue or need, and try to use technical merits to subvert corporate decision making.

Ultimately, you're in it together with business stakeholders. Either you trust each other, or you don't. And "the business can't be trusted to make decisions that have technical impacts" is the first step towards a decay of trust on both sides.

>If you're explaining it in terms of internals and implementation details, then you're always going to get this response.

You're also going to get this response if you explain in terms of a business case.

The business case for literally every refactoring/tooling story is this, btw:

This story will cut down the number of bugs and speed up development. By how much will they speed up development? I don't know. How many bugs and of what severity? Some bugs and at multiple levels of severity and you're not going to notice it when it happens because nobody notices bugs that don't happen. By when? I don't know, but you won't see any impact straight away.

The benefits are vague and abstract. The time until expected payoff is long. Vague, long term business cases don't get prioritized unless the prioritizer understands the gory details, which, as we both know, they won't.

The features and bugfixes - user stories - are not vague. They get prioritized.

>I'm not naive. We've all worked with stakeholders that make stupid choices

I am not complaining about stakeholders in general. I've worked with smart stakeholders and dumb stakeholders. I've never worked with a stakeholder that could appropriately compare the relative importance of my "refactor module B" story and "feature X which the business needs". All I've worked with are stakeholders who trusted me to do that part myself (which paid off for them) and stakeholders who insisted on doing it for the team because that's what SCRUM dictated (which ended badly for them).

>Ultimately, you're in it together with business stakeholders. Either you trust each other, or you don't. And "the business can't be trusted to make decisions that have technical impacts" is the first step towards a decay of trust on both sides.

No, the first (and indeed, only) step is not delivering.

> This story will cut down the number of bugs and speed up development. By how much will they speed up development? I don't know. How many bugs and of what severity? Some bugs and at multiple levels of severity and you're not going to notice it when it happens because nobody notices bugs that don't happen. By when? I don't know, but you won't see any impact straight away.

Point to historical data where possible. SWAG where appropriate.

"We've probably spent over 100 hours fixing bugs in this janky ass system for every 10 hours of real honest-to-god implementation of features work. That outage on Friday? Missing our last milestone by a week? All avoidable. We've been flying blind because we have no instrumentation, and changes are painful. Proper tooling would've shown us exactly what was wrong, easily halving our fix time, even if nothing else about this system changed. A week's worth of investment would've already paid itself off."

Frankly, I'm way better at estimating this kind of impact than how long it'll take to implement feature X.

"Currently, every time we want to build a release of the software in order to test it before deployment, __ developers need to stop working on features and maintenance while we go through the build process, which takes __ hours/days. There are a lot of manual steps involved, and we found that we make an average of __ errors in the process each time, which takes an additional __ hours/days to resolve. We go through all of this __ times a year.

We've determined that we can automate the entire process by setting up a Continuous Integration (CI) server. There's some work involved in setting it up; we estimate it will take __ days/weeks to get it running. But once it's running, (we'll always have a build running __ minutes after each code change)|(we can click on a button in the CI's GUI and we'll have a build running __ minutes later), and we'll be saving __ hours/days of effort per build/year."

Plug in your numbers. If the time to deploy the CI server exceeds the savings, the business would be justified in telling you not to do it. (You'd have to make a case based on quality and reproducibility, which is tougher.) If the cost is less than the savings, the business should see this as a no-brainer, and the only restraint would be scheduling a time to get it done. (Not having it might cost more, but it might not cost as much as failing to get other necessary work done.)

> If the cost is less than the savings, the business should see this as a no-brainer, and the only restraint would be scheduling a time to get it done. (Not having it might cost more, but it might not cost as much as failing to get other necessary work done.)

And that's the crux of the problem. The business invariably mistakenly believes that piling more features onto the steaming pile of crap that is the codebase is the better solution. Add on to that that some mid-level PM promised feature X to the C-level in M months, where M is such short notice even a engineering team with cloning and time machines would be short-staffed, and was chosen without even asking the engineering staff what their estimate of such work would be.

To the business, the short term gains of good engineering practices are essentially zero. The next feature is non-zero. The long-term is never considered.

I've had multiple PMs balk at estimates I've given them. "How could internationalizing the entire product take so long? We just need to add a few translations!" No, we need to add support for having translations at all, we need to dig ourselves out from under enough of our own crap to even add that support, we need to figure out what text actually exists, and needs translating, actually add those translations, and we need to survey and double-check a whole host of non-text assets because you mistakenly believe that "internationalization" only applies to text. Next comes the conversation about "wait, you can't just magic me a list of strings that need translating? I need that for the translators tomorrow!" No, they're mixed in with all the other strings that don't need translating, like the hard-coded IPv5 address of the gremlin that lives in the boiler room eating our stack traces.

Then, later, we'll lose a week of time because the translation files that engineering provided were turned into Word documents by PMs. One word doc, with every string from every team, and then those Word docs got translated. So now we have French.docx, but that of course only has the French. So now engineers are learning enough French to map the French back to the English so they know what translations correspond to what messages.

Depending on how convoluted the case is, you don't know what the end result would save in costs. "we'll be saving __ hours/days of effort per build/year" is a complete unknown.

If the expected value of a task is a complete unknown, then there is NO business justification for doing the task. As an engineer with the responsibilty (or desire) to get business buy-in for a task, you must learn to quantify its value in terms that are meaningful to the business.

It doesn't have to be cost, that just happens to be easiest because it can be opinion-free. You can also express value in terms of business risks or opportunities, but the impact can be seen as an opinion, and you can be challenged by someone with different opinions.

That's the thing, the research and cost assesment itself would take days of work. So not many would care, and one might argue why /should/ you care to begin with. That's how it ends up being stalled at the idea stage.

I can conjure a scientific wild ass guess on the spot.

"I wasted around 10 hours last week thanks to inadvertently pulling broken builds because we don't have a CI server. I spent 4 hours manually deploying things because we don't have a CI server."

"When can I move the new CI server I already setup on my workstation - because fuck wasting half my week to that nonsense, and I had nothing better to do while the devs who broke the build fixed it - to a proper server where everyone can benefit?"

Extrapolating, that's what - 4 months per year of potential savings?

Sure, 14 hours might not be enough time to automate your entire build process, but it should be enough to automate some of it, get some low hanging fruit and start seeing immediate gains. Incrementally improve it for more gains when you're waiting for devs to fix the build for stuff the CI server didn't catch.

Thank you for insight. If the deployment indeed takes hours and there are high chances of pulling wrong builds, then indeed the time costs are very tangible. But if manual deployment takes at most 10 to 20 minutes and relatively straightforward, then it might be less of a case. Depending on how often you deploy also impacts this greatly. I guess in certain cases ROI is just not extremely high and that greatly reduces the appeal of such investment.

No problem, hope it's useful :). You're right about deploy frequency - that 10-20 minute manual deploy done 3-6x a day already adds up to my 4 hours a week deploying things. 10% of the work week right there!

On the other end of the spectrum, a lot of my personal projects don't warrant even the small effort of configuring an existing build server. I'm the only contributor, nobody else will be breaking anything or blocking me, builds are super fast... even if "it will pay off in the long run", there are other higher impact things I could do that will pay off even better in the long run.

In the middle, I've put off automating some build stuff for our occasional (~monthly) "package" builds for our publisher - especially the brittle, easy to fix by hand, rightly suspected to be hard to automate stuff. I was generally asked to summarize VCS history for QA/PMs anyways - can't automate that.

When we started doing daily package builds near a release, however, it ate up enough of my time that non-technical management actually noticed and prodded me before I thought to prod them. Started by offloading my checklist to our internal QA (an interesting alternative to fully automating things) and eventually automated all the parts QA would forget, not know how to handle, or waste too much attention on.

Even then, some steps remained manual - e.g. uploading to our publisher's FTP server. Tended to run out of quota, occasionally full despite having quota available, sometimes unreachable, or uploading too slow thanks to internet issues - at which point someone would have to transfer by sneakernet instead anyways. Not much of a point trying to make the CI server handle all that.

No, the time/effort to make a build with a working CI system is zero, plus any activities that remain manual by design (e.g. installations that require physical access to the server). The uncertainty is only about the feasibility and cost of implementing CI: this is one of the rare cases in which the benefits of software can be measured objectively, easily and in advance.

I would disagree, you still have the build time. And even if engineer doesn't do anything during that time, they are still occupied. No one would switch tasks during 10 minute build, there's just nothing you can do during such a short timeframe. In that case, in terms of business costs, it doesn't matter if engineer is busy or not during that period: they are still not gonna be doing more.

The alternative to spending X minutes waiting for a CI build is spending slightly more than X minutes executing a manual build with a nonzero chance of time-wasting mistakes, not doing nothing.

Not waiting for a build means not testing it, in which case a manual non-build of an uninteresting software configuration seems attractively elegant but a CI system still provides value by recording that a certain configuration compiles and by making the built application available for later use in case it's needed.

That is indeed true, CI is way less error prone, thanks. Having a build ready to go is quite handy too.

What do you mean by "not testing it" "seems attractively elegant"? Testing a build is still a must, although that ends up to be manual testing usually (unit tests don't assure much, integration take a lot of engineering effort for setting up and writing, especially if they were not taken care of from the start).

Who the fuck writes a fully costed business case on whether or not to spend a day setting up a CI server?

I'm trying to get some fucking work done, not convince investors I need a series A.

Ah, I see the problem. You have no interest in understanding why your business makes the decisions it makes; you just expect them to give you permission to do whatever you say you want to do.

You said: I've tried explaining why we need to set up a CI server. ... In almost all cases they nod and feign interest and understanding and their eyes glaze over.

The reason you've failed to make a convincing case, I believe, is because you're talking in your language instead of theirs. Perhaps they've tried to explain to you, in their language, why they won't prioritize your CI server, and you nodded and feigned interest while your eyes glazed over.

The quote I gave you expresses your request and justification for a CI server into terms the business needs: what problem does it solve, what does it cost, how does it affect on-going costs, what are the risks of doing it and not doing it, and what impact does it have on other activities if it is done and if it is not done. This is not a "fully costed business case" or "convincing investors you need a series A". If you've given any thought at all to why you want a CI server beyond "I want it" you should have no problem filling in the blanks in my quote. And if you haven't bothered to think that much about it, your business is doing the right thing by giving your requests a low priority, because they shouldn't give your ideas any more attention than you're giving them yourself.

You're making good points, but there is a lot of truth to your parent's sense that making a business case for every little thing is deeply inefficient. The hard part is striking a good balance between one extreme of arrogant engineers who never think about the business case for the things they are working on and the other extreme of having technical decisions micromanaged by non-technical managers.

Yes, it can be deeply inefficient, but so is not getting approval to do necessary work. You have to start making progress somewhere, even if it's not as fast as you'd like it to be. If you're sucessful with this, you gain credibility and over time your recommendation will be sufficient to get approval for smaller tasks, and the business case will only need to be made for bigger tasks.

If you're not sucessful with this approach, and can't get approval despite showing that it's in the business' best interests using the business' own criteria, then your business is too dysfunctional and toxic to fix. Time to move on.

>Yes, it can be deeply inefficient, but so is not getting approval to do necessary work.

No, actually not needing approval to do necessary work is very efficient.

>You have to start making progress somewhere, even if it's not as fast as you'd like it to be. If you're sucessful with this, you gain credibility and over time your recommendation will be sufficient to get approval for smaller tasks, and the business case will only need to be made for bigger tasks.

There's no point in working to gain enough credibility to be able to do your own job effectively when you can simply leave and go and work somewhere else that doesn't expect you to prove to it that you can do your job after they've hired you.

Even if you manage to prevent the company from shooting itself in the foot as far as you're concerned by "proving your worth", it'll probably only go and shoot itself in the foot somewhere else and that will also ultimately become your problem.

In any case, this process tends to feed upon itself. Failures in delivery lead to a lack of distrust which leads to micromanagement which leads to failures in delivery. It's not that you can't escape that vicious cycle, it's that it typically has a terrible payoff matrix.

I meant not getting approval for necessary work, and therefore not being able to do the necessary work, is inefficient. Not needing approval for necessary work is great; we agree on that.

You're right about having to make a choice between fixing the place you're at or finding a new place to be. There are many factors to consider, and sometimes trying to fix the place you're at can be worth the effort.

> If you're sucessful with this, you gain credibility and over time your recommendation will be sufficient to get approval for smaller tasks, and the business case will only need to be made for bigger tasks.

Maybe! Alternatively: if you give a mouse a cookie, it will want a glass of milk. It might be worthwhile to establish early on that the technical leadership needs to be trusted to make their own decisions about trivial things.

"Ah, I see the problem. You have no interest in understanding why your business makes the decisions it makes"

No, the problem is that you believe that micromanagement is effective.

"The reason you've failed to make a convincing case, I believe, is because you're talking in your language instead of theirs."

No, the reason is because the ROI is vague and not easily costable and the time until expected return is usually months. By contrast, feature X gets customer Y who is willing to pay $10k for a licence on Tuesday.

This hyperfocus on the short term and visceral ROI over the long term and vague ROI isn't limited to software development, incidentally. It is a very, very common business dysfunction in all manner of industries - from agriculture to health care to manufacturing. Companies that manage to get over this dysfunction by hiring executives who have a deep understanding of their business and are willing to make long term investments often end up doing very, very well compared to the companies that chase next quarter's earnings figures with easy wins.

This is also why companies that are run by actual experts instead of MBA drones inevitably end up doing better (ask any doctor about this). It's not the fault of the people beneath them for not speaking the MBA's language. It's the fault of MBAs for being unqualified to run businesses.

Now, fortunately, product managers don't have to understand development because they can choose not to have to make decisions that require them to. However, if they insist on making decisions that require them to understand development then they will damage their own interests.

"The quote I gave you expresses your request and justification for a CI server into terms the business needs: what problem does it solve, what does it cost, how does it affect on-going costs, what are the risks of doing it and not doing it, and what impact does it have on other activities if it is done and if it is not done."

How low level are you willing to take this? Would you agree to make a business case for why you are using your particular text editor? Would you provide an estimate of the risks of not providing you with a second monitor? Where's the cut off point if it's not a day's work? Perhaps you are costing the company money with those decisions, after all.

At some point, dysfunctional management can't be overcome. I'm not really talking about that extreme case; I'm talking about the more common case where engineers don't understand management priorities because they're not aware of the business' non-technical concerns that are part of the prioritization decisions.

If you want to spend a day on a CI server, it'll cost the company a day of your time (say, $1k) and will save maybe 5x that over the year by saving an hour of your time dealing with each build. That's great and worth doing. But, if it means that your company will miss out on $10k of revenue Tuesday, it's a net loss. And if missing that revenue means payroll can't be made on Friday, the company is screwed. The hyperfocus on short-term may be dysfunction, or it may be a sign that the company is in serious trouble. Jumping ship might be the best choice.

"Speaking the MBA's language" isn't really about terminology, it's about a different point of view with different concerns and priorities. A PM choosing your text editor sure sounds like micro-management of a technical decision that the PM doesn't understand, but maybe the text editor you want to use has licensing costs for business use that you're not aware of because you always used it personally, and the PM's decision is based on that business concern rather than the technical merits. Same topic, same choice to make, different point of view.

>I'm talking about the more common case where engineers don't understand management priorities because they're not aware of the business' non-technical concerns that are part of the prioritization decisions.

Ok, so assuming:

* All user stories are prioritized by management.

* Management determines the exact % of time spent on refactoring stories.

* Refactoring stories are prioritized by devs and slotted alongside user stories (according to the % above).

What kind of hypothetical non-technical concerns that are part of managment's prioritization decisions would become a problem?

Because, as far as I can see, in such a case, it wouldn't matter if the devs are not aware of the non-technical concerns because those concerns would still be reflected by the prioritization.

If all three of your assumptions are true, then you're correct. The trick is getting the second two assumptions to be true. Management and devs have to agree on a % time split, and management has to agree with intermingled priorities instead of "dev % goes at the end".

Negotiating those agreements is where having a common ground on business concerns helps. And yes, it sure does help when the managers can also see things from the dev's point of view too. In my experience, it's easier for devs to understand business concerns than the other way around, so that's the way I lean.

Yes. This sort of cost-benefit analysis also ignores some intangibles such as:

"When we're interviewing people and they find out just how backwards our CI system is, the smart ones will laugh at us and work somewhere else and we'll be left with just the dumb ones."

As @DougWebb said, immediate cost saved is the easiest way to sell CI, especially if the savings are large.

He didn't say it was the only way. Nor that you can't add more arguments if cost savings alone isn't convincing enough.

CI's cost savings will not be immediate, large or easy to measure. That's why creating this "sales process" to make them happen is such a toxic mistake.

I worked somewhere once that forced me to spend political capital to make these kinds of things happen and it was a terrible waste.

Nobody notices the disasters that don't happen and when somebody is 2x faster and develops code with fewer bugs, that tends to reflect well upon them even if they were building upon your work.

> Who the fuck writes a fully costed business case on whether or not to spend a day setting up a CI server?

A lesson I learned the hard way is that if the business doesn't care then neither should you, It's just not worth fighting uphill battles like this. The only way to measure what a business cares about (distinct from what they say they care about) is by looking at what they're willing to spend money on.

If building software is annoying for you personally then you can automate much of it, maybe even setup a CI server on you're own machine.

>A lesson I learned the hard way is that if the business doesn't care then neither should you, It's just not worth fighting uphill battles like this.

Absolutely. I used to work for a company where the battles were uphill and constant. I quit and now work for a company with no battles. One had bad financials and the other has very good financials.

The first company did teach me how to deal with very extreme technical debt, though (they'd been digging their hole for a while), which actually is a useful thing to know.

>If building software is annoying for you personally then you can automate much of it, maybe even setup a CI server on you're own machine.

The solution is GTFO.

Agree completely, just note that you can only pull the GTFO card a couple of times in a row, then it hurts your ability to pay rent, no matter how true.

It's a trap actually, sometimes it's the shit companies like this that are the only ones hiring.

The tone of this post is a little flippant, but ultimately, I have to agree.

It comes down to how much trust the executive sponsors have in the face of the engineering org, and how the business views technology, as responsible professionals or children who have to be closely supervised and monitored.

Nurturing that relationship is one of the most important jobs of an executive/C-level engineering manager.

> I've tried this "getting them involved" approach and it failed miserably for me. I've tried explaining why module A had to be decoupled from module B to stakeholders. I've tried explaining why we need to set up a CI server. I've tried explaining why technology B needs to isolated and eliminated.

Because that's too technical. You have to frame the problems in terms that impact them or their employees in terms of user stories/case studies.

Notice the OP said that users have long login times due to various issues and he can solve them by doing X, and not "TCP/IP timeouts and improper caching policies are causing back pressure leading to stalls in the login pipeline..."

Note that I said "I've tried explaining WHY", not "I've tried explaining what a TCP/IP stack is".

The explaining "why" in and of itself isn't particularly hard - refactoring and development tooling will speed up development in the future and reduce bugs. It's getting it prioritized that's hard - and that's because the business case of 'potentially reducing the likelihood of bugs in the future' and 'a story 3 months from now might take 3 days instead of 2' isn't a particularly compelling one - not because it's not important - but because it's not visceral and concrete enough.

In practice I've seen what this does. The process of introducing transaction costs (having to 'sell' the case of refactoring code is exactly that) simply stops it from happening.

If, as a business, you want to introduce this transaction code into your development process you will end up paying more dearly for it in the long run as you deal with the effects of compounding technical debt.

Setting up a CI server is not a user story. It doesn't deliver any value to the customer on its own, and thus is not really something that should be in the customer backlog. It should be rolled into the first story done on the project, as it's a part of setting up the development environment. Similarly, you probably didn't have a story for creating the git repository, nor one for installing your text editor.

You work with your customer to decide what end-user bugfixes and features to prioritize, but it's your job to make technical decisions. That's why they hired you. Don't push those decisions back onto them.

The problem isn't so much of setting up a CI server as much as the fact that the task requires some time and resources from potentially other people. It might need approval for a simple VM with some disks, for example. But more importantly, use of such a server oftentimes means you need some time from other engineers to setup CI tasks including QA, security, etc. Anything that requires team consensus potentially requires meetings and some formalization.

But really, management that doesn't keep up conceptually with the business trends of software engineering management are low performers in the same way as engineers that refuse to learn how to improve their code even if it doesn't directly impact their immediate codebase (functional programming patterns as an embedded software engineer comes to mind).

I've worked with people where every task was a user story. E.g.:

Title: Reveiw code for feature X

Description: As a OUR-APP product manager I want to understand how feature X is implemented.

Acceptance Criteria: AC1: Feature X is documented in the wiki in context Y. AC2: ...

> PS: I love the simple counter-based bookkeeping perspective from the linked post. (And think someone else suggested something similar in a previous performance / debugging front page article)

If you're short on budget and need to sell the whole package to management just do that one, it will make all that invisible stuff visible in more detail than they likely have the stomach for and you'll be granted budget in no time because there is nothing that spells lost business better than a fair sized gap between customers entering on the left and only a trickle coming out on the right.

In effect this is funnel visualization for the internals of an application in all the gory detail.

In my personal experience, the codebase is not the problem, but the people and the culture who made the codebase.

We sent people to the moon with the computing power of a calculator, and with enough good people and effort, and version control, you can rewrite any legacy codebase to meet rigorous standards and meet the performance needs of the users.

Out of all the things humanity is trying and has accomplished, this is not unacheivable.

If the people don't want it, and the culture does not lend itself to high standards, often people will not see the same pain a developer will experience when they see poor code performing poorly. They may complain about the output and all the other things wrong about the platform, but that doesn't mean the company is going to support a legacy reboot, that just means the company has an accepted culture of low performance and complaining.

I say this in the context of a non software development company relying on alot of software.

my most recent form of personal torture I have endured is watching my IT department take a 45 yr old blackbox piece of software from a very old outdated engineering firm who never specialized in software, and actually trying to port it to AWS, thinking it will speed up performance AND save costs. They have no idea if there is capability for the kernel to exploit concurrency of the algorithms on the inside.

What they do know is there was a bug in the code, and it took 8 months to fix embedded in 1500 lines of code, it had no API and all the developers were dead or unemployed by the original company. They pay millions of dollars annually for this liscense and additionally millions more for an "HPC" to run it on.

They would never consider rewriting it, or contracting a new firm with a timeline, performance standards, needs and competitive cost recruiting. They don't know how. They don't understand how.

This is the way of the world outside of software development companies.


If youre wondering how I exist in such a painful environment, I'm an Electrical Engineer, and I do not work for a software company. I get to mentor under some of the most brilliant and game changing Engineers in my industry, but it has very little to do with software development.

It could have alot to do with it, but the engineers have no interest in taking advantage of software. They have to be able to first understand the advantages software can provide but..I mean, some of the engineers I work with don't know what a GPU is. Never heard of it.

I write all my own code for my own work from scratch.

Your comment reflects what I heard last week from a friend. He's a PE (Professional Engineer) working with a large power utility company, and his company uses software that, in his words, "really sucks".

He was telling me this because he got a job offer from a startup, where they wanted him to be the subject matter expert for an application they were developing for power utilities.

It's a good offer, and he's tempted, but he's not sure if he can fit into the software/startup culture. In addition, he felt that the developers were looking to use him like a reference book - he got the impression that the founders saw software as the answer to everything, and that they didn't see power engineering as a particularly hard domain. There was (in his words) a distinct whiff of "developers are the cool guys".

This turned him off somewhat. So he's not sure if he'll take the offer, and I'd say he's leaning no.

Just an anecdote to illustrate the clash of cultures.


I am getting the same kind of offers. Theres alot of subsidies and VC investment in "clean" energy and "smart grid" so developers left and right have a new market to apply their software development skills to.

In my experience, I am not being used as a reference book, and my long term investment in coding/Minor in Compsci and personal projects make me able to translate enegineering speak to software speak. I have not had the experience that the developers think software is the answer and that power is trivial.

I think the real clash of cultures is the culture of software development realizing how much beuracracy surrounds the power grid, and how much resistance there is to change within the industry, because the industry really does not understand how 95% of their day manually editing excel workbooks that output fortran run files from the 80s could be deleted and improved, and of course, most of them don't want to improve since there is a 30yr generational gap in the power industry.

Half of the people I work with don't believe in climate change and think Elon Musk is taking their jobs. This industry is ripe for disruption and I don't think the mentality that software/smart grid stuff can improve it is incorrect, but I do think the naivete that everyone in every industry is as openminded and continually putting in effort to learn and produce working products, which is a mindset required by successful growing software firms, is proving to be a big barrier and wake up call to software companies trying to come in an help.

I honestly blame our industry over software developers, but yeh, theres also a goldrush in "smartgrid" "clean energy" stuff and everyone want's to be apart of it.

Definitely as a Power Engineer it leaves you getting lots of opportunities and having to sort through who is willing to invest in understanding the complexities of innovating on the power grid, and who is going to cop out once they realize you can't whip up an app and make money off users the way snapchat will.

Theres also, and for good reason, lots of cybersecurity policies surrounding software running on the power grid, because hacking the grid has detrimental effects that can quickly translate to coast wide blackouts etc. That also means the newest github release of that multiplatform coffeescript spinoff is not going to be allowed to be used in alot of the grid side applications, and there is more work involveded with vetting development.

All of these things really quickly weed out the devs who are looking for quick stardom and the next easiest cashcow, and landing on the latest buzzwords related to clean energy. It can be frustrating weeding out the companies who try to hire you from that perspective.

It doesn't mean the grid doesnt need better software, it just means people, even developers who want to cash in on hot finance markets are going to take the path of least resistance, and the powergrid is not the path of least resistance (no pun intended) when it comes to quick cash and unicorn apps.

You have to actually CARE about innovating on the grid, and not just to pretend to care because theres billions sitting around in funding waiting to fund good smart energy innovation. Regardless, because this money is there now, and there's the backing of social reinforcement of being able to herald your startup as bleeding edge world saving technology thats going to stop societies impending doom from global warming, is a very compelling emotional appeal that makes marketing and justifying your product easy.

It's hot right now and in then ext ten years we will see who is around to grab quick cash and feel good about being the poster child for saving the climate and who is willing to invest in truly renovating the powergrid and enabling clean energy as a sutainable long term solution that is eocnomically viable without startup subsidies to cover the cost of initial investments.

Eventually these companies have to show a profit...

It is frustrating to be in the industry as an Engineer under the age of 35, and also have exposure to Compsci and friends working at Amazon and Google, and having to explain a hyperlink to a coworker 3 levels above you.

It's also frustrating to take graduate level classes and do research and R&D and have spent years designing power grid and actually being on the power grid and seeing construction and putting in hard work to learn Electrical Power before it was "cool" and then have an insurmountable rush of developers who want you to help them change the world. They are the CEO, you are the reference book engineer.

I get that, but its important to look past that and see there is true benefit to the innovation. And for the most part they are right, this industry has been sitting in static mode riding comfortably for a long time in many ways when it comes to trying to stay technologicially relevant, and maintain sustainable infrastructure that allows for growth. So it is frustrating, but ultimately the software/tech community is in the right. It's time for a change and this industry needs to admit it kind of sucked at changing itself for decades.

It also helps that I have 8 years of coding experience I put in on my own time to help ease this barrier but I am an exception to the rule as I have been told by recruiters doing smart grid dev.

These problems tend to be systemic, not just tech problems and usually by the time we reach this stage management is a little more amendable to things like feature freezes than what the regular crew would be dealing with. There is a reason you get to that stage.

So I can see how we have (much) more freedom when it comes to setting the time table and more diplomacy and better salesmanship might be required at an earlier stage. But then you can point to this comment here and suggest that it is probably much cheaper to do this in house than to hire a bunch of consultants to do it by the time the water is sloshing over the dikes.

Extreme planned refactoring perhaps:

"Many teams schedule refactoring as part of their planned work, using a mechanism such as "refactoring stories". Teams use these to fix larger areas on problematic code that need dedicated attention. Planned refactoring is a necessary element of most teams' approach - however it's also a sign that the team hasn't done enough refactoring using the other workflows."


>I don't disagree at all, but I think the more valuable advice would be to explain how this can be done at a typical company.

I resolved to try this if I ever ran into the same problem again after a whole bunch of arguments at a previous few companies:

* Set up a (paper) slider with 0-100% on it and put it somewhere prominent on the wall. Set it at 70%. That's the % of time you spend on features vs. the % of time you spend on refactoring (what that entails should be the development team's prerogative).

* Explain to the PM (or their boss) that they can change it at any time.

* Explain that it's ok to have it at 100% for a short while but if they keep it up for too long (e.g. weeks) they are asking for a precipitous decline in quality.

* Track all the changes and maintain a running average.

I think a lot of people suspect that management would just put it at 100% and leave it but I suspect that wouldn't happen. Most manager's "cover their ass" instincts will kick in given how simple, objective and difficult to bullshit the metric is once it's explicit.

I'd call it "Maintaining existing code" rather than "Refactoring".

To a non-technical manager, the former sounds like pretty much what it is, and won't raise many questions. (If they do question it, ask them if they maintain their car while it's still running ok, or just wait until it breaks down before they do anything to care for it.)

Refactoring, on the other hand, sounds like a buzz word, and if they look it up they'll get "rewriting code that's already working so that it continues working the same way". They probably won't get the nuances about why that's a useful thing to do, so it'll sound like busywork and they won't be happy with letting your team do it. They also won't be able to justify it to their management if they're questioned about it, which is critical for getting buy-in from your managers.

That's a good one, I will definitely steal that. Thank you!

Freeze is a business decision, NOT a technical decision. How much risk is acceptable to them is the question. If they want low risk than they need to freeze early, if they can stand risk then they can freeze latter. If they want the best of both worlds then they need to invest in automation (build and test) up until the point where the costs of automation exceed the value of lower risk with a late freeze date.

Remember you need to work in their terms. Risk is something they understand. The risk is they ship as soon as the last feature is done, without discovering that the last feature broke everything else. From there they move back, the last feature is done, so we do a 30 second sanity test - increase that to 30 minutes, 1 week, 1 month... They should have charts (if not create them) showing how long after a bug is introduced is it discovered on average, use those charts to help guide the decision. If the freeze time frame is too long then they allocate budget to fix it, or otherwise plan around this.

There are a lot of options, but they are not technical.

If a change is hard to make, first make the changes that will make it easy. Recurse.

> And frankly, if you can't point to bugs or performance issues, it's likely you don't need to be refactoring in the first place!

I feel this is a lack of clarity around the word refactoring. Improving the code in a way that fixes bugs is "bug fixing", in a way that makes it do its job faster is "optimisation" and in a way that improves the design is "refactoring".

Of course one can do several of them at the same time. And add features, at least in the small.

Refactoring can be a valuable activity for bits of a code base where the cost of change could be usefully reduced. It's useful to have a word that can be used to describe that activity that isn't commonly conflated with bug-fixing or optimisation.

Sound advice.

re: Write Your Tests

I've never been successful with this. Sure, write (backfill) as many tests as you can.

But the legacy stuff I've adopted / resurrected have been complete unknowns.

My go-to strategy has been blackbox (comparison) testing. Capture as much input & output as I can. Then use automation to diff output.

I wouldn't bother to write unit tests etc for code that is likely to be culled, replaced.

re: Proxy

I've recently started doing shadow testing, where the proxy is a T-split router, sending mirror traffic to both old and new. This can take the place of blackbox (comparison) testing.

re: Build numbers

First step to any project is to add build numbers. Semver is marketing, not engineering. Just enumerate every build attempt, successful or not. Then automate the builds, testing, deploys, etc.

Build numbers can really help defect tracking, differential debugging. Every ticket gets fields for "found" "fixed" and "verified". Caveat: I don't know if my old school QA/test methods still apply in this new "agile" DevOps (aka "winging it") world.

> Semver is marketing, not engineering.

I agree with many of your points, but that casual dig at semver is unwarranted and reveals a misunderstanding of the motivation behind it [1]. Semver defines a contract between library authors and their clients, and is not meant for deployed applications of the kind being discussed here. Indeed, the semver spec [2] begins by stating:

> 1. Software using Semantic Versioning MUST declare a public API.

It has become fashionable to criticize semver at every turn. We as a community should be more mindful about off-the-cuff criticism in general, as this is exactly what perpetuates misconceptions over time.

[1]: https://news.ycombinator.com/item?id=13378637

[2]: http://semver.org/

Build numbers, internal accounting process, engineering.

Semver, outsiders view, marketing.

Two different things, conflating them causes heartache. Keep them separate.

> re: Write Your Tests, I've never been successful with this ... I wouldn't bother to write unit tests etc for code that is likely to be culled, replaced.

I think you misread the author. He says "Before you make any changes at all write as many end-to-end and integration tests as you can." (emphasis mine)

> My go-to strategy has been blackbox (comparison) testing. Capture as much input & output as I can. Then use automation to diff output.

That's an interesting strategy! Similar to the event logs OP proposes?

Sounds like approval testing: http://approvaltests.com/

You capture the initial output from the original code, then treat this canonical version as the expected result until something changes.

The thing about end-to-end and integration tests is that at some point, your test has to assert something about the code, which requires knowing what the correct output even is. E.g., let's say I've inherited a "micro"service; it has some endpoints. The documentation essentially states that "they take JSON" and "they return JSON" (well, okay, that's at least one test) — that's it!

The next three months are spent learning what anything in the giant input blob even means, and the same for the output blob, and realizing that a certain output in the output comes directly from the sql of `SELECT … NULL as column_name …` and now you're silently wondering if some downstream consumer is even using that.

Belated reply, sorry. Been chewing.

Methinks I've prioritized writing of tests, of any kind, based on perceived (or acknowledged) risks.

Hmmm, not really like event logs. More of a data processing view of the world. Input, processing, output. When/if possible, decouple the data (protocol and payloads) from the transport.

First example, my team inherited some PostScript processing software. So to start we greedily found all the test reference files we could, captured the output, called those the test suite. Capturing input and output requires lots of manual inspection upfront.

Second sorta example, whenever I inherit an HTTP based something (WSDL, SOAP, REST), I capture validated requests and generated responses.

Pinning tests can be helpful for scary legacy code! http://rick.engineer/Pinning-tests/

For testing comparison testing should probably be the preferred means of testing (solves the oracle problem). A combinatoric tester of the quickcheck variety can be invaluable here,and can be used from the unit test level all the way to external service level tests. Copy the preferably small sections of code that are the fix or functionality target, compare the old and copied paths with the combinatoric tester, modify the copied path, understand any differences, remove the old code path (keep the combinatoric test asserting any invariant or properties).

Some other important points:

- Inst. and Logging: And also add an assert() function that throws or terminates in development and testing, but logs in production. Sprinkle it around when your working on the code base. If the assert asserts assumptions were wrong and now you know a bit more about what the code does. Also the asserts are your documentation and nothing says correct documentation like a silent assert

Fix bugs - Yes, and fix bugs causing errors first. Make it a priority every morning to review the logs, and fix the cause of error messages until the application runs quiet. Once its established that the app does not generate errors unless something is wrong, it will be very obvious when code starts being edited and mistakes start being made.

One thing at a time - And minimal fixes only. Before staring a fix ask what is the minimal change that will accomplish the objective. Once in midst of a code tragedy many other things will call out to be fixed. Ignore the other things. Accomplish the minimal goal. Minimal changes are easy to validate for correctness. Rabbit holes run deep and deepness is hard to validate.

Release - Also almost the first thing to do on a poorly done project is validate build and release scripts (if they exist). Validate generated build artifacts against a copy of the build artifact on the production machine. Use the Unix diff utility to match for files and content or you will miss something small but important. For deployment, make sure you have a rollback scheme in place or % staged rollout scheme because, at some point, mistakes will be made. Release often because the smaller the deploy the less change and the less that can go wrong.

To help others with this strategy of blackbox/comparison testing, it's also often called "characterization" testing [1]. (In case you want to read more about this strategy.)

[1] https://en.wikipedia.org/wiki/Characterization_test

>My go-to strategy has been blackbox (comparison) testing. Capture as much input & output as I can. Then use automation to diff output.

Same here - you have an oracle, it would be a waste not to use it. You can probably also think of some test cases that are not likely to show up often in the live data, but I would contend that until you know the implementation thoroughly, you are more likely to find input that tests significant corner cases in the live data, rather than by analysis.

> My go-to strategy has been blackbox (comparison) testing. Capture as much input & output as I can. Then use automation to diff output. I wouldn't bother to write unit tests etc for code that is likely to be culled, replaced.

I think that is precisely what the article advocates - although the definition of what end-to-end and integration tests are varies wildly from place to place.

> First step to any project is to add build numbers. Semver is marketing, not engineering. Just enumerate every build attempt, successful or not. Then automate the builds, testing, deploys, etc.

A thousand times this. And get to a point where the build process is reproducible, with all dependencies checked in (or if you trust your package manager to keep things around...). You should be able to pull down any commit and build it.

That's absolutely true, I totally wrote that under the assumption that you at least have some kind of build process and that it actually works. I will add another section to the post.

> write your tests.

From my point of view, this is always key. The moment you can have testable components, it's the moment you can begin to decompose the old system in parts. Once you begin with decomposition, Its easier first to pick on low hanging fruits to show that you are advancing and then transitioning to the dificult parts.

pd: I've been all my carreer maintaining & refactoring others code. I've never had any problem to take orphan systems or refactor old ones, and I kind of enjoy it.

If you have such of that old & horrible legacy systems, send it my way :D.

Interesting read for those who don't understand our fancy for legacy code: http://typicalprogrammer.com/the-joys-of-maintenance-program...

The article is a description of my career! Thanks for sharing.

+1 for the split testing + diff approach. We've successfully used this several times to replace old components with new implementations.

I'd add a prerequisite to the top of this list:

- Get a local build running first.

Often, a complete local build is not possible. There are tons of dependencies, such as databases, websites, services, etc. and every developer has a part of it on their machine. Releases are hard to do.

I once worked for a telco company in the UK where the deployment of the system looked like this: (Context: Java Portal Development) One dev would open a zip file and pack all the .class files he had generated into it, and email it to his colleague, who would then do the same. The last person in the chain would rename the file to .jar and then upload it to the server. Obviously, this process was error prone and deployments happened rarely.

I would argue that getting everything to build on a central system (some sort of CI) is usefull as well, but before changing, testing, db freezing, or anything else is possible, you should try to have everything you need on each developer's machine.

This might be obvious to some, but I have seen this ignored every once in a while. When you can't even build the system locally, freezing anything, testing anything, or changing anything will be a tedious and error prone process...

> I would argue that getting everything to build on a central system (some sort of CI) is usefull as well, but before changing, testing, db freezing, or anything else is possible, you should try to have everything you need on each developer's machine.

I'd extend this and say that the CI server should be very naive as well. It's only job is to pull in source code and execute the same script (makefile, whatever) that the developers do. Maybe with different configuration options or permissions, but the developers should be able to do everything the CI server does in theory.

A big anti pattern I see is build steps that can only be done by the CI server and/or relying on features of the CI server software.

Added, thank you.

Also added a bit about the very obvious backup that you need to make before starting any work at all. Just in case...

This is a good high-level overview of the process. I highly recommend that engineers working in the weeds, read "Working Effectively with Legacy Code" [1], as it has a ton of patterns in it that you can implement, and more detailed strategies on how to do some of the code changes hinted at in this article.

[1] https://www.safaribooksonline.com/library/view/working-effec...

Second this, this is one of the best coding books I've read.

edit: it also gives a lot of similar advice to the article, big-bang rewrites often impossible, drawing a line somewhere in the application to do input-output diffing tests when you make a change

I mostly agree with this - bite-sized chunks is really the main ingredient to success with complex code base reformations.

FWIW, if you want to have a look at a reasonably complex code base being broken up into maintainable modules of modernized code, I rewrote Knockout.js with a view to creating version 4.0 with modern tooling. It is now in alpha, maintained as a monorepo of ES6 packages at https://github.com/knockout/tko

You can see the rough transition strategy here: https://github.com/knockout/tko/issues/1

In retrospect it would've been much faster to just rewrite Knockout from scratch. That said, we've kept almost all the unit tests, so there's a reasonable expectation of backwards compatibility with KO 3.x.

> In retrospect it would've been much faster to just rewrite Knockout from scratch.

That's most likely not true, but looking backwards it often feels that way. The problem is that you're now a lot wiser about that codebase than you were at the beginning and if you had done that rewrite there could have easily been fatalities.

But of course it feels as if the rewrite would be faster and cleaner. How bad could it be, right? ;)

And then you suddenly have two systems to maintain, one that is not yet feature complete and broken in unexpected ways and one that is servicing real users who can't wait until you're done with your big-bang effort. And then you start missing deadlines and so on.

It's funny in a way that even after a successful incremental project that itch still will not go away.

> The problem is that you're now a lot wiser about that codebase than you were at the beginning

That may not be true in this case, if the rewriter is also the original author and has remained active in the codebase over the years.

Yes, but that's an entirely different situation than the one I'm targeting in the article. But yes, in that case you have better chances.

Even so, there is the Netscape story as evidence to the contrary.

And I'm sure Netscape is far from alone in that category ;-)

But (disclaimer) as someone who as advocated for big-bang-rewrite's before, I'm still under the impression that there are situations where they can be net-better.

Factors may include:

- there is no database involved, just code. Even more helpful if the existing code is "pure".

- a single developer can hold the functionality in their head.

- there are few bugs-as-features, tricky edge cases that must be backwards-compatibility, etc.

- as stated above, it's the primary author.

- much of the existing functionality is poor, and the path for building, launching, and shifting to a "replacement product" is relatively clear.

Advocating to never rewrite can be harmful, and make things harder for people for whom that actually would be the best approach.

Yes, but those are special cases. For every rule there is an exception, and of course if the parts above apply you are fully in control and are well able to judge whether you should rewrite or not.

But the situation that I'm describing is not ticking any of those boxes and I think I made that quite clear in the pre-amble.

> the situation that I'm describing is not ticking any of those boxes

Oh, there's no doubt in my mind about that!

Some people may read this and extrapolate too far regarding their own situation (there's a reason this is a specialty field, it's hard stuff).

One thing that bothers me is that people tend to expect miracles. I usually tell them it will take as long as it took to fuck it up to fix it. But that doesn't mean that you can't have some initial results to point the way in a short time. It's more about establishing a process and showing that there is a way out of the swamp than that it is something super tricky or difficult. Just follow the recipe, don't let yourself be distracted (this can be really hard, some management just can't seem to get out of the way) and keep moving.

> In retrospect it would've been much faster to just rewrite Knockout from scratch.

You're getting a bit of pushback on this sentiment, so I'll play devil's advocate a bit here.

I've tried gradual refactors in the past, with poor results, because unfocused technical teams and employee turnover can really kill velocity on long-term goals that take gradual but detailed work.

That is, replacing all those v1 API calls with the v2 API calls over five months seems fine, but there's risk that it actually takes several years after unexpected bugs and/or "urgent" feature releases come into play. And by that time, you might have employee turnover costs, retraining costs, etc.

I'm just saying the risk equation isn't as cut and dry as it seems. There's is survivor bias in play in both the "rewrite it" and the "gradually migrate it" camps.

The rewrite only works - in my experience, YMMV - if the team is already 100% familiar with the codebase as it is and the task is a relatively simple one and there is a nice set of tests and docs to go with the whole package.

Outside that boundary you're set up for failure.

The one caveat is that there are times when the business realizes that their old workflows and features aren't what they now need. The rewrite becomes a new project competing with the old rather than a functional rewrite.

This is also fraught with peril. However, it is a different set of problems. In an ideal world, you have engineers who can make reasoned decisions.

However, if the company culture allowed one application to devolve into chaos, what will make the second application better?

You raise an excellent point and usually in tandem we educate management (not the tech people) on how they failed in their oversight and guidance role.

The real problem of course is to let things slide this far in the first place. But that's an entirely different subject, for sure the two go hand-in-hand and often what you touch on is the major reason the original talent has long ago left the company. By the time we get called in it is 11:58 or thereabouts.

At some point they'll junk the in-house program and buy something off the shelf.

Assuming something off the shelf is available, yes. In fact, if something off the shelf is available we'll be happy to make that recommendation, too many companies that aren't software houses suddenly feel that they need to write everything from the ground up. And even companies that are software houses suffer from NIH more often than not. (Though, I have to say that in my experience in the last couple of years or so this is improving, it used to be that every company had their own in-house developed framework but now we see more and more standardization.)

I agree about the YMMV part. The same caveats, small scope and developers with expertise, apply in the gradual migration plans as well in my experience. It's clearly true in the extreme cases (python2 -> python3) and I've seen the same patterns happen inside companies as well.

Looks like you had a too ambitious goal. Your rewrite would suffer from even more unexpected bugs, and the same urgent features, but worst, because you will have to fix them in 2 different systems. When your organization won't help you, you have to do less.

> bite-sized chunks is really the main ingredient to success with complex code base reformations.

An excellent talk about this is "The Scandalous Story of the Dreadful Code Written by the Best of Us" by Katrina Owen [0]

[0] http://www.kytrinyx.com/talks/scandalous-story/

Is anyone else flabbergasted by the amount of effort required to mock a function call in Go, as described by this talk?

Like, when at 3:20 the presenter says there's a thing you can do that makes it utterly trivial to test this feature, I immediately assumed she'll just have to write some mocks for the `comm` package, and plug that in. Cool, I guess she'll talk about a nice mocking library or something, or there's some business complexity involved where the comm package is particularly stateful and so difficult to mock.

But no. The big difficulty seems to be that the language doesn't allow you to mock package-level functions; and so before you can mock anything you have to introduce an indirection - add an interface through which the notify package has to call things, move the code in the comm package into methods on that interface, correct all code to pass around this interface and call methods on it.

Why would you choose to work in language that makes the most common testing action so painful?

It shouldn't be 'the most common testing action'. In my mind, the number of mocks required for a test is usually inversely proportional to the quality of the code; if you need to mock out 20 random implementations to test something, you've either got an integration test masquerading as a unit test, or you've got very tightly coupled code. Mocks that need to be injected via monkey patching are worse than 'normal', dependency injected mocks. `quality = mock_count^-1 + monkey_patched_mock_count^-2`

Monkey patching is a sign of bad code in 99% of cases. In that 1% of cases where it might be justified, you can restructure your code to use indirection and dependency injection, and avoid having to use monkey patching. It might not be as nice as monkey patching in that 1% of cases. But I'd rather work in a language without monkey patching, precisely because it makes it incredibly obvious when you've coupled your shit.

Working in Go changed how I write my JS code. I don't know if you write much JS, but to my mind, `sinon` is mocking. `proxyquire` and `rewire` are monkey patching; monkey patching with the aim of helping mocking, but monkey patching none the less. My JS tests now don't use proxyquire or rewire, though they might use sinon. I find this produces easier to read code.

Well, every external call is 'coupling' in your code. Whether it happens on an interface passed as an argument or by resolving the name in some other fashion doesn't really change how tightly coupled your code is.

To me, having to change a function into a method on a singleton interface just to be able to mock it for tests seems like working around inadequacies of the language. And I'm not sure why `module.Interface.method` is easier to read than `module.function`.

That really is an excellent talk, thanks for sharing.

> In retrospect it would've been much faster to just rewrite Knockout from scratch.

Why do you say that? The idea one could get it right writing from scratch is one of those seductive thoughts, but in my experience it never works out that way.

> Why do you say that? The idea one could get it right writing from scratch is one of those seductive thoughts, but in my experience it never works out that way.

Of course the alternate route – rewriting - is just a hypothetical so we can only suppose how it would've turned out.

That said, rewriting from scratch would've been pretty straightforward, since the design is pretty much set.

The real value of the existing code resides in the unit tests that Steve Sanderson, Ryan Niemeyer, and Michael Best created – since they illuminated a lot of weird and deceptive edge cases that would've likely been missed if we had rewritten from scratch.

So I suspect you are right, that it's just a seductive thought.

Amen. By the time (if ever) the rewrite has reached feature parity with the original, it is as bad as the original.

Also second system syndrome.

Nice work on Knockout refactor. We still are actively using KO in our core product, and it's nice to see some legs left in the framework.

KnockoutJS is hands down my favourite JS library of all times (it's a large part of why I build things in a structured better way (I was/am primarily a backend dev)), it's awesome to see that it has a modern future since I have quite a few projects using it and 'porting' will be a lot easier so thanks for the amazing work you are doing :).

Can I still install and use via a nuget package? It looks like it's integrated with all those crazy npm tools now but I'm not sure if that's just for development nor usage.

Sorry, @flukus, the alpha is not yet on nuget.

How does one get better if they only ever work in code bases that are steaming piles of manure? So far I've worked at two places and the code bases have been in this state to an extreme. I feel like I've been in this mode since the very beginning of my career and am worried that my skill growth has been negatively impacted by this.

I work on my own side projects, read lots of other people's code on github and am always looking to improve myself in my craft outside of work, but I worry it's not enough.

You can certainly improve some of your skills working on terrible code bases. For instance, you should become much better at debugging. You will have to learn debugging techniques and tools that you may never have had to use in other code bases.

Also, here is a paradox: take someone who has only ever seen terrible code bases and someone who has only ever seen very good code bases. How can they know? They might take a guess based on how well the software works, but that's probably not very reliable.

I think a good software engineer is someone who has seen a lot of different things, good and bad; someone who knows what design choices work and what will plunge software into the depths of Hell; probably someone who has make mistakes themselves and lived through the consequences.

But yeah, when working on such a code base, do read some code outside of it now and then, never forget there are better ways to do things. And if you are starting to feel burnt out by the quality of the code base you work on, you should probably make a change.

I think its pretty common- and I think you're lucky.

I was surprised to see the article say "It happens at least once in the lifetime of every programmer,". I think if you work on greenfield projects your whole career you're likely the one who's creating these 'steaming piles of manure'.

By working on bad legacy projects you learn an awful lot of things about what works and what is a problem to maintain - it will make you a better developer.

The only issue is if you always work on legacy stuff and never get to write greenfield you might get typecast as such. Whether that is a problem of not is up to you. Sounds like you care enough you can change when/if you want to.

I think you're setting up a false dichotomy. There are codebases other than just legacy and greenfield projects: high-quality, well-structured and well-maintained code.

I would agree that if all you work on is greenfield you're probably making the messes others are cleaning up, but I don't think that means developers are bound to either make messes or clean them up. There are plenty of good, long-lived projects out there.

Not every old project is legacy.

This is what I've been wondering about. I don't care if the stack isn't the newest, or the tech is the shiniest. I'm just more interested in working on code that was _engineered_. That is code that was designed and then built. That's the problem I have with most of the code I'm working in.

At my current place of work, we're not even using xmlhttprequest. We're using an antiquated xml library that's been hand rolled (xajax + major changes) to emulate our ajax requests. It's insanity to me that we're still in this mode.

> I think if you work on greenfield projects your whole career you're likely the one who's creating these 'steaming piles of manure'.

Eggzactly, well stated.

I generally clear my head by reading mailinglists and looking at how projects of my interests do things and keep their commits in order, especially around bugfixes. OpenBSD is a fun one to read through as well as others. I also go to/watch talks about people managing their own piles of manure and change processes.

As long as you keep your eyes open to other people doing what your organization is struggling with right the first time it gives sufficient motivation to approach every problem with 'why is this here and how could we do this better.' The great thing about the state of F/OSS right now is that you have codebases that have to change because of things like large amounts of RAM being so cheap- that very well understood algorithm designed to only do things in 64MB so as to not swap out no longer makes sense and so there are intelligent motions to fix it. I've been planning on reading the Postgres 9.6 changes for parallel queries to understand how they did the magic in a sane and controlled manner and shipped a working feature.

> I've been planning on reading the Postgres 9.6 changes for parallel queries to understand how they did the magic in a sane and controlled manner and shipped a working feature.

Very incrementally - we've been adding more and more infrastructure since PostgreSQL 9.4. Which finally was user visible with some basic parallelism in 9.6, which'll be greatly expanded in 10. There's some things that we'd have done differently if we'd started in a green field, that we had to less optimally to avoid breaking the world...

Thanks for the response, this makes me feel a bit more positive.

It sounds like you at least have a good feel for what's bad and what's worse (which is good).

I think one thing you can do is attempt to isolate the code surrounding the next chunk you work on. Do as much as you reasonably can of the things the article mentions. This may only be writing tests and adding logging, but if it's an improvement over what's there, you'll improve the experience of the next person involved with that code.

I'd warn you against jumping ship in hopes of finding a "clean" code base. Most code is somewhere on a spectrum of "maintainable enough" and something... grimmer.

If you really are unhappy and don't feel like you're growing or have the ability to grow, maybe try out contributing to a well-maintained OSS project. If you find yourself immensely happier, dust off your resume ;)

While starting out, knowing what not to do and precisely why is nearly as important as knowing what worked. In the case of good codebases and bad codebases though, you still need to be careful not to cargo-cult wholesale the architecture that worked before, and conversely not to do everything different from the last horror you worked in. View it as a learning opportunity as you debug: some things they will have gotten right, sounds like many things have been gotten wrong, but the process of reasoning them apart is still valuable.

All that being said, certainly do not hesistate to look around if you feel like you aren't growing as fast as you could be. Life is short and it's a sellers market for engineer labor in most places I have seen.

Yeah, right now I haven't worked at my current place long enough to leave. I want to put in enough effort on my part to warrant feeling like I haven't grown enough too. I've decided that the best thing I can do is focus on what I'm not doing well enough or consistently enough until I feel like I've covered all my bases/can't learn any more on my own.

The main problem I have is how to structure what it is I'm trying to improve upon. I also want more external perspective to help guide me towards becoming better in the web development field, but I don't feel like the company I'm at has developers with a modern web development skillset to offer that guidance.

Unfortunately I work in an area where the web developer talent is pretty shallow. The general programmer talent pool is deep, but I still feel like the specialization towards webdev and modern practices just aren't here.

IMO, it really depends on the context. If you are working with people who share your assessment of the current situation (both business and technical folks) and want to improve it, you'll have a great chance to learn from others' (and your) mistakes.

However, constantly putting off fires, under the gun, in horrible code bases, is probably not a good way to learn how to design software... It's a good way to learn how to debug and reason about problems, which is also a valuable skill to develop, though.

Start your own company? But even that I think is futile.

The causes of manure code are usually out of your control - tight deadlines; new devs touching stuff without properly understanding the whole; organization prioritizing short-term reward over long-term sustainability.

You also have to consider the inherent survivorship bias - only successful businesses live long enough that their codebase has time to grow into a big mess. Any company that lives more than a few years inevitably ends up with "manure". You'd have to be in the extremely rare position where you are profitable and have no pressure to keep growing (investors) in order to invest enough time into technical craft to not end up with manure code.

You can learn a lot from mopping up steaming piles of manure. Recognizing what manure is and the thought processes/business incentives that produce it will be helpful to you in not making your own.

Also, even if your current codebases are manure, that doesn't mean everyone in your company makes manure. Find people on your team who don't write it, and learn from them.

If nobody is like that in your company, then maybe you should change jobs if you've been there more than two years. Cleaning up manure helps with interviewing because you can share your war stories with the interviewer.

I'd never trust a developer that's only worked on green field projects, they're oblivious to the mess they leave because they aren't there long enough to feel the pain of their design decisions. So you've got one up on a lot of people there.

Aside from your own projects, look for opportunities for other projects at work where you can start with a fresh technology stack. Some of these projects might be taking over the non-core functions of the main app. For instance, chances are a lot of the UI is sub-optimal (generic crud based) for some specific users. You might be able to create a slicker interface that makes it easier for them to do specific tasks that feed that data into the main database.

Frankly, I never quite understood the importance of clear documentation until I found one such code base smoldering on my porch.

At the very least, write a doc that explains how to build the product, including where to find the parts in source control, what the dependencies are, what servers it'll get installed on, and so on.

The goal being to increase your shop's "Bus Factor"


> At the very least, write a doc that explains how to build the product, including where to find the parts in source control, what the dependencies are, what servers it'll get installed on, and so on.

... in the form of a Jenkins build configuration. (If possible; if the system requires legacy compilers that only run on old Windows versions or a proprietary compiler for an embedded target, good luck.)

The legacy projects are the ones where a doc with all that info would be the most useful. :)

Something to be cognizant of when creating this "keystone" doc - not losing it on some "share" that nobody can find anymore.

Thus, the use of README in the root directory of a project.

Since it would contain the location of the root directory, putting there would be circular. Hopefully the organization has a central location for their documentation that is somewhat organized (via SharePoint, or even a network share with folders). Reducing the number of things that a new hire would need to "just know" to a minimum should be a goal.

> Hopefully the organization has a central location for their documentation that is somewhat organized...

My office uses a combination of Redmine, Slack, email, gitlab, network drives, google docs, dropbox, some pdfs floating around, and a readme in the root of each repo...

Exactly. Because...

It started out on...

That Novell drive

... but then...

That Win95 share

... Samba

... Wiki

... Sharepoint

... nah, let's start using Confluence now


If you have it, it's nice to have a build/CI server that has a UI showing all the projects in a dept/work-group and where they come from in source control.

I love the notion of bus factor. Whenever a bunch of devs go out drinking I think this every time we cross the street. :)

I've learned an enormous amount from fixing terrible bugs in terrible code.

One tip is that when you've finally found the actual line(s) with the bug, always try to understand why the programmer made that mistake.

This has taught me much about what constructs are error prone.

This is an interesting one. I'll have to think about this while I'm taking notes about bugs.

I hate to say it, but I think the answer is "with difficulty".

From my own experience, it's really hard to know what's bad, what's good and what's an acceptable workaround if you've never seen anything different. Myself, I got lucky and ended up working on a project after the start of my career with someone who could explain the whats and (more importantly) the whys of bad/good/ugly code bases.

Generally, try and get some skill in being able to view a codebase from a high level. Draw it out on a whiteboard in boxes. Perhaps do this on other, pet projects first as it's nearly impossible to do this with a spaghetti-code project. If you can't pick out modular parts, then you have a big ball of mud. If you can, try and work on making and keeping them uncoupled. If you can, try and work on finding the natural boundaries of the other code you couldn't break up, and make those less coupled (you don't need to solve the coupling problems all at once!).

Are there a mix of architectural patterns in the code? This is pretty common when you're working on a legacy project. It's what happens when you get someone who doesn't really know how to architect, or there were a bunch of folks throughout the history of the project who (probably) had the right intentions, but didn't get it finished. Or, and this is the worst, you had two or more team members trying to bend the project to their own preferences without communicating with each other. If this is the case, talk to your team, agree on one and then you can work towards getting the style consistant. You don't even need to pick the best one. Getting a project into a consistant state is better than having an ugly mix and match.

Are there a bunch of mixed up design patterns floating around? Try and refactor those out as much as possible. Design patterns are great, and you should use them where appropriate. But if you find a lot of them nested within each other, it's not a good sign and probably indicates someone at some point swallowed a design pattern book and thought it would be a good idea to implement them. All of them. Nested patterns can more the likely be refactored out to simplify the code. Though again, make sure you understand what they are there for first. Otherwise you may be unpicking something intentionally complex that needs to exist to remove complexity elsewhere.

What does the DB look like? Is it designed around the projects business logic? Is this sensible for your project? Personally, I dislike putting any business logic into the data storage layer but it might be sensible for your particular project, so YMMV. If business logic in the DB is causing nasty workarounds, then you may have something else to refactor there, though this may not be possible.

Never refactor just for the sake of it! If you don't have buy-in for your ideas on how to improve a code-base from the rest of your team, you're going to be creating problems. You may also be missing critical information that your tech-lead knows about and made design decisions based on it. There have been several times I've tried to make things better as a Junior dev, only to find out I'd made some bad assumptions and created a mess.

Don't refactor without tests either. The system may be reliant on strange code, so make passing tests before changing things. That way you at least know the behaviour hasn't changed.

> Do not ever even attempt a big-bang rewrite

I'd love to hear a more balanced view on this. I think this idea is preached as the gospel when dealing with legacy systems. I absolutely understand that the big rewrite has many disadvantages. Surely there is a code base that has features such that a rewrite is better. I'm going to go against the common wisdom and wisdom I've practiced until now, and rewrite a program I maintain that is

1. Reasonably small (10k loc with a large parts duplicated or with minor variables changed).

2. Barely working. Most users cannot get the program working because of the numerous bugs. I often can't reproduce their bugs, because I get bugs even earlier in the process.

3. No test suite.

4. Plenty of very large security holes.

5. I can deprecate the old version.

I've spent time refactoring this (maybe 50 hours) but that seems crazy because it's still a pile of crap and at 200 hours I don't think it look that different. I doubt it would take 150 hours for a full rewrite.

Kindly welcoming dissenting opinions.

What tends to happen as you refactor bad code is that you gain some intuition about the way the code needs to flow. The longer you spend grinding away at the existing code, the more likely it is that rewriting it will work, because you'll have pent-up "architectural energy" waiting to be used, and good, already-debugged code from the previous version that can be copied in.

The most likely causation for crossing a threshold from refactor to rewrite, while steering clear of the "big bang rewrite", is that you have to ship a feature that triggers an end-run around some of the existing architecture. So you ship both new architecture and the new feature, and then it works so well that you can deprecate the old one almost immediately, eliminating entire modules that proved redundant.

Edit: And if you don't really know where to start when refactoring, start by inlining more of the code so that it runs straightline and has copy-pasted elements(you can use a comment to note this: "inlined from foo()"). This will surface the biggest redundancies at a minimum of effort.

10k loc is very very minor league. You can do anything you want on a base that size, it won't matter.

100's of thousands to millions of loc is a lot more problematic, many moving parts and weird interplay is to be expected.

I understand that it likely "won't matter". My point was to ask if it was worth talking about outliers to the Never Rewrite law.

eg it's assumed when talking about refactoring over rewriting that a large portion of features is working. There should be some percentage where it's worth rewriting over refactoring. Or perhaps a size where it's small enough to easily rewrite.

Yes, that's definitely a discussion worth having.

To me you can rewrite anything that:

(1) you fully understand (and you'd better be right about that)

(2) you have total control over already

(3) is small enough for (1) and (2) to be possible

(this is where I think a lot of people over-estimate their capabilities)

(4) where you have the ability to absorb a catastrophic mistake

(which usually isn't the pay-grade of the programmers)

and finally

(5) where you have a 'plan-B' in case the rewrite against all odds fails anyway

None of these are absolutes, if there is no business riding on the result then you can of course do anything you want. The history of IT is littered with spectacular failures of teams that figured they could do much better by tossing out the old and setting a date for the deploy of the shiny new system. Whatever you do make sure that your work won't add to that pile.

The older, the larger, poorer documented, worse tested the system is the bigger the chance that it is not fully understood.

Joel Spolsky has a pretty good rundown. The biggest takeaway for me was that legacy apps usually don't have clear reproducible requirements. All the corner cases are written down in one place: the old code. Throwing that out means you'll recreate most of the bugs that were already fixed in the old system.

It is painful to look at and work with the old code, so we want to avoid it. But some things worth doing are painful, like exercise, or getting a cavity filled.

[edit] https://www.joelonsoftware.com/2000/04/06/things-you-should-...

Your example is much smaller than what people are usually talking about in terms of big-bang rewrites. So maybe you will be successful.

Even so, you're better off doing a step-by-step rewrite, where the new stuff and the old stuff coexists in a single application. That way your users can continue getting incremental benefits over time even if the rewrite takes dramatically longer than you're optimistic estimate.

If you can't figure out how to manage the complexity of a piecemeal rewrite, consider that you may not actually understand the system well enough to avoid making version 2 just as bad as version 1.

Most people overestimate their ability to act differently than they've acted in the past. It's like the unjustified optimism of a New Year's resolution that this time you're actually going to exercise every day. To get a better result than last time, you need to impose some very clear rules on yourself that cause you to work differently.

Not attempting a full rewrite of a significant codebase is excellent advice because it's usually the right advice.

That's not to say that it can never be successful, just that the circumstances in which it will are sufficiently rare that it's usually worth discounting relatively early on.

In >20 years of dev experience, I can only think of one occasion where I successfully did a big bang rewrite i.e. tore down an application and restarted it with an equivalent system that had approx zero common code.

In that case, it was a C++ program that wouldn't actually build from clean. A lot of the code was redundant as the use cases had morphed over time (and/or weren't ever required but were coded anyway) and most changes were stuffed into base classes as it was effectively impossible work out how objects interacted. Releases took about 3 months for about 2 weeks worth of dev.

Initially, I didn't plan to rewrite it. When I realised I couldn't understand what it was doing, I took a step back and worked out what it should have been doing, assuming that I could map one to the other. What I found was that, at heart, it should have been doing something fairly simple but that the original "designers" had thrown the kitchen sink at it and its core function was lost in the morass.

I also came up with a way of making it easy to show that the new system was correct more deeply than just tests. This gave me, and folks I needed to convince, a lot more confidence that a rewrite made sense than would normally be the case.

In summary, it was quite a rare set of events that led me to the conclusion that a rewrite was the right direction: the existing system being a complete basket case, my happening to have a lot of domain expertise, the problem space turning out to be relatively simple and finding a way to "prove" correctness, all contributed. I doubt I would have made the same decision if any of them were different.

There are cases where you can do a rewrite, but still avoid the big-bang cutover, by exposing the new app only to some subset of customers or transactions. That isn't possible with every app, of course.

I think the gospel view is when you have to do both...rewrite and big bang cutover. Especially when there is no obvious fallback.

Not a dissenting opinion but I'd love to see some case studies on rewrites. As a consultant this is a frequent request and will probably be big business in the future as people migrate off of expensive legacy mainframe or other applications from the 80's, 90's, and possibly 2000's.

It's not "rewrite" that's bad, it's thinking you can cut over to a new system in a "big bang".

Rewrites are definitely common and beneficial, but the successful ones always run the new code and the old code side-by-side for an extended period of time. Which means you're still tending and caring about the old code, even as you strive to direct most of your effort into the new code.

There's the Sivers CD baby to rails (fail) and back to PHP (success) case https://sivers.org/rails2php

The application may be a steaming pile of crap, but you probably don't have as much knowledge of the problem domain as the creators did. You will get there, over time. Starting a complete rewrite throws away the bad parts, but it also throws away accumulated knowledge.

I'd say as long as you've read and understand nearly every line of code in the old system, you're good to rewrite it from scratch.

And if you write the test suite first, you're in a much better position to do this successfully.

I also disagreed with that part in the article. Big-bang rewrites can be just fine - but usually there are reasons it's not possible.

> Before you make any changes at all write as many end-to-end and integration tests as you can.

I don't agree with this. People can't write proper coverage for a code base that they 'fully understand'. You will most likely end up writing tests for very obvious things or low hanging fruits; the unknowns will still seep through at one point or another.

Forget about refactoring code just to comply with your tests and breaking the rest of the architecture in the process. It will pass your 'test' but will fail in production.

What you should be doing is:

1. Perform architecture discovery and documentation (helps you with remembering things).

2. Look over last N commits/deliverables to understand how things are integrating with each other. It's very helpful to know how code evolved over time.

3. Identify your roadmap and what sort of impact it will have on the legacy code.

4. Commit to the roadmap. Understand the scope of the impact for your anything you add/remove. Account for code, integrations, caching, database, and documentation.

5. Don't forget about things like jobs and anything that might be pulling data from your systems.

Identifying what will be changing and adjusting your discovery to accommodate those changes as you go is a better approach from my point of view.

By the time you reach the development phase that touches 5% of architecture, your knowledge of 95% of design will be useless, and in six months you will forget it anyways.

You don't cut a tree with a knife to break a branch.

I agree with you, I don't know why you were downvoted. In my experience the first and biggest problem when taking over legacy codebases is the lack of knowledge of what features the code is supposed to support. Just starting out with writing integration test has a risk that you end up with even more meaningless code to maintain.

Actually, contrary to the advise of the writer, I like to start out with fixing some bugs. I find it a great way to gain some knowledge and it has the added benefit of keeping business stakeholders happy. And while fixing those bugs you can start writing the first integration and unit tests.

I gasped when I saw this article at the top of HN due to the relevance of it right now in my life. I am currently working on a real monolithic jambalaya that suffers from a lack of documentation, architecture, extreme abstraction, rampant tight coupling and no previous source control.

Your point on performing architecture discovery and documentation is spot on. It has really helped me to strip away the mess and understand the flow of the logic and maybe even shine some light on the parts of code that are valuable.

This article is painfully relevant for me. I just reviewed a code base with zero tests, documentation, no inheritance- rampant duplication.

It's a simple event tracking system and yet there are 75 models, and over 80 controllers. This was outsourced to a team which coincidentally appears to have close to that many devs working there. The good news is that according to the client "it pretty much works". I know better than to suggest a Big Bang - though it seems so appealing.

Documentation and code freeze are my next steps and implementing end to end testing.

> I gasped when I saw this article at the top of HN due to the relevance of it right now in my life.

You are not the only one :)

How do people handle this in dynamic languages like JavaScript? I have done a lot of incremental refactoring in C++ and C# and there the compiler usually helped to find problems.

I am now working on a node.js app and I find it really hard to make any changes. Even typos when renaming a variable often go undetected unless you have perfect test coverage.

This is not even a large code base and I find it already hard to manage. Maybe i have been using typed languages for a long time so my instincts don't apply to dynamic languages but I seriously wonder how one could maintain a large JavaScript codebase.

I think you just captured the essence of why microservices are so popular. Dynamic languages just don't scale to large codebases, so there's enormous pressure to decompose software into chunks that can be digested more easily.

Some amount of this is good, but it often forces the chunk boundaries to be smaller than the "natural" clumping of data and behavior in a distributed system. IMHO this is a much worse problem than a messy monolith; you can refactor a monolithic codebase to be more modular, but refactoring hundreds of microservices is a herculean endeavor.

My problem with microservices is the word micro.

> Dynamic languages just don't scale to large codebases

You mean "popular" dynamic languages due to their lack of tooling. Dynamic languages like Smalltalk scale up just fine, but Smalltalk has automated refactoring tools. In other words it's a tool support problem, not a dynamic language problem.

> Dynamic languages just don't scale to large codebases

Static languages scale to large codebases. There's no app that a static language (and those who insist on static types) can't turn into a much larger codebase :-)

I love the imagery of "mountains of dirt": http://steve-yegge.blogspot.com/2007/12/codes-worst-enemy.ht...

"hundreds of microservices"

I can't imagine a scenario where you need hundreds although I don't doubt that people will create such a system.

Do not underestimate architecture astronauts, ever.

I was reading "Building Microservices" by Sam Newman, he mentioned that some of his clients moved from monolith to 300+ microservices without going into details, so yeah, that made me wonder about it as well. (it was a decent book otherwise).


Or non-generalized, custom hard-coded static typed end-points for every single reference/option list and workflow state transition.

Welcome to our little "full 'big bang' rewrite" Frankenstein 4.0 :-(

Not my idea...

Try TypeScript.

Though I wouldn't think that test coverage needs "perfect" to catch a bad variable name, but maybe that's why there's so much obsessive tooling when it comes to coverage in the JavaScript world.

Don't forget that JS is often in a UI, doing asynchronous event/IO handling, so testing timing is important, not just spelling. (great, that's exactly the property names that object would have had, if it existed yet)

That, and it's often reading in data (JSON or XML) from another system, and it is what it is, so see if it quacks or not.

From the people that brought you SOAP, it's (drum roll) TYPE SCRIPT!

It's not really solving my problems, just making more work.

So because Microsoft made SOAP and also made TypeScript then TypeScript must be bad? That's nonsense.

Also, I'm not a frontend guy, and the comment I was replying to was talking about node.js, but having to put a setTimeout or something in your tests just seems wrong.

On the one hand, argumentum ad hominem is a logical fallacy.

On the other hand, expecting not to be bit after the last dozen times seems kinda stupid. I'm not a fan of MS.

re: timeout. Yeah, waiting for one (or many!) other async operation(s) to complete in response to an event is a nuisance, but that's how it works, in particular if you don't want a UI to freeze up.

Full stack is hard, at least if somebody wants you to swap in and out of levels several times a week. But that's another rant about ruining projects...

Let me try again.

Don't rename properties that "escape" from a given context. Sorry, but it's not going to be a good use of your time. Do document (JSDoc or similar) what the property is used for and why (as far as you can tell)

It's OK to rename local variables and parameters (the "root" local identifier, not the properties), though.

It might not be Smalltalk (I wouldn't know), but the JetBrains IDE support for JS is pretty good in terms of type inference, "where defined" lookups, "show documentation" support, duplicate / undefined symbol detection and other stuff I'm probably forgetting at the moment.

Seriously, though, avoid the traditional class/constructor/prototype setup (rather than short lived object literals as parameter objects and return values). It makes things too widely visible, and harder to safely change later. And it's more work, anyway.

Learn how to refactor a nested function which uses closure values into a reusable function with a longer argument list on which you can use partial function application as a form of dependency injection - or the other way around, for something used only one place.

An important lesson in managing code in a dynamic language is to limit the scope of everything as much as possible. Software designed as a cluster of many mutable singletons is going to hurt.

OOP was the hotness in the 80s. It's time to learn other paradigms, too (move from the '60s to the '70s), even if IDE designers have to update how "intellisense" (aka auto-complete) works :-)

I've found integration testing to be very useful when dealing with JavaScript web stuff. If the desired output looks correct, you can usually work with the understanding that the JavaScript did its job.

See Selenium. http://docs.seleniumhq.org/

It's not exactly a general solution fit for all situations and persons and purposes, but there's always TypeScript.

Typescript's increasing ability to type check JS code without modification (especially if it already has JSDoc comments or is already using npm-installed libraries with type information) is moving it to be a better fit as a solution for more situations.

Contributing factor: thinking in terms of Simula67 / C++ / Java / C#. (stop doing so)

Since property names are dynamic, avoid making data global (singletons, et al) at all costs, to limit the amount of string searching and informed "inferences" you have to make. Using a more functional programming style that tracks data flow of short lived data works better than trying the "COBOL with namespaces" approach of mutable data everywhere that gets whacked on at will.

Sorta ironic: monstrous, so called, "self documenting" identifier names are not a good idea in a dynamic language. A short (NOT single letter, long enough to be a memorable mnemonic) identifiable name is more likely to be typed and eye-ball checked correctly.

There is no "self documenting" code - literate programming is your friend, or at least JSDoc is. It's not practical to put "why" something is into its name.

Of course, if you inherited some hot mess written by a hard-core Java / C# programmer, yeah, life is gonna suck :-(

Disclaimer: I've been doing a lot of Angular the last couple of years, which is over reliant on long lived, widely visible, mutable data. I would rather go the route of something like Redux than Type Script, though. (I suppose you could do both, but I want to NOT do Type Script if I can help it)

I've also worked with a number of languages that had runtime types and/or that allowed some kind of "string interpolation" for identifiers here and there since the 80s. No biggie.

Buh, buh, buh, TYPESSSS!!! Yeah, so. Let's talk about excessive temporal coupling, (mutable) OOP (only) folks...

<RANT ENDS> (for now)

You'll find these issues with pretty much any dynamic or scripting language once the codebase becomes large enough.

For JavaScript, your best bet would be to integrate external tooling (such as JSHint) into your gulpfile or grunt.

I used to work on a messy legacy codebase. I managed to clean it, little by little, even though most of my colleagues and the management were a bit afraid of refactoring. It wasn't perfect but things kinda worked, and I had hope for this codebase.

Then the upper management appointed a random guy to do a "Big Bang" refactor: it has been failing miserably (it is still going on, doing way more harm than good). Then it all started to go really bad... and I quit and found a better job!

All of this seems to focus on the code, after glossing over the career management implications in the first paragraph.

I've done this sort of work quite a number of times and I've made mistakes and learned what works there.

It's actually the most difficult part to navigate successfully. If you already have management's trust (i.e., you have the political power in your organization to push a deadline or halt work), you're golden and all of the things mentioned in the OP are achievable. If not, you're going to have to make huge compromises. Front-load high-visibility deliverables and make sure they get done. Prove that it's possible.

Scenario 1) I came in as a sub-contractor to help spread the workload (from 2 to 3) building out a very early-stage application for dealing with medical records. I came in and saw the codebase was an absolute wretched mess. DB schema full of junk, wide tables, broken and leaking API routes. I spent the first two weeks just bulletproofing the whole application backend and whipping it into shape before adding new features for a little while and being fired shortly afterwards.

Lesson: Someone else was paying the bills and there wasn't enough visibility/show-off factor for the work I was doing so they couldn't justify continuing to pay me. It doesn't really matter that they couldn't add new features until I fixed things. It only matters that the client couldn't visibly see the work I did.

Scenario 2) I was hired on as a web developer to a company and it immediately came to my attention that a huge, business-critical ETL project was very behind schedule. The development component had a due date three weeks preceding my start date and they didn't have anyone working on it. I asked to take that on, worked like a dog on it and knocked it out of the park. The first three months of my work there immediately saved the company about a half-million dollars. Overall we launched on time and I became point person in the organization for anything related to its data.

Lesson: Come in and kick ass right away and you'll earn a ton of trust in your organization to do the right things the right way.

Big bang rewrites are needed in order to move forward faster.

A huge issue with sticking to an old codebase for such a long time is that it gets older and older. You get new talent that doesn't want to manage it and leave, so you're stuck with the same old people that implemented the codebase in the first place. Sure they're smart, knowledgable people in the year 2000, but think of how fast technology changes. Change, adapt, or die.

A big bang rewrite will nine out of ten times slow you down, it will not accelerate things, and the most likely outcome is that not only will it be slower, it might fail entirely.

It's a complete fallacy to think that you're going to do much better than the previous crew if you are not prepared to absorb the lessons they left behind in that old crusty code.

So you'll have to learn them all over again.

> Change, adapt, or die.

Die it is then.

It's not a given that legacy code means "no people still around, no docs and no tests". I'm on a rewrite project and I'm 10 years in, and the whole crew from the last project (also around 15 years) is till in this project too. That helps.

The causes of the big bang rewrite are usually not just "this code smells let's rewrite it" but rather that the old product reached some technical dead end. Perhaps it can't scale. Perhaps it's a desktop product written in an UI framework that doesn't support high DPI screens and suddenly all the customers have high DPI screens. Obviously in that situation you'd aim to just replace a layer of the application (a persistence layer, an UI layer) but as we all know that's not how it works. The cost of a rewrite shouldn't be underestimated - as you said there is no reason to believe that if it took 50 man years for the last team then the new team will take 50 too. But that is in itself not a reason to not do it.

Fair enough. So the real lesson then is 'it depends', as with everything else. But the kind of jobs where the cleanup crews get called in are on the verge of hopeless and it is not rare that we do these on a 'no-cure, no pay' basis.

Great to see you be part of such a long lived team, that's a rarity these days. That's got to be a fantastic company to work for. Usually even relatively modest turnover (say 15% per year) is enough to effectively replace all the original players within a couple of years, most software projects long outlive their creators presence at the companies they were founded in. Add in some acquisitions or spin-outs and it gets to the point where nobody even knows who wrote the software to begin with.

Software is only as good as the people that write it. In an ideal world, you'll have a team that specializes in this sort of things, can understand the business needs, and get it done.

There are always risks with every action taken. You can't be scared to take a big risk for a bigger payout versus sucking it up and doing things the way they've been done for 15 years.

All software is written three times.

First, to learn the problem. Second, to learn the solution. Thrice, to do it right.

Skip a step at your own peril.

Incrementalism and do-over both have their place.

If you're resurrecting legacy code, I can't imagine successfully rewriting it until after you understand both the problem and solution. Alternately, change the business (processes), so that the legacy can be retired / mooted.

Anything that fails isn't moving anything anywhere.

And in the type of place that has a dysfunctional, legacy software system running core business operations, don't count on all the other ducks being in a row (anything resembling agile, ability to release to prod on a reasonable cadence, ability to provision sufficient test data, working test systems, etc).

If it's an internal system that you've been working on and maintaining... for 10 years... maybe (just maybe). If you're a consultant stepping in, I wouldn't touch that option for love or money.

This might be true (big bang rewrite) for small web sites or non business critical utility software, but if your cash flow depends on the software, you do not want failing software to stop that cash flow.

Correct- rewrites should be done in tandem with maintaining legacy systems until the new system is finished.

There should be some sort of overlap before completely sunsetting the old system.

I think this mostly applies to newer startups who have changed teams with no or little hand off.

Fighting technical debt is hard. Fighting it with a blindfold is harder. Fighting it with 0 frame of reference is daunting. Fighting it the rest of the company is demanding new features right now is a recipe for stagnation, bugs, and burn out.

The problem with big bang rewrites is you end up falling in every trap the original developers fell into.

It is amazing how much of our profession's knowledge ends up as a odd if statement buried deep in the code to some method or stored procedure dealing with an edge case that gets missed in the big bang rewrite. Its also amazing how much money the failure to preserve that knowledge can cost.

I wonder if its time for professional software archeologists?

> I wonder if its time for professional software archeologists?

No, but it is time to make a real effort to teach the lessons learned to newcomers. I really feel that as an industry we completely fail at that. Blog posts such as these are my feeble attempt at trying to make a contribution to solving this problem.

Writing software is still "creative" and less "engineering". There aren't many ways to build a bridge but many ways to express yourself in language. Natural language or computer language that is.

Add this to "business requirements" and you get the big pile of manure we walk in every day. Like how does the knowledge of IEEE754 help me if the requirement is to sum up some value of the last three days, unless the last three days are on a weekend or holiday. (ok, stupid example) The point is domain language does not translate to computer language very well and a programmer is not a domain expert. He is .. just a programmer, a creative programmer, and we are millions each doing their thing a little different.

Strangely, at all the big corp jobs I've been at, the good programmers have become domain experts as part of the job. How else could they have a real feel for what the business needed and if the code was correct?

Oh yes, I totally agree with you on this. But it is a long road to become a domain expert. Don't forget people eventually leave jobs and you get new hires. Leaves lot of room for errors.

The issue to think about is - if you don't know enough to "upgrade/replace in place" - then you probably won't know enough to rewrite from scratch.

That's why you need product owners/managers to cover the business logic and design a system that incorporates it. There needs to be tiers of coordinations to make sure a system is built to spec. TDD plays a big part in rebuilding legacy codebases.

The OP has so many reasonable, smart-sounding advice that doesn't work in the real world.

1) "Do not fall into the trap of improving both the maintainability of the code or the platform it runs on at the same time as adding new features or fixing bugs."

Thanks. However, in many situations this is simply not possible because the business is not there yet so you need to keep adding new features and fix bugs. And still, the code base has to be improved. Impossible? Almost, but we're paid for solving hard problems.

2) "Before you make any changes at all write as many end-to-end and integration tests as you can."

Sounds cool, except in many cases you have no idea how the code is supposed to work. Writing tests for new features and bugfixes is a good advice (but that goes against other points the OP makes).

3) "A big-bang rewrite is the kind of project that is pretty much guaranteed to fail.

No, it's not. Especially if you're rewriting parts of it at a time as separate modules

My problem with the OP is really that it tells you how to improve a legacy codebase given no business and time pressure.

On the contrary, we do this work under extreme business and time pressure, sometimes existential pressure (as in: fail and the company fails).

That's exactly why this list is set up the way it is: you will get results fast and they will be good results.

If you want to play the 'I'm doing a sloppy job because I'm under pressure' card then consider this: the more pressure the less room there is for mistakes.

Here is a much more play-by-play account of one of these jobs where management gave me permission to do a write-up as part of the deal:


(For obvious reasons management usually does not give such permission, nobody wants to admit they let it get that far on their watch, I did my best to obscure which company this is about.)

> That's exactly why this list is set up the way it is: you will get results fast and they will be good results.

What do you mean by 'fast'? If you can get meaningful improvements in a few months' time, then you're just working with smaller code base than what I thought of. If you're talking about stopping for a year, then .. well, that's the problem I'm talking about.

> If you want to play the 'I'm doing a sloppy job because I'm under pressure' card

No, I just wanted to share my opinion that I disagree with the overly generalized suggestions you're making.

> What do you mean by 'fast'?

Much faster than by going the rewrite route (assuming that is even possible, which I am convinced for anything but the most trivial problems it isn't). Preferably to first deploy within a few days and incremental changeover to the new situation starting within two weeks or so of the starting gun being fired.

> If you can get meaningful improvements in a few months' time, then you're just working with smaller code base than what I thought of.


> If you're talking about stopping for a year, then .. well, that's the problem I'm talking about.

Who said so?

All I said is that you should only do one thing at a time. Do not attempt to achieve two results with one release.

> No, I just wanted to share my opinion that I disagree with the overly generalized suggestions you're making.

You are very welcome to your own opinion about my 'overly generalized suggestions' it's just that they are a lot more than suggestions, they are things that I (and others, see this thread for evidence) have used countless times and that simply work.

All you do is a bunch of naysaying without offering up anything concrete as an alternative that would work better or evidence that anything posted would not work in practice. It does and it pays my bills.

> > What do you mean by 'fast'?

> deploy within a few days and incremental changeover to the new situation starting within two weeks or so

I'm going to take this as confirmation that you're working on very, very small projects. This would be an extraordinarily unrealistic timeframe for large projects, which take vastly larger quantities of time to apply the steps you've outlined - which, in turn, renders those steps useless in a competitive business context as far as large applications are concerned.

No, it just means that I have crew for jobs like these that knows their stuff.

500K lines is 'small' by our standards and if we are not moving within two weeks that translates into one very unhappy customer. That's something a typical team of 5 to 10 people has produced in a few years.

Note that I wrote 'incremental' and 'starting'. That doesn't mean the job is finished at that point in time. But we should have a very solid grasp of the situation, which parts are bleeding the hardest and what needs to be done to begin to plug those holes. That the whole thing in the end can become a multi-year project is obvious, we're not miracle workers, merely hard workers.

In a way the size of the codebase is not even relevant. What is most important is that you get the whole team and the management aligned behind a single purpose and then to follow through on that. Those first couple of weeks are crucial, they are tremendously hard work even for a seasoned team that has worked together on jobs like these several multiple times.

The one case I wrote about here was roughly that size (so small by my standards), within 30 days the situation was under control. We're now two years later and they are still working on the project but what was done in that short period is the foundation they are still using today.

If a project is much larger than that then obviously it will take more time. Just the discovery process can take a few weeks to months, but in that case I would recommend to split the project up into several smaller ones that can be operated on independently with 'frozen interfaces' where-ever they can be found.

That way you can parallelize a good part of the effort without stepping on each others toes all the time.

The problem is not that you can't tackle big IT projects well. The problem is that big IT projects translate into big budgets and that in turn attracts all kinds of overhead that does not contribute to the end result.

If you strip away that overhead you can do a lot with a (relatively) small crew.

If you're going to tackle a code base in excess of something 10 M loc in this way you will again run into all kinds of roadblocks. For those situations it would likely pay off to spend a few months on the plan of attack alone.

If a project that large came my way I would refuse, it would tie us down for way too long.

But that's out of scope for the article afaic, we're talking about medium to large project, say 50 manyears worth of original work that has become unmaintainable for some reason or other (mass walk-out, technical debt out of control or something to that effect).

If those are 'very very small projects' by your standards than so be it.

> 50 manyears worth of original work that has become unmaintainable for some reason or other (mass walk-out, technical debt out of control or something to that effect)

That's the scale I'm talking about, so at least we're on the same page there.

It sounds to me like your specialty routinely puts you in situations where the client has reached the end of the line and is in Hail Mary Mode, where they're amenable to having a consultant do Whatever It Takes to turn things around. To me, that sounds like just about the best case scenario for addressing the issues with legacy software, and pretty far removed from the Usual Case.

In my mind, the Usual Case is legacy software that's in obvious decline but still has significant utility, and for which there is still a significant portion of the market that can be attracted with added features. That's the long tail for a huge swath of the industry. In those cases, it's unthinkable to halt development for _any_ significant stretch of time. It's dog eat dog out here, and when your competitors aren't pausing for breath, you can't either - it's just a totally different world, and I think you're inappropriately pushing the wisdom from your own corner of it out into spaces where it's just not applicable.

In a similar vein, I think your opinions on rewrites are a bit skewed by the fact that the _only_ ones you encounter in your specialty are ones that have failed miserably (or at the very least, they're seriously overrepresented).

You clearly have a very solid and proven game plan for the constraints you're used to, but I think many of the extrapolations aren't valid.

I'd be more than happy to believe you if the comments in this thread weren't for the most part confirming my experience. On the other hand I'm more than willing to believe that there are plenty of places where none of this applies (though, I haven't seen them) and where with some slight variation you could get a lot of mileage out of these methods.

Because if the only extra constraint would be 'you can't halt development' then that's easy enough: simply iterate on smaller pieces and slip in the occasional roadmap item to grease the wheels. But that does assume that development had not yet ground to a halt in the first place.

The biggest difference between your experience and my experience I think is that our little band of friends is external, so we get to negotiate up front about what the constraints are and if we put two scenarios on the table, one of which is ~70% cheaper because we temporarily halt development completely then that is the most likely option for the customer to take.

But.. all these things do work in the real world. Keeping refactorings to a separate commit is important as when you break something, you can just roll it back. Countless times I've seen new people come in and try refactoring things to their liking as they work on some other feature. Then when it later breaks some component they didn't know about, we have to unpick their commit and manually separate out the crap 'refactoring' they did and the actual feature they were working on.

End to end tests are great to test the system actually works as expected as per the requirements spec. You should know how to write this, else.. how are you even testing your feature to begin with after you've written it?

And big rewrites always take longer than people think, which can sink a business if they're not careful with their resources and don't manage their time appropriately. All in all, these points you've mentioned all seem actually very reasonable to me.

A "big-bang rewrite" means rewriting the entire app from scratch. So if you're rewriting pieces as modules you are by definition not doing a big-bang rewrite.

While the advice sounds cool, doing one feature at a time is often impractical. I recently moved an old app with spaghetti Jquery to Vue. The paradigm is completely different. What ended up happening is I had a base that worked, moved a set of features at a time from the old to the new. This is more like rebuilding the git history from a new base compared to doing incremental one change at a time that the article advocates.

>3) "A big-bang rewrite is the kind of project that is pretty much guaranteed to fail.

>No, it's not. Especially if you're rewriting parts of it at a time as separate modules

I guess it depends on what he considers to be a "big bang rewrite." I don't think any of the incremental approach you mention counts as one.

My preferred definition here, is that a big bang rewrite is a monolithic rewrite so big it goes bang (fails), pushing things from "pretty much guaranteed to fail", to "is by definition a failure".

You might end up rewriting the entire codebase through an incremental approach ala the Ship of Theseus through a series of smaller rewrites, but that's something very different and distinct from a "big bang rewrite" to me.

I can vouch that the approach suggested by OP works from my own experience. Incremental refactoring backed by confidence of safety-nets (tests and ability to fast-fail and revert) helped us stabilize a legacy codebase and then improve it. Depending on how bad the state of code is, adding new features may be even accelerated by minor refactoring.

I would argue that is you do not have idea about how the codebase works adding new features without breaking anything else is going to be sheer luck most of the times.

That's ok, we'll be more than happy to charge his boss $something terrible K/month per person to bail them out at some point :)

Nah, that's condescending, really. We might have different experience because of the codebases we've worked with; but I don't think there's a need for this kind of sarcasm.

Well, I've been in this line of business for 30+ years, if you think there is some aspect of it that you structurally encounter that I don't then I'm all ears.

Keep in mind that almost everybody that we end up cleaning up after has the exact attitude that you display and the only reason they feel that way is because they leave before the bill is due.

I don't mind, it keeps me employed.

Ad hominem and appeal to authority are not the best way to argue about technology, so let's leave this.

(It feels really odd that someone tells me that he's making a living after cleaning up after people like me while so far I've thought I am paid for scaling small, poorly written systems up to enterprise levels, but well. I'd have appreciated more if we could have talked about specifics instead of "this works because I know it works").

We can talk about specifics, I linked one very particular job here (the one that I was fortunate enough to be allowed to write about, which is normally not the case) and I've invited you to give your own stories.

As you can see elsewhere in this thread I'm more than willing to change my tune and/or update the post if there is relevant information.

But your initial tone of voice + your categorical denial that these things are valuable makes it a bit harder to find common ground.

If you are scaling small poorly written systems then that already gives one very important data point that is divergent with the situation I've written about. The systems you start with are small, the systems I start with are usually large to very large and are running a mid to large enterprise and are - if you're lucky - a decade old or even much older. Either that or they are recent - and totally botched - rewrites.

If you are happy in your groove then more power to you but chances are that sooner or later you too will be handed a pile of manure without a shovel to go with it and maybe then you'll find some useful tips in that blogpost.

And the bit about the bookkeeping applies to your situation just as much as it does to larger and older systems.

Ad hominem and appeal to authority are not the best way to argue about technology, so let's leave this.

It seems to be de riguer when dealing with some of the top ranked people on HN, who all too often seem to have long ago forgotten the rules that they apparently think no longer apply to them because they have more karma than god.

My hat is off to you in how well you handled this.

Michele, I know you don't like me and I know you feel the need to insert your $0.02 wherever you feel you can stick the knife in. What drives you to do this I have absolutely no clue, but I'm sure you have your reasons. Enjoy.

I don't dislike you, I just dislike the degree to which you have zero respect for me. I think you are full well aware of the reasons I do the things I do.

I don't look to stick anyone with a knife. It is shocking how comfortable people are with casually watching my suffering and doing nothing about it while acting like any push back I give against the systemic issues that help keep me trapped in dire poverty somehow makes me evil incarnate.

I have zero respect for you because of how you treat others.

Promising them the moon to only yank the rug out from under them when it matters. You know full well what I'm talking about here and that's something that I will not forgive you for.

Your circumstances have nothing to do with any of this.

Promising them the moon to only yank the rug out from under them when it matters. You know full well what I'm talking about here and that's something that I will not forgive you for.

If I am guessing correctly as to what you are talking about, you have that backwards. That person abandoned me. I did not abandon them.

#1 is just refactoring.

Make the change simple, then make the simple change.

They only know it took you three days to implement the feature. They don't need to know how you spent the first 2 days.

Paying down that debt allows the team to scale larger and maintain velocity longer. If they don't like your rate of delivery now, how are they going to like it when the code calcifies and everything takes twice as long?

We used the same strategy OP describes to bring a legacy emergency firefighter dispatch system under control and it worked well.

It's my turn to disagree with something in the article.

> Before you make any changes at all write as many end-to-end and integration tests as you can.

I'm beginning to see this as a failure mode in and of itself. Once you give people E2E tests it's the only kind of tests they want to write. It takes about 18 months for the wheels to fall off so it can look like a successful strategy. What they need to do is learn to write unit tests, but You have to break the code up into little chunks. It doesn't match their aesthetic sense and so it feels juvenile and contrived. The ego kicks in and you think you're smart enough you don't have to eat your proverbial vegetables.

The other problem is e2e tests are slow, they're flaky, and nobody wants to think about how much they cost in the long run because it's too painful to look at. How often have you see two people huddled over a broken E2E test? Multiply the cost of rework by 2.

It is great to see more people sharing their strategies for managing legacy codebases. However, I thought it might be worth commenting on the suggestion about incrementing database counters:

> "add a single function to increment these counters based on the name of the event"

While the sentiment is a good one, I would warn against introducing counters in the database like this and incrementing them on every execution of a function. If transactions volumes are high, then depending on the locking strategy in your database, this could lead to blocking and locking. Operations that could previously execute in parallel independently now have to compete for a write lock on this shared counter, which could slow down throughput. In the worst case, if there are scenarios where two counters can be incremented inside different transactions, but in different sequences (not inconceivable in a legacy code), then you could introduce deadlocks.

Adding database writes to a legacy codebase is not without risk.

If volumes are low you might get away with it for a long time, but a better strategy would probably just to log the events to a file and aggregate them when you need them.

Are there businesses building automation and tooling for working with legacy codebases? It seems like a really good "niche" for a startup. The target market grows faster every year :)

Semantic Designs[0] is one of several companies that sells software for working with legacy codebases and programming language translation. [1] is a SO post by one of their founders that describes some of the difficulties in programming language translation.

[0] http://www.semdesigns.com/

[1] https://stackoverflow.com/a/3460977/3465526

Interesting, thanks! Sounds like it's a really hard problem.

Tools like NDEpend (for .NET) help me a bit with modularizing.

I doubt we'll ever see automation beyond what we do today in this space.

> I doubt we'll ever see automation beyond what we do today in this space.

Really? That's pretty pessimistic, considering what DeepMind is doing.

Yeah I'm very very pessimistic in that area. Effectively I think cleaning up bad/tangled OO code to be such a difficult problem that the level of AI required is beyond not just what we can achieve but beyond what we can imagine. For example I believe it's much harder than coding entirely new applications from text descriptions of its features. That would limit the usefulness of an AI that can untangle existing code...

Specifically to do what?

- Help developers build a high level understanding of the code and relationships between modules (with millions of lines, this is extremely hard)

- Automate refactoring code to reduce complexity and cross-dependencies

- Automate rewriting parts of code in mode modern languages and replacing it with some mediation layer (protobuf etc)

I think industries like finance would welcome with open arms something that can do this. And it could go for a high price if it's still saving them money on countless hours of developer time. It's a growing cost every year to maintain legacy code that was written 3+ developer generations ago, and it's dangerous in cases where peoples' lives depend on the code being bug-free (infrastructure, medical)

I started thinking about this problem a few days ago in a thread about AI https://news.ycombinator.com/item?id=14430652

I think some of this already exists using APM and code analysis. The only issue is existing toolsets often display ugly diagrams in like ER format or some flow diagram. Hard to read.

Agreed about the pre-requirements: Adding some tests, reproducible builds, logs, basic instrumentations.

Highly disagree about the order of coding. That guy wants to change the platform, redo the architecture, refactor everything, before he starts to fix bugs. That's a recipe for disaster.

It's not possible to refactor anything while you have no clue about the system. You will change things you don't understand, only to break the features and add new bugs.

You should start by fixing bugs. With a preference toward long standing simple issues, like "adding a validation on that form, so the app doesn't crash when the user gives a name instead of a number". See with users for a history of simple issues.

That delivers immediate value. This will give you credit quickly toward the stakeholders and the users. You learn the internals doing, before you can attempt any refactoring.

Sometimes your inner desires to rewrite it from scratch can be overwhelming.


> add instrumentation. Do this in a completely new database table, add a simple counter for every event that you can think of and add a single function to increment these counters based on the name of the event.

The idea is a good one but the specific suggested implementation .. hasn't he heard of statsd or kibana?

Not available on all platforms. Think: mainframes, platforms no longer with the times, non-unix and so on.

If you have access to a tool like that by all means use it, the specific implementation is not relevant, the article merely tries to show a simplest way to implement this very useful functionality that will work without limitation on just about anything that I can think of.

> Not available on all platforms. Think: mainframes, platforms no longer with the times, non-unix and so on.

YMMV, though I would steer people towards an off-the-shelf solution over rolling your own.

Does "non-unix" mean Windows? My experience there has been that you can find a statsd client for your language of choice, and a way to plug whatever logging tool you have into kibana.

Quite often there is an embedded component in the mix somewhere, or even a machine that is not networked in any present day sense of the word.

The whole reason these jobs exist is because modern tooling and the luxury that comes with them is unavailable. But I've yet to find a platform where that counter trick did not work, even on embedded platforms you can usually get away with a couple of incs and a way to read out the counters.

If the timing isn't too close to failure.

One interesting case involved a complex real life multi-player game with wearable computers. In the end we got it to work but only by making all the software run twice as fast as it did before so we could use the odd cycles for the stats collection without the rest of the system noticing. That was a big of a hack. And the best bit: after making it work we then used all the freed up time to send extra packets to give the system some redundancy and this greatly improved reliability.

That system was running 8051 micro controllers and the guy that wrote the original said that 'this couldn't be done'. Fun times :)

The server side portion of that particular project got completely re-written as well roughly along the lines presented in the article, that wasn't a huge project (500K lines or so) but I was very happy that it wasn't my first large technical debt mitigation project or I would have likely stranded.

Serious question - what platforms are you working on that you can't send a udp packet that looks like


Hehe. If you can't imagine that then you have a sheltered and probably very happy life. I don't care if it speaks ethernet, arcnet, twinax, X.25 or nothing at all, we'll find a way. By the time you can start sending UDP packets you are already on very solid footing.

Be happy if your dev environment does not include an emulated version of the real hardware that mysteriously does not seem to be 100% representative of the real thing.

Telling someone they lead a sheltered life isn't the same as actually answering the question.

What actual systems have you worked on that were connected to a database, but couldn't send UDP?

Anything embedded running sqlite for instance.

Anything on mainframes or older systems that do not have ethernet.

Anything running Netware or equivalent (true, there you could probably hack some kind of interface but whether it would be reliable or not is another matter).

Healthcare.gov is a good example although not legacy codebase. Anyway, I think fixing small bugs and writing tests are best way to learn how to work with legacy system. This allows me to see what components are easier to rewrite/refactor/add more logging and instrumentation. Business cannot wait months before a bug is fixed just for the sake of making a better codebase. But I agree on database changes should be minimal to none as much as possible. Also, overcommunicate with your downstream customers of your legacy system. They may be using your interface in an unexpected manner.

I have done a number of serious refactoring myself and god tests will save me a huge favor despite I have to bite teeth for a few days to a few weeks.

This should be one of the first tasks that any aspiring career programmer has. It's an essential experience in making a professional.

Great advice. Writing integration tests or unit tests around existing functionality is extremely important but unfortunately might not always be feasible given the time, budget, or complexity of the code base. I just completed a new feature for an existing and complex code base but was given the time to write an extensive set of end-to-end integration tests covering most scenarios before starting my coding. This proved invaluable once I started adding my features to give me confidence I wasn't breaking anything and helped find a few existing bugs no one had caught before!

> Writing integration tests or unit tests around existing functionality is extremely important but unfortunately might not always be feasible given the time, budget, or complexity of the code base.

Bottom line: If the project cannot afford to properly maintain the code, it's a failure of the business model. Projects can be maintained indefinitely, but it costs money. And that means the project has to bring in enough money to pay for those maintenance costs.

The options, as I see them:

1. Accept that this particular project, and those that intimately depend on it, has a lifecycle and will eventually die, either slowly or quickly. Prepare for that fact, staying ahead of the reaper by quitting, transferring to another project, etc.

2. Build a case to leadership that the project is underfunded long-term. This takes communication skills, persuasion skills, technical skills, and political skills. You'll need to go to all the stakeholders in their frame of reference and explain the risk involved in fundamentally depending on legacy code.

Anyway, engineers tend to see the "legacy code" problem as a technical one. It is in the sense it takes technical work to fix it. But the root cause is a misallocation of resources. If the needed resources aren't there in the first place, the problem is a bad business model.

Alternatively. Teams should be organized around products not around projects. The idea that you can move developers around new projects is wrong. A large organization with this mindset will end up with a lot of unmantainable and unmantained code.

I would argue that refactoring a legacy code base without tests, is not refactoring.

Russian Roulette?

Yeah, I've done this. It's frustrating and easy to burn out doing it because progress seems so arbitrary. Legacy upgrades are usually driven by large problems or the desire to add new features. Getting a grip on the code base while deflecting those desires can be hard.

This type of situation is usually a red flag that the company's management doesn't understand the value of maintaining software until the absolutely have to. That, in itself, is an indicator of what they think of their employees.

> This type of situation is usually a red flag that the company's management doesn't understand the value of maintaining software until the absolutely have to.

Recent conversation with the manager of a company: "I've yet to see anybody give me a good reason why we need to maintain the software we already built if it work."

No kidding.

That's just a poor job of surfacing the consequences of not maintaining software by whoever built it.... unless their software is bug free... and we all know there is so much bug free software out there.

Sadly, I think this is more of a rule than an exception...

First and foremost, do not assume that everyone who ever worked on the code before is a bumbling idiot. assume the opposite.

If it's code that has been running successfully in production for years, be humble.

Bugifxes, shortcuts, restraints - all are real life and prevent perfect code and documentation under pressure.

The team at Salesforce.com is doing a massive re-platforming right now with their switch to Lightning. Should provide a few good stories, switching over millions of paying users, not fucking up billions in revenue.

WRT architecture: In my experience, you would be lucky if you are free to change the higher level structure of the code without having to dive deeply into the low-level code. Usually, the low-level code is a tangle of pathological dependencies, and you can't do any architectural refactoring without diving in and rooting them out one at a time (I was pulling up ivy this weekend, so I was primed to make this comment!)

> ...you would be lucky if you are free to change the higher level structure of the code without having to dive deeply into the low-level code.

The problem, in my mind, is that code can't be accurately modeled on one axis from "low level" to "high level". You can slice a system in many ways:

- network traffic

- database interactions

- build time dependencies

- run time dependencies

- hardware dependencies

- application level abstractions

...and certainly more. On top of that, the dimensions are not orthogonal. You might need to bump the major version of a library to support a new wire format, for example. Anyway, since there are many ways to slice a project, what is "high level" in on perspective can be "low level" from another. And vice versa.

That's a good point I'll update the post. Thank you.

I was in this situation more than once.

My actions are usually these:

* Fix the build system, automate build process and produce regular builds that get deployed to production. It's incredible that some people still don't understand the value of the repeatable, reliable build. In one project, in order to build the system you had to know which makefiles to patch and disable the parts of the project which were broken at that particular time. And then they deployed it and didn't touch it for months. Next time you needed to build/deploy it was impossible to know what's changed or if you even built the same thing.

* Fix all warnings. Usually there are thousands of them, and they get ignored because "hey, the code builds, what else do you want." The warning fixing step allows to see how fucked up some of the code is.

* Start writing unit tests for things you change, fix or document. Fix existing tests (as they are usually unmaintained and broken).

* Fix the VCS and enforce sensible review process and history maintenance. Otherwise nobody has a way of knowing what changed, when and why. Actually, not even all parts of the project may be in the VCS. The code, configs, scripts can be lying around on individual dev machines, which is impossible to find without the repeatable build process. Also, there are usually a bunch of branches with various degrees of staleness which were used to deploy code to production. The codebase may have diverged significantly. It needs to be merged back into the mainline and the development process needs to be enforced that prevents this from happening in the future.

Worst of all is that in the end very few people would appreciate this work. But at least I get to keep my sanity.

I've always found it remarkably quick to fix warnings too, tends to be the same mistakes over and over.

This says, near the end, "Do not ever even attempt a big-bang rewrite", but aren't a LOT of legacy in-house projects completely blown out of the water by well-maintained libraries of popular, modern languages, that already exist? (In some cases these might be commercial solutions, but for which a business case could be made.)

I'm loath to give examples so as not to constrain your thinking, but, for example, imagine a bunch of hairy Perl had been built to crawl web sites as part of whatever they're doing, and it just so happens that these days curl or wget do more, and better, and less buggy, than everything they had built. (think of your own examples here, anything from machine vision to algabreic computation, whatever you want.)

In fact isn't this the case for lots and lots of domains?

For this reason I'm kind of surprised why the "big bang rewrite" is, written off so easily.

Sometimes you get an entire septic tank full of...

Code base that is non-existent, as the previous attempts were done with MS BI (SSIS) tools (for all things SSIS is not for) and/or SQL Stored procedures, with no consistency on coding style, documentation, over 200 hundred databases (sometimes 3 per process that only exist to house a handful of stored procedures), and a complete developer turn over rate of about every 2 years. with Senior leadership in the organization clueless to any technology.

As you look at a ~6000 lines in a single stored procedure. You fight the urge to light the match, and give it some TLC ( Torch it, Level it, Cart it away) to start over with something new.

Moral of the story, As you build, replace things stress to everyone to "Concentrate of getting it Right, instead of Getting it Done!" so you don't add to the steaming pile.

Can you convince management that development in this situation is horrible and expensive and that there are better architectures?

Regarding instrumentation and logging - this can also be used to identify areas of the codebase that can possibly be retired. If it is a legacy application, there are likely areas that aren't used any longer. Don't focus on tests or anything in these areas and possibly deprecate them.

From what I've seen the most common mistake when starting working on a new codebase is to not read it all before doing any change.

I really mean it, a whole lot of programmers simply dont read the codebase before starting a task. Guess the result, specially in terms of frustration.

Sometimes the code is so horribly written we have nothing else to try but to poke at it with a stick in different ways until it breaks.

> Before you make any changes at all write as many end-to-end and integration tests as you can.

^ Yes and no. That might take forever and the company might be struggling with cash. I would instead consider adding a metrics dashboard. Basically - find the key points: payments sent, payments cleared, new user, returning user, store opened, etc. THIS isn't as good as a nice integration suite - but if a client is hard on cash and needs help - this can be setup in hours. With this setup - after adding/editing code you can calm investors/ceos'. Alternatively, if it's a larger corp it will be time strapped - then push for the same thing :)

I think instead of "as many as you can" it's "as many as you can afford."

Any advice on what steps to take when the legacy codebase is incredibly difficult to test?

I completely agree with the sentiment that scoping the existing functionality and writing a comprehensive test suite is important - but how should you proceed when the codebase is structured in such a way that it's almost impossible to test specific units in isolation, or when the system is hardcoded throughout to e.g. connect to a remote database? As far as I can see it'll take a lot of work to get the codebase into a state where you can start doing these tests, and surely there's a risk of breaking stuff in the process?

An after-the-fact test suite is a different beast than one written concurrently with the app. It's not worth trying to force one to be the other.

Work from the outside in, keeping most of the system as a black box. Start with testing the highest-level behaviors that the business/users care about.

I've been a part of several successful big-bang rewrites, and several unsuccessful ones, and saying that if you're smart they're not on the table is just flat out wrong.

The key is an engaged business unit, clear requirements, and time on the schedule. Obviously if one or more of these things sounds ridiculous then the odds of success are greatly diminished. It is much easier if you can launch on the new platform a copy of the current system, not a copy + enhancements, but I've been on successful projects where we launched with new functionality.

I've yet to see a large system with lots of subsystems rewritten in one go, but I'm more than open to being convinced that it can be done so if you could please do a write-up of how such a project was managed.

The ones I have seen - and this is actually one of the major reasons the clean-up crew gets called in the first place - is big bang rewrite projects gone astray.

One huge problem with rewrites of old code is that the requirements are no longer known or even misunderstood.

The biggest problem with "the new system" is that it's rarely a rewrite of the second system. Obviously someone liked the old system otherwise it wouldn't be rewritten. But the business case for the new system isn't just lower maintenance cost, higher performance, a modern look etc. It's always going to be all those new features. That's what sinks the new project.

Can't say I agree with the big bang rewrite part necessarily - at my last job, I found myself having to do significant refactors. The reason was that each view had its own concept of a model for interacting with various objects, which resulted in a lot of different bugs from one off implementations. My refactor had some near term pain of having to fix various regressions I created, but ultimately it led to much better long term maintenance.

I agree with most of this, though I think it doesn't dive into the main problem:

Freezing a whole system is practically impossible. What you usually get is a "piecewise" freeze. As in: you get to have a small portion of the system to not change for a given period.

The real challenge is: how can you split your project in pieces of functionalities that are reasonably sized and replaceable independently from each other.

There is definitely no silver bullet for how to do this.

I could probably do a better job of making that clear in the article. The whole point is to iterate and to lock and release parts selectively so you are never working on more than one thing at the time.

> How to Improve a Legacy Codebase When You Have Full Control Over the Project, Infinite Time and Money, and Top-Tier Developers

edit: I'm being a little snarky here, but the assumptions here are just too much. This is all best-case scenario stuff that doesn't translate very well to the vast majority of situations it's ostensibly aimed at.

Genuinely would like to know how anyone has managed to do both of:

> write as many end-to-end and integration tests as you can


> make sure your tests run fast enough to run the full set of tests after every commit

>Use proxies to your advantage

At my last gig we used this exact strategy to replace a large ecommerce site piece by piece. Being able to slowly replace small pieces and AB test every change was great. We were able to sort out all of the "started as a bug, is now a feature" issues with low risk to overall sales.

> Do not ever even attempt a big-bang rewrite

Really? Are there no circumstances under which this would be appropriate? It seems to me this makes assumptions about the baseline quality of the existing codebase. Surely sometimes buying a new car makes more sense than trying to fix up an old one?

For what the OP is talking about, I would say to never attempt a rewrite.

The only caveat is if you have spent the time to truly understand the codebase, then maybe you can do it. Most people advocate a rewrite because they don't WANT to understand the codebase. Even if you understand the codebase, it's pretty dangerous, but at least you have some idea of what you're saying you will rewrite.

So yeah, it can happen, but if you are in the situation that you have the knowledge and experience to override that rule, then you have the knowledge and experience to know that you CAN override that rule. It sounds a little circular, but it's how I tend to aim my broadly-given advice. If someone knows what they're doing, they should be able to recognize when they can ignore your advice. Anything else would have to be tailored to each specific instance, which isn't plausible in a blog post.

Your car buying analogy is flawed. When you buy a new car, someone has built it for you. It's cost effective because the manufacturer builds a great number of them. You can be fairly certain that it works and if it doesn't you'll have a guarantee.

When you rewrite a software system, you do it yourself. You don't know whether you'll succeed. You might end up with worse end-results. The assumption here is that no off-the-shelf software can be used to replace it. Hence rewrite.

Another thing you can do is start recording all requests that cause changes to the system in an event store (a la event sourcing). Once you have this in place, you can use the event stream to project a new read model (e.g.a new, coherent, database structure).

The biggest problem in improving legacy codebase is that the people who have involved with have been too long and are completely using old techinques and as a new developer you can not change them, they will change you which means its hard to improve.

> Yes, but all this will take too much time!

I'm actually quite curious; how long does this process typically take you?

What are the most relevant factors on which it scales? Messiness of existing code? Number of modules/LOC? Existing test coverage?

Good questions.

How long it takes depends on the mandate given by management. Sometimes it's 30 days to get from zero to something stable and incrementally improvable at which point we hand back to the company with maybe a transition period where we still manage the project. Sometimes it is just a feasibility study in which case it can be even shorter. But if it is boots-in-the-mud (which is where the real money is) then it can be up to a year.

It scales just fine provided you have the people and this is more often than not a huge problem. It's happened that we had to leave people in place for months or even years after the project was in essence done simply because as soon as our backs were turned it was back to the usual methods. That's actually really frustrating when it happens.

Existing test coverage can speed things up but if the tests are brittle or otherwise not helpful can actually make things much worse.

As for number of modules or LOC: if you're doing a platform switch that can really eat up time, if it is just to bring things under control then it does not really matter much.

One you did not mention, but which can greatly impact the speed with which you can move is the quality of existing documentation. If there is anything at all, especially up to date requirements documentation that can serve as a tie breaker between a suspected bug or a feature it can make a huge difference.

Very interesting, thanks!

Thanks for posting, some excellent high-level advice.

Stick around that startup long enough and this a good set of things to do with your own code.

I agree with everything said, but I think they assumed a well-maintained and highly functionality legacy codebase. In my experience, there are a few steps before any of those.


1. Find out which functionality is still used and which functionality is critical

Management will always say "all of it". The problem is that what they're aware of is usually the tip of the iceberg in terms of what functionality is supported. In most large legacy codebases, you'll have major sections of the application that have sat unused or disabled for a couple of decades. Find out what users and management actually think the application does and why they're looking to resurrect it. The key is to make sure you know what is business critical functionality vs "nice to have". That may happen to be the portions of the application that are currently deliberately disabled.

Next, figure out who the users are. Are there any? Do you have any way to tell? If not, if it's an internal application, find someone who used it in the past. It's often illuminating to find out what people are actually using the application for. It may not be the application's original/primary purpose.


2. Is the project under version control? If not, get something in place before you change anything.

This one is obvious, but you'd be surprised how often it comes up. Particularly at large, non-tech companies, it's common for developers to not use version control. I've inherited multi-million line code bases that did not use version control at all. I know of several others in the wild at big corporations. Hopefully you'll never run into these, but if we're talking about legacy systems, it's important to take a step back.

One other note: If it's under any version control at all, resist the urge to change what it's under. CVS is rudimentary, but it's functional. SVN is a lot nicer than people think it is. Hold off on moving things to git/whatever just because you're more comfortable with it. Whatever history is there is valuable, and you invariably lose more than you think you will when migrating to a new version control system. (This isn't to say don't move, it's just to say put that off until you know the history of the codebase in more detail.)


3. Is there a clear build and deployment process? If not, set one up.

Once again, hopefully this isn't an issue.

I've seen large projects that did not have a unified build system, just a scattered mix of shell scripts and isolated makefiles. If there's no way to build the entire project, it's an immediate pain point. If that's the case, focus on the build system first, before touching the rest of the codebase. Even for a project which excellent processes in place, reviewing the build system in detail is not a bad way to start learning the overall architecture of the system.

More commonly, deployment is a cumbersome process. Sometimes cumbersome deployment may be an organizational issue, and not something that has a technical solution. In that case, make sure you have a painless way to deploy to an isolated development environment of some sort. Make sure you can run things in a sandboxed environment. If there are organizational issues around deploying to a development setup, those are battles you need to fight immediately.

I don't completely understand your warning to stick with the existing version control environment. Just because you switch development to git doesn't mean you delete the old CVS archive. Isn't consulting the old archive sufficient whenever you're doing a significant historical investigation?

There are a couple of reasons I'd argue it's best to avoid switching version control environments early on.

1. Integration with whatever build/issue tracking systems are present is worth preserving until you have the time to recreate it properly.

Duplicating what's already there under the new environment is always more problematic than it looks like at first glance. This is especially true when you're dealing with any in-house components (which usually manage to show up somewhere).

2. A clean break where you leave the old VCS behind and archived is tempting, but it's rarely ideal in the long-term.

The old archive is likely to wind up being deleted/lost/bitrotted/etc after a year or two. Invariably, you wind up in a spot a few years down the line where it would be useful to have the full commit history, and the old VCS winds up being inaccessible. Ideally, you'd want to preserve as much history as possible when migrating. However, trying to correctly preserve commit history (and associated issue tracker info, etc) is always a time-sink, in my experience. It's easy for simple projects, and a real pain for complex projects with a weird, long history. Choose the time that you attempt it wisely.


Again, I'm not saying don't move, I'm just saying that it almost always winds up taking a lot of time and effort. I'd argue you're better off spending that time and effort on other portions of the project early on.

Also, things like git-svn can be real lifesavers in some of these cases, though they do add an extra layer of complexity. If you do want to use a different VCS, I'd take the git-svn/etc approach until you're sure there are are no extra integration problems.

All that said, yeah, if there's no history and no integration with other systems/tools, go straight for something modern!

do refactoring you should have known at the time and not the brand new fangled way to do things, that way each new way fades into the other.

Delete it...

(Speaking from experience from work)

Is there someone who left the legacy code and became beneficial?

As a result of my work experience, it was more beneficial to delete the legacy code and only provide the necessary functions when renewing the system.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact