I don't disagree at all, but I think the more valuable advice would be to explain how this can be done at a typical company.
In my experience, "feature freeze" is unacceptable to the business stakeholders, even if it only has to last for a few weeks. And for larger-sized codebases, it will usually be months. So the problem becomes explaining why you have to do the freeze, and you usually end up "compromising" and allowing only really important, high-priority changes to be made (i.e. all of them).
I have found that focusing on bugs and performance is a good way to sell a "freeze". So you want feature X added to system Y? Well, system Y has had 20 bugs in the past 6 months, and logging in to that system takes 10+ seconds. So if we implement feature X we can predict it will be slow and full of bugs. What we should do is spend one month refactoring the parts of the system which will surround feature X, and then we can build the feature.
In this way you avoid ever "freezing" anything. Instead you are explicitly elongating project estimates in order to account for refactoring. Refactor the parts around X, implement X. Refactor the parts around Z, implement Z. The only thing the stakeholders notice is that development pace slows down, which you told them would happen and explained the reason for.
And frankly, if you can't point to bugs or performance issues, it's likely you don't need to be refactoring in the first place!
You need to make the invisible (refactoring and code quality) visible (tracking) so they can see what the current state is and map the future.
The biggest reason business stakeholders push back against this is that developers tend to communicate this in terms of "You don't need to know anything about this. But we've decided it needs to be done." Which annoys someone when they're paying the hours.
I've had decent success with bringing up underlying issues on roadmaps, even to the generalness of "this feature / component has issues." It's a lot easier conversation if it's adding "That thing that we've had on our to-do list for a couple months" vs "This new thing that I never told you about."
And as far as pitching, if the code is at all modular, you can usually get away with "new feature in code section A" + "fixes and performance improvements in unrelated section B" in the same release.
PS: I love the simple counter-based bookkeeping perspective from the linked post. (And think someone else suggested something similar in a previous performance / debugging front page article)
In almost all cases they nod and feign interest and understanding and their eyes glaze over. And why should they be interested? The stories are almost always abstract and the ROI is even more abstract. It's all implementation details to them. These stories usually languish near the bottom of the task list and you often need to sneak it in somehow to get it done at all.
I think the only real way of dealing with this problem is to allocate time for developers to retrospect on what needs dealing with the code (what problems caused everybody the most pain in the last week?), then time to plan refactoring and tooling stories and time to do those stories alongside features and bugs.
Stakeholders do need to assess what level of quality they are happy with (and if it's low, developers should accept that or leave), but that should be limited to telling you how much time to devote to these kinds of stories, not what stories to work on and not what order to do them in.
I don't see why they shouldn't have visibility into this process but there's no way they should be allowed to micromanage it any more than they should be dictating your code style guidelines.
This is, IMO, the single worst feature of SCRUM - one backlog, 100% determined by the product owner whom you have to plead or lobby if you want to set up a CI server.
If you're explaining it in terms of internals and implementation details, then you're always going to get this response.
Your job as a business-facing developer is to translate the technical details (in as honest a way as is possible) into a business outcome.
I'm not naive. We've all worked with stakeholders that make stupid choices and can't seem to grasp a point dangled right in front of them.
But. Even more often than that I've seen (especially in-house) IT talk down to the business, push an agenda through the way they summarize an issue or need, and try to use technical merits to subvert corporate decision making.
Ultimately, you're in it together with business stakeholders. Either you trust each other, or you don't. And "the business can't be trusted to make decisions that have technical impacts" is the first step towards a decay of trust on both sides.
You're also going to get this response if you explain in terms of a business case.
The business case for literally every refactoring/tooling story is this, btw:
This story will cut down the number of bugs and speed up development. By how much will they speed up development? I don't know. How many bugs and of what severity? Some bugs and at multiple levels of severity and you're not going to notice it when it happens because nobody notices bugs that don't happen. By when? I don't know, but you won't see any impact straight away.
The benefits are vague and abstract. The time until expected payoff is long. Vague, long term business cases don't get prioritized unless the prioritizer understands the gory details, which, as we both know, they won't.
The features and bugfixes - user stories - are not vague. They get prioritized.
>I'm not naive. We've all worked with stakeholders that make stupid choices
I am not complaining about stakeholders in general. I've worked with smart stakeholders and dumb stakeholders. I've never worked with a stakeholder that could appropriately compare the relative importance of my "refactor module B" story and "feature X which the business needs". All I've worked with are stakeholders who trusted me to do that part myself (which paid off for them) and stakeholders who insisted on doing it for the team because that's what SCRUM dictated (which ended badly for them).
>Ultimately, you're in it together with business stakeholders. Either you trust each other, or you don't. And "the business can't be trusted to make decisions that have technical impacts" is the first step towards a decay of trust on both sides.
No, the first (and indeed, only) step is not delivering.
Point to historical data where possible. SWAG where appropriate.
"We've probably spent over 100 hours fixing bugs in this janky ass system for every 10 hours of real honest-to-god implementation of features work. That outage on Friday? Missing our last milestone by a week? All avoidable. We've been flying blind because we have no instrumentation, and changes are painful. Proper tooling would've shown us exactly what was wrong, easily halving our fix time, even if nothing else about this system changed. A week's worth of investment would've already paid itself off."
Frankly, I'm way better at estimating this kind of impact than how long it'll take to implement feature X.
We've determined that we can automate the entire process by setting up a Continuous Integration (CI) server. There's some work involved in setting it up; we estimate it will take __ days/weeks to get it running. But once it's running, (we'll always have a build running __ minutes after each code change)|(we can click on a button in the CI's GUI and we'll have a build running __ minutes later), and we'll be saving __ hours/days of effort per build/year."
Plug in your numbers. If the time to deploy the CI server exceeds the savings, the business would be justified in telling you not to do it. (You'd have to make a case based on quality and reproducibility, which is tougher.) If the cost is less than the savings, the business should see this as a no-brainer, and the only restraint would be scheduling a time to get it done. (Not having it might cost more, but it might not cost as much as failing to get other necessary work done.)
And that's the crux of the problem. The business invariably mistakenly believes that piling more features onto the steaming pile of crap that is the codebase is the better solution. Add on to that that some mid-level PM promised feature X to the C-level in M months, where M is such short notice even a engineering team with cloning and time machines would be short-staffed, and was chosen without even asking the engineering staff what their estimate of such work would be.
To the business, the short term gains of good engineering practices are essentially zero. The next feature is non-zero. The long-term is never considered.
I've had multiple PMs balk at estimates I've given them. "How could internationalizing the entire product take so long? We just need to add a few translations!" No, we need to add support for having translations at all, we need to dig ourselves out from under enough of our own crap to even add that support, we need to figure out what text actually exists, and needs translating, actually add those translations, and we need to survey and double-check a whole host of non-text assets because you mistakenly believe that "internationalization" only applies to text. Next comes the conversation about "wait, you can't just magic me a list of strings that need translating? I need that for the translators tomorrow!" No, they're mixed in with all the other strings that don't need translating, like the hard-coded IPv5 address of the gremlin that lives in the boiler room eating our stack traces.
Then, later, we'll lose a week of time because the translation files that engineering provided were turned into Word documents by PMs. One word doc, with every string from every team, and then those Word docs got translated. So now we have French.docx, but that of course only has the French. So now engineers are learning enough French to map the French back to the English so they know what translations correspond to what messages.
It doesn't have to be cost, that just happens to be easiest because it can be opinion-free. You can also express value in terms of business risks or opportunities, but the impact can be seen as an opinion, and you can be challenged by someone with different opinions.
"I wasted around 10 hours last week thanks to inadvertently pulling broken builds because we don't have a CI server. I spent 4 hours manually deploying things because we don't have a CI server."
"When can I move the new CI server I already setup on my workstation - because fuck wasting half my week to that nonsense, and I had nothing better to do while the devs who broke the build fixed it - to a proper server where everyone can benefit?"
Extrapolating, that's what - 4 months per year of potential savings?
Sure, 14 hours might not be enough time to automate your entire build process, but it should be enough to automate some of it, get some low hanging fruit and start seeing immediate gains. Incrementally improve it for more gains when you're waiting for devs to fix the build for stuff the CI server didn't catch.
On the other end of the spectrum, a lot of my personal projects don't warrant even the small effort of configuring an existing build server. I'm the only contributor, nobody else will be breaking anything or blocking me, builds are super fast... even if "it will pay off in the long run", there are other higher impact things I could do that will pay off even better in the long run.
In the middle, I've put off automating some build stuff for our occasional (~monthly) "package" builds for our publisher - especially the brittle, easy to fix by hand, rightly suspected to be hard to automate stuff. I was generally asked to summarize VCS history for QA/PMs anyways - can't automate that.
When we started doing daily package builds near a release, however, it ate up enough of my time that non-technical management actually noticed and prodded me before I thought to prod them. Started by offloading my checklist to our internal QA (an interesting alternative to fully automating things) and eventually automated all the parts QA would forget, not know how to handle, or waste too much attention on.
Even then, some steps remained manual - e.g. uploading to our publisher's FTP server. Tended to run out of quota, occasionally full despite having quota available, sometimes unreachable, or uploading too slow thanks to internet issues - at which point someone would have to transfer by sneakernet instead anyways. Not much of a point trying to make the CI server handle all that.
Not waiting for a build means not testing it, in which case a manual non-build of an uninteresting software configuration seems attractively elegant but a CI system still provides value by recording that a certain configuration compiles and by making the built application available for later use in case it's needed.
What do you mean by "not testing it" "seems attractively elegant"? Testing a build is still a must, although that ends up to be manual testing usually (unit tests don't assure much, integration take a lot of engineering effort for setting up and writing, especially if they were not taken care of from the start).
I'm trying to get some fucking work done, not convince investors I need a series A.
You said: I've tried explaining why we need to set up a CI server. ... In almost all cases they nod and feign interest and understanding and their eyes glaze over.
The reason you've failed to make a convincing case, I believe, is because you're talking in your language instead of theirs. Perhaps they've tried to explain to you, in their language, why they won't prioritize your CI server, and you nodded and feigned interest while your eyes glazed over.
The quote I gave you expresses your request and justification for a CI server into terms the business needs: what problem does it solve, what does it cost, how does it affect on-going costs, what are the risks of doing it and not doing it, and what impact does it have on other activities if it is done and if it is not done. This is not a "fully costed business case" or "convincing investors you need a series A". If you've given any thought at all to why you want a CI server beyond "I want it" you should have no problem filling in the blanks in my quote. And if you haven't bothered to think that much about it, your business is doing the right thing by giving your requests a low priority, because they shouldn't give your ideas any more attention than you're giving them yourself.
If you're not sucessful with this approach, and can't get approval despite showing that it's in the business' best interests using the business' own criteria, then your business is too dysfunctional and toxic to fix. Time to move on.
No, actually not needing approval to do necessary work is very efficient.
>You have to start making progress somewhere, even if it's not as fast as you'd like it to be. If you're sucessful with this, you gain credibility and over time your recommendation will be sufficient to get approval for smaller tasks, and the business case will only need to be made for bigger tasks.
There's no point in working to gain enough credibility to be able to do your own job effectively when you can simply leave and go and work somewhere else that doesn't expect you to prove to it that you can do your job after they've hired you.
Even if you manage to prevent the company from shooting itself in the foot as far as you're concerned by "proving your worth", it'll probably only go and shoot itself in the foot somewhere else and that will also ultimately become your problem.
In any case, this process tends to feed upon itself. Failures in delivery lead to a lack of distrust which leads to micromanagement which leads to failures in delivery. It's not that you can't escape that vicious cycle, it's that it typically has a terrible payoff matrix.
You're right about having to make a choice between fixing the place you're at or finding a new place to be. There are many factors to consider, and sometimes trying to fix the place you're at can be worth the effort.
Maybe! Alternatively: if you give a mouse a cookie, it will want a glass of milk. It might be worthwhile to establish early on that the technical leadership needs to be trusted to make their own decisions about trivial things.
No, the problem is that you believe that micromanagement is effective.
"The reason you've failed to make a convincing case, I believe, is because you're talking in your language instead of theirs."
No, the reason is because the ROI is vague and not easily costable and the time until expected return is usually months. By contrast, feature X gets customer Y who is willing to pay $10k for a licence on Tuesday.
This hyperfocus on the short term and visceral ROI over the long term and vague ROI isn't limited to software development, incidentally. It is a very, very common business dysfunction in all manner of industries - from agriculture to health care to manufacturing. Companies that manage to get over this dysfunction by hiring executives who have a deep understanding of their business and are willing to make long term investments often end up doing very, very well compared to the companies that chase next quarter's earnings figures with easy wins.
This is also why companies that are run by actual experts instead of MBA drones inevitably end up doing better (ask any doctor about this). It's not the fault of the people beneath them for not speaking the MBA's language. It's the fault of MBAs for being unqualified to run businesses.
Now, fortunately, product managers don't have to understand development because they can choose not to have to make decisions that require them to. However, if they insist on making decisions that require them to understand development then they will damage their own interests.
"The quote I gave you expresses your request and justification for a CI server into terms the business needs: what problem does it solve, what does it cost, how does it affect on-going costs, what are the risks of doing it and not doing it, and what impact does it have on other activities if it is done and if it is not done."
How low level are you willing to take this? Would you agree to make a business case for why you are using your particular text editor? Would you provide an estimate of the risks of not providing you with a second monitor? Where's the cut off point if it's not a day's work? Perhaps you are costing the company money with those decisions, after all.
If you want to spend a day on a CI server, it'll cost the company a day of your time (say, $1k) and will save maybe 5x that over the year by saving an hour of your time dealing with each build. That's great and worth doing. But, if it means that your company will miss out on $10k of revenue Tuesday, it's a net loss. And if missing that revenue means payroll can't be made on Friday, the company is screwed. The hyperfocus on short-term may be dysfunction, or it may be a sign that the company is in serious trouble. Jumping ship might be the best choice.
"Speaking the MBA's language" isn't really about terminology, it's about a different point of view with different concerns and priorities. A PM choosing your text editor sure sounds like micro-management of a technical decision that the PM doesn't understand, but maybe the text editor you want to use has licensing costs for business use that you're not aware of because you always used it personally, and the PM's decision is based on that business concern rather than the technical merits. Same topic, same choice to make, different point of view.
Ok, so assuming:
* All user stories are prioritized by management.
* Management determines the exact % of time spent on refactoring stories.
* Refactoring stories are prioritized by devs and slotted alongside user stories (according to the % above).
What kind of hypothetical non-technical concerns that are part of managment's prioritization decisions would become a problem?
Because, as far as I can see, in such a case, it wouldn't matter if the devs are not aware of the non-technical concerns because those concerns would still be reflected by the prioritization.
Negotiating those agreements is where having a common ground on business concerns helps. And yes, it sure does help when the managers can also see things from the dev's point of view too. In my experience, it's easier for devs to understand business concerns than the other way around, so that's the way I lean.
"When we're interviewing people and they find out just how backwards our CI system is, the smart ones will laugh at us and work somewhere else and we'll be left with just the dumb ones."
He didn't say it was the only way. Nor that you can't add more arguments if cost savings alone isn't convincing enough.
I worked somewhere once that forced me to spend political capital to make these kinds of things happen and it was a terrible waste.
Nobody notices the disasters that don't happen and when somebody is 2x faster and develops code with fewer bugs, that tends to reflect well upon them even if they were building upon your work.
A lesson I learned the hard way is that if the business doesn't care then neither should you, It's just not worth fighting uphill battles like this. The only way to measure what a business cares about (distinct from what they say they care about) is by looking at what they're willing to spend money on.
If building software is annoying for you personally then you can automate much of it, maybe even setup a CI server on you're own machine.
Absolutely. I used to work for a company where the battles were uphill and constant. I quit and now work for a company with no battles. One had bad financials and the other has very good financials.
The first company did teach me how to deal with very extreme technical debt, though (they'd been digging their hole for a while), which actually is a useful thing to know.
>If building software is annoying for you personally then you can automate much of it, maybe even setup a CI server on you're own machine.
The solution is GTFO.
It's a trap actually, sometimes it's the shit companies like this that are the only ones hiring.
It comes down to how much trust the executive sponsors have in the face of the engineering org, and how the business views technology, as responsible professionals or children who have to be closely supervised and monitored.
Nurturing that relationship is one of the most important jobs of an executive/C-level engineering manager.
Notice the OP said that users have long login times due to various issues and he can solve them by doing X, and not "TCP/IP timeouts and improper caching policies are causing back pressure leading to stalls in the login pipeline..."
The explaining "why" in and of itself isn't particularly hard - refactoring and development tooling will speed up development in the future and reduce bugs. It's getting it prioritized that's hard - and that's because the business case of 'potentially reducing the likelihood of bugs in the future' and 'a story 3 months from now might take 3 days instead of 2' isn't a particularly compelling one - not because it's not important - but because it's not visceral and concrete enough.
In practice I've seen what this does. The process of introducing transaction costs (having to 'sell' the case of refactoring code is exactly that) simply stops it from happening.
If, as a business, you want to introduce this transaction code into your development process you will end up paying more dearly for it in the long run as you deal with the effects of compounding technical debt.
You work with your customer to decide what end-user bugfixes and features to prioritize, but it's your job to make technical decisions. That's why they hired you. Don't push those decisions back onto them.
But really, management that doesn't keep up conceptually with the business trends of software engineering management are low performers in the same way as engineers that refuse to learn how to improve their code even if it doesn't directly impact their immediate codebase (functional programming patterns as an embedded software engineer comes to mind).
Title: Reveiw code for feature X
Description: As a OUR-APP product manager I want to understand how feature X is implemented.
AC1: Feature X is documented in the wiki in context Y.
If you're short on budget and need to sell the whole package to management just do that one, it will make all that invisible stuff visible in more detail than they likely have the stomach for and you'll be granted budget in no time because there is nothing that spells lost business better than a fair sized gap between customers entering on the left and only a trickle coming out on the right.
In effect this is funnel visualization for the internals of an application in all the gory detail.
We sent people to the moon with the computing power of a calculator, and with enough good people and effort, and version control, you can rewrite any legacy codebase to meet rigorous standards and meet the performance needs of the users.
Out of all the things humanity is trying and has accomplished, this is not unacheivable.
If the people don't want it, and the culture does not lend itself to high standards, often people will not see the same pain a developer will experience when they see poor code performing poorly. They may complain about the output and all the other things wrong about the platform, but that doesn't mean the company is going to support a legacy reboot, that just means the company has an accepted culture of low performance and complaining.
I say this in the context of a non software development company relying on alot of software.
my most recent form of personal torture I have endured is watching my IT department take a 45 yr old blackbox piece of software from a very old outdated engineering firm who never specialized in software, and actually trying to port it to AWS, thinking it will speed up performance AND save costs. They have no idea if there is capability for the kernel to exploit concurrency of the algorithms on the inside.
What they do know is there was a bug in the code, and it took 8 months to fix embedded in 1500 lines of code, it had no API and all the developers were dead or unemployed by the original company. They pay millions of dollars annually for this liscense and additionally millions more for an "HPC" to run it on.
They would never consider rewriting it, or contracting a new firm with a timeline, performance standards, needs and competitive cost recruiting. They don't know how. They don't understand how.
This is the way of the world outside of software development companies.
If youre wondering how I exist in such a painful environment, I'm an Electrical Engineer, and I do not work for a software company. I get to mentor under some of the most brilliant and game changing Engineers in my industry, but it has very little to do with software development.
It could have alot to do with it, but the engineers have no interest in taking advantage of software. They have to be able to first understand the advantages software can provide but..I mean, some of the engineers I work with don't know what a GPU is. Never heard of it.
I write all my own code for my own work from scratch.
He was telling me this because he got a job offer from a startup, where they wanted him to be the subject matter expert for an application they were developing for power utilities.
It's a good offer, and he's tempted, but he's not sure if he can fit into the software/startup culture. In addition, he felt that the developers were looking to use him like a reference book - he got the impression that the founders saw software as the answer to everything, and that they didn't see power engineering as a particularly hard domain. There was (in his words) a distinct whiff of "developers are the cool guys".
This turned him off somewhat. So he's not sure if he'll take the offer, and I'd say he's leaning no.
Just an anecdote to illustrate the clash of cultures.
I am getting the same kind of offers. Theres alot of subsidies and VC investment in "clean" energy and "smart grid" so developers left and right have a new market to apply their software development skills to.
In my experience, I am not being used as a reference book, and my long term investment in coding/Minor in Compsci and personal projects make me able to translate enegineering speak to software speak. I have not had the experience that the developers think software is the answer and that power is trivial.
I think the real clash of cultures is the culture of software development realizing how much beuracracy surrounds the power grid, and how much resistance there is to change within the industry, because the industry really does not understand how 95% of their day manually editing excel workbooks that output fortran run files from the 80s could be deleted and improved, and of course, most of them don't want to improve since there is a 30yr generational gap in the power industry.
Half of the people I work with don't believe in climate change and think Elon Musk is taking their jobs. This industry is ripe for disruption and I don't think the mentality that software/smart grid stuff can improve it is incorrect, but I do think the naivete that everyone in every industry is as openminded and continually putting in effort to learn and produce working products, which is a mindset required by successful growing software firms, is proving to be a big barrier and wake up call to software companies trying to come in an help.
I honestly blame our industry over software developers, but yeh, theres also a goldrush in "smartgrid" "clean energy" stuff and everyone want's to be apart of it.
Definitely as a Power Engineer it leaves you getting lots of opportunities and having to sort through who is willing to invest in understanding the complexities of innovating on the power grid, and who is going to cop out once they realize you can't whip up an app and make money off users the way snapchat will.
Theres also, and for good reason, lots of cybersecurity policies surrounding software running on the power grid, because hacking the grid has detrimental effects that can quickly translate to coast wide blackouts etc. That also means the newest github release of that multiplatform coffeescript spinoff is not going to be allowed to be used in alot of the grid side applications, and there is more work involveded with vetting development.
All of these things really quickly weed out the devs who are looking for quick stardom and the next easiest cashcow, and landing on the latest buzzwords related to clean energy. It can be frustrating weeding out the companies who try to hire you from that perspective.
It doesn't mean the grid doesnt need better software, it just means people, even developers who want to cash in on hot finance markets are going to take the path of least resistance, and the powergrid is not the path of least resistance (no pun intended) when it comes to quick cash and unicorn apps.
You have to actually CARE about innovating on the grid, and not just to pretend to care because theres billions sitting around in funding waiting to fund good smart energy innovation. Regardless, because this money is there now, and there's the backing of social reinforcement of being able to herald your startup as bleeding edge world saving technology thats going to stop societies impending doom from global warming, is a very compelling emotional appeal that makes marketing and justifying your product easy.
It's hot right now and in then ext ten years we will see who is around to grab quick cash and feel good about being the poster child for saving the climate and who is willing to invest in truly renovating the powergrid and enabling clean energy as a sutainable long term solution that is eocnomically viable without startup subsidies to cover the cost of initial investments.
Eventually these companies have to show a profit...
It is frustrating to be in the industry as an Engineer under the age of 35, and also have exposure to Compsci and friends working at Amazon and Google, and having to explain a hyperlink to a coworker 3 levels above you.
It's also frustrating to take graduate level classes and do research and R&D and have spent years designing power grid and actually being on the power grid and seeing construction and putting in hard work to learn Electrical Power before it was "cool" and then have an insurmountable rush of developers who want you to help them change the world. They are the CEO, you are the reference book engineer.
I get that, but its important to look past that and see there is true benefit to the innovation. And for the most part they are right, this industry has been sitting in static mode riding comfortably for a long time in many ways when it comes to trying to stay technologicially relevant, and maintain sustainable infrastructure that allows for growth. So it is frustrating, but ultimately the software/tech community is in the right. It's time for a change and this industry needs to admit it kind of sucked at changing itself for decades.
It also helps that I have 8 years of coding experience I put in on my own time to help ease this barrier but I am an exception to the rule as I have been told by recruiters doing smart grid dev.
So I can see how we have (much) more freedom when it comes to setting the time table and more diplomacy and better salesmanship might be required at an earlier stage. But then you can point to this comment here and suggest that it is probably much cheaper to do this in house than to hire a bunch of consultants to do it by the time the water is sloshing over the dikes.
"Many teams schedule refactoring as part of their planned work, using a mechanism such as "refactoring stories". Teams use these to fix larger areas on problematic code that need dedicated attention.
Planned refactoring is a necessary element of most teams' approach - however it's also a sign that the team hasn't done enough refactoring using the other workflows."
I resolved to try this if I ever ran into the same problem again after a whole bunch of arguments at a previous few companies:
* Set up a (paper) slider with 0-100% on it and put it somewhere prominent on the wall. Set it at 70%. That's the % of time you spend on features vs. the % of time you spend on refactoring (what that entails should be the development team's prerogative).
* Explain to the PM (or their boss) that they can change it at any time.
* Explain that it's ok to have it at 100% for a short while but if they keep it up for too long (e.g. weeks) they are asking for a precipitous decline in quality.
* Track all the changes and maintain a running average.
I think a lot of people suspect that management would just put it at 100% and leave it but I suspect that wouldn't happen. Most manager's "cover their ass" instincts will kick in given how simple, objective and difficult to bullshit the metric is once it's explicit.
To a non-technical manager, the former sounds like pretty much what it is, and won't raise many questions. (If they do question it, ask them if they maintain their car while it's still running ok, or just wait until it breaks down before they do anything to care for it.)
Refactoring, on the other hand, sounds like a buzz word, and if they look it up they'll get "rewriting code that's already working so that it continues working the same way". They probably won't get the nuances about why that's a useful thing to do, so it'll sound like busywork and they won't be happy with letting your team do it. They also won't be able to justify it to their management if they're questioned about it, which is critical for getting buy-in from your managers.
Remember you need to work in their terms. Risk is something they understand. The risk is they ship as soon as the last feature is done, without discovering that the last feature broke everything else. From there they move back, the last feature is done, so we do a 30 second sanity test - increase that to 30 minutes, 1 week, 1 month... They should have charts (if not create them) showing how long after a bug is introduced is it discovered on average, use those charts to help guide the decision. If the freeze time frame is too long then they allocate budget to fix it, or otherwise plan around this.
There are a lot of options, but they are not technical.
I feel this is a lack of clarity around the word refactoring. Improving the code in a way that fixes bugs is "bug fixing", in a way that makes it do its job faster is "optimisation" and in a way that improves the design is "refactoring".
Of course one can do several of them at the same time. And add features, at least in the small.
Refactoring can be a valuable activity for bits of a code base where the cost of change could be usefully reduced. It's useful to have a word that can be used to describe that activity that isn't commonly conflated with bug-fixing or optimisation.
re: Write Your Tests
I've never been successful with this. Sure, write (backfill) as many tests as you can.
But the legacy stuff I've adopted / resurrected have been complete unknowns.
My go-to strategy has been blackbox (comparison) testing. Capture as much input & output as I can. Then use automation to diff output.
I wouldn't bother to write unit tests etc for code that is likely to be culled, replaced.
I've recently started doing shadow testing, where the proxy is a T-split router, sending mirror traffic to both old and new. This can take the place of blackbox (comparison) testing.
re: Build numbers
First step to any project is to add build numbers. Semver is marketing, not engineering. Just enumerate every build attempt, successful or not. Then automate the builds, testing, deploys, etc.
Build numbers can really help defect tracking, differential debugging. Every ticket gets fields for "found" "fixed" and "verified". Caveat: I don't know if my old school QA/test methods still apply in this new "agile" DevOps (aka "winging it") world.
I agree with many of your points, but that casual dig at semver is unwarranted and reveals a misunderstanding of the motivation behind it . Semver defines a contract between library authors and their clients, and is not meant for deployed applications of the kind being discussed here. Indeed, the semver spec  begins by stating:
> 1. Software using Semantic Versioning MUST declare a public API.
It has become fashionable to criticize semver at every turn. We as a community should be more mindful about off-the-cuff criticism in general, as this is exactly what perpetuates misconceptions over time.
Semver, outsiders view, marketing.
Two different things, conflating them causes heartache. Keep them separate.
I think you misread the author. He says "Before you make any changes at all write as many end-to-end and integration tests as you can." (emphasis mine)
> My go-to strategy has been blackbox (comparison) testing.
Capture as much input & output as I can. Then use automation to diff output.
That's an interesting strategy! Similar to the event logs OP proposes?
You capture the initial output from the original code, then treat this canonical version as the expected result until something changes.
The next three months are spent learning what anything in the giant input blob even means, and the same for the output blob, and realizing that a certain output in the output comes directly from the sql of `SELECT … NULL as column_name …` and now you're silently wondering if some downstream consumer is even using that.
Methinks I've prioritized writing of tests, of any kind, based on perceived (or acknowledged) risks.
Hmmm, not really like event logs. More of a data processing view of the world. Input, processing, output. When/if possible, decouple the data (protocol and payloads) from the transport.
First example, my team inherited some PostScript processing software. So to start we greedily found all the test reference files we could, captured the output, called those the test suite. Capturing input and output requires lots of manual inspection upfront.
Second sorta example, whenever I inherit an HTTP based something (WSDL, SOAP, REST), I capture validated requests and generated responses.
Some other important points:
- Inst. and Logging: And also add an assert() function that throws or terminates in development and testing, but logs in production. Sprinkle it around when your working on the code base. If the assert asserts assumptions were wrong and now you know a bit more about what the code does. Also the asserts are your documentation and nothing says correct documentation like a silent assert
Fix bugs - Yes, and fix bugs causing errors first. Make it a priority every morning to review the logs, and fix the cause of error messages until the application runs quiet. Once its established that the app does not generate errors unless something is wrong, it will be very obvious when code starts being edited and mistakes start being made.
One thing at a time - And minimal fixes only. Before staring a fix ask what is the minimal change that will accomplish the objective. Once in midst of a code tragedy many other things will call out to be fixed. Ignore the other things. Accomplish the minimal goal. Minimal changes are easy to validate for correctness. Rabbit holes run deep and deepness is hard to validate.
Release - Also almost the first thing to do on a poorly done project is validate build and release scripts (if they exist). Validate generated build artifacts against a copy of the build artifact on the production machine. Use the Unix diff utility to match for files and content or you will miss something small but important. For deployment, make sure you have a rollback scheme in place or % staged rollout scheme because, at some point, mistakes will be made. Release often because the smaller the deploy the less change and the less that can go wrong.
Same here - you have an oracle, it would be a waste not to use it. You can probably also think of some test cases that are not likely to show up often in the live data, but I would contend that until you know the implementation thoroughly, you are more likely to find input that tests significant corner cases in the live data, rather than by analysis.
I think that is precisely what the article advocates - although the definition of what end-to-end and integration tests are varies wildly from place to place.
> First step to any project is to add build numbers. Semver is marketing, not engineering. Just enumerate every build attempt, successful or not. Then automate the builds, testing, deploys, etc.
A thousand times this. And get to a point where the build process is reproducible, with all dependencies checked in (or if you trust your package manager to keep things around...). You should be able to pull down any commit and build it.
From my point of view, this is always key. The moment you can have testable components, it's the moment you can begin to decompose the old system in parts. Once you begin with decomposition, Its easier first to pick on low hanging fruits to show that you are advancing and then transitioning to the dificult parts.
pd: I've been all my carreer maintaining & refactoring others code. I've never had any problem to take orphan systems or refactor old ones, and I kind of enjoy it.
If you have such of that old & horrible legacy systems, send it my way :D.
- Get a local build running first.
Often, a complete local build is not possible. There are tons of dependencies, such as databases, websites, services, etc. and every developer has a part of it on their machine. Releases are hard to do.
I once worked for a telco company in the UK where the deployment of the system looked like this: (Context: Java Portal Development) One dev would open a zip file and pack all the .class files he had generated into it, and email it to his colleague, who would then do the same. The last person in the chain would rename the file to .jar and then upload it to the server. Obviously, this process was error prone and deployments happened rarely.
I would argue that getting everything to build on a central system (some sort of CI) is usefull as well, but before changing, testing, db freezing, or anything else is possible, you should try to have everything you need on each developer's machine.
This might be obvious to some, but I have seen this ignored every once in a while. When you can't even build the system locally, freezing anything, testing anything, or changing anything will be a tedious and error prone process...
I'd extend this and say that the CI server should be very naive as well. It's only job is to pull in source code and execute the same script (makefile, whatever) that the developers do. Maybe with different configuration options or permissions, but the developers should be able to do everything the CI server does in theory.
A big anti pattern I see is build steps that can only be done by the CI server and/or relying on features of the CI server software.
Also added a bit about the very obvious backup that you need to make before starting any work at all. Just in case...
edit: it also gives a lot of similar advice to the article, big-bang rewrites often impossible, drawing a line somewhere in the application to do input-output diffing tests when you make a change
FWIW, if you want to have a look at a reasonably complex code base being broken up into maintainable modules of modernized code, I rewrote Knockout.js with a view to creating version 4.0 with modern tooling. It is now in alpha, maintained as a monorepo of ES6 packages at https://github.com/knockout/tko
You can see the rough transition strategy here: https://github.com/knockout/tko/issues/1
In retrospect it would've been much faster to just rewrite Knockout from scratch. That said, we've kept almost all the unit tests, so there's a reasonable expectation of backwards compatibility with KO 3.x.
That's most likely not true, but looking backwards it often feels that way. The problem is that you're now a lot wiser about that codebase than you were at the beginning and if you had done that rewrite there could have easily been fatalities.
But of course it feels as if the rewrite would be faster and cleaner. How bad could it be, right? ;)
And then you suddenly have two systems to maintain, one that is not yet feature complete and broken in unexpected ways and one that is servicing real users who can't wait until you're done with your big-bang effort. And then you start missing deadlines and so on.
It's funny in a way that even after a successful incremental project that itch still will not go away.
That may not be true in this case, if the rewriter is also the original author and has remained active in the codebase over the years.
Even so, there is the Netscape story as evidence to the contrary.
But (disclaimer) as someone who as advocated for big-bang-rewrite's before, I'm still under the impression that there are situations where they can be net-better.
Factors may include:
- there is no database involved, just code. Even more helpful if the existing code is "pure".
- a single developer can hold the functionality in their head.
- there are few bugs-as-features, tricky edge cases that must be backwards-compatibility, etc.
- as stated above, it's the primary author.
- much of the existing functionality is poor, and the path for building, launching, and shifting to a "replacement product" is relatively clear.
Advocating to never rewrite can be harmful, and make things harder for people for whom that actually would be the best approach.
But the situation that I'm describing is not ticking any of those boxes and I think I made that quite clear in the pre-amble.
Oh, there's no doubt in my mind about that!
Some people may read this and extrapolate too far regarding their own situation (there's a reason this is a specialty field, it's hard stuff).
You're getting a bit of pushback on this sentiment, so I'll play devil's advocate a bit here.
I've tried gradual refactors in the past, with poor results, because unfocused technical teams and employee turnover can really kill velocity on long-term goals that take gradual but detailed work.
That is, replacing all those v1 API calls with the v2 API calls over five months seems fine, but there's risk that it actually takes several years after unexpected bugs and/or "urgent" feature releases come into play. And by that time, you might have employee turnover costs, retraining costs, etc.
I'm just saying the risk equation isn't as cut and dry as it seems. There's is survivor bias in play in both the "rewrite it" and the "gradually migrate it" camps.
Outside that boundary you're set up for failure.
This is also fraught with peril. However, it is a different set of problems. In an ideal world, you have engineers who can make reasoned decisions.
However, if the company culture allowed one application to devolve into chaos, what will make the second application better?
The real problem of course is to let things slide this far in the first place. But that's an entirely different subject, for sure the two go hand-in-hand and often what you touch on is the major reason the original talent has long ago left the company. By the time we get called in it is 11:58 or thereabouts.
An excellent talk about this is "The Scandalous Story of the Dreadful Code Written by the Best of Us" by Katrina Owen 
Like, when at 3:20 the presenter says there's a thing you can do that makes it utterly trivial to test this feature, I immediately assumed she'll just have to write some mocks for the `comm` package, and plug that in. Cool, I guess she'll talk about a nice mocking library or something, or there's some business complexity involved where the comm package is particularly stateful and so difficult to mock.
But no. The big difficulty seems to be that the language doesn't allow you to mock package-level functions; and so before you can mock anything you have to introduce an indirection - add an interface through which the notify package has to call things, move the code in the comm package into methods on that interface, correct all code to pass around this interface and call methods on it.
Why would you choose to work in language that makes the most common testing action so painful?
Monkey patching is a sign of bad code in 99% of cases. In that 1% of cases where it might be justified, you can restructure your code to use indirection and dependency injection, and avoid having to use monkey patching. It might not be as nice as monkey patching in that 1% of cases. But I'd rather work in a language without monkey patching, precisely because it makes it incredibly obvious when you've coupled your shit.
Working in Go changed how I write my JS code. I don't know if you write much JS, but to my mind, `sinon` is mocking. `proxyquire` and `rewire` are monkey patching; monkey patching with the aim of helping mocking, but monkey patching none the less. My JS tests now don't use proxyquire or rewire, though they might use sinon. I find this produces easier to read code.
To me, having to change a function into a method on a singleton interface just to be able to mock it for tests seems like working around inadequacies of the language. And I'm not sure why `module.Interface.method` is easier to read than `module.function`.
Why do you say that? The idea one could get it right writing from scratch is one of those seductive thoughts, but in my experience it never works out that way.
Of course the alternate route – rewriting - is just a hypothetical so we can only suppose how it would've turned out.
That said, rewriting from scratch would've been pretty straightforward, since the design is pretty much set.
The real value of the existing code resides in the unit tests that Steve Sanderson, Ryan Niemeyer, and Michael Best created – since they illuminated a lot of weird and deceptive edge cases that would've likely been missed if we had rewritten from scratch.
So I suspect you are right, that it's just a seductive thought.
Also second system syndrome.
I work on my own side projects, read lots of other people's code on github and am always looking to improve myself in my craft outside of work, but I worry it's not enough.
Also, here is a paradox: take someone who has only ever seen terrible code bases and someone who has only ever seen very good code bases. How can they know? They might take a guess based on how well the software works, but that's probably not very reliable.
I think a good software engineer is someone who has seen a lot of different things, good and bad; someone who knows what design choices work and what will plunge software into the depths of Hell; probably someone who has make mistakes themselves and lived through the consequences.
But yeah, when working on such a code base, do read some code outside of it now and then, never forget there are better ways to do things. And if you are starting to feel burnt out by the quality of the code base you work on, you should probably make a change.
I was surprised to see the article say "It happens at least once in the lifetime of every programmer,". I think if you work on greenfield projects your whole career you're likely the one who's creating these 'steaming piles of manure'.
By working on bad legacy projects you learn an awful lot of things about what works and what is a problem to maintain - it will make you a better developer.
The only issue is if you always work on legacy stuff and never get to write greenfield you might get typecast as such. Whether that is a problem of not is up to you. Sounds like you care enough you can change when/if you want to.
I would agree that if all you work on is greenfield you're probably making the messes others are cleaning up, but I don't think that means developers are bound to either make messes or clean them up. There are plenty of good, long-lived projects out there.
Not every old project is legacy.
At my current place of work, we're not even using xmlhttprequest. We're using an antiquated xml library that's been hand rolled (xajax + major changes) to emulate our ajax requests. It's insanity to me that we're still in this mode.
Eggzactly, well stated.
As long as you keep your eyes open to other people doing what your organization is struggling with right the first time it gives sufficient motivation to approach every problem with 'why is this here and how could we do this better.' The great thing about the state of F/OSS right now is that you have codebases that have to change because of things like large amounts of RAM being so cheap- that very well understood algorithm designed to only do things in 64MB so as to not swap out no longer makes sense and so there are intelligent motions to fix it. I've been planning on reading the Postgres 9.6 changes for parallel queries to understand how they did the magic in a sane and controlled manner and shipped a working feature.
Very incrementally - we've been adding more and more infrastructure since PostgreSQL 9.4. Which finally was user visible with some basic parallelism in 9.6, which'll be greatly expanded in 10. There's some things that we'd have done differently if we'd started in a green field, that we had to less optimally to avoid breaking the world...
I think one thing you can do is attempt to isolate the code surrounding the next chunk you work on. Do as much as you reasonably can of the things the article mentions. This may only be writing tests and adding logging, but if it's an improvement over what's there, you'll improve the experience of the next person involved with that code.
I'd warn you against jumping ship in hopes of finding a "clean" code base. Most code is somewhere on a spectrum of "maintainable enough" and something... grimmer.
If you really are unhappy and don't feel like you're growing or have the ability to grow, maybe try out contributing to a well-maintained OSS project. If you find yourself immensely happier, dust off your resume ;)
All that being said, certainly do not hesistate to look around if you feel like you aren't growing as fast as you could be. Life is short and it's a sellers market for engineer labor in most places I have seen.
The main problem I have is how to structure what it is I'm trying to improve upon. I also want more external perspective to help guide me towards becoming better in the web development field, but I don't feel like the company I'm at has developers with a modern web development skillset to offer that guidance.
Unfortunately I work in an area where the web developer talent is pretty shallow. The general programmer talent pool is deep, but I still feel like the specialization towards webdev and modern practices just aren't here.
However, constantly putting off fires, under the gun, in horrible code bases, is probably not a good way to learn how to design software... It's a good way to learn how to debug and reason about problems, which is also a valuable skill to develop, though.
The causes of manure code are usually out of your control - tight deadlines; new devs touching stuff without properly understanding the whole; organization prioritizing short-term reward over long-term sustainability.
You also have to consider the inherent survivorship bias - only successful businesses live long enough that their codebase has time to grow into a big mess. Any company that lives more than a few years inevitably ends up with "manure". You'd have to be in the extremely rare position where you are profitable and have no pressure to keep growing (investors) in order to invest enough time into technical craft to not end up with manure code.
Also, even if your current codebases are manure, that doesn't mean everyone in your company makes manure. Find people on your team who don't write it, and learn from them.
If nobody is like that in your company, then maybe you should change jobs if you've been there more than two years. Cleaning up manure helps with interviewing because you can share your war stories with the interviewer.
Aside from your own projects, look for opportunities for other projects at work where you can start with a fresh technology stack. Some of these projects might be taking over the non-core functions of the main app. For instance, chances are a lot of the UI is sub-optimal (generic crud based) for some specific users. You might be able to create a slicker interface that makes it easier for them to do specific tasks that feed that data into the main database.
The goal being to increase your shop's "Bus Factor"
... in the form of a Jenkins build configuration. (If possible; if the system requires legacy compilers that only run on old Windows versions or a proprietary compiler for an embedded target, good luck.)
Thus, the use of README in the root directory of a project.
My office uses a combination of Redmine, Slack, email, gitlab, network drives, google docs, dropbox, some pdfs floating around, and a readme in the root of each repo...
It started out on...
That Novell drive
... but then...
That Win95 share
... nah, let's start using Confluence now
One tip is that when you've finally found the actual line(s) with the bug, always try to understand why the programmer made that mistake.
This has taught me much about what constructs are error prone.
From my own experience, it's really hard to know what's bad, what's good and what's an acceptable workaround if you've never seen anything different. Myself, I got lucky and ended up working on a project after the start of my career with someone who could explain the whats and (more importantly) the whys of bad/good/ugly code bases.
Generally, try and get some skill in being able to view a codebase from a high level. Draw it out on a whiteboard in boxes. Perhaps do this on other, pet projects first as it's nearly impossible to do this with a spaghetti-code project. If you can't pick out modular parts, then you have a big ball of mud. If you can, try and work on making and keeping them uncoupled. If you can, try and work on finding the natural boundaries of the other code you couldn't break up, and make those less coupled (you don't need to solve the coupling problems all at once!).
Are there a mix of architectural patterns in the code? This is pretty common when you're working on a legacy project. It's what happens when you get someone who doesn't really know how to architect, or there were a bunch of folks throughout the history of the project who (probably) had the right intentions, but didn't get it finished. Or, and this is the worst, you had two or more team members trying to bend the project to their own preferences without communicating with each other. If this is the case, talk to your team, agree on one and then you can work towards getting the style consistant. You don't even need to pick the best one. Getting a project into a consistant state is better than having an ugly mix and match.
Are there a bunch of mixed up design patterns floating around? Try and refactor those out as much as possible. Design patterns are great, and you should use them where appropriate. But if you find a lot of them nested within each other, it's not a good sign and probably indicates someone at some point swallowed a design pattern book and thought it would be a good idea to implement them. All of them. Nested patterns can more the likely be refactored out to simplify the code. Though again, make sure you understand what they are there for first. Otherwise you may be unpicking something intentionally complex that needs to exist to remove complexity elsewhere.
What does the DB look like? Is it designed around the projects business logic? Is this sensible for your project? Personally, I dislike putting any business logic into the data storage layer but it might be sensible for your particular project, so YMMV. If business logic in the DB is causing nasty workarounds, then you may have something else to refactor there, though this may not be possible.
Never refactor just for the sake of it! If you don't have buy-in for your ideas on how to improve a code-base from the rest of your team, you're going to be creating problems. You may also be missing critical information that your tech-lead knows about and made design decisions based on it. There have been several times I've tried to make things better as a Junior dev, only to find out I'd made some bad assumptions and created a mess.
Don't refactor without tests either. The system may be reliant on strange code, so make passing tests before changing things. That way you at least know the behaviour hasn't changed.
I'd love to hear a more balanced view on this. I think this idea is preached as the gospel when dealing with legacy systems. I absolutely understand that the big rewrite has many disadvantages. Surely there is a code base that has features such that a rewrite is better. I'm going to go against the common wisdom and wisdom I've practiced until now, and rewrite a program I maintain that is
1. Reasonably small (10k loc with a large parts duplicated or with minor variables changed).
2. Barely working. Most users cannot get the program working because of the numerous bugs. I often can't reproduce their bugs, because I get bugs even earlier in the process.
3. No test suite.
4. Plenty of very large security holes.
5. I can deprecate the old version.
I've spent time refactoring this (maybe 50 hours) but that seems crazy because it's still a pile of crap and at 200 hours I don't think it look that different. I doubt it would take 150 hours for a full rewrite.
Kindly welcoming dissenting opinions.
The most likely causation for crossing a threshold from refactor to rewrite, while steering clear of the "big bang rewrite", is that you have to ship a feature that triggers an end-run around some of the existing architecture. So you ship both new architecture and the new feature, and then it works so well that you can deprecate the old one almost immediately, eliminating entire modules that proved redundant.
Edit: And if you don't really know where to start when refactoring, start by inlining more of the code so that it runs straightline and has copy-pasted elements(you can use a comment to note this: "inlined from foo()"). This will surface the biggest redundancies at a minimum of effort.
100's of thousands to millions of loc is a lot more problematic, many moving parts and weird interplay is to be expected.
eg it's assumed when talking about refactoring over rewriting that a large portion of features is working. There should be some percentage where it's worth rewriting over refactoring. Or perhaps a size where it's small enough to easily rewrite.
To me you can rewrite anything that:
(1) you fully understand (and you'd better be right about that)
(2) you have total control over already
(3) is small enough for (1) and (2) to be possible
(this is where I think a lot of people over-estimate their capabilities)
(4) where you have the ability to absorb a catastrophic mistake
(which usually isn't the pay-grade of the programmers)
(5) where you have a 'plan-B' in case the rewrite against all odds fails anyway
None of these are absolutes, if there is no business riding on the result then you can of course do anything you want. The history of IT is littered with spectacular failures of teams that figured they could do much better by tossing out the old and setting a date for the deploy of the shiny new system. Whatever you do make sure that your work won't add to that pile.
The older, the larger, poorer documented, worse tested the system is the bigger the chance that it is not fully understood.
It is painful to look at and work with the old code, so we want to avoid it. But some things worth doing are painful, like exercise, or getting a cavity filled.
Even so, you're better off doing a step-by-step rewrite, where the new stuff and the old stuff coexists in a single application. That way your users can continue getting incremental benefits over time even if the rewrite takes dramatically longer than you're optimistic estimate.
If you can't figure out how to manage the complexity of a piecemeal rewrite, consider that you may not actually understand the system well enough to avoid making version 2 just as bad as version 1.
Most people overestimate their ability to act differently than they've acted in the past. It's like the unjustified optimism of a New Year's resolution that this time you're actually going to exercise every day. To get a better result than last time, you need to impose some very clear rules on yourself that cause you to work differently.
That's not to say that it can never be successful, just that the circumstances in which it will are sufficiently rare that it's usually worth discounting relatively early on.
In >20 years of dev experience, I can only think of one occasion where I successfully did a big bang rewrite i.e. tore down an application and restarted it with an equivalent system that had approx zero common code.
In that case, it was a C++ program that wouldn't actually build from clean. A lot of the code was redundant as the use cases had morphed over time (and/or weren't ever required but were coded anyway) and most changes were stuffed into base classes as it was effectively impossible work out how objects interacted. Releases took about 3 months for about 2 weeks worth of dev.
Initially, I didn't plan to rewrite it. When I realised I couldn't understand what it was doing, I took a step back and worked out what it should have been doing, assuming that I could map one to the other. What I found was that, at heart, it should have been doing something fairly simple but that the original "designers" had thrown the kitchen sink at it and its core function was lost in the morass.
I also came up with a way of making it easy to show that the new system was correct more deeply than just tests. This gave me, and folks I needed to convince, a lot more confidence that a rewrite made sense than would normally be the case.
In summary, it was quite a rare set of events that led me to the conclusion that a rewrite was the right direction: the existing system being a complete basket case, my happening to have a lot of domain expertise, the problem space turning out to be relatively simple and finding a way to "prove" correctness, all contributed. I doubt I would have made the same decision if any of them were different.
I think the gospel view is when you have to do both...rewrite and big bang cutover. Especially when there is no obvious fallback.
Rewrites are definitely common and beneficial, but the successful ones always run the new code and the old code side-by-side for an extended period of time. Which means you're still tending and caring about the old code, even as you strive to direct most of your effort into the new code.
I don't agree with this. People can't write proper coverage for a code base that they 'fully understand'. You will most likely end up writing tests for very obvious things or low hanging fruits; the unknowns will still seep through at one point or another.
Forget about refactoring code just to comply with your tests and breaking the rest of the architecture in the process. It will pass your 'test' but will fail in production.
What you should be doing is:
1. Perform architecture discovery and documentation (helps you with remembering things).
2. Look over last N commits/deliverables to understand how things are integrating with each other. It's very helpful to know how code evolved over time.
3. Identify your roadmap and what sort of impact it will have on the legacy code.
4. Commit to the roadmap. Understand the scope of the impact for your anything you add/remove. Account for code, integrations, caching, database, and documentation.
5. Don't forget about things like jobs and anything that might be pulling data from your systems.
Identifying what will be changing and adjusting your discovery to accommodate those changes as you go is a better approach from my point of view.
By the time you reach the development phase that touches 5% of architecture, your knowledge of 95% of design will be useless, and in six months you will forget it anyways.
You don't cut a tree with a knife to break a branch.
Actually, contrary to the advise of the writer, I like to start out with fixing some bugs. I find it a great way to gain some knowledge and it has the added benefit of keeping business stakeholders happy. And while fixing those bugs you can start writing the first integration and unit tests.
Your point on performing architecture discovery and documentation is spot on. It has really helped me to strip away the mess and understand the flow of the logic and maybe even shine some light on the parts of code that are valuable.
It's a simple event tracking system and yet there are 75 models, and over 80 controllers. This was outsourced to a team which coincidentally appears to have close to that many devs working there. The good news is that according to the client "it pretty much works". I know better than to suggest a Big Bang - though it seems so appealing.
Documentation and code freeze are my next steps and implementing end to end testing.
You are not the only one :)
I am now working on a node.js app and I find it really hard to make any changes. Even typos when renaming a variable often go undetected unless you have perfect test coverage.
Some amount of this is good, but it often forces the chunk boundaries to be smaller than the "natural" clumping of data and behavior in a distributed system. IMHO this is a much worse problem than a messy monolith; you can refactor a monolithic codebase to be more modular, but refactoring hundreds of microservices is a herculean endeavor.
My problem with microservices is the word micro.
You mean "popular" dynamic languages due to their lack of tooling. Dynamic languages like Smalltalk scale up just fine, but Smalltalk has automated refactoring tools. In other words it's a tool support problem, not a dynamic language problem.
Static languages scale to large codebases. There's no app that a static language (and those who insist on static types) can't turn into a much larger codebase :-)
I love the imagery of "mountains of dirt": http://steve-yegge.blogspot.com/2007/12/codes-worst-enemy.ht...
I can't imagine a scenario where you need hundreds although I don't doubt that people will create such a system.
Or non-generalized, custom hard-coded static typed end-points for every single reference/option list and workflow state transition.
Welcome to our little "full 'big bang' rewrite" Frankenstein 4.0 :-(
Not my idea...
That, and it's often reading in data (JSON or XML) from another system, and it is what it is, so see if it quacks or not.
From the people that brought you SOAP, it's (drum roll) TYPE SCRIPT!
It's not really solving my problems, just making more work.
Also, I'm not a frontend guy, and the comment I was replying to was talking about node.js, but having to put a setTimeout or something in your tests just seems wrong.
On the other hand, expecting not to be bit after the last dozen times seems kinda stupid. I'm not a fan of MS.
Full stack is hard, at least if somebody wants you to swap in and out of levels several times a week. But that's another rant about ruining projects...
Don't rename properties that "escape" from a given context. Sorry, but it's not going to be a good use of your time. Do document (JSDoc or similar) what the property is used for and why (as far as you can tell)
It's OK to rename local variables and parameters (the "root" local identifier, not the properties), though.
It might not be Smalltalk (I wouldn't know), but the JetBrains IDE support for JS is pretty good in terms of type inference, "where defined" lookups, "show documentation" support, duplicate / undefined symbol detection and other stuff I'm probably forgetting at the moment.
Seriously, though, avoid the traditional class/constructor/prototype setup (rather than short lived object literals as parameter objects and return values). It makes things too widely visible, and harder to safely change later. And it's more work, anyway.
Learn how to refactor a nested function which uses closure values into a reusable function with a longer argument list on which you can use partial function application as a form of dependency injection - or the other way around, for something used only one place.
An important lesson in managing code in a dynamic language is to limit the scope of everything as much as possible. Software designed as a cluster of many mutable singletons is going to hurt.
OOP was the hotness in the 80s. It's time to learn other paradigms, too (move from the '60s to the '70s), even if IDE designers have to update how "intellisense" (aka auto-complete) works :-)
See Selenium. http://docs.seleniumhq.org/
Since property names are dynamic, avoid making data global (singletons, et al) at all costs, to limit the amount of string searching and informed "inferences" you have to make. Using a more functional programming style that tracks data flow of short lived data works better than trying the "COBOL with namespaces" approach of mutable data everywhere that gets whacked on at will.
Sorta ironic: monstrous, so called, "self documenting" identifier names are not a good idea in a dynamic language. A short (NOT single letter, long enough to be a memorable mnemonic) identifiable name is more likely to be typed and eye-ball checked correctly.
There is no "self documenting" code - literate programming is your friend, or at least JSDoc is. It's not practical to put "why" something is into its name.
Of course, if you inherited some hot mess written by a hard-core Java / C# programmer, yeah, life is gonna suck :-(
Disclaimer: I've been doing a lot of Angular the last couple of years, which is over reliant on long lived, widely visible, mutable data. I would rather go the route of something like Redux than Type Script, though. (I suppose you could do both, but I want to NOT do Type Script if I can help it)
I've also worked with a number of languages that had runtime types and/or that allowed some kind of "string interpolation" for identifiers here and there since the 80s. No biggie.
Buh, buh, buh, TYPESSSS!!! Yeah, so. Let's talk about excessive temporal coupling, (mutable) OOP (only) folks...
<RANT ENDS> (for now)
Then the upper management appointed a random guy to do a "Big Bang" refactor: it has been failing miserably (it is still going on, doing way more harm than good). Then it all started to go really bad... and I quit and found a better job!
I've done this sort of work quite a number of times and I've made mistakes and learned what works there.
It's actually the most difficult part to navigate successfully. If you already have management's trust (i.e., you have the political power in your organization to push a deadline or halt work), you're golden and all of the things mentioned in the OP are achievable. If not, you're going to have to make huge compromises. Front-load high-visibility deliverables and make sure they get done. Prove that it's possible.
Scenario 1) I came in as a sub-contractor to help spread the workload (from 2 to 3) building out a very early-stage application for dealing with medical records. I came in and saw the codebase was an absolute wretched mess. DB schema full of junk, wide tables, broken and leaking API routes. I spent the first two weeks just bulletproofing the whole application backend and whipping it into shape before adding new features for a little while and being fired shortly afterwards.
Lesson: Someone else was paying the bills and there wasn't enough visibility/show-off factor for the work I was doing so they couldn't justify continuing to pay me. It doesn't really matter that they couldn't add new features until I fixed things. It only matters that the client couldn't visibly see the work I did.
Scenario 2) I was hired on as a web developer to a company and it immediately came to my attention that a huge, business-critical ETL project was very behind schedule. The development component had a due date three weeks preceding my start date and they didn't have anyone working on it. I asked to take that on, worked like a dog on it and knocked it out of the park. The first three months of my work there immediately saved the company about a half-million dollars. Overall we launched on time and I became point person in the organization for anything related to its data.
Lesson: Come in and kick ass right away and you'll earn a ton of trust in your organization to do the right things the right way.
A huge issue with sticking to an old codebase for such a long time is that it gets older and older. You get new talent that doesn't want to manage it and leave, so you're stuck with the same old people that implemented the codebase in the first place. Sure they're smart, knowledgable people in the year 2000, but think of how fast technology changes. Change, adapt, or die.
It's a complete fallacy to think that you're going to do much better than the previous crew if you are not prepared to absorb the lessons they left behind in that old crusty code.
So you'll have to learn them all over again.
> Change, adapt, or die.
Die it is then.
The causes of the big bang rewrite are usually not just "this code smells let's rewrite it" but rather that the old product reached some technical dead end. Perhaps it can't scale. Perhaps it's a desktop product written in an UI framework that doesn't support high DPI screens and suddenly all the customers have high DPI screens. Obviously in that situation you'd aim to just replace a layer of the application (a persistence layer, an UI layer) but as we all know that's not how it works. The cost of a rewrite shouldn't be underestimated - as you said there is no reason to believe that if it took 50 man years for the last team then the new team will take 50 too. But that is in itself not a reason to not do it.
Great to see you be part of such a long lived team, that's a rarity these days. That's got to be a fantastic company to work for. Usually even relatively modest turnover (say 15% per year) is enough to effectively replace all the original players within a couple of years, most software projects long outlive their creators presence at the companies they were founded in. Add in some acquisitions or spin-outs and it gets to the point where nobody even knows who wrote the software to begin with.
There are always risks with every action taken. You can't be scared to take a big risk for a bigger payout versus sucking it up and doing things the way they've been done for 15 years.
First, to learn the problem. Second, to learn the solution. Thrice, to do it right.
Skip a step at your own peril.
Incrementalism and do-over both have their place.
If you're resurrecting legacy code, I can't imagine successfully rewriting it until after you understand both the problem and solution. Alternately, change the business (processes), so that the legacy can be retired / mooted.
And in the type of place that has a dysfunctional, legacy software system running core business operations, don't count on all the other ducks being in a row (anything resembling agile, ability to release to prod on a reasonable cadence, ability to provision sufficient test data, working test systems, etc).
If it's an internal system that you've been working on and maintaining... for 10 years... maybe (just maybe). If you're a consultant stepping in, I wouldn't touch that option for love or money.
There should be some sort of overlap before completely sunsetting the old system.
Fighting technical debt is hard. Fighting it with a blindfold is harder. Fighting it with 0 frame of reference is daunting. Fighting it the rest of the company is demanding new features right now is a recipe for stagnation, bugs, and burn out.
I wonder if its time for professional software archeologists?
No, but it is time to make a real effort to teach the lessons learned to newcomers. I really feel that as an industry we completely fail at that. Blog posts such as these are my feeble attempt at trying to make a contribution to solving this problem.
Add this to "business requirements" and you get the big pile of manure we walk in every day. Like how does the knowledge of IEEE754 help me if the requirement is to sum up some value of the last three days, unless the last three days are on a weekend or holiday. (ok, stupid example) The point is domain language does not translate to computer language very well and a programmer is not a domain expert. He is .. just a programmer, a creative programmer, and we are millions each doing their thing a little different.
1) "Do not fall into the trap of improving both the maintainability of the code or the platform it runs on at the same time as adding new features or fixing bugs."
Thanks. However, in many situations this is simply not possible because the business is not there yet so you need to keep adding new features and fix bugs. And still, the code base has to be improved. Impossible? Almost, but we're paid for solving hard problems.
2) "Before you make any changes at all write as many end-to-end and integration tests as you can."
Sounds cool, except in many cases you have no idea how the code is supposed to work. Writing tests for new features and bugfixes is a good advice (but that goes against other points the OP makes).
3) "A big-bang rewrite is the kind of project that is pretty much guaranteed to fail.
No, it's not. Especially if you're rewriting parts of it at a time as separate modules
My problem with the OP is really that it tells you how to improve a legacy codebase given no business and time pressure.
That's exactly why this list is set up the way it is: you will get results fast and they will be good results.
If you want to play the 'I'm doing a sloppy job because I'm under pressure' card then consider this: the more pressure the less room there is for mistakes.
Here is a much more play-by-play account of one of these jobs where management gave me permission to do a write-up as part of the deal:
(For obvious reasons management usually does not give such permission, nobody wants to admit they let it get that far on their watch, I did my best to obscure which company this is about.)
What do you mean by 'fast'? If you can get meaningful improvements in a few months' time, then you're just working with smaller code base than what I thought of. If you're talking about stopping for a year, then .. well, that's the problem I'm talking about.
> If you want to play the 'I'm doing a sloppy job because I'm under pressure' card
No, I just wanted to share my opinion that I disagree with the overly generalized suggestions you're making.
Much faster than by going the rewrite route (assuming that is even possible, which I am convinced for anything but the most trivial problems it isn't). Preferably to first deploy within a few days and incremental changeover to the new situation starting within two weeks or so of the starting gun being fired.
> If you can get meaningful improvements in a few months' time, then you're just working with smaller code base than what I thought of.
> If you're talking about stopping for a year, then .. well, that's the problem I'm talking about.
Who said so?
All I said is that you should only do one thing at a time. Do not attempt to achieve two results with one release.
> No, I just wanted to share my opinion that I disagree with the overly generalized suggestions you're making.
You are very welcome to your own opinion about my 'overly generalized suggestions' it's just that they are a lot more than suggestions, they are things that I (and others, see this thread for evidence) have used countless times and that simply work.
All you do is a bunch of naysaying without offering up anything concrete as an alternative that would work better or evidence that anything posted would not work in practice. It does and it pays my bills.
> deploy within a few days and incremental changeover to the new situation starting within two weeks or so
I'm going to take this as confirmation that you're working on very, very small projects. This would be an extraordinarily unrealistic timeframe for large projects, which take vastly larger quantities of time to apply the steps you've outlined - which, in turn, renders those steps useless in a competitive business context as far as large applications are concerned.
500K lines is 'small' by our standards and if we are not moving within two weeks that translates into one very unhappy customer. That's something a typical team of 5 to 10 people has produced in a few years.
Note that I wrote 'incremental' and 'starting'. That doesn't mean the job is finished at that point in time. But we should have a very solid grasp of the situation, which parts are bleeding the hardest and what needs to be done to begin to plug those holes. That the whole thing in the end can become a multi-year project is obvious, we're not miracle workers, merely hard workers.
In a way the size of the codebase is not even relevant. What is most important is that you get the whole team and the management aligned behind a single purpose and then to follow through on that. Those first couple of weeks are crucial, they are tremendously hard work even for a seasoned team that has worked together on jobs like these several multiple times.
The one case I wrote about here was roughly that size (so small by my standards), within 30 days the situation was under control. We're now two years later and they are still working on the project but what was done in that short period is the foundation they are still using today.
If a project is much larger than that then obviously it will take more time. Just the discovery process can take a few weeks to months, but in that case I would recommend to split the project up into several smaller ones that can be operated on independently with 'frozen interfaces' where-ever they can be found.
That way you can parallelize a good part of the effort without stepping on each others toes all the time.
The problem is not that you can't tackle big IT projects well. The problem is that big IT projects translate into big budgets and that in turn attracts all kinds of overhead that does not contribute to the end result.
If you strip away that overhead you can do a lot with a (relatively) small crew.
If you're going to tackle a code base in excess of something 10 M loc in this way you will again run into all kinds of roadblocks. For those situations it would likely pay off to spend a few months on the plan of attack alone.
If a project that large came my way I would refuse, it would tie us down for way too long.
But that's out of scope for the article afaic, we're talking about medium to large project, say 50 manyears worth of original work that has become unmaintainable for some reason or other (mass walk-out, technical debt out of control or something to that effect).
If those are 'very very small projects' by your standards than so be it.
That's the scale I'm talking about, so at least we're on the same page there.
It sounds to me like your specialty routinely puts you in situations where the client has reached the end of the line and is in Hail Mary Mode, where they're amenable to having a consultant do Whatever It Takes to turn things around. To me, that sounds like just about the best case scenario for addressing the issues with legacy software, and pretty far removed from the Usual Case.
In my mind, the Usual Case is legacy software that's in obvious decline but still has significant utility, and for which there is still a significant portion of the market that can be attracted with added features. That's the long tail for a huge swath of the industry. In those cases, it's unthinkable to halt development for _any_ significant stretch of time. It's dog eat dog out here, and when your competitors aren't pausing for breath, you can't either - it's just a totally different world, and I think you're inappropriately pushing the wisdom from your own corner of it out into spaces where it's just not applicable.
In a similar vein, I think your opinions on rewrites are a bit skewed by the fact that the _only_ ones you encounter in your specialty are ones that have failed miserably (or at the very least, they're seriously overrepresented).
You clearly have a very solid and proven game plan for the constraints you're used to, but I think many of the extrapolations aren't valid.
Because if the only extra constraint would be 'you can't halt development' then that's easy enough: simply iterate on smaller pieces and slip in the occasional roadmap item to grease the wheels. But that does assume that development had not yet ground to a halt in the first place.
The biggest difference between your experience and my experience I think is that our little band of friends is external, so we get to negotiate up front about what the constraints are and if we put two scenarios on the table, one of which is ~70% cheaper because we temporarily halt development completely then that is the most likely option for the customer to take.
End to end tests are great to test the system actually works as expected as per the requirements spec. You should know how to write this, else.. how are you even testing your feature to begin with after you've written it?
And big rewrites always take longer than people think, which can sink a business if they're not careful with their resources and don't manage their time appropriately. All in all, these points you've mentioned all seem actually very reasonable to me.
>No, it's not. Especially if you're rewriting parts of it at a time as separate modules
I guess it depends on what he considers to be a "big bang rewrite." I don't think any of the incremental approach you mention counts as one.
You might end up rewriting the entire codebase through an incremental approach ala the Ship of Theseus through a series of smaller rewrites, but that's something very different and distinct from a "big bang rewrite" to me.
I would argue that is you do not have idea about how the codebase works adding new features without breaking anything else is going to be sheer luck most of the times.
Keep in mind that almost everybody that we end up cleaning up after has the exact attitude that you display and the only reason they feel that way is because they leave before the bill is due.
I don't mind, it keeps me employed.
(It feels really odd that someone tells me that he's making a living after cleaning up after people like me while so far I've thought I am paid for scaling small, poorly written systems up to enterprise levels, but well. I'd have appreciated more if we could have talked about specifics instead of "this works because I know it works").
As you can see elsewhere in this thread I'm more than willing to change my tune and/or update the post if there is relevant information.
But your initial tone of voice + your categorical denial that these things are valuable makes it a bit harder to find common ground.
If you are scaling small poorly written systems then that already gives one very important data point that is divergent with the situation I've written about. The systems you start with are small, the systems I start with are usually large to very large and are running a mid to large enterprise and are - if you're lucky - a decade old or even much older. Either that or they are recent - and totally botched - rewrites.
If you are happy in your groove then more power to you but chances are that sooner or later you too will be handed a pile of manure without a shovel to go with it and maybe then you'll find some useful tips in that blogpost.
And the bit about the bookkeeping applies to your situation just as much as it does to larger and older systems.
It seems to be de riguer when dealing with some of the top ranked people on HN, who all too often seem to have long ago forgotten the rules that they apparently think no longer apply to them because they have more karma than god.
My hat is off to you in how well you handled this.
I don't look to stick anyone with a knife. It is shocking how comfortable people are with casually watching my suffering and doing nothing about it while acting like any push back I give against the systemic issues that help keep me trapped in dire poverty somehow makes me evil incarnate.
Promising them the moon to only yank the rug out from under them when it matters. You know full well what I'm talking about here and that's something that I will not forgive you for.
Your circumstances have nothing to do with any of this.
If I am guessing correctly as to what you are talking about, you have that backwards. That person abandoned me. I did not abandon them.
Make the change simple, then make the simple change.
They only know it took you three days to implement the feature. They don't need to know how you spent the first 2 days.
Paying down that debt allows the team to scale larger and maintain velocity longer. If they don't like your rate of delivery now, how are they going to like it when the code calcifies and everything takes twice as long?
> Before you make any changes at all write as many end-to-end and integration tests as you can.
I'm beginning to see this as a failure mode in and of itself. Once you give people E2E tests it's the only kind of tests they want to write. It takes about 18 months for the wheels to fall off so it can look like a successful strategy. What they need to do is learn to write unit tests, but You have to break the code up into little chunks. It doesn't match their aesthetic sense and so it feels juvenile and contrived. The ego kicks in and you think you're smart enough you don't have to eat your proverbial vegetables.
The other problem is e2e tests are slow, they're flaky, and nobody wants to think about how much they cost in the long run because it's too painful to look at. How often have you see two people huddled over a broken E2E test? Multiply the cost of rework by 2.
> "add a single function to increment these counters based on the name of the event"
While the sentiment is a good one, I would warn against introducing counters in the database like this and incrementing them on every execution of a function. If transactions volumes are high, then depending on the locking strategy in your database, this could lead to blocking and locking. Operations that could previously execute in parallel independently now have to compete for a write lock on this shared counter, which could slow down throughput. In the worst case, if there are scenarios where two counters can be incremented inside different transactions, but in different sequences (not inconceivable in a legacy code), then you could introduce deadlocks.
Adding database writes to a legacy codebase is not without risk.
If volumes are low you might get away with it for a long time, but a better strategy would probably just to log the events to a file and aggregate them when you need them.
I doubt we'll ever see automation beyond what we do today in this space.
Really? That's pretty pessimistic, considering what DeepMind is doing.
- Automate refactoring code to reduce complexity and cross-dependencies
- Automate rewriting parts of code in mode modern languages and replacing it with some mediation layer (protobuf etc)
I think industries like finance would welcome with open arms something that can do this. And it could go for a high price if it's still saving them money on countless hours of developer time. It's a growing cost every year to maintain legacy code that was written 3+ developer generations ago, and it's dangerous in cases where peoples' lives depend on the code being bug-free (infrastructure, medical)
I started thinking about this problem a few days ago in a thread about AI https://news.ycombinator.com/item?id=14430652
Highly disagree about the order of coding. That guy wants to change the platform, redo the architecture, refactor everything, before he starts to fix bugs. That's a recipe for disaster.
It's not possible to refactor anything while you have no clue about the system. You will change things you don't understand, only to break the features and add new bugs.
You should start by fixing bugs. With a preference toward long standing simple issues, like "adding a validation on that form, so the app doesn't crash when the user gives a name instead of a number". See with users for a history of simple issues.
That delivers immediate value. This will give you credit quickly toward the stakeholders and the users. You learn the internals doing, before you can attempt any refactoring.
The idea is a good one but the specific suggested implementation .. hasn't he heard of statsd or kibana?
If you have access to a tool like that by all means use it, the specific implementation is not relevant, the article merely tries to show a simplest way to implement this very useful functionality that will work without limitation on just about anything that I can think of.
YMMV, though I would steer people towards an off-the-shelf solution over rolling your own.
Does "non-unix" mean Windows? My experience there has been that you can find a statsd client for your language of choice, and a way to plug whatever logging tool you have into kibana.
The whole reason these jobs exist is because modern tooling and the luxury that comes with them is unavailable. But I've yet to find a platform where that counter trick did not work, even on embedded platforms you can usually get away with a couple of incs and a way to read out the counters.
If the timing isn't too close to failure.
One interesting case involved a complex real life multi-player game with wearable computers. In the end we got it to work but only by making all the software run twice as fast as it did before so we could use the odd cycles for the stats collection without the rest of the system noticing. That was a big of a hack. And the best bit: after making it work we then used all the freed up time to send extra packets to give the system some redundancy and this greatly improved reliability.
That system was running 8051 micro controllers and the guy that wrote the original said that 'this couldn't be done'. Fun times :)
The server side portion of that particular project got completely re-written as well roughly along the lines presented in the article, that wasn't a huge project (500K lines or so) but I was very happy that it wasn't my first large technical debt mitigation project or I would have likely stranded.
Be happy if your dev environment does not include an emulated version of the real hardware that mysteriously does not seem to be 100% representative of the real thing.
What actual systems have you worked on that were connected to a database, but couldn't send UDP?
Anything on mainframes or older systems that do not have ethernet.
Anything running Netware or equivalent (true, there you could probably hack some kind of interface but whether it would be reliable or not is another matter).
I have done a number of serious refactoring myself and god tests will save me a huge favor despite I have to bite teeth for a few days to a few weeks.
Bottom line: If the project cannot afford to properly maintain the code, it's a failure of the business model. Projects can be maintained indefinitely, but it costs money. And that means the project has to bring in enough money to pay for those maintenance costs.
The options, as I see them:
1. Accept that this particular project, and those that intimately depend on it, has a lifecycle and will eventually die, either slowly or quickly. Prepare for that fact, staying ahead of the reaper by quitting, transferring to another project, etc.
2. Build a case to leadership that the project is underfunded long-term. This takes communication skills, persuasion skills, technical skills, and political skills. You'll need to go to all the stakeholders in their frame of reference and explain the risk involved in fundamentally depending on legacy code.
Anyway, engineers tend to see the "legacy code" problem as a technical one. It is in the sense it takes technical work to fix it. But the root cause is a misallocation of resources. If the needed resources aren't there in the first place, the problem is a bad business model.
This type of situation is usually a red flag that the company's management doesn't understand the value of maintaining software until the absolutely have to. That, in itself, is an indicator of what they think of their employees.
Recent conversation with the manager of a company: "I've yet to see anybody give me a good reason why we need to maintain the software we already built if it work."
If it's code that has been running successfully in production for years, be humble.
Bugifxes, shortcuts, restraints - all are real life and prevent perfect code and documentation under pressure.
The team at Salesforce.com is doing a massive re-platforming right now with their switch to Lightning. Should provide a few good stories, switching over millions of paying users, not fucking up billions in revenue.
The problem, in my mind, is that code can't be accurately modeled on one axis from "low level" to "high level". You can slice a system in many ways:
- network traffic
- database interactions
- build time dependencies
- run time dependencies
- hardware dependencies
- application level abstractions
...and certainly more. On top of that, the dimensions are not orthogonal. You might need to bump the major version of a library to support a new wire format, for example. Anyway, since there are many ways to slice a project, what is "high level" in on perspective can be "low level" from another. And vice versa.
My actions are usually these:
* Fix the build system, automate build process and produce regular builds that get deployed to production. It's incredible that some people still don't understand the value of the repeatable, reliable build. In one project, in order to build the system you had to know which makefiles to patch and disable the parts of the project which were broken at that particular time. And then they deployed it and didn't touch it for months. Next time you needed to build/deploy it was impossible to know what's changed or if you even built the same thing.
* Fix all warnings. Usually there are thousands of them, and they get ignored because "hey, the code builds, what else do you want." The warning fixing step allows to see how fucked up some of the code is.
* Start writing unit tests for things you change, fix or document. Fix existing tests (as they are usually unmaintained and broken).
* Fix the VCS and enforce sensible review process and history maintenance. Otherwise nobody has a way of knowing what changed, when and why. Actually, not even all parts of the project may be in the VCS. The code, configs, scripts can be lying around on individual dev machines, which is impossible to find without the repeatable build process. Also, there are usually a bunch of branches with various degrees of staleness which were used to deploy code to production. The codebase may have diverged significantly. It needs to be merged back into the mainline and the development process needs to be enforced that prevents this from happening in the future.
Worst of all is that in the end very few people would appreciate this work. But at least I get to keep my sanity.
I'm loath to give examples so as not to constrain your thinking, but, for example, imagine a bunch of hairy Perl had been built to crawl web sites as part of whatever they're doing, and it just so happens that these days curl or wget do more, and better, and less buggy, than everything they had built. (think of your own examples here, anything from machine vision to algabreic computation, whatever you want.)
In fact isn't this the case for lots and lots of domains?
For this reason I'm kind of surprised why the "big bang rewrite" is, written off so easily.
Code base that is non-existent, as the previous attempts were done with MS BI (SSIS) tools (for all things SSIS is not for) and/or SQL Stored procedures, with no consistency on coding style, documentation, over 200 hundred databases (sometimes 3 per process that only exist to house a handful of stored procedures), and a complete developer turn over rate of about every 2 years. with Senior leadership in the organization clueless to any technology.
As you look at a ~6000 lines in a single stored procedure. You fight the urge to light the match, and give it some TLC ( Torch it, Level it, Cart it away) to start over with something new.
Moral of the story, As you build, replace things stress to everyone to "Concentrate of getting it Right, instead of Getting it Done!" so you don't add to the steaming pile.
I really mean it, a whole lot of programmers simply dont read the codebase before starting a task. Guess the result, specially in terms of frustration.
^ Yes and no. That might take forever and the company might be struggling with cash. I would instead consider adding a metrics dashboard. Basically - find the key points: payments sent, payments cleared, new user, returning user, store opened, etc. THIS isn't as good as a nice integration suite - but if a client is hard on cash and needs help - this can be setup in hours. With this setup - after adding/editing code you can calm investors/ceos'. Alternatively, if it's a larger corp it will be time strapped - then push for the same thing :)
I completely agree with the sentiment that scoping the existing functionality and writing a comprehensive test suite is important - but how should you proceed when the codebase is structured in such a way that it's almost impossible to test specific units in isolation, or when the system is hardcoded throughout to e.g. connect to a remote database? As far as I can see it'll take a lot of work to get the codebase into a state where you can start doing these tests, and surely there's a risk of breaking stuff in the process?
Work from the outside in, keeping most of the system as a black box. Start with testing the highest-level behaviors that the business/users care about.
The key is an engaged business unit, clear requirements, and time on the schedule. Obviously if one or more of these things sounds ridiculous then the odds of success are greatly diminished. It is much easier if you can launch on the new platform a copy of the current system, not a copy + enhancements, but I've been on successful projects where we launched with new functionality.
The ones I have seen - and this is actually one of the major reasons the clean-up crew gets called in the first place - is big bang rewrite projects gone astray.
One huge problem with rewrites of old code is that the requirements are no longer known or even misunderstood.
Freezing a whole system is practically impossible. What you usually get is a "piecewise" freeze. As in: you get to have a small portion of the system to not change for a given period.
The real challenge is: how can you split your project in pieces of functionalities that are reasonably sized and replaceable independently from each other.
There is definitely no silver bullet for how to do this.
edit: I'm being a little snarky here, but the assumptions here are just too much. This is all best-case scenario stuff that doesn't translate very well to the vast majority of situations it's ostensibly aimed at.
> write as many end-to-end and integration tests as you can
> make sure your tests run fast enough to run the full set of tests after every commit
At my last gig we used this exact strategy to replace a large ecommerce site piece by piece. Being able to slowly replace small pieces and AB test every change was great. We were able to sort out all of the "started as a bug, is now a feature" issues with low risk to overall sales.
Really? Are there no circumstances under which this would be appropriate? It seems to me this makes assumptions about the baseline quality of the existing codebase. Surely sometimes buying a new car makes more sense than trying to fix up an old one?
The only caveat is if you have spent the time to truly understand the codebase, then maybe you can do it. Most people advocate a rewrite because they don't WANT to understand the codebase. Even if you understand the codebase, it's pretty dangerous, but at least you have some idea of what you're saying you will rewrite.
So yeah, it can happen, but if you are in the situation that you have the knowledge and experience to override that rule, then you have the knowledge and experience to know that you CAN override that rule. It sounds a little circular, but it's how I tend to aim my broadly-given advice. If someone knows what they're doing, they should be able to recognize when they can ignore your advice. Anything else would have to be tailored to each specific instance, which isn't plausible in a blog post.
When you rewrite a software system, you do it yourself. You don't know whether you'll succeed. You might end up with worse end-results. The assumption here is that no off-the-shelf software can be used to replace it. Hence rewrite.
I'm actually quite curious; how long does this process typically take you?
What are the most relevant factors on which it scales? Messiness of existing code? Number of modules/LOC? Existing test coverage?
How long it takes depends on the mandate given by management. Sometimes it's 30 days to get from zero to something stable and incrementally improvable at which point we hand back to the company with maybe a transition period where we still manage the project. Sometimes it is just a feasibility study in which case it can be even shorter. But if it is boots-in-the-mud (which is where the real money is) then it can be up to a year.
It scales just fine provided you have the people and this is more often than not a huge problem. It's happened that we had to leave people in place for months or even years after the project was in essence done simply because as soon as our backs were turned it was back to the usual methods. That's actually really frustrating when it happens.
Existing test coverage can speed things up but if the tests are brittle or otherwise not helpful can actually make things much worse.
As for number of modules or LOC: if you're doing a platform switch that can really eat up time, if it is just to bring things under control then it does not really matter much.
One you did not mention, but which can greatly impact the speed with which you can move is the quality of existing documentation. If there is anything at all, especially up to date requirements documentation that can serve as a tie breaker between a suspected bug or a feature it can make a huge difference.
1. Find out which functionality is still used and which functionality is critical
Management will always say "all of it". The problem is that what they're aware of is usually the tip of the iceberg in terms of what functionality is supported. In most large legacy codebases, you'll have major sections of the application that have sat unused or disabled for a couple of decades. Find out what users and management actually think the application does and why they're looking to resurrect it. The key is to make sure you know what is business critical functionality vs "nice to have". That may happen to be the portions of the application that are currently deliberately disabled.
Next, figure out who the users are. Are there any? Do you have any way to tell? If not, if it's an internal application, find someone who used it in the past. It's often illuminating to find out what people are actually using the application for. It may not be the application's original/primary purpose.
2. Is the project under version control? If not, get something in place before you change anything.
This one is obvious, but you'd be surprised how often it comes up. Particularly at large, non-tech companies, it's common for developers to not use version control. I've inherited multi-million line code bases that did not use version control at all. I know of several others in the wild at big corporations. Hopefully you'll never run into these, but if we're talking about legacy systems, it's important to take a step back.
One other note: If it's under any version control at all, resist the urge to change what it's under. CVS is rudimentary, but it's functional. SVN is a lot nicer than people think it is. Hold off on moving things to git/whatever just because you're more comfortable with it. Whatever history is there is valuable, and you invariably lose more than you think you will when migrating to a new version control system. (This isn't to say don't move, it's just to say put that off until you know the history of the codebase in more detail.)
3. Is there a clear build and deployment process? If not, set one up.
Once again, hopefully this isn't an issue.
I've seen large projects that did not have a unified build system, just a scattered mix of shell scripts and isolated makefiles. If there's no way to build the entire project, it's an immediate pain point. If that's the case, focus on the build system first, before touching the rest of the codebase. Even for a project which excellent processes in place, reviewing the build system in detail is not a bad way to start learning the overall architecture of the system.
More commonly, deployment is a cumbersome process. Sometimes cumbersome deployment may be an organizational issue, and not something that has a technical solution. In that case, make sure you have a painless way to deploy to an isolated development environment of some sort. Make sure you can run things in a sandboxed environment. If there are organizational issues around deploying to a development setup, those are battles you need to fight immediately.
1. Integration with whatever build/issue tracking systems are present is worth preserving until you have the time to recreate it properly.
Duplicating what's already there under the new environment is always more problematic than it looks like at first glance. This is especially true when you're dealing with any in-house components (which usually manage to show up somewhere).
2. A clean break where you leave the old VCS behind and archived is tempting, but it's rarely ideal in the long-term.
The old archive is likely to wind up being deleted/lost/bitrotted/etc after a year or two. Invariably, you wind up in a spot a few years down the line where it would be useful to have the full commit history, and the old VCS winds up being inaccessible. Ideally, you'd want to preserve as much history as possible when migrating. However, trying to correctly preserve commit history (and associated issue tracker info, etc) is always a time-sink, in my experience. It's easy for simple projects, and a real pain for complex projects with a weird, long history. Choose the time that you attempt it wisely.
Again, I'm not saying don't move, I'm just saying that it almost always winds up taking a lot of time and effort. I'd argue you're better off spending that time and effort on other portions of the project early on.
Also, things like git-svn can be real lifesavers in some of these cases, though they do add an extra layer of complexity. If you do want to use a different VCS, I'd take the git-svn/etc approach until you're sure there are are no extra integration problems.
All that said, yeah, if there's no history and no integration with other systems/tools, go straight for something modern!
(Speaking from experience from work)
As a result of my work experience, it was more beneficial to delete the legacy code and only provide the necessary functions when renewing the system.