Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Inherited the worst code and tech team I have ever seen. How to fix it?
557 points by whattodochange on Sept 18, 2022 | hide | past | favorite | 676 comments
I have to find a strategy to fix this development team without managing them directly. Here is an overview:

- this code generates more than 20 million dollars a year of revenue

- it runs on PHP

- it has been developed for 12 years directly on production with no source control ( hello index-new_2021-test-john_v2.php )

- it doesn't use composer or any dependency management. It's all require_once.

- it doesn't use any framework

- the routing is managed exclusively as rewrites in NGInX ( the NGInX config is around 10,000 lines )

- no code has ever been deleted. Things are just added . I gather the reason for that is because it was developed on production directly and deleting things is too risky.

- the database structure is the same mess, no migrations, etc... When adding a column, because of the volume of data, they add a new table with a join.

- JS and CSS is the same. Multiple versions of jQuery fighting each other depending on which page you are or even on the same page.

- no MVC pattern of course, or whatever pattern. No templating library. It's PHP 2003 style.

- In many places I see controllers like files making curl requests to its own rest API (via domain name, not localhost) doing oauth authorizations, etc... Just to get the menu items or list of products...

- no caching ( but there is memcached but only used for sessions ...)

- team is 3 people, quite junior. One backend, one front, one iOS/android. Resistance to change is huge.

- productivity is abysmal which is understandable. The mess is just too huge to be able to build anything.

This business unit has a pretty aggressive roadmap as management and HQ has no real understanding of these blockers. And post COVID, budget is really tight.

I know a full rewrite is necessary, but how to balance it?

First off, no, a full rewrite is not only not necessary, but probably the worst possible approach. Do a piece at a time. You will eventually have re-written all the code, but do not ever fall into the trap of a "full re-write". It doesn't work.

But before you re-write once line of code - get some testing in place. Or, a lot of testing. If you have end-to-end tests that run through every feature that is currently used by your customer base, then you have a baseline to safely make changes. You can delete code as long as the tests pass. You can change code as long as the tests pass.

Once you are at that point, start picking off pieces to modernize and improve.

Also, respect the team. Maybe they aren't doing what you would, but they are keeping this beast alive, and probably have invaluable knowledge of how to do so. Don't come in pushing for change... come in embracing that this beast of a codebase makes 20 million a year. So talk about how the team can improve it, and modernize their skills at the same time.

Because if you walk in, saying, "This all sucks, and so do you, lets throw it out", do you really have to wonder why you are hitting resistance?

I fully agree with this, but I think it misses a key step:

As the team’s manager, it’s your job to get buy-in from the executives to gradually fix the mess. You don’t need to tell the team exactly how to fix it, but you gotta get buy-in for space to fix it.

One approach is just to say “every Friday goes to adding tests!” (And then when there’s some reasonable test coverage, make fridays go to refactoring that are easy with the new tests, and so on).

But this often fails because when Friday comes, something is on fire and management asks to please quickly squeeze this one thing in first.

The only other approach I know of is to get buy in for shipping every change slightly slower, and making the code touched by that change better. Eg they want to add feature X, ok add a test for adjacent existing functionality Y, then maybe make Y a little better, just so adding X will be easier, then build X, also with tests. Enthusiastically celebrate that not only X got shipped but Y also got made better.

If the team is change averse, it’s because they’re risk averse. Likely with good reason, ask for anecdotes to figure out where it comes from. They need to see that risk can be reduced and that execs can be reasonable.

You need the buy-in, both from the execs and the team. Things will go slightly slower in the beginning and it’s worth it. Only you can make sell this. The metaphor of “paying off technical debt” is useful here since interest is sky high and you want to bring it under control.

Before anything else, getting buy-in for any kind of major change from the execs is key. Explain the situation and the effects. Have everything in writing, complete with date and signatures. Push back hard every time this commitment gets sabotaged because something is supposedly on fire. Get a guaranteed budget for external trainings and workshops, again in writing. Then talk to the team.

If you cannot get those commitments in writing, or later on get ignored multiple times: run. Your energy and sanity is better spent elsewhere. No need to fight an uphill battle alone – and for what? The company just revealed itself for what it is and you have no future there.

First I’d do that, then think about the engineering part.

To be fair if I was an exec at a company and the new IT lead wants me to commit, in writing, to XYZ, I’d not keep them around long. You can’t run a company on that kind of deep mistrust.

Nothing in the OP suggests abusive management. Incompetence, maybe, but I see no reason to assume that they’ll backtrack on agreements, and a new management hire who immediately starts sewing mistrusts is not someone I’d trust to get things to a higher level.

Clearly you've been in a very positive bubble. I envy you but that's not an experience shared by many.

As a programmer contractor and a guy who sometimes gets called to save small businesses due to stalled development (happened 6 times in my 20y career) I'm absolutely not even opening my laptop anymore -- before I see a written commitment from execs (email is enough; I tag/label those and make sure I can easily find them in the future).

Reasons are extremely simple and self-defensive in nature: execs can and do backtrack from agreements all the time. At the time we arrive in an oral agreement they made 20 other invisible assumptions they never told me about and when one of them turns out to be not true (example: they thought you can get onboarded in 2 days into a system with 5000+ source files and be productive as a full-blown team member on day #3) they start backtracking faster than you can say "that's not professional".

I don't dispute your positive experience. But please be aware that it's not the norm. Most execs out there treat programmers as slaves with big salaries and nothing more, and we get exactly the treatment you might expect when they have that mindset.

Sorry not sorry but I have to save my own arse first; I've been bound to extremely awful contracts when I've been much younger and stupider and I am not allowing that ever again.

I can single-handedly make a business succeed with technology, and I have done so. I am not staying anywhere where execs hand-away everything with "should be simple and quick, right? k thx bye".

Thanks, that’s what I was aiming for. It’s kind of a litmus test what kind of professionalism you can expect in a place – if any. Especially when they have shown prior incompetence, as in OP’s example.

In all honesty, given that example, if I didn’t get immediate buy-in, I’d throw the towel right then. Over 15 years of experience show that train wrecks only ever get fixed when they are recognized as such from the start.

> You can’t run a company on that kind of deep mistrust.

Trust has to be earned in some ways (but you can expect some base-level). But I want to argue on another point: as an exec, you can use this kind of writing to also get commitment from the team, to balance things out. But ofc for that there needs to be a fair discussion of priorities and once you have that, there is usually no reason to contractify the outcome.

>if I was an exec at a company and the new IT lead wants me to commit, in writing, to XYZ, I’d not keep them around long. You can’t run a company on that kind of deep mistrust.

Emails are writing, if you're imagining the IT lead walking in with a paper contract I see why you would say that.

That's essentially what the GP was implying, "Have everything in writing, complete with date and signatures."

Nowhere were contracts mentioned. A proper proposal, for example, always has a date and to sign it if agreed to is just professional conduct. I’d be wary of any exec not willing to do that. Instant red flag.

That’s what a proposal is too, it’s not necessarily a demand

That's fair. I've worked at more established places with formal design doc/RFC and sign off processes and it can work well.

After reading the description of the SOP at this shop, the idea that the OP would be able to introduce an additional layer of process requiring multiple stakeholders and management seemed like a bridge too far in my mind :).

Do leads not write proposals or RFCs? I’m not sure why you wouldn’t keep them around long if they laid out their plans in a clear way, and then pitched it to others

"Have everything in writing, complete with date and signatures"

It is possible that the executives won't take it well to all of the formality here (writing and signatures). How would you convince them that this is necessary?

"Have everything in writing" is a bad mindset and is not going to save you.

Exeutives are looking at you as the expert to deliver a good outcome. Which means making good decisions, managing expectations and keeping everyone in the loop.

Generally, if it gets to the point of having to dig up who signed off on what, you've already failed. Often you won't even get the chance to dig up those emails, because delivering a bad outcome is enough for execs to write you off without even needing to hear your excuses.

> because delivering a bad outcome is enough for execs to write you off without even needing to hear your excuses.

What makes you think they are excuses? Constantly chasing moving targets and not having even one of them agreed upon in writing is heaven for bad execs. I've seen it happen a good amount of times, my colleagues too.

I don't view the "you changed requirements 20 times the last month and I can't keep up with your impossible imagined schedule" statement as an excuse.

If the goal is to remove bad execs, then a document trail can help, although I'd suggest starting with some statistics like "over the last 3 months, we moved the goalpost 8 times, which led to an effective throughput of 4 weeks of work being done rather than the expected 12 weeks. How do you think we could improve these conditions?" Collaboration first.

Keeping email threads for reference is probably plenty data enough, btw; "signatures" sounds like the wrong approach. Maybe even just summarize the direction given in a wiki document with a change log with time stamps and requesting person, which you can review once in a while, and the sheer length of it might be enough to bring the point across.

Thank you -- good advice to put collaboration first. I sometimes have a problem that I assume the worst right away. But I've met some true villains in my life and career so maybe that's why. I'll do my best to implement your advice.

> and the sheer length of it might be enough to bring the point across.

This one sadly hasn't been true -- I tried it but I get blank stares and sometimes grumbling about making people read long stuff that I can just summarize to them. Maybe there's a way out of this conundrum as well.

Your job is to deliver what the execs consider to be a good outcome.

That includes helping the stakeholders come up with a stable set of requirements. Most of the time when teams are dealing with a lot of requirements change, it's because they never captured the true requirements which usually change at a much slower rate.

Secondly, your job is also to manage expectations, so that execs know what the impact of any changes will be when they request them.

Changes aren't an excuse to deliver late or over budget. These parameters are flexible and new targets should have been agreed when the requirements change was requested.

Execs will usually assess your performance without discussion. There is no venue to bring your cache of documents to prove your innocence after the fact.

We all know the ideal theory. I am talking execs that constantly change requirements, refuse to sign under any stable requirements, and think everything is "quick and easy", and take offense when you try to manage their expectations.

Reasonable people I easily work with. It's the rest who are the problem.

Sounds like you haven't worked in an environment where this happens. You get regarded as 'the expert to deliver a good outcome' sure. But you're ALSO expected to deliver an aggressive roadmap of a while load of other stuff that people already committed to. Someone's something's got to give

Dates and signatures are theatrical overkill.

I've yet to work at a place where meeting minutes, sent out to all attendees post-meeting, aren't sufficient for the same purpose (ass covering & continued adherence to The Plan as originally agreed).

I'm sure signature and date places do exist... but, I'd probably be looking for a new job if I worked at one.

The dates and signatures bit is nonsense, but it does help to have things in writing to ensure everyone's on the same page. That just means that when you're discussing things not in writing, you send a written follow up to everyone that's involved immediately afterwards. If it's a meeting, take detailed notes and send them around afterwards. If it's a one on one conversation, just send a follow up email that says something like, "Hi x, I just wanted to memorialize our conversation - here are the main notes that I took. Please let me know if any of this sounds off to you. Thank you."

That doesn't preclude them from not reading that email and later telling you they said something completely different, but at that point you should probably be heading for the door anyway.

Having stuff in writing is essential, for accountability on all sides. The exact format does not matter, neither does what passes as a signature in a company. My example was for broadest possible applicability. The point is the willingness to commit to something in writing and to take the time to reflect on the implications of doing so. If you cannot get that, you’ve already lost. There will be moving targets.

It’s interesting to see how all responses focus on the signature part as problematic due to its supposed formality. Is this an American work culture thing? I see signing off on an agreement as a signal of professional conduct and reliability.

> But this often fails because when Friday comes, something is on fire and management asks to please quickly squeeze this one thing in first.

There's a solution to this problem: nothing goes live on Fridays.

> and making the code touched by that change better.

Getting buy-in from management on this always appeared to me as weird. The alternative is a codebase that can only ever get worse over time. So you either gotta gold plate everything, which will take way longer than allowing for some after-the-fact improvement as needed, or your codebase turns into a pile of shit very quickly and your velocity grinds to a halt very quickly.

> Getting buy-in from management on this always appeared to me as weird. The alternative is a codebase that can only ever get worse over time.

Well that's just the thing: they have no notion of a "bad code base". To them that's an excuse and a negotiation leverage by the programmer to ask for more money. They judge others by themselves I guess.

It just feels like an amateur hour thing.

If my plumber came to me to ask if he can just dry assemble the pipes and leave them that way I'm gonna get a new plumber.

That’s assuming you know something about plumbing. If you don’t, you’ll just nod your head and say ok, that sounds good. The same thing is happening in these businesses. The business owners generally don’t know programming. Terms like “refactoring” mean nothing to them at best and sounds like “rewrite from scratch” at worst.

It's scary out there, man. A lot of people in HN judge by US companies and startups but I've only been in that bubble once for a few months and the rest of my 20 years of career has been everywhere else. And it's insanely bad in many places.

Do not waste time with a company that is going to collapse unless they are willing to do whatever it takes.

They are going to collapse making 20M a year, sure.

Revenue is not profit

Not the team’s manager. OP says so in the first line.

Yeah, there's a process. It's something that I've done a bunch of times for a bunch of clients.

There's so much low-hanging fruit there that's so easy to fix _right now_. No version control? Good news! `git init` is free! PHPCS/PHP-CS-fixer can normalise a lot, and is generally pretty safe (especially when you have git now). Yeah, it's overwhelming, but OP said that the software is already making millions - you don't wanna fuck with that.

I've done it, I've written about it, I've given conference talks about it. The real bonus for OP is that the team is small, so there's only a few people to fight over it. It's pretty easy to show how things will be better, but remember that the team are going to resist deleting code not because that they're unaware that it's bad, but because they are afraid to jeporadise whatever stability that they've found.

Personally, I would never run a linter of any kind on a full codebase that doesn't have tests. After having been bitten by all kinds of bugs over the years, I wouldn't suggest auto-linting any file that you aren't actively working on.

It's rare that linting will actually make the code work better. Granted, it could catch some security bugs. But they can - and will - introduce new bugs. You just have to ask if it's worth the risk.

This. It's so tempting when a linter warns "This code is misleading; it would be clearer to do it this other way" to think "Easy fix: change it the way the linter suggests." But, make the change, and you may discover (hopefully before delivery) that the code functionality depends on the confusing behavior.

And also starting by fixing the js/css/html front end is likely the safest, as it wont corrupt any customer data & it will be visible when something breaks. That can probably be the next best candidate to do a major overhaul. I'd also hope that a $20M/year project can afford to hire someone senior in addition to these 3 juniors?

> hope that a $20M/year project can afford to hire someone senior

Never underestimate the ability of management to look a gift horse in the mouth while shooting it in the foot.

why would someone senior even want to join this team? Especially someone senior enough to fix this. The productivity is horrible and there's no kudos for fixing something that's lived for 12 years like this.

I was added to a team because, to quote the VP, "they're good but they need some adult supervision"

Mixing skill levels in a team is healthy.

> why would someone senior even want to join this team?

Well, money. Why would someone even want to join any team?

Theoretically a company that's making $20m/year on this can afford to make it worth someone's while to come in and fix it. The problem isn't finding someone who will do it, it's that the company assumes they can continue to get by indefinitely on paying too little.

For the love of refactoring.

git init seems like job #1 because at least then you can delete every commented out line and start a little cleaner.

to loose all the comments? :D that would make it even harder to read

In a project without version control (or one that doesn't trust it enough) there are always whole sub-programs made up of dead code. It's usually some combination of commented-out blocks and functions that are only called from within those commented-out blocks. Removing commented code (not real, descriptive comments) is the first step to eliminating all this dead code, and eliminating dead code buys a ton more flexibility in what you can change safely.

Fully agreed, I was tasked with using an old library and my first order of business was to make an analysis of dead code branches. The GIT commit removed 17 out of 80 files and about 10-11% of the code in some other files (that were not deleted) and the library works 100% the same -- confirmed by tests that I painstakingly added during the last weeks.

Less code, less confusion.

> It doesn't work.

That's simply not true. I've inherited something just as bad as this. We did a full rewrite and it was quite successful and the company went on to triple the revenue.

> get some testing in place

Writing tests for something that is already not functional, will be a waste of time. How do you fix the things that the test prove are broken? It is better to spend the time figuring out what all the features are, document them and then rewrite, with tests.

The problem with people new to the company starting a rewrite from scratch is that they often are poorly informed on why things were the way they were before. If you start big, you can have bad outcomes where the new system might be objectively worse than the old one... but you are stuck trying to get the new thing out for the next 5 years because too many people sunk too much political capital into it.

As an example, I worked at an ad-tech startup that swapped it's tech team out when it had ~100 million in revenue (via acqui-hire shenanigans). The new tech team immediately committed to rewriting the code base into ruby micro-services and were struck by strange old tech decisions like "why does our tracking pixel return a purple image?". The team went so far as to stop anyone from committing to the main service for several years in a vain attempt to speed up the rewrite/architecture migration.

These refactors inevitably failed to produce a meaningful impact to revenue, as a matter of fact the company's revenue had begun to decline. The company eventually did another house cleaning on the tech team and had some minor future successes - but this whole adventure effectively cost their entire Series D round along with 3 years of product development.

You're making a silent assumption that the original team is well informed about why the things are like they are and that they know what they are doing. I think it is not always the case.

I've been to a project once where the mess in the original system was the result of the original team not knowing what they were doing and just doing permutation based programming - applying random changes until it kinda worked. The situation was very similar to that described by the OP. They even chose J2EE just because the CTO heard other companies were using it, despite not having a single engineer knowing J2EE. Overall after a year of development the original system barely even worked (it required manual intervention a few times per day to keep running!), and even an imperfect rewrite done by a student was already better after 2 weeks of coding.

So I believe the level you're starting the rewrite from is quite an important factor.

Then of course there is a whole world of a difference between "They don't know what they are doing" vs "I don't like their tech stack and want to master <insert a shiny new toy here>". There former can be recognized objectively by:

- very high amount of broken functionality

- abysmal pace at which new features are added

The original team may not have been the best at the task, but they still managed to deliver 100 MM in revenue. Sometimes the things they leave behind/ignore simply don’t matter to the business/useful tech.

Particular to ad tech, the lifespan of any particular software is lower than you’d expect (unless your google/Facebook). Technology that pays out big one year will become pretty meh within 3 years. In the case above I’d argue that the new tech team didn’t really understand this dynamic and so they focused on the wrong things such as rewriting functionality that didn’t matter for the future. Or making big bets on aspects of the product which were irrelevant.

To the OP, we don’t know that the lifespan of any of these php files is greater than an individual contract. If the business can be modeled as solve a contract by putting a php file on prod - rewriting may be entirely worthless as the code can be “write once, read never”.

Revenue is a crazy kpi for technical excellence. You should never let a high revenue rely on extremely bad code.

We (my good friend and I who both have 20+ years of experience) were brought in specifically do to the rewrite. We were new to the company. We actually had to rebuild the entire IT department while we were at it as well.

> new tech team immediately committed to rewriting the code base into ruby micro-services

well... sigh.

> These refactors inevitably failed to produce a meaningful impact to revenue

It sounds like less about the refactor itself and more about the skills of the team doing the refactor. You certainly can't expect a refactor to go well if the team makes poor decisions to begin with.

> We were brought in specifically do to the rewrite.

That's the key difference. The stakeholders should always be in on the rewrite.

> It sounds like less about the refactor itself and more about the skills of the team doing the refactor. You certainly can't expect a refactor to go well if the team makes poor decisions to begin with.

This has been my biggest struggle with rewrites where I’m currently working. We have several large, messy old codebases that everyone agrees “needs a rewrite” to (1) correct for all the early assumptions in business needs that turned out wrong, (2) deal with old PHP code that is very prone to breakage with every major new PHP version released, and (3) add much needed architectural patterns specific to our needs.

I’ve seen rewrites of portions of the project work when they involve myself and one other mid-level dev who has a grasp on solid sw engineering practices, but when the rest of the (more senior) team get involved on the bigger “full rewrite”, they end up quickly making all the same mistakes that led to the previous project being the mess that it is.

Sure, it will be using fancy new PHP 8 features, and our Laravel framework will force some level of dependency injection, but the you start seeing giant God classes being injected over here, but duplicated code copy-pasted over there, all done by “senior” devs you feel you can’t question too strongly.

To that end, an open and collaborative culture in which you start the rewrite with some agreed upon principles, group code reviews and egos kept in check, are all necessary for this to work.

You have a great experience and did a great job indeed. My only question is how does one get 20 years of such experience without horrific flashbacks of “let’s just rewrite it” decisions. Do you do rewrites/redesigns often? What’s your success rate?

I've done what I would consider as four rewrites that I can remember as large events in my life (although not fully what you'd expect). But all are good stories in my opinion.

First one was the above example. It was for the largest hardcore porn company on the planet. Myself and my good friend Jeff rebuilt an already successful business IT department from the ground up and made it even more successful. Ever heard of 'the armory in sf'?

Second was that jeff and I were hired as contractors by Rob @ Pivotal Labs (ceo) to help the CloudFoundry team rewrite itself after he had bought the team and trimmed it down to only the good people. That one was a huge mess. We spent a lot of time deep in shitty ruby code using print statements trying to figure out what 'type' of an object something was and, of course, backfilling tests. It was a fun project and both Jeff and I learned the Pivotal way, which was probably the most enlightening thing I had ever learned about how to develop software correctly from a PM perspective. If you want to improve your skills beyond just slinging code, spend some time figuring their methodology out. Much of it is documented in Pivotal Tracker help documentation and blog posts.

Third one was not really a rewrite, but the original two founders, who were not technical, had tried to hire a guy and got burned because the guy couldn't finish the job. Sadly, they had already paid the person a ton of money and got really nothing functional out of it. We (jeff and I again!) just started over. We did a MVP in 3 months (to the exact date, because we both know how to write stories using pivotal tracker and do proper estimates) and ended up doing $80m in revenue, in our first year with an initial team of 5 people.

Fourth one was three guys (who were also not technical) I kind of randomly met after I moved to Vietnam. They were deploying litecoin asic miners into a giant Vietnamese military telco (technically, they are all military). They had hired another guy to do the IT work and he was messing it all up. They invited me out to help install machines, I came out, rebuilt their networking layout and then proceeded to eventually fix their machines because the software 'firmware' that was on them was horrible. I also added monitoring with prometheus so that I could 'see' issues with all these machines. That first day on the job, they fired the other guy and made me CTO. We ended up deploying in another datacenter as well. It was a really wild experience with a ton more stories.

Life has been, um, interesting. Thanks for reading this far.

Please tell me that you've retired now due to your incredible billing rates and track record of success.

Not everything has been a success. For example, unless you're stupid rich and can afford years of losses, never start/own a night club or you might end up working for the rest of your life to pay off your debts.

The problem is that most developers are crap and self centered on working with the tech they like.

You need to work with someone who doesn't care about filling up their CV with "ruby microservices" and get stuff done.

If I went into a business to do a rewrite and decided to use $shinyNewTech because I want to build up rust experience I'd probably end up wasting years with little results.

The existing app was a large rails monolith. This wasn’t a small 10 person team but a 50 person org. Groups can get funny ideas sometimes.

> why does our tracking pixel return a purple image?

Now I'm really curious, is there some exciting non-obvious reason for a tracking pixel to be purple? Was it #FF00FF or more like #6600DD?

This definitely needs an answer.

In fact, until OP can give us the right answer, we immediately need even wrong answers!

You reading this. Yes, you. Give your best wrong answer below.

My best wrong answer is that there were different colored pixels for different front-end versions, and the app had some radically different responses depending on the version. Maybe MENA would return white, SE Asia green, people who signed up during a sale would return blue, whatever. After a while, the other pixels were removed and only one shade of purple were used for everyone, but the code for processing them was not removed. So now, if the tracking pixel is not a precise shade of purple, some unexpected shenanigans ensue.

I worked on an app once that used two different, equally ancient libraries to a) generate thumbnails and b) create a png from a pdf. While modifying part of this process I started realizing that there were conditions where you'd get a PDF thumbnail at the end, but its output had a red tint to it.

Input looked fine and invoking each step manually worked fine as well.

Come to find out that certain PDFs contained color calibration information that, combined with how we were calling it, would treat ARGB as RGB. The input would have transparency info defined and the thumbnail generator would happily repurpose the alpha channel as the red channel instead.

The tracking pixel was made my scaling the company logo down to a 1x1 image.

That's brilliant! That way nobody could accuse you of spying. "It's just our logo. What's all the fuss about?"

Obviously it's because mauve has the most RAM.

Page background where pixel displayed was purple

Accessibility. Protanopia affects cones perceiving red color.

Obviously !! The anti-doppler shift trick /s :)

I did a rewrite of a 30 year old bit of perl/php2 over the last year. Not knowing why things were the way they were was really useful for the younger team members and me to get familiar with the codebase and the business context.

Anecdotal: I asked people why they keep incorrectly using jQuery methods and produce ambiguous, difficult-to-maintain code in the year of 2022. (we still have jQuery as a dependency for legacy code.) The response was that they were not aware that native counterparts like document.querySelectorAll exist in the browser. They just copied the old jQuery code, modified them, and it worked.

I am pretty sure this kind of thing exists in any large legacy codebase.

You don't need comprehensive tests for tests to start delivering value.

Figure out the single most important flow in the application - user registration and checkout in an e-commerce app, for example.

Write an automated end-to-end test for that. You could go with full browser automation using something like Playwright, or you could use code that exercises HTTP endpoints without browser automation. Either is fine.

Get those running in GitHub Actions (after setting up the git scraping trick I described here: https://news.ycombinator.com/item?id=32884305 )

The value provided here immense. You now have an early warning system for if someone breaks the flow that makes the money!

You also now have the beginnings of a larger test suite. Adding tests to an existing test suite is massively easier then starting a new test suite from scratch.

You're assuming the existing flow is working perfectly and I agree with you that testing is a godsend. I constantly yell that testing is great. Heck, I even worked for Pivotal Labs that does TDD and pair development, and loved it.

Let's say you start to write tests and start to see issues crop up. Now what? How do you fix those things?

Github actions!? They don't even have source control to begin with. There are so many steps necessary to just get to that point, why bother?

If the existing code base already has extremely slow movement and people are unwilling to touch anything for fear of breaking it... you're never going to get past that. Let's say you do even fix that one thing... how do you know it isn't breaking something else?

It is a rats nest of compounding issues and all you are doing is putting a bandaid on a gushing open wound. Time to bring in a couple talented developers and start over. Define the MVP that does what they've learned their customers actually need from their 'v1' and go from there. Focus on adding features (with tests) instead of trying to repair a car that doesn't pass the smog test.

> Let's say you start to write tests and start to see issues crop up. Now what? How do you fix those things?

I assumed the tests wouldn't be for correctness, but for compatibility. If issues crop up, you reproduce the issues exactly in the rewrite until you can prove no one depends on them (Chesterton's fence and all).

The backwards-compatibility-at-all-costs approach makes sense if the product has downstream integrations that depend on the current interface. If your product is self-contained, then you're free to take the clean slate approach.

> I assumed the tests wouldn't be for correctness, but for compatibility.

You're assuming that the people coming in to write these tests can even make that distinction. How do you even know what the compatibility should be without really diving deep into the code itself? Given how screwed up the codebase already is, it could be multiple layers of things work against each other. OP mentioned multiple versions of jquery on the same page as an example.

Writing tests for something like that is really a waste of time. Better to just figure out what's correct and rewrite correct code. Then write tests for that correct code... that's what moves things forward.

> How do you even know what the compatibility should be without really diving deep into the code itself?

You can pretty much black-box the code and only deep dive when there are differences. Here's what I've done in the past for a rewrite of an over-the-network service:

1. Grab happy-path results from prod (wireshark pcap, HTTP Archive, etc), write end-to-end tests based on these to enable development-time tests that would catch the most blatant of regressions.

2. Add a lot of logging to the old system and in corresponding places in the new system. If you have sufficient volumes, you can compare statistical anomalies between the 2 systems

3. Get a production traffic from a port mirror, compare the response of your rewritten service against the old service one route at a time. Log any discrepancies and fix them before going live, this is how you catch hard-to-test compat issues

4. Optionally perform phased roll out, with option to roll back

5. Monitor roll out for an acceptable period, if successful, delete old code/route and move to the next one.

The above makes sense when backwards compatibility is absolutely necessary. however, the upsides is once you've set up the tooling and the processes, subsequent changes are faster.

All of that, while technically correct and possible, is vastly more complicated and time intensive than a rewrite of what the OPs description of the codebase is.

Yes, it absolutely is - but the trade off is a far lower risk of introducing breaking changes. Depending on the industry/market/clients - it may be the right tradeoff

In my eyes, a rewrite won't be introducing breaking changes. It would be to figure out what functionality makes money, then replicate that functionality as best as possible so that the company can continue to make money as well as build upon the product to make even more money.

We're talking about a webapp here, not rocket science.

The biggest problem isn't even the codebase in this situation.

When you keep finding bugs like that while refactoring and making things better, it will demoralise you. The productivity will stop when that happens.

It also require above average engineers to fix the mess and own it for which there is not much benefit.

Your refactoring broke things? Now it's your turn to fix it and also ship your deliverables which you were originally hired for. Get paged for things that weren't your problem.

If I was a manager and assigned this kind of refactoring work, I will attach a significant bonus otherwise I know my engineers will start thinking of switching to other places unless we pay big tech salaries.

People keep quoting Joel's post about why refactoring is better than rewrite but if your refactor is essentially a rewrite and your team is small or inexperienced - it's not clear which is better.

Parallel construction and slowly replacing things is a lot of unpaid work. Just the sheer complexity of doing it bit by bit for each piece is untenable for a 3 person team where most likely other two might not want to get into it.

> It also require above average engineers to fix the mess and own it for which there is not much benefit.

That's not true, it doesn't require above average engineers. It requires a tech lead that has the desire and backing to make a change, and engineers willing to listen and change. It doesn't require a 10x engineer to start using version control, or to tell their team to start using version control for example..

Source control seems like a straightforward first step, regardless of what approach is going to be taken going forward

One would think, but how do you go from source control to deployment on the production server though? If they were editing files on the server directly, there could be a whole mess of symlinks and whatever else on there. Even worse, how do you even test things to see if you break anything?

It is a can of worms.

Just start somewhere. These guys are making changes, actual functional changes and bug fixes in that environment meaning they already have all the problems you imagine are going to get in the way of fixing this mess. So stop fretting and just start small with one tiny thing. It doesn't really matter with what. You don't even need automated tests necessarily. It's a small simple flow that needs 10 minutes to run the same test steps manually for? Write them down and do it manually, I don't care. Just Do it.

Been there, done that. Slightly differently where they had a test server and prod server. So already better except one day I made a change and copied to prod. Yes it was manual. Just scp the files over to prod. And stuff broke. Turned out someone had fixed a bug directly in prod but never made the change on the test server.

First thing I did was to introduce version control and create a script to do make deployment automatic meaning it was just a version control update on Prod (also scripting languages here). Magically we never had an issue with bugs reappearing after that.

Pretty simple change and you can go from there.

The above code base was over 20 years old and made use of various different scripting languages and technologies including some of the business logic being in stored procedures. Zero test coverage anywhere. You just 'hide' small incremental changes to make things better in everything you do. Gotta touch this part because they want a change? Well it could break anyhow so make it better and if it breaks, it breaks and you fix it. It needs judgment though. Don't rewrite an entire module when the ask was adding a field somewhere. Make it proportional to the change you need to make and sometimes it's not going to be worth it to make something better. Just leave it.

Not sure the little hammer will fix much. And making folks use a method in new code pisses them off. "You say important I do this your way this time, even though there are 1000 examples of doing it the other way. I feel persecuted and your way is pointless, because it doesn't fix everything anyway. And its slowing me down and making me look bad."

Not rational but folks don't have to explain their feelings. You will be hated.

The little hammer definitely fixes. It does it in the same way as water cut the grand canyon. The beauty is that it works over time.

Now as for how to get the other devs on board, I agree with you that you can't just barge in and tell them everything they are doing is wrong etc. I never said to do that and I'm replying to a specific comment in the thread not the original Ask HN.

I.e. when I write about what I've done in the past, I got buyin from my boss and my colleagues on what I was going to do. But I didn't just sit there and kept doing what they had done over the past years. I changed lots of other little things too in the same manner.

So if we do want to talk about the original Ask HN and how to get the existing employees not to hate you, you can start by letting them tell you about what they think the problems are. What are their pain points. They might just not know what to do about them but actually see them as problems too. Maybe they've tried things already but failed or got shot down by others in the company. Maybe they did try to introduce version control but their non tech boss shot them down.

Of course it may not work out. Some people really are just stupid and won't listen even if you try to help them and make them part of the solution.

Startups have runway and can die when big-company processes forced up on them. It can sink them.

I'm not sure where you're pulling that from. There's no mention of startup here. Neither in the original (actually the opposite I'd say, 12 years and just a business unit).

None of what I said is a big-company process in any way. If in your book using source control is a big company process that will sink a startup then be my guest and I will just hope we never have to work together. Source control is a no-brainer that I even use just for myself, have used in teams of two and teams dozens to hundres. The amount of process around is what scale with the kind of company. Source control is useful by itself in every single size of company.

Source control is necessary and simple, yes.

Code review, coding standards, required tests for everything, multiple stages of deployment - are not simple and can stall development. Done wrong they can sink a company.

It's easy to read the worst possible construction on what other people write here. It's never a good idea.

Btw I worked at a startup for 8 years. It was still a startup, depending on new investment to meet the monthly. In any case the described dev group was behaving in a way that used to be typical of startups. And even business units in larger organizations have runway.

Yeah, lot of worms...and if while refactoring things break. You are on the hook for scanning through that complex monster at 3 am and finding the issue and fixing it for no additional pay in most cases.

They can literally copy the whole directory from their local machine to production as a first step for all I care.

How do they test things on production? If there’s a bug how do they revert to the previous version? There are way more issues without source control than with.

Doesn't Git support symlinks? Empty directories could be trouble though. One would have to put a .GITKEEP into every directory before checkin, and a step at deployment time to remove them again.

"Github actions!? They don't even have source control to begin with."

Right: no point in adding any tests until you've got source control in place. Hence my suggestion for a shortcut to doing that here: https://news.ycombinator.com/item?id=32884305

How do 2 junior devs manage to rewrite the entire product while also meeting the ongoing goals of the business?

You're trying to spec features on a moving target.

Even if they were able to do 50% time on the rewrite you'll never actually get to feature parity.

The only viable plan, unless the company has an appetite to triple the dev headcount, is to set an expectation that features will have an increased dev time, then as you spec new features you also spec out how far you will go into the codebase refactoring what the new features touch.

But it is functional. Grandparent post is suggesting that all the currently used functionality should have tests written for it. It makes sense, as that way they can gather the requirements of a rewrite at the same time.

We don't know that it is functional... maybe the company is only making $20m and should be making $60m. Like I said, we tripled the revenue with a rewrite.

What we did was make the case that we could increase revenue by being able to add valuable features more easily/quickly. We started with a super MVP rewrite that kept the basic valuable features, launched, then spent the rest of our time adding features (with tests). Hugely successful.

The key, of course, will be to get 1-2 top notch developers in place to set things up correctly from the beginning. You're never going to be effective with a few jr's who don't have that level of experience.

> We don't know that it is functional... maybe the company is only making $20m and should be making $60m. Like I said, we tripled the revenue with a rewrite.

It's $20m functional. It's possible it could be better but unless this is the kind of huge org where 20m is nothing (doesn't sound like it) you really need the behaviors documented before you start screwing with it. It's very likely this thing has some pretty complex business logic that is absolutely critical to maintain.

> you really need the behaviors documented before you start screwing with it. It's very likely this thing has some pretty complex business logic that is absolutely critical to maintain.

Nothing I said suggested otherwise. Absolutely critical for whomever is doing a rewrite to understand everything they can about the application and the business, before writing a single line of code.

You sound frustrated that you've joined a company with an absolute stinker of a codebase, because you're confident you could deliver much better results having refactored it first. You're managing a group of people probably enormously under-productive because of the weight of the technical debt they're under. Every change takes months. It's riddled with hard-to-fix bugs. It's insecure. There are serious bus factor problems.

Many of us have been in this exact position before, multiple times. Many of us have seen somebody say "our only choice is a full rewrite" - some of us were the one making that decision. Many of us have seen that decision go disastrously wrong.

For me, the problem was my inability to do what I'm good at: write tests, write implementations that pass that test, etc. Every time I suggested doing something, somebody would have a reason why that would fail because of some unclear piece of the code. So rather than continuously getting blocked, I tried to step into my comfort zone of writing greenfield code. I built a working application that was a much nicer codebase, but it didn't match the original "spec" from customer expectations, so I spent months trying to adjust to that. I basically gave up managing the team because I was so busy writing the code. In the end, I left and the company threw away the rewritten code. They're still in business using the shitty old codebase, with the same development team working on it.

If you really want to do the rewrite, accept how massively risky and stressful it will be. The existing team will spend the whole team trying to prove you were wrong and they were right, so you need to get them to buy into that decision. You need to upskill them in order to apply the patterns you want. And you need to tease apart those bits of the codebase which are genuinely awful from those that for you are merely unfamiliar.

Personally, I would suggest a course for you like https://www.jbrains.ca/training/course/surviving-legacy-code, which gives you a wider range of patterns to apply to this problem.

Maybe this was meant as a reply to the main post?

“I won the lottery, you can too. If you don’t buy a ticket, you’re never gonna win right…?”

There is a lot of evidence rewrites are hard to do well, and especially prone to failure.

…you might pull it off, it’s not impossible, sure. …but are you seriously saying it’s the approach everyone should take because it worked for you once?

Here my $0.02 meaningless anecdotal evidence: I’ve done a rewrite twice and it was a disaster once and went fine the second time. 50% strike rate, for me, personally, on a team of 8.

What’s your rate? How big was your team, how big was the project? What was the budget? Did you do it on time and on budget? It’s pretty easy to say, oh yeah, I rewrote some piece of crap that was a few hundred lines in my spare time.

…but the OP is dealing with a poorly documented system that’s very big, and very important and basically working fine. You’re dishing out bad advice here, because you happened to get lucky once.

Poor form.

Good advice: play it safe, use boring technology and migrate things piece by piece.

Big, high risk high reward plays are for things you do when a) things are on fire, or b) the cost of failure is very low, or c) you’re prepared to walk away and get a new job when they don’t work out.

> How do you fix the things that the test prove are broken?

Uhm. The tests don’t do any such things.

> It is better to spend the time figuring out what all the features are, document them

Yes. And the tests you should write are executable documentation showing how things are. It is like taking a plaster cast of a fossil. You don’t go “i think this is how a brachiosaurus fibula should look like” and then try to force the bones into that shape. You mould the plaster cast (your tests) to the shape of the fossil (the code running in production). Then if during excavation (the rewrite) something changes or get jostled you will know immediately that it happened, because the cast (the tests) no longer fit.

> We did a full rewrite and it was quite successful and the company went on to triple the revenue.

Which sure beats some other company coming along and "rewriting" the same or similar functionality in a competing product and killing your own revenue. But it does come down to how big the codebase is and how long it would take for an MVP to be a realistic replacement. If there are parts that are complex but unlikely to need changing soon you can usually find ways to hide them behind some extra layer. Is there any reason you couldn't just introduce proper processes (source control, PRs, CI/CD etc.) around the existing code though?

Kudos to you for successfully delivering in a similar situation. That said, I think your advice is a bit cavalier. The industry is littered with the carcasses of failed rewrites. The fact that you have done it in one context does not mean that this team can pull it off in another.

I'll also say there's a lot of semantics at play here. What is a "rewrite", what is a "test" vs a "document", what is "functional"? I read your main point being that one should avoid sunk-cost fallacy and find the right places to cut bait and write off unsalvageable pieces. The art of major tech debt cleanup is how big of pieces can you bite off without overwhelming the team or breaking the product.

> Writing tests for something that is already not functional, will be a waste of time.

This is not TDD; it's writing tests to confirm the features that work now. Then, when you make changes, you can get an early warning if something starts going south.

Of course a full rewrite can be successful. This is the problem when people base their entire critical thinking on blog posts. They then go on to preach it everywhere as well!

The blog posts are warnings about what not to do. People, naturally, when they don't fully understand something or can't grasp the complexity of something want to rebuild. Because writing also helps us understand that is what we are building. But its a trap, what you've rewritten will never be the same as before and there lies the footguns.

The blogs are plainly stating, "even though you feel you should rewrite, you probably shouldn't."

Or some of us have experienced failed rewrites. It can be a potentially expensive mistake.

> get some testing in place

What is really needed (and almost definitely doesn’t exist) is some kind of spec for the software.

Exactly. If they write tests, they will be just doing TDD where the specification becomes a problem in itself.

It is a 12 year old legacy product. What specification exists other than, "Yesterday it did X when I clicked the button, but now it does not do that anymore."

This is the point: I don’t TDD, but i am a big fan of tests. I’m this case the incorrect spec can be flagged, but all the other incorrect specs will also be there. If your Fix doesn’t break a spec, great, but if it does you can check if that spec was correct. It’s a back and forth between code and business requirements

You must have missed the part where it makes 20M revenue per year.

I gotta love hacker news, people who think the fact a backend is written in horrid PHP means it is "already not functional" while they spend their days learning something like Haskell that make them negative revenue per year.

Who knows if that 20m revenue should be 60m? They could be held back greatly by the fact that the developers are not motivated to change anything.

I also don't know Haskell and have no desire to learn it. I prefer to build products in static compiled languages where I can more easily hire developers.


It's also a juggling job from hell so keep a cool head and seek support and resources for what needs to be done.

A big first step is to duplicate and isolate, as much as possible, a "working copy" of the production working code.

You now need to maintain the production version, as requests go on, while also carving out feasible chunks to "replace" with better modules.

Obviously you work against the copy, test, test again, and then slide in a replacement to the live production monolith .. with bated breath and a "in case of fubar" plan in the wings.

If it's any consolation, and no, no it isn't, this scenario is suprisingly common in thriving businesses.

This approach is a trap.

Management need to know that this needs a rewrite, and a more capable team, and that persuing on aggressive roadmap while things are this bad is impossible.

If they say no, and you try to muddle your way through it anyway, you are setting yourself up to fail.

If they say yes, ask for the extra resources necessary to incrementally rewrite. I would bring in new resources to do this with modern approaches and leave the existing team to support the shrinking legacy codebase.

Why would the existing team stick around knowing their jobs would be slowly rewritten into oblivion by others?

Where else are they going to go if they prefer this mess?

Why would they need to be replaced if they’re ultimately convinced to enter the 21st century?

Your suggestion sounds like the strangler fig pattern. While a valuable strategy in some cases, it does present the risk of duplicating poor architecture choices into the new code.

I would normally opt for your suggested approach too. However, based on the description given, I’d most likely recommend a complete rewrite in this case. The architecture appears to be quite poor and the risk of infecting new code with previous bad decision-making may be too great.

Yeah, I agree, full rewrite from scratch are almost never the good approach. It will start a tunnel when you cannot add anything useful to production for months, and you will have no idea when you can finally ship the whole thing and when you do, it will be very risky.

Do things progressively. Read the code, figure out the dependencies, find the leaves and starts with refactoring that. Do add tests before changing anything to make sure you known if you change some existing behaiors.

Figuring out such code base as a whole might be overwhelming, but remember that it probably looks much more complicated than it is actually.

In a team with only two people working on the monster it seems reasonable that they’d be able to manage two development streams at the same time.

This is the correct answer.

For an additional perspective see this classic: https://dhemery.com/articles/resistance_as_a_resource/

All good points, but…

This is a clear case where he needs to look for another job IMMEDIATELY.

Here’s why…

1. The problems listed are too technical and almost impossible to communicate to a non-technical audience meaning business or c-suite.

2. The fixes will not result (any time soon) in a change that’s meaningful to business like increased revenue or speed to market. Business will not reward you if you are successful or provide resources to get the job done unless the value is apparent to them (See #1).

Employment is a game. Winning that game means knowing when to get out and when to stay.

It’s time to plan your exit both for your own sanity and the good of your family.

+1. also start by adding git first and have a test env set up.

A new person who complains about existing code and proposes "Rewrite everything" on week one, will not met with __respect__

+1. came here to say this! it's in prod, making money; bring up the discussion of full rewrite with the management at your own peril. learn to tame the beast by pruning one dead/redundant function at a time, that's the best you can do, both for the project and for yourself!

My first instinct was "get some testing in place" too. That served me well in recent projects where I was in a similar situation. I was wondering if anyone has any advice on how to make sure your tests are... comprehensive? I was fortunate enough to have full flow tests in place from the beginning and a great team which knew the intricacies of the subject matter. We made lists of usecases and then tried to find orthogonal test cases. But that was my naive approach wondering if there are better methods out there. Especially if there is zero testing.

One more thing I’d add; for the love of all that is holy make sure the tests run lightning quick.

What you want to do is first reduce the cost and risk of making changes, to a close to zero as possible.

Then, come up with a broad system design that defines higher levels of abstraction. Your goal is not to redesign the system from scratch but to specify the existing hierarchies which are currently implicit in the code. Are there different modules that naturally emerge? Ok, what are they?

Once you have a sense of what the destination will look like, make tiny changes to get just one module done. Move in little bits at a time, to build up evidence that things can work.

The way to change a culture is to set such a strong positive example that people naturally went to follow. Telling other people their work sucks is not that example, but first pitching in to speed up development cycles can make everyone happy.

And lastly you have at least some responsibility to inform management of the risk they aren’t aware of. Things will go much better for you if you tell your manager that the codebase was built in a way that makes future changes expensive and risky, and this is fine for where the business was but at some point it makes sense to invest in shifting the development velocity/risk curve of the business.

> The way to change a culture is to set such a strong positive example that people naturally went to follow. Telling other people their work sucks is not that example, but first pitching in to speed up development cycles can make everyone happy.

This is the part I'm having the most trouble with. What if you are at a place which is not software minded? Any tips on making them understand?

“Never rewrite” is a popular cargo-cult that sprang from a well known blog article that made the rounds some years ago. The urge to rewrite can be a naive impulse for sure, but there are LOTS of cases where new and better technology can result in tremendous gains, or where a code base is simply too far gone to redeem. The biggest successes of my career have almost all been ground up rewrites of existing products using new technology or techniques that resulted in orders-of-magnitude improvements in performance and ROI. If you can make incremental improvements that’s great, but sometimes it’s just not possible to rewrite “a piece at a time” because there are no pieces, just one big ball of mud. To the original author: If you don’t rewrite this mess, your competitors will. I’d say: lay out the case for an overhaul, stand your ground, don’t implement any new features until you’ve got a clear path to reducing technical debt, and if you can’t get buy-in to an overhaul just leave. What you’re describing sounds like a textbook scenario for burnout and there are lots of other opportunities where you can work on things in ways that you’ll actually enjoy.

This. So much.

I'd argue that the first order of business is getting the code committed to SCM. Then you can coach the team on new branches (features/bugs), and build the culture of using the SCM. Do this before going to the execs and giving the 10,000 meter view.

Go to the execs and get buy in on the scope of what you need. I'd recomment articulating it in terms of risk reduction. You have a $20M revenue stream, and little control/testing over the machinery that generates this. You'll work on implementing a plan to get this under control (have an outline of this ready, and note that you need to assess more to fill in the details). You need space/time/resources to get this done.

Then get the testing in place. Make this part of the culture of SCM use. Reward the team for developing sanity/functionality tests. Get CI/CD going (a simple one, that just works). From this you can articulate (with the team's input), coding/testing standards to be adhered to.

After all this, start identifying the problematic low hanging fruit. Work each problem (have a clear problem statement, a limited scope for the problem, and a desired solution). You are not there to boil the ocean (rewrite the entire thing). You are there to make their engineering processes better, and move them to a more productive environment. Any low hanging fruit will have a specific need/risk attached to it. Like "we drop tables/columns regularly using user input." Based upon the culture you created with SCM/testing, you can have the team develop the tests for expected and various corner cases. From that, you can replace the low hanging fruit.

Keep doing that until the fruit is no longer low hanging. Prove to the execs that you can manage/solve problems. Once you have that done, you can make a longer term roadmap appeal, that is, start looking at what version 2.0 (or whatever number) would look like, and what internal bits you need to change to get there.

Basically, evolution, not revolution. Easier for execs under pressure to deliver results to swallow. Explain in terms of risks/costs and benefits, though in the near term, most of the focus sounds like it should be on risk reduction.

Not a good advice. I have been at Op's shoes, and I inherited a project that was a clusterf, and did a full re-write. It was a lot of work (more than anticipated), but eventually it was very successful.

The original code was just not salvageable. (It was quickly done as a fast hack, and it would break left and right, causing outages).

Just make sure the OP needs to understand what the OG system is trying to do, and what it will take to re-write it to something sane. Don't start it, before understanding all the caveats of the system/project you are trying to re-write.

Do it in small pieces and you'll be there forever - it'll never get done.

Map out the functionality related to the (hard) requirements and kick off replacing the product(s) with something modern and boring.

> Also, respect the team. Maybe they aren't doing what you would, but they are keeping this beast alive, and probably have invaluable knowledge of how to do so. Don't come in pushing for change...

Yes, 3 people creating a revenue of $20 million/year is impressive.

But what if 1, let alone 2 of them quit and/or fall ill? That's way too much risk for this type of revenue.

If a new team member needs a year to just understand how the code is organized, then a well structured and documented rewrite certainly is necessary.

Something this messy is highly likely to have many security vulnerabilities. Maybe start with a scan or pentest and use that as additional justification to get things in order. 20M a year also means that this company can't afford for this application to be compromised.

The strangler pattern of rewriting individual pieces is also what leads to 3-4 incompatible versions of Jquery. You could start with one key page and rewrite it in React or whatever your preference but if you never manage to kill one of the old dependencies you are just making even more of a tangled web.

I would try to identify how entangled some of the dependencies are and start my rewrite with the goal of getting rid of them. But yeah I agree that version control and testing is going to be key here as you any backsliding will probably result in the idea of future refactoring being viewed negatively.

This sounds like solid advice. A rewrite would be a world of hurt, particularly if you don’t have buy-in from the existing team.

Regarding the team, junior they may be, as he says, but they’re rolling with a multi-mullion dollar product. If they’re keeping the product going and continuing to add business value, then they’re doing something right. Their engineering practices might be questionable, but they seem to have a solid product.

However, getting testing in place is going to be a challenge. I’ve encountered systems that sound similar to this one (perfectly functional, zero discernible architecture, not remotely designed with any kind of testing in mind.) It’ll be difficult to convince the suits that introducing testing has any real value when you’re starting from zero.

The first thing than comes to mind is the strangler fig pattern. Sounds like a useful idea in this instance.

> …an alternative [to a re-write] is to gradually create a new system around the edges of the old, letting it grow slowly over several years until the old system is strangled.[0]

[0] https://martinfowler.com/bliki/StranglerFigApplication.html

This is exactly the right advice. Full rewrite might look good on the resume but will be a late error prone disaster.

Start with tests can't emphasize this enough.

> You can delete code as long as the tests pass.

It's true that poorly maintained code contains a lot of pieces which should be deleted but if tests where added post-hock it is hard to be sure that they cover all use cases.

After adding basic tests I would suggest to improve logging to get good understanding of how the software is used. Better to store them in a database which allows quick queries over all data you have (I'd personally would use ClickHouse but there are other options). But even with good logs you need to wait and collect enough data otherwise you can miss rare but important use cases. E. g. something which happens only during the tax season.

Basically every time I decided for a full rewrite I ended up thinking "thank god I made that decision, the new architecture is much simpler" (and no, it didn't just seem simpler to me).

The big rewrite works - but only if you have a team you can trust. You need a new team of seniors to pair with the current team, promise a promotion to the current team at the end of the task.

Committing to an iterative approach is what I do when I don't have enough authority/ political tokens and I can't afford a rewrite.

Over time it gets less and less priority from the business and you end up with half a codebase being crap and half codebase being ok and maintaining stuff is even harder.

Agreed, full rewrite is a horrible idea. Source: worked on a rewrite of a project that was like this: PHP from 2003, 7 figures in revenue, written by someone who was not a developer, no version control or testing. And it failed horribly.

I have tactical suggestions, but the strategy is simple: move toward more modern software practices, one step at a time.

But first, the elephant in the room. You say you need to help the project

> without managing [the team] directly

Who does? How can you help them?

Because you don't have direct authority, all the tactics and suggestions mentioned here won't be as helpful as they would if you were the manager in charge. And it's hard to offer concrete advice without knowing exactly how you are connected. A principal in the same company and want to help? A peer of the manager? A peer of the team members? Each of these would have different approaches.

And how much time do you have to help? Is this something you are doing in the shadows? Part of your job? Your entire job?

With that said, here's my list of what to try to influence the team to implement. Don't worry about best of breed for the tools, just pick what the company uses. If the tool isn't in use at the company, pick something you and the team are familiar with. If there is nothing in that set, pick the industry standard (which I try to supply).

1. version control. Git if you don't have any existing solution. GitHub or GitLab are great places to store your git repos

2. bug tracker. You have to have a place to keep track of issues. GitHub issues is adequate, but there are a ton of options. This would be an awesome place to try to get buy-in from the team about whichever one they like, because the truth it is doesn't matter which particular bug tracker you use, just that you use one.

3. a build tool so you have one click deploys. A SaaS tool like CircleCI, GitHub actions is fine. If you require "on prem", Jenkins is a fine place to start. But you want to be able to deploy quickly.

4. a staging environment. This is a great place to manually test things and debug issues without affecting production. Building this will also give you confidence that you understand how the system is deployed, and can wrap that into the build tool config.

5. testing. As the parent comment mentions, end to end testing can give you so much confidence. It can be easy to get overwhelmed when adding testing to an existing large, crufty codebase. I'd focus on two things: unit testing some of the weird logic; this is a relatively quick win. And setting up at least 1-2 end to end tests through core flows (login, purchase path, etc). In my experience, setting up the first one of each of these is the toughest, then it gets progressively easier. I don't know what the industry standard for unit testing in php is any more, but have used phpunit in the past. Not sure about end to end testing either.

6. Documentation. This might be higher, depending on what your relationship with the team is, but few teams will say no to someone helping out with doc. You can document high level arch, deployment processes, key APIs, interfaces, data stores, and more. Capture this in google docs or a wiki.

7. data migrations. Having some way to automatically roll database changes forward and back is a huge help for moving faster. This looks like a viable PHP option: https://laravel.com/docs/9.x/migrations which might let you also introduce a framework "via the side door". This is last because it is least important and possibly more intrusive.

None of these are about changing the code (except maybe the last one), but they all wrap the code in a blanket of safety. There's the added bonus that it might not trigger sensitivities of the team because you aren't touching "their code". After implementing, the team should be able to move faster and with more confidence.

Since you are not the direct manager, you want to help the team get better through your influence and through small steps. That will build trust and allow you to suggest bigger ones, such as bringing in a framework or building abstraction layers.

Agree with this approach 100%

Yes same. Sometimes you see a frankenstein code and devs get all emotional and wants a full rewrite or die attitude. Maybe take a step back and migrate piece by piece.

> First off, no, a full rewrite is not only not necessary, but probably the worst possible approach. Do a piece at a time. You will eventually have re-written all the code, but do not ever fall into the trap of a "full re-write". It doesn't work.

I've seen systems where the entirety of the codebase is such a mess, but is so tightly coupled with the business domain, that a rewrite feels impossible in the first place. Furthermore, because these systems are often already working, as opposed to some hypothetical new rewrite, new features also get added on top of the old systems, meaning that even if you could rewrite them, by the time you would have done so, it would already be out of date and wouldn't do everything that the new thing would do (the alternative to which would be making any development 2x larger due to needing to implement things both in the old and new versions, the new one perhaps still not having all of the building blocks in place).

At the same time, these legacy systems are often a pain to maintain, have scalability and stability challenges and absolutely should not be viewed as a "live" codebase that can have new features added on top of it, because at that point you're essentially digging your own grave deeper and deeper, waiting for the complexity to come crumbling down. I say that as someone who has been pulled into such projects, to help and fix production environments after new functionality crippled the entire system, and nobody else knew what to do.

I'd say there is no winning here. A full rewrite is often impossible, a gradual migration oftentimes is too complex and not viable, whereas building on top of the legacy codebase is asking for trouble.

> But before you re-write once line of code - get some testing in place. Or, a lot of testing. If you have end-to-end tests that run through every feature that is currently used by your customer base, then you have a baseline to safely make changes. You can delete code as long as the tests pass. You can change code as long as the tests pass.

This is an excellent point, though! Testing is definitely what you should begin with when inheriting a legacy codebase, regardless of whether you want to rewrite it or not. It should help you catch new changes breaking old functionality and be more confident in your own code's impact on the project as a whole.

But once again, oftentimes you cannot really test a system.

What if you have a service that calls 10 other services, which interact with the database or other external integrations, with tight coupling between all of the different parts? You might try mocking everything, but at that point you're spending more time making sure that the mocking framework works as expected, rather than testing your live code. Furthermore, eventually your mocked data structures will drift out of sync to what the application actually does.

Well, you might try going the full integration test approach, where you'd have an environment that would get tests run against it. But what if you cannot easily create such an environment? If there are no database migrations in place, your only option for a new environment will be cloning an existing one. Provided that there is a test environment to do it from (that is close enough to prod) or that you can sufficiently anonymize production data if you absolutely need to use it as the initial dump source, you might just run into issues with reproducibility regardless. What if you have multiple features that you need to work on and test simultaneously, some of which might alter the schema?

If you go for the integration testing approach, you might run into a situation where you'll need multiple environments, each of which will need their own tests, which might cause significant issues in regards to infrastructure expenses and/or software licensing costs/management, especially if it's not built on FOSS. Integration tests are still good, they are also reasonably easy to do in many of the modern projects (just launch a few containers for CI, migrate and seed the database, do your tests, tear everything down afterwards), but that's hard to do in legacy projects.

Not only that, but you might not even be fully aware how to write the tests for all of your old functionality - either you need to study the whole system in depth (which might not be conceivable), or you might miss out on certain bits that need to be tested and therefore have spotty test coverage, letting bugs slip through.

> Once you are at that point, start picking off pieces to modernize and improve.

It helps to be optimistic, but for a plethora of reasons, many won't get that far. Ideally this is what people should strive for and it should be doable, but in these older projects typically the companies maintaining them have other issues in regards to development practices and reluctance to introduce tools/approaches that might help them improve things, simply because they view that currently things are working "good enough", given that the system is still generating profits.

Essentially, be aware of the fact that attempts to improve the system might make things worse in the short term, before they'll get better in the long term, which might reflect negatively upon you, unless you have sufficient buy-in to do this. Furthermore, expect turnover to be a problem, unless there's a few developers who are comfortable maintaining the system as is (which might present a different set of challenges).

Ideally, start with documentation about how things should work, typical use cases, edge cases etc.

Then move on to tests, possibly focusing on unit tests at first and only working with integration tests when you have the proper CI/environment setup for this (vs having tests that randomly fail or are useless).

After that, consider breaking the system up into modules and routing certain requests to the new system. Many won't get this far and I wouldn't fault you for exploring work in environments that set you up for success, instead of ones where failure is a looming possibility.

i'd do it that way too,

- tests to cement interfaces

- gradually write module supporting this interface

- replace module on test clone and bench / retest it

when this module is ok, do another

Huh. You are literally saying do a full rewrite. But it's also the worst idea?

Edit: A full rewrite always meant replacing every part of a system. Whether you do it gradually doesn't really matter.

"Whether you do it gradually doesn't really matter."

It absolutely DOES matter. A gradual rewrite is much more likely to work than a stop-the-press rewrite.

It's still a rewrite. The crux of the statement I made.

There problem with a classic full rewrite is that the existing system is thrown away immediately. All the existing features are not available in production until the rewrite adds them back in. Often incomplete, buggy, changed beyond all recognition, or a combination of all of these. That obviously sucks and is the reason the classic rewrite is rarely done. However, it is clear that something must happen.

"Full rewrite" is a description of the end state, not the process.

The best way to do a full rewrite is incrementally, with test support and consideration for natural separation of internal subsystems.

The best way to do a rewrite may be incremental, but the terminology of "full rewrite" doesn't usually refer to an incremental rewrite, it refers to starting from scratch.

I don't think that's true -- a "full" rewrite is used in contrast to a "partial" rewrite, where only part of the system is replaced. It's called a "full" rewrite because the goal from the start is to fully replace the system with new code.

Consider that if this were not true, then there would be no way to describe an incremental full rewrite, nor any way to describe a from-scratch replacement of a subsystem.

I've written on this topic before, for example https://increment.com/software-architecture/exit-the-haunted...

He’s saying to Ship of Theseus the codebase. Don’t build a new ship and then burn down the old ship. Replace the old ship piece by piece in place.

That only works if the new pieces correspond to old pieces. If there's no good structure to build on, the units to be replaced will constrain the architecture of the new ship.

At some point you end up trying to change a pumpkin boat into an aircraft carrier, and there's no obvious way you can do that one piece at a time.

> If there's no good structure to build on, the units to be replaced will constrain the architecture of the new ship.

Which is why you do it in stages: add scaffolding until local rewrites are possible, then rewrite the business logic, then tear the scaffolding down.

That's a good analogy actually. Scaffolding is a kind of temporary test structure that you can use to maintain function while you figure out something better.

Maybe there are some underlying architectural problems that need to be addressed, but it would be impossible make those changes from the current situation. It sounds like it is impossible to even know what code is live vs sitting on the server. How do you even know you have a firm grasp on the current architecture when it is unclear what code is even running the product?

A lot of low hanging fruit to be addressed that will likely lead to meaningful improvements. Once the code is in better shape and some unfortunate legacy pattern is identified, than it can be considered time to re-tool the architecture.

Agreed. The first thing to do is figure out WTF is going on. This is perhaps the hardest kind of thing to do as a developer.

Full rewrite generally means stop the presses we are gonna migrate this whole thing from here to there and no new features until it's done (hint it never gets done).

I’ve only ever witnessed ship-of-Theseus style migrations and those also never get done.

Does not compute... Ship of Theseus is just regular old development of course it never gets done but new features aren't put on hold.

I mean like “we want to replace X with Y”. Y incrementally starts replacing X, but 100% migration is never achieved, meaning double the API surface area exists indefinitely.

Because the migration doesn’t block new features, that means the org gets tired and reallocates the effort elsewhere before it’s ever done, with no immediate consequences. Rinse and repeat.

I think you've not witnessed Ship of Theseus, but "build Ship2 next to Ship1 and start using Ship2 while Ship1 is still being used and keep saying you're going to migrate to Ship2 eventually but meanwhile Ship1 and Ship2 diverge and now you have 2 ships".

I recently witnessed this mess and it is an enormous mess. Don't build Ship2 in the first place. Instead, replace Ship1's mast and sails, and rudder etc until you've replaced all the parts in Ship1. That's the SoT approach.

Right but how do you replace the masts? Don’t you have to build mast2 and then tear down mast1 if you want to have continuous propulsion?

Yes, can you see how that's quite different from building a second ship?

In my comment, X and Y are different masts, not different ships.

I understand now.

A "full rewrite" means that after the completion of the rewrite, the old code has been fully replaced by new code.

What you're describing is a "stop-the-world" rewrite.

I think you are being needlessly pedantic. Everyone understands that "full rewrite" means "restarting from scratch" in this context, especially since the poster was very clear that eventually everything will be touched.

They’re saying to do it, eventually, incrementally and not all at once.

And critically… never completing an incremental rewrite doesn’t matter. Everything remained working the whole time and continues to make the company money. And as a bonus, you were also able to make feature changes that the business wanted at the same time. It’s classic XP, when the money runs out, the system still works!

> this code generates more than 20 million dollars a year of revenue

From a business perspective, nothing is broken. In fact, they laid a golden goose.

> team is 3 people, quite junior. One backend, one front, one iOS/android. Resistance to change is huge.

My mistake, they didn't lay a golden goose--they built a money printer. The ROI here is insane.

> productivity is abysmal which is understandable. The mess is just too huge to be able to build anything.

But you just told me they built a $20M revenue product with 3 bozos. That sounds unbelievably productive.

> This business unit has a pretty aggressive roadmap as management and HQ has no real understanding of these blockers

You should consider quitting your job.

As far as the business is concerned, there are no problems... because well... they have a money printer, and your team seems not to care enough to advocate for change. Business people don't give a damn about code quality. They give a damn about value. If 2003 style PHP code does that, so be it. Forget a rewrite, why waste time and effort doing simple refactoring? To them, even that has negative financial value.

From their perspective, you're not being paid to make code easy to work with, you're being paid to ship product in a rats nest. Maybe you could make a business case for why its valuable to use source control, dependency management, a framework, routing outside of nginx, and so on... but it doesn't sound like any of that mattered on the road to $20M a year, so it will be very difficult to convince them otherwise especially if your teammates resist.

This, again, is why you should consider leaving.

Some developers don't mind spaghetti, cowboy coding. You do. Don't subject yourself to a work environment and work style that's incompatible with you, especially when your teammates don't care either. I guarantee you will hate your job.

In my opinion OP should seriously consider this advice.

I really mean nothing patronizing here, but I suspect OP does not have the corporate experience to handle this situation. This is a corporate equivalent of a double-black diamond downhill route. OP was hired by people who have little understanding of tech and already came in with guns blazing. I might almost wonder if OP's a sacrificial lamb.

But, the tech advice of not doing a rewrite, making tests, soothing any hurt feelings, creating local instances will help. Make the everyday coding experiences of the tech team nicer. Source control, local instances, and unit/integration/E2E tests are a gimme.

The old rule of thumb applies: pick only 2 of speed, cost or quality. You cannot have 3.

> I suspect OP does not have the corporate experience to handle this situation.

I agree with this. OP doesn't say, but reading between the lines, corporate at best doesn't understand the ramifications, but corporate doesn't care about the ramifications.

They're getting 20 million in revenue from 3 cheap devs. Things are going great, according to corporate. They're not going to learn, and OP is going to get blamed when things can't get done.

I just quit because I was placed in a similar situation. The CEO, who does have a CS background albeit ancient, insisted there was nothing wrong with the tech stack that couldn't be solved by vertically scaling and then horizontally scaling. We were at the limits of the former and the architecture made many important parts impossible for the later, but that's another discussion.

The problem wasn't tech scaling, it was process scaling. We really couldn't divide work easily because there were often conflicts. People would join, see the horrible code, then leave. We specifically had to hire off-shore junior devs who didn't know any better and snowball them. I felt the last part was unethical and didn't want to be part of it any longer.

OP is not doing any favors for themselves, and especially not for the junior devs on the team. This job is going to set back the career for the junior devs. They're wasting their time on ancient methods and technologies.

>snowball them

Could you define this phrase and what English dialect it's from?

> what English dialect it's from?

I assume American English.

Prior to the sexual slang made popular by the movie Clerks, snowballing in the context I used basically means to blindside or con someone.

One definition on Urban Dictionary:

"A situation where a criminal has found themselves in possession of an easy target and proceeds to rob them and leave them mortally wounded for fun, a synonym for getting iced."

There's also snowballing meaning a problem getting bigger and bigger when unaddressed. I'm probably using it in an older, not-exactly-mainstream way.

In my dialect, which is some sort of American English, trying to snow someone means trying to BS or con them.

Blindsiding someone means taking them unawares - hitting them when they aren't looking, physically or metaphorically.

A speedball has cocaine in it, which is sometimes known as snow.

"Iced" can mean killed, but the Urban Dictionary definition is oddly specific

I agree about "snowballing" meaning increasing in size or momentum, but it doesn't have to refer to a problem.

Quite literally. I assume you never got a snowball in you eyes?

> OP does not have the corporate experience to handle this situation

Even before that, s/he doesn't even seem to understand any measure of business. $20m/year with 3 people is BIG. Any disturbance to whatever makes that happen will hurt the business greatly. When a full rewrite shakes the boat and causes a few millions of revenue loss or the loss of potential market share or opportunities, they will rightly fire him.

Best route would be to improve things without hurting anything. It doesn't matter things are 'properly done'. It matters whether the organization survives and prospers. What's 'proper' is redefined every 3-5 years, most of the time without objective basis anyway. So no need to hurt the business to force the current software paradigm.

Is the entirety of the whole company just 3 people? It doesn't sound like it. That 3 person team is just 'tech' it seems - there may be 5-10 managers/sales/support/etc people. And... $20m is revenue, not profit. If the cost of their sales is, say, $15m... and there's 15 people working at the company, that's quite healthy, but it's not some money-printing goldmine at this stage.

> Is the entirety of the whole company just 3 people?

From what is told in the summary, it seems like the software stack and these 3 people constitute the core of the business that is going on. The business may be something totally different. But it seems to be running on that software.

> And... $20m is revenue

Even if they have lower margins than what you yourself imagined, $20 m revenue/year is still a gigantic amount. You can improve upon whetever margin or inefficiency in the process and increase the net profit - optimize vendors, costs, sales pipeline.

The difficulty in modern Internet business is getting to some place at which you can command $20 m revenue from any market. Finding the customers and the users. Not the margins.

I’ve worked at a $100 million revenue online company that didn’t make any profit. OP says money is tight after covid, so seems like it the business is connected to a (physical) market that was affected by the pandemic. He says there is an extensive roadmap by a management team that are not familiar with the technical details, so it seems like the technology is not the core business.

It’s not difficult getting to a high revenue fueled by aggressive and expensive acquisition using money from investors who gets dazzled by growth numbers, but if your customer lifetime value is low and you’re not a pure SAAS business which means margins won’t automatically improve with scale, turning that company profitable can prove very difficult.

Additional reason to switch jobs: It is perfectly working for now, as far as the company is concerned, the three man team is doing a tremendous job, and the team conviced itself of the same. Now they have to ship additional stuff, and think they are perfectly able to do so. You are there to make sure they ship, most likely they wont, because things work until they don't anymore. When they don't, and the team isn't shipping, guess who gets the blame? The one team member, the one thing that changed. And after that, the mess will only grow.

Edit: Typos. I suffer severe typing legasteny more often than not...

Best answer. If the money is ok, and the environment not too toxic/stressful, you might just see it as a challenge to secretly improve a codebase without anybody noticing, while still delivering what the higher-ups want to see. Or maybe just scratch the first part and try to see how much further you can push that turd with every coding crime imaginable. One-up the juniors in ugly hack Olympics. Ship a feature and put it on your CV before leaving.

Otherwise, walk away immediately.

> a challenge to secretly improve a codebase without anybody noticing

That is how it should be done in any case anyway. Improvements should get slowly rolled out without disturbing the users and the business.

> That is how it should be done in any case anyway.

Not exactly. IT management should be always telling people stuff like "did you notice that the integration with XYZ that never worked well stopped failing?" or "did you notice that we delivered those few last features at record time?" and explaining why.

That is assuming that things are failing. $20m/year with 3 people does not look like anything is failing to me.

That is assuming things are improving. If you aren't improving anything, then yeah, you don't have anything to say.

This is very insightful. I learned it kind of the hard way. The business world is a mess. Requirements are a mess and always changing. This leads to messy code that requires a lot of time to clean up. You don't have time for that as long as there is always more customer wishes and projects coming in. As long as the business keeps working there's always something of top priority coming in. The pain starts growing but the steaming pile of code just doesn't collapse. It just kind of keeps on working while you are adding more and more code. Sure, the pain is big and progress is quite slow but what's the alternative?

My advice would be to listen to the developers, to understand them and the business. To understand what they really need. What a viable path forward would be. A complete rewrite, a second system for new developments, many more developers or something. Or maybe it is the optimum solution right now because the whole company is so messy and your job is not to change the company structure. Then you maybe you could support them by slowly enhancing their skill set and accept what you can't change. Doesn't sound like fun? Then leave soon, staying won't do you any good.

This 100%. This line is an immediate nail in the coffin:

> This business unit has a pretty aggressive roadmap as management and HQ has no real understanding of these blockers. And post COVID, budget is really tight.

I've been at a company not unlike this... several MLOC of VB WinForms that was total spaghetti, but a highly successful app that brought in a lot of revenue. In our case the majority of the dev team was in agreement that the situation wasn't sustainable, and at first (meaning when I joined the company) we had engineering leadership who mostly agreed. They brought in several rounds of consultants to evaluate the code base and announced plans for a major modernization effort. But the consultants largely agreed with the dev team that the code was in such bad shape that it effectively needed a rewrite. At one point we did a prioritization and t-shirt sizing exercise and came up with _30 man years_ worth of items in the "critical tech debt" bucket. Apparently engineering leadership was not aligned with the C level suite about how much money they were willing to spend on this thing, because within the next year there was 100% turnover of management in the engineering org. A couple of (talented!) people who had been hired for the modernization effort left first, then the other engineers who knew the code base well followed. Last I heard the company was limping along relying on Eastern European and Indian contractors to keep the lights on.

In short OP, you can probably get some wins; maybe some major process improvements like using source control, maybe more minor things like introducing patterns or frameworks so that net new code isn't a total mess. But there is zero chance that you're to do anything like a rewrite or major refactoring without leadership understanding that they're in an unsustainable situation and a willingness from them to invest time and money to fix it.

Seems so many of us have been hired at one of these companies. Same here. No source control. No bug tracker. No tests. There was no formal system for producing builds--releases to customers were simply copied from the workstation of whichever engineer could build it successfully today. There were so many bugs and crashes, it was hard to even get through basic customer use cases. There was no spec. There was no product roadmap or plan. Sales would sell something, then run downstairs and say "We just sold XYZ, you need to implement XYZ in the software blob somewhere!"

The CEO/Founder wouldn't even consider a refactoring or cleanup session, let alone a full or partial rewrite. Only features drive sales, and we've sold so many things we don't have, so all he wanted was feature cram. Every so often, a VIP customer complained about some major use case that simply didn't work, so only in those cases was bug fixing permitted. And in those cases, the VIP customer would get a custom bespoke build made with the bug fixed. I was hired because the last person of my seniority could not cram features in fast enough and gave up in disgust.

They only got source control because I came in on a weekend, unpaid, to do it. I lasted a little over a year. Bootstrapped founder (and sole shareholder) eventually sold the company for ~$150M. Sometimes it seems there is no justice in the world :)

This is why we as a software world need some minimum standards for stuff that deals with sensitive information of users.

Don't get me wrong, if it works it works, but the question is for how long and who will suffer when it doesn't?

Also from a business perspective: If I were the CEO of that company I'd probably like to know that there is something built on sand and a huge technological dept. It is a cashcow now, but I'd like to ensure it still can be one in the future. And for this some level of maintenance would be expected.

Same thing for reliabilty. If as a CEO I knew the entire functioning of the thing that brings the cash hinges on one developer remembering whether index_john-final.php or index_peter-final-final.php is the latest code I would probably have a hard time sleeping.

That means the minimum OP should do is explain the situation neutrally and your point of view is certainly something he should weave into this. In the end the higher ups need to know this, and what they decide to do is their thing, but OP needs to make them aware of why this could potentially endanger the service in the future. If they then decide to take that risk — so be it.

It wasn't necessarily written by those 3 devs. They're just the current team. Granted, they probably have been that for a long time because of the resistance to change, but the brightest minds are probably long gone.

I'd bet it's B2B and has an expensive sales division that top management believes (tbh maybe even rightly) is the real revenue driver.

Good answer.

Some developpers seem to think that their jobs is to engineer nice and beautiful systems. It's not.

As a developper, you're getting paid (fyi the minimum so you don't leave the company) in order to maximise total shareholders' returns. That's it.

The business doesn't care if the codebase is garbage, with massive technical debt, nor if you struggle working with it. That's literally not even a problem as long as it is concerned.

Nice and beautiful systems aren't the job, but a system that will continue working in the future while still allowing people to keep adding features and that won't suffer massive security breaches really is. It sounds like the current system is one bad feature implementation away from hosing the database, one hardware fault away from no longer existing, and one interested hacker away from a complete compromise.

You are right and obviously good engineering practices are not incompatible with the company’s interests.

I was more reacting to OPs attitude, where he asks how to fix everything without even being clear if he is managing the team or officially in charge of the product.

I may be mistaken but the impression I have is that the management is perfectly happy with a bad system and expects the engineers to simply go along. This is then a management decision. It may blow up later and sink the business (or not, it can just hold for the lifetime of the product).

As far as OP is concerned, this situation would mean that he should probably leave the ship now. (As staying would probably result in 1. No marketable skill development (being stuck with a garbage codebase) 2. Burn-out has he takes on the Herculean task of cleaning it up. 3. Be blamed if something bad happens.

Usually nothing changes for me as a developer if shareholders get bigger returns on their investments. So I don't really care.

What I care about is the code quality because good code makes my job easier. I inherited the code, not the business.

Well it is the management’s job to make your work aligned with the company’s best interests.

It is possible that you’re already doing what you’re getting paid for. In that case you shouldn’t go out of your way to increase the company’s profits (and nobody expects you to).

(In my original post I didn’t say that as a developer you should code thinking of quarterly results, I just stated the obvious that you are employed for the shareholders to get money back)

So there was a physicist, an engineer, and a business guy, and they were discussing God.

The physicist said that God must be a physicist, because He had to know about matter and energy and so on.

The engineer said that God must be an engineer, because He had to do something useful with the matter and energy - turning chaos into order.

And the business guy asked, "Where do you think all that chaos was coming from?"

Only 3 developers maintaining a horrible codebase is a massive business risk. In this market, they could easily leave for better jobs within months of each other. Especially if they're junior and not company lifers. They money printer will print money until one day, it suddenly doesn't.

Exactly. I often think of scenarios such as this in evolutionary terms. Imagine a doppelgänger competitor in this space, exactly the same, except they're pretty content with their crappy tech ecosystem and focus on delivering features (albeit, more slowly than they might with a more modern system.) And so...

1. You focus on the advice here, adding tests (which IMHO is actually pretty difficult on a legacy poorly architected/documented system), source control, refactoring, etc.... with any time remaining in the schedule devoted to adding features.

2. Your doppelgänger chugs along and releases several more features than you do, giving them a market edge.

What happens next I suppose depends upon what difference those extra features make. If the delta is small, you may be able to pull out ahead, in the long run. But if it's large, then your company may start to lose revenue / customers. Then, the screws will likely tighten even harder, and you'll be forced to sideline refactoring efforts and double down on delivering features. And then those refactoring/cleanup efforts will bit rot and you'll find yourself back roughly where you started, except now you're behind your competitor.

There's a quote I can't seem to find atm that summarizes this. It was something along the lines of "With php our product is already up and gaining market share meanwhile our competitor is still futzing with their rails configurations." (If you know the actual quote I'd love to see it.))

> But you just told me they built a $20M revenue product with 3 bozos. That sounds unbelievably productive.

This doesn't indicate how profitability, and seems to ignore that management/owners might have had something to do with it. A well-connected industry player is better poised to start/found/build/grow a company to that level than someone with 0 experience.

And... yes... quitting to something which matches the OPs expectations will likely be better all around than trying to 'fix' something people aren't asking to be fixed (it seems).

$20M revenue is not the same as $20M profit.

It’s not the same, but if those 20M is primarily generated by the software, then it’s those 3 ppl, who contribute to the top line. The rest, like sales, marketing, are irrelevant: fire them and the product will keep generating revenue off the existing customer base. It will stop doing so, however, if the product brakes. So, the post above is right to an extent, this is the golden goose. ))

Unless the revenue is for products ordered on the site and shipped to paying customers. Believe it or don't, this is still done at some sites that are not Amazon.

Not if they're spending $19m in google ads to make $20m in revenue.

> team is 3 people, quite junior.

Even at FAANG salaries this wouldn't be that much compared with $20M

You don't know what the costs are though. The site could have huge costs of content acquisition or any number of reasons to not be making anywhere near $20 million profit.

Revenue isn't the same as Earnings Before Salaries either.

E.g. maybe it's an e-wholesaler or widget reseller, bought $19M goods and sold $20M. Or maybe it was much slower than expected, they actually bought $25M goods and are burning 500k/month on warehousing. Or whatever.

Yes, but the parent is saying this could be an e-commerce website, or construction company, etc. But having an iOS app feels unnecessary for most of these businesses

a lot of field operations these days are done on phone or tablet apps, so I can totally see an ios developer being vital for internal processes

The engineering team might be 3 people but not the whole company.

Assuming profitability is even a problem, if $20M in revenue is coming from just 3 devs, the driving cost of the company isn't the tech. It's other parts of the company. That would be be another red flag against the leadership.

>> team is 3 people, quite junior. One backend, one front, one iOS/android. Resistance to change is huge.

> My mistake, they didn't lay a golden goose--they built a money printer. The ROI here is insane.

We can not make that conclusion. Presumably the business still needs salespeople, support, etc.

P.S. I am not saying that ROI is bad, either. Just that we can't say very much either way.

I think the thing is that the codebase after 5 years of development was already capable of making 19M/year, and the past 7 extra years have just added 1M/year. The next year of development will not add any more because the thing is collapsing under it’s own weight.

so it will just be 20 mil a year with 3 jr devs

sounds ok?

Only in absolute terms. Kind a feels like a missed chance if you actually have a 100M market.

That's what I was thinking as well. Run for you hills.

It's likely a losing up hill battle.

Great answer. Developers are gunna hate it

THIS. A million times this. The world is about solving real user problems. And obviously this code solves a huge 20M USD problem. And the code is obviously so good, that even 3 "bozos" (to use your language) can manage and maintain in for decades. This is the holy grail. I wouldn't change a thing and instead ask: how can you make this generate 40M USD a year? This is how you will add real value to the company, and management will love you.

If the mess generates $20m a year, that's great and I agree with you!

If the mess generated $20m last year and it's projected to generate $20m next year, that's a problem.

If the second case is true, I believe it's somewhat the responsibility of the OP to sell solving this long-term problem to the _rest of_ business. If they hired him as an expert in that area, they should listen to him.

If that fails, leave.

Sheesh, I had to look this far to find this comment, half of the top comments were discussing how much testing OP should do and what manner he should approach the act of refactoring things while ignoring that the "monster" generates a boat load of money for the company. I love hacker news a lot but people seriously need to get out of their developer box and look at the big picture.

Smells like executive stupidity to me.

There is some consensus that you can't fix stupid.

Even if you are very talented at repairing airplane engines in flight.

The profit at all costs, just make the next quarter's numbers (so we get our bonus), attitude is what leads to disasters like Twitter.

Their pursuit of profits above all else have likely gotten people killed. They represent a clear and present danger to US National Security.

This is the sanest answer. No amount of leadership is going to help an incompetent team. A codebase with massive technical debt, tight coupling, and accidental complexity will be hard to improve incrementally. Impossible without competent engineers.

I agree it's the sane answer. But I don't think these engineers are incompetent. They lacked direction, accidentally followed worst practices, and _still_ came out on top. I would say they are good engineers but perhaps bad project managers / architects.

You don't not use source control because nobody directed you to and you 'accidentally' .. what, forgot about it?

You don't use it because you haven't heard of it; = not competent.

If someone incompetent can make me $20mil a year in revenue, bring on incompetence!

I find it hard to imagine you’d never heard of source control by now. You’d have to have been living under a rock for the past 15 years.

Or be a bona fide 'script kiddy', learnt some WordPress PHP or whatever and got a job as 'webmaster' or something straight out of school (UK-sense, I specifically mean no university), no formal CS/software eng. training, never properly an intern/junior trained by people who know what they're doing.

I'm sure it happens. And then you get the next job with 5y PHP experience or whatever, employer doesn't mind no formal training (not that I'm saying they should in general - but if they're non-technical hiring someone to 'do it', or first hire to build the team or whatever, then they probably should as a reasonable proxy!), rinse and repeat.

If such a team of 3 people comprised of script kiddies and 5y PHP coders are going to create a $20m/year product, you can be sure that they will take precedence over anyone who was 'properly' educated in cs when it comes to hiring.

> I'm sure it happens

Yeah it does happen. While using the Internet, quite frequently, you are looking at such products developed by such teams, making millions of dollars a year. Even as the good engineering that is being done at FAANG is now being questioned over profitability, with even Google talking about 'inefficiency'.

The shitty software probably isn't the product. It could be some sales/inventory management tool or whatever, that before they got some 'script kiddies' in was just some forms in Microsoft Access (is that what it's called.. the forms on top of database tool we had to learn in ICT at school) orwwhatever.

I think many people here are reacting to $20M forgetting not everything's a SaaS/in the business of selling software (but mostly still has some (in-house) software somewhere).

> The shitty software probably isn't the product

The shitty software is what sells the product, from the description. Even if the shitty software is a sales/inventory management tool or 'whatever', from the description it is obvious that it is vital to whatever business they are doing.

It doesn't matter whether it was built with Microsoft Access and Excel files. If its contributing a major part of that $20m /year, its not shitty, its golden.

Anyone who understands the trials of modern business, including any tech lead who had to deal with even merely stakeholders and low-level business decisions would prefer to have a $20 m/year sh*t before a well-crafted, 'properly built' architecture. The difficult thing is getting to that $20 m/year. The difficulty of rearchitecting or maintaining things pale in comparison to that.

> I think many people here are reacting to $20M forgetting not everything's a SaaS/in the business of selling software (but mostly still has some (in-house) software somewhere).

Everyone is aware of that. Many are also aware that getting to $20m/year in WHATEVER form is more difficult than architecting a 'great' stack & infra.

Well, I don't agree. You'd struggle to do it without any software at all these days, but you can certainly do it without anything written in-house.

My point about Access (or Excel or whatever as you say) was that that would be the very early days of something starting to happen in-house, that wouldn't even be the hypothetical 'script kiddies'.

> but you can certainly do it without anything written in-house

Nope. Not really. Your average SV startup idea in which the end users will do some simple, but catchy things with your app - yeah, go all no-code if you want to get it started.

But, in real business, in which there are inventories, sales, vendors, shipping companies, deliveries, contracts, quotas, FIFO and LIFO queues and all kinds of weird stuff, things don't work that way. You may end up having to code something specific in order to be able to work with just one vendor or a big customer even. They may even be using Excel. You do it without blinking because millions of dollars of ongoing revenue depend on such stuff.

Or been drinking to much of the "move fast and brake things" koolaid for all of the 5 days of your career.

Not a developer, but I was onvolved, and are again involved, in some crucial dev projects on which the future success of my employer depends. Any developer who deploys to production without testing, or worse, develops directly in production is by every definition at least incompetent. If not an incompetent wannabe rockstar ninja cowboy without even realizing it. And those devs are dangerous.

Therefore...the people mucking about with production should not be called developers! Problem solved, next?

A lovely knot to unravel!

First, get everything in source control!

Next, make it possible to spin service up locally, pointing at production DB.

Then, get the db running locally.

Then get another server and get cd to that server, including creating the db, schema, and sample data.

Then add tests, run on pr, then code review, then auto deploy to new server.

This should stop the bleeding… no more index-new_2021-test-john_v2.php

Add tests and start deleting code.

Spin up a production server, load balance to it. When confident it works, blow away the old one and redeploy to it. Use the new server for blue/green deployments.

Write more tests for pages, clean up more code.

Pick a framework and use it for new pages, rewrite old pages only when major functionality changes. Don’t worry about multiple jquery versions on a page, lack of mvc, lack of framework, unless overhauling that page.

I largely agree with this approach, but with 2 important changes:

1) "Next, make it possible to spin service up locally, pointing at production DB."

Do this, but NOT pointing at production DB. Why? You don't know if just spinning up the service causes updates to the database. And if it does, you don't want to risk corruption of production DB. This is too risky. Instead, make a COPY of the production DB and spin up locally against the COPY.

2) OP mentions team of 3 with all of them being junior. Given the huge mess of things, get at least 1 more experienced engineer on the team (even if it's from another team on loan). If not, hire an experienced consultant with a proven track record on your technology stack. What? No budget? How would things look when your house of cards comes crashing down and production goes offline? OP needs to communicate how dire the risk is to upper management and get their backing to start fixing it immediately.

Yeah, having the experimental code base pointing at the production data base sounds like fun. I did that. We had a backup. I'm still alive.

This is the right way to think about it. My only disagreement is that I'd do the local DB before the local service. A bunch of local versions of the service pointing at the production DB sounds like a time bomb.

And it's definitely worth emphasizing that having no framework, MVC, or templating library is not a real problem. Those things are nice if you're familiar with them, but if the team is familiar with 2003 vintage PHP, you should meet them there. That's still a thing you can write a website in.

> if the team is familiar with 2003 vintage PHP, you should meet them there. That's still a thing you can write a website in.

You can write a website in it, but you cannot test it for shit.

If this is true, OP can consider writing tests of the website using a frontend test suite like Cypress, especially with access to local instances connected to local databases.

There's no value to retroactive unit testing. Retroactive tests should be end-to-end or integration level, which you certainly can do without a framework.

Framework are not needed to test. I've been testing and validating my code way back, in C. Not because I was an early adopter (I'm still not), but because I needed to debug it, so... faster.

Good strategy. I would suggest not hooking it up to prod DB at the start. Rather script out something to restore prod DB backups nightly to a staging env. That way you can hookup non prod instances to it and keep testing as the other engineers continue with what they do until you can do a flip over as suggested. Key here is always having a somewhat up to date DB that matches prod but isn't prod so you don't step on toes and have time to figure this out.

Note that going from no source control to first CD instance in prod is going to take time...so assume you need a roll out strat that won't block the other enigneers.

Considering what sounds like reluctance for change the switch to source control is also going to be hard. You might want to consider scripting something that takes the prod code and dumps it into SC automatically, until you have prod CD going...after that the engineers switch over to your garden variety commit based reviews and manual trigger prod deploy.

Good luck! It sounds like a interesting problem

> Next, make it possible to spin service up locally, pointing at production DB.

I think this is bad advice, just skip it.

I would make a fresh copy of the production DB, remove PII if/where necessary and then work from a local DB. Make sure your DB server version is the same as on prod, same env etc.

You never know what type of routines you trigger when testing out things - and you do not want to hit the prod DB with this.

I am inclined to agree. The other advice was excellent, but pointing local instances to production databases is a footgun.

I've kind of reconsidered this a bit. Right now, the only way to test that the database and frontend interact properly is to visit the website and enter data and see it reflected either in the database or in the frontend.

It's less terrible to have a local instance that does the same thing. As long as the immediate next step is setting up and running a local database.

But the ting is, you have no idea if not even a single GET request fires of an internal batch job to do X on the DB.

I mean, there are plenty of systems in place who somehow do this (Wordpress cron I think) so that's not unheard of.

For me, still a nope: Do not run a against prod DB especially if the live system accounts for 20M yearly revenue.

Agree with this approach. You have nginx in front of it already so you can replace one page at a time without replacing everything.

One thing I haven’t seen mentioned here is introducing SSO on top of the existing stack, if it’s not there. SSO gives you heaps of flexibility in terms of where and how new pages can be developed. If you can get the old system to speak the new SSO, that can make it much easier to start writing new pages.

Ultimately, a complete rewrite is a huge risk; you can spend a year or 2 or more on it, and have it fail on launch, or just never finish. Smaller changes are less exciting, but (a) you find out quickly if it isn’t going to work out, and (b) once it’s started, the whole team knows how to do it; success doesn’t require you to stick around for 5 years. An evolutionary change is harder to kick off, but much more likely to succeed, since all the risk is up front.

Good luck.

I think "SSO" here maybe doesn't mean "Single-sign on"? Something else?

No, I meant single sign on.

In my experience, if you can get SSO working for (or at least in parallel with) the old codebase, it makes it much easier to introduce a new codebase because you can bounce the user outside of the legacy nginx context for new functionality, which lets the new code become a lot more independent of the old infra.

I mean there are obviously ways to continue using the old auth infra/session, but if the point is to replace the old system from the outside (strangler fig pattern) then the auth layer is pretty fundamental.

That’s what I faced a similar situation - I needed to come up with ways to ensure the new code was legacy free, and SSO turned out to be a big one. But of course YMMV.

I'd add putting in a static code analysis tool in there because that will give you a number for how bad it is (total number of issues at level 1 will do), and that number can be given to upper management, and then whilst doing all the above you can show that the number is going down.

There is significant danger that management will use these metrics to micromanage your efforts. They will refuse changes that temporarily drive that number up, and force you to drive it down just to satisfy the tool.

For example, it is easy to see that low code coverage is a problem. The correct takeaway from that is to identify spots where coverage is weakest, rank them by business impact and actual risk (judged by code quality and expected or past changes) and add tests there. Iterate until satisfied.

The wrong approach would be to set something above 80% coverage as a strict goal, and force inconsequential and laborious test suites on to old code.

Many tools allow you to set the existing output as a baseline. That's your 0 or 100 or whatever. You can track new changes from that, and only look for changes that bring your number over some threshold. You can't necessarily fix all the existing issues, but you can track whether you introduce new ones.

The results might also be overwhelming in the beginning.

Solid advice. I did 2 full re-writes with great success and to add to this list I would also make sure you are communicating with executives (possibly gently at first depending on the situation), really learning about the domain and the requirements (it takes time to understand the application), investing in your team (or changing resources - caution not right away and at once since there is a knowledge gap here). The rewrite will basically have massive benefits to the business: in our case: stability (less bugs), capability to add new features faster and cheaper, scalability, better user experience etc. This can get exiting to executives depending on the lifecycle of the company. Getting them exited and behind is one of the core tasks. Dont embark on this right away as you need more information, but this will matter.

Among the things I'd prioritize is to make a map of all services/APIs/site structure and how everything falls into place. This would help you make informed decisions when adding new features or assessing which part of the monolith is most prone to failure.

Best advice so far.

This is the way.

This hacker agiles.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact