I recently became CTO of a company with two moderately successful SaaS products. The second one is a fork of the first, but has been maintained by more competent people than the original product. Both are still monuments to technical debt.
About 4 million lines of PHP code, written by underpaid, sometimes not well meaning, freelancers and students over the span of 8 years. The CEO wrote a large part, but stopped learning new techniques around 2004.
I'm bringing competent and well paid people in through my network and try my very best to give them as much freedom as possible, I allow and encourage greenfield modules/services that run on separate and new infrastructure for anything that is possible to be rewritten in the timeframes, but the larger part of the job is still mind numbing to my team and that makes me question my wisdom.
If anyone has tips on how to steer such a ship in a direction where the work is less frustrating for my devs I'm very open for advice.
I've run across this situation many times (I'm a "senior" team lead, meaning, I've been working for 25+ years). I've witnessed companies that overcame tech debt, and seen companies fail because of it. There's basically three approaches that people can take.
One is the "big re-write". They start on a new code base, and try to develop it in parallel. It takes a very long time, and the teams have to work on two solutions for some time. It's a big bang approach, and it often fails, or drags on for years.
The second is massive refactoring. It requires intensive testing and best practices. This strategy requires that the teams focus on testing intensely. However, often the testing culture is not there, which is why the code became unmanageable in the first place. It's kind of like starting over. And the new focus and discipline on testing is hard for teams to do without strong leadership, training, or new talent.
The last, and most effective in my opinion, is to go with a service-based, incremental approach. If the code base is not already using services, APIs must be built. Frontend/apps must be de-coupled from the legacy components. A clear domain model has to be agreed upon, and then parts of the legacy codebase are put behind APIs, and de-coupled from the rest. Over time, sections are refactored independently, and the APIs can hide legacy away. Maybe the legacy parts are refactored and replaced, or they stay around for a while. But the key is, that this approach allows multiple people or teams to work in parallel and focus on their areas. This is domain-driven design in action, and it works. New features can actually be developed sooner, even though the legacy is not replaced yet.
In the end, overcoming tech debt is about people. And on larger code bases, it's more of an organizational problem than a code problem. Developers need to be able to move forward without having to navigate too much code or too many different people.
Do you subscribe to the microservice philosophy? I ask because you come from the era of shared objects, and I personally still consider libraries with well defined APIs to be much simpler than dealing with multiple processes possibly running across different hardware.
I do break my infrastructure apart, but far less aggressively than some advocates.
I'm curious what 25 years had lead you to believe.
I'm not the op and I don't have 25 years experience but I'm a manager/architect with 18 year's behind me. My take:
I very much believe in microservices. I've repeatedly seen small library based approaches fail because one of the key and near universal truths of tech debt IME is code being tightly coupled. When you force an API interaction to happen via an outside protocol, you force a clean contract and a culture of coding to a contract. Decoupling the code allows yet team to move faster and more independently.
>The last, and most effective in my opinion, is to go with a service-based, incremental approach.
I was once handed a project where the original architect drank the microservices kool aid. It didn't stop the different services from being tightly coupled to one another - it just made the pain of that happening worse.
It made testing a pain - you needed to set up and run 11 different services on your machine to test anything.
It made debugging a pain - you had to trace calls over multiple different services with code often written in different languages. Debugging became more and more like detective work.
It created a multiplicity of irritating edge cases. The 'calculation server' could time out if it took too long - and it sometimes did. Serialization/deserialization was also an area rife with bugs.
The code quality got worse due to this approach, exacerbated by the team lead at one point giving people 'responsibility' for different services to different people.
I think microservices where they've "worked" has typically been a path of least resistance to realizing Conway's law - a tacit acknowledgement that different corporate fiefdoms want to write and deploy code in their own way and won't communicate effectively with one another. In that respect I think it's effective because it's easier to draw up REST API contracts between disparate often different-language-speaking teams using microservices than it is to draw up library API contracts.
Surrounding technical debt with integration tests and incrementally refactoring (decoupling modules, deduplicating code and adding assertions) is the only way to approach technical debt.
>In the end, overcoming tech debt is about people.
No. It's a technical problem. People problems definitely exacerbate it - deadlines, politics, etc. but it's still a technical problem in the end.
> It made testing a pain - you needed to set up and run 11 different services on your machine to test anything.
That is explicitly not the "microservices kool aid." The first thing a microservice needs to be is independently testable, so it can be independently developed.
Thats what we do, and its working reasonably well. Using microservices instead of libs makes working in independent teams easier and faster, and we can decouple better (different internal domain models, managing their own persistence).
It definitely is a drag, but there's just no way around that.
I look at services as being an admission that we haven't really evolved language design into the Eli Whitney era. It looked for a time like we were moving that way with VBX, workflow engines, and mobile agent style divisions. Everything we do seems external to the languages we use.
I'm finding your service/API suggestion interesting,in light of Steve Yegge's somewhat infamous post about Amazon's initiative to acomplish this ssome years back.
I think one important step is to recognize that bugs can and will happen in the process of cleaning up this tech debt. There are 2 things you need to do, to handle this.
1. Tell your team that cleaning up the tech debt is a major company priority, and even if some mistakes are made along the way, that's an acceptable price to pay for the benefits involved. People shouldn't let the fear of breaking something, dissuade them from cleaning up the mess.
2. Have your team invest heavily on building test/QA infrastructure, so that if they were to break something, it would get automatically caught and flagged, before it reaches production. If the advice given in bullet 1 scares you, then you need to double down here to make up for it.
I've been in teams where pull requests that significantly cleaned up the code base were literally rejected, because people were paranoid that any change at all could break something in unforeseen ways. People just bunkered up into a "if it ain't broke, don't touch it" mentality, which meant that the tech debt problem never ever improved. Ultimately, the only way to get yourself out of this hole, is by encouraging people to take risks even if it involves them sometimes failing, and building better safety nets to catch them on the occasions when they do fail.
>I've been in teams where pull requests that significantly cleaned up the code base were literally rejected, because people were paranoid that any change at all could break something in unforeseen ways. People just bunkered up into a "if it ain't broke, don't touch it" mentality, which meant that the tech debt problem never ever improved.
Was this in a project with or without unit tests? Usually you can win over people like this by writing a test harness around the area affected, getting that through their desks, and then pushing for change.
If even that is rejected you take it upstairs and make an ultimatum or straight up quit and pat yourself in the back for a job well done, regardless of which one you picked.
There are other ways to test legacy code rather the applying unit testing. If no testing is present at all, unit testing is not likely the correct approach and function/integration testing (using a PhantomJS for example) might be more suitable.
None of the options being discussed are "simple". There isn't a simple way out of technical debt. If there were, it wouldn't be debt.
Unit tests rarely contribute much to the technical debt, because worst generally-seen case is that you can just throw them away. I've seen that, unit tests so bad they weren't usable for much. I'm yet to see a code base destroyed by trying to make it testable, whereas I've seen a ton of codebases that were made quite powerful and flexible, yet also fairly pleasant to use, because they were built to be testable.
Also, in the end I'm far more interested in "automated" testing than anybody's pedantic definitions of what "unit" or "integration" or whatever testing is. That's not the point. The point is that I ought to be able to set up an automated built server and run useful, meaningful tests on it.
Tests are great as long as there's a strict discipline (you usually need to enforce this with tooling) of never committing a change that causes any tests to fail.
Then anyone who writes a test has to make it pass, and anyone who makes a change that breaks a test has to fix it. No rotting tests that passed long ago and have been failing for months that everyone ignores because whatever
I once encountered old code where many of the functions had an argument "bool isUnitTest". Counting the number of things that have to be wrong with the workplace in order for this to occur and to be acceptable is left as an exercise for the reader.
Worst codebase I have ever seen. The incumbents had zero interest in paying down technical debt. Unit tests were just another thing that could be used to game metrics and justify budgets.
Metaphorically speaking, the group issued predatory high-interest payday loans to keep the customer paying increasing amounts of interest on a growing amount of technical debt, forever.
I have been in such a project (nuclear waste recycling). All the refactoring have been postponed until the Y2K project. All the tests had to be performed for this project. This project should have been boring (very few code was involving date), but because of this huge refactoring, it was interesting. We have removed almost all the technical debt.
>I've been in teams where pull requests that significantly cleaned up the code base were literally rejected, because people were paranoid that any change at all could break something in unforeseen ways.
That's why you need to surround the code with realistic integration tests first before changing anything.
>Ultimately, the only way to get yourself out of this hole, is by encouraging people to take risks
No, that's absolutely wrong. You need to first de-risk changes by creating a test harness that catches the bugs that terrify your developers.
Don't be afraid to break stuff yeah, as long as breaking something is a deliberate decision. Many times the root source of your technical debt is the design of your public API, it might be so bad it's hard even to create a wrapper for it. In order to ever get rid of that debt you must remove the entry point to it which means unhappy customers, make sure you have a new api ready and a transitioning path before depreciating the old one.
Unfortunately big test suites won't help for this type of refactoring though, on the contrary they might even be in your way.
Check out the book "Working Effectively with Legacy Code", by Michael Feathers[0].
I believe the basic approach is to write tests to capture the current behaviour at the system boundaries - for a web application, this might take the form of automated end-to-end tests (Selenium WebDriver) - then, progressively refactor and unit test components and code paths. By the end of the process, you'll end up with a comprehensive regression suite, giving developers the confidence to make changes with impunity - whether that's refactoring to eliminate more technical debt and speed up development, or adding features to fulfill business needs.
This way, you can take a gradual, iterative approach to cleaning up the system, which should boost morale (a little bit of progress made every iteration), and minimises risk (you're not replacing an entire system at once).
I've used this approach to rewrite a Node.js API that was tightly coupled to MongoDB, and migrated it to PostgreSQL.
The key point for me is the fact that you're allowing your developers to actually fix the issues. Most developers I know love improving old code, even if the end-users will never notice.
Personally I don't mind having to deal with the technical debt, but I absolutely detest being forced into technical debit. This is especially true if I have not yet been allowed to deal with what ever debt that already exists in a given system.
The worst thing to hear in a meeting with manager or project-leads is "We'll deal with 'that' issue after launch" or "Yeah, we need to hit the deadline, so let's just get this thing working, and deal with the fallout later". Dealing with problems retroactively is always going to be more expensive and some problems simply aren't fixable after a product launch. If you're going into product with known bugs or defects, at least let the technical people choose which bugs.
Make sure there is time dedicated to cleaning up the bad stuff. All the platitudes about code quality are worthless without dedicating time to improve things.
have the team come up with a list of achievable meaningful milestones (eg. 'eliminate use of nasty obsolete library X'), ensure some time is spared to progress them; it'll become clear if the team is net paying off or accruing
also, find someone who thrives on eliminating crap and let them get stuck in
Also make sure that person's work is recognized as being important.
I really like to eliminate crap, but I also try to avoid doing it as much as possible because it's something that never earns any recognition or praise. Historically I've had to really push to even get my boss to recognize it as an accomplishment during my quarterly reviews. Maybe management says they don't like technical debt, but that doesn't mean anything if incentive structures are designed in ways that encourage me to pile it on ever higher.
This is a tough spot to be in. From the perspective of a tech lead / individual contributor (I don't consider myself in the company of a Fellow or C-level exec), I've witnessed this kind of situation before and learned a few lessons from it, which I'd like to share here. (Bear in mind, I use the phrase "lessons learned" here loosely, as I could have drawn incorrect conclusions from my experiences, so please pay more attention to the explanations of the points more than the takeaways.)
- Don't bite off more than you can chew. You could "replatform" and try to replace everything that currently works with new tech. In my experience, while the result is an admirable amount of sophisticated technology, the value for the business is unrealizable for a significant amount of time (which usually results in "bad things," like your stock going down, employees being unhappy/leaving due to thinking that it's not going to work out, sales not hitting targets because they don't believe in the product they're selling, customers having more "strength" during contract/sales negotiations, etc). I have observed ~4 years of significant (maybe over a hundred engineers) investment in an effort to replace the entire technology of a relatively young company. I have read many stories of companies doing this and going belly-up as what appears to be a direct result (few companies survive the process; Uber would be an example of a company that is doing this and will survive[1] -- Steve Blank has some particularly appropriate reading material[2]). The problem is the business must continue to grow and sell its product during this time (stable business is important, but growth is critical -- and you can't focus on growth when you're rewriting your technology from the ground up). This seems obvious, but the moment I hear someone say "greenfield" when they also have significant tech debt, I raise an eyebrow of suspicion.
- So, following that, do bite off very small chunks and slowly decompose your tech debt into whatever your well-paid, highly-competent team lead(s) recommend in terms of architecture. For example, if you have a huge legacy application, slowly separate each logical component into its own microservice (assuming your team leads believe microservices are the best architecture for your use-case, etc).
- Freelancers/students/interns/contractors are great! But don't let them design anything. I say this not as a jab to anyone in this category. I work with people in these categories daily. However, it's critical that their work is only implementation and that it is written in a way that has absolute minimal cognitive overhead. The reason is because if you hire someone temporarily to produce a hoozit that does Thing, then in the absolute best case, they will produce exactly that, but you (and your well-paid, highly-competent staff) will have absolutely no idea or understanding as to how or why the hoozit does Thing, or how to modify the hoozit to be a whatsit, or make the hoozit do Otherthing. Sure, you could figure it out. But I posit the cost of doing so is greater than the cost of having done it yourself, even if it takes longer. And now that I think of it, this point is supported strongly by your very own experience already: temp workers always produce technical debt, even in the absolute best case (an example of a worse case is that they produce an unmaintainable/incomprehensible/unmodifiable hoozit that does not do Thing or only does Thing in some conditions a.k.a. being riddled with bugs). Competency or compensation has little or no effect. My belief is that this is often because temp workers know their position is temporary and will strive to achieve exactly Thing when building your hoozit -- they are economically motivated to do so. They are not economically motivated to build your hoozit in such a way that it can become a whatsit later on.
- Remember to deliver value -- in particular, drive growth. This is super important in the technology industry. You should always have some amount of your engineering muscle focused on delivering new value. Sometimes that is reducing technical debt, or decomposing your legacy application, or building new products. Sometimes it's integrating a third-party's product with yours in some way, as "uncool" as that may sound. Sometimes you are in a bind when you can't produce new value without dealing with technical debt. So deal with the technical debt in the most sensible way (i.e., not producing additional technical debt, but also not foregoing any work to reduce your technical debt in the pure interest of improving the top line). If you think of your legacy application as delivering some value, but you don't ever add new features to it, consider your competition and whether your business will lose critically due to lack of innovation.
- I believe that the best engineers care a lot more about understanding the business and how their work aligns with it. If you ask your team, I expect they will tell you that they would rather work on the mind-numbing effort, in some way, if it is the best thing for the business. Junior software developers will always prefer greenfield. More senior software developers will seek the optimal solution. (Similarly to how junior engineers will prefer Shiny New Tech X, whereas senior engineers will only use Shiny New Tech X when it really, really makes sense to do so and there is a strong alignment with the business and/or low risk to doing so -- like in a new product/microservice that is less critical than your core product offerings, for instance.)
- Don't fall into the trap of believing that the only way to grow the business is to abandon/migrate away from legacy applications. At the least, move your users/customers as you decompose/replace legacy software/services. Avoid trying to make big migrations, especially wholesale. (OK, maybe I'm repeating my first point here...)
Separately, I'm curious why the CEO having written any of the code matters if he is no longer contributing code (I'm assuming he is not, since you mentioned he stopped learning new techniques). Or did you mean that he stopped having free time to learn new techniques and therefore stopped contributing code? I would think the CEO needs to stop contributing code as soon as you have enough engineers to meet your minimum desired sprint points (or whatever metric you use for productivity). I certainly don't think that not having learned new techniques necessitates the cessation of contribution, although learning new techniques is a likely byproduct of contribution (due to experience, research, implementation itself, code reads, etc). But, I'm digressing from the topic.
Anyway, hopefully my commentary/experience is helpful to you. Best of luck!
About 4 million lines of PHP code, written by underpaid, sometimes not well meaning, freelancers and students over the span of 8 years. The CEO wrote a large part, but stopped learning new techniques around 2004.
I'm bringing competent and well paid people in through my network and try my very best to give them as much freedom as possible, I allow and encourage greenfield modules/services that run on separate and new infrastructure for anything that is possible to be rewritten in the timeframes, but the larger part of the job is still mind numbing to my team and that makes me question my wisdom.
If anyone has tips on how to steer such a ship in a direction where the work is less frustrating for my devs I'm very open for advice.