Hacker News new | comments | ask | show | jobs | submit login
A Taxonomy of Technical Debt (riotgames.com)
711 points by edroche 10 months ago | hide | past | web | favorite | 113 comments

This is a fantastic article.

Contagion is a really great term. I've seen my poor abstractions be replicated by others on my team, to my horror -- "don't they see why I did that in this particular case, and not in this other case?" Of course, that's entirely, 100% my fault. I picked a poor abstraction, I put it in the code, I didn't document it well enough, and of COURSE other programmers are going to look to it when solving similar problems. They should!

That said... Sometimes I spend a bunch of time finding the right abstraction for a feature that we end up not expanding. And then it feels bad that I spent all this extra time coming up with the "right" solution, instead of just hacking out something that works. Hmm...

One team I was part of kept a separate backlog of technical debt and experiments. It was nice to have a place to say, "in 30 days, look at this hacky thing and see if it's worth making better". Or, "I noticed this is a mess, here's how I might clean it up." We'd occasionally talk over the backlog and prioritize it, which helped communicate both the general make-things-better spirit and specific issues like you mention. I really liked it.

One thing that made it work is that we worked on it in small slices all the time, without involving the product manager. It was still visible, so there'd be the occasional question, but as long as we kept delivering user value, nobody worried to much about our mysterious code concerns.

Funny enough, most companies I worked for, I had to follow "You can refactor if the PM doesn't catch you spending those precious minutes for this".

There was only one time, where we had every Friday, time to improve the codebase. 2 months later it became every 2nd Friday, though.

I'm really pissed that technical debt is considered as "Hey the dev guys are complaining again".

> I'm really pissed that technical debt is considered as "Hey the dev guys are complaining again".

That's because it's very untransparent to anyone other than the engineers working on a project.

I've had a limited amount of success by making this more transparent. Signaling every time a feature will take longer because of a piece of technical debt the team wants to fix caused the fix to get priority before implementing the 4th and 5th feature affected.

How is "technical debt" handled in meatspace?

Don't the bean counters at Ford Motor Company (for example) nark on the assembly line workers and industrial engineers and QA/QC folks have work pile up, broken machines lying around, uncleaned trash?

It's risk/reward to the people who want to decide how their money is spent, isn't it?

In your example, the worst-case scenario is that someone could die, and that tends to spur on investors to discover the probity within themselves to spend some money avoiding an expensive lawsuit.

But when the devs are complaining about the old code being terrible and making their lives hard, it never seems to hinder them that much to management. They keep banging out new features and fixing bugs, and nothing bad seems to happen. But the drip-drip-drip of bugs keeps increasing, and the new features take a little longer each time, and nobody dies at least, but the thing becomes a haunted moneypit that nobody wants to touch, and you're stuck with it now unless you rewrite it all at huge expense, etc., etc.

Maybe everyone should just treat a piece of software as they would a life. I bet we've all seen some codebases where if it were a friend, you probably would have staged an intervention by now. Your software baby needs absolute care from the get-go until the very end, or it will get sick and probably die, and most likely in a very prolonged and painful way.

The place I used to work in has been hiring (junior) people like crazy. Part of the reason they need so many is the crushing foundational technical debt at the core. When they hired someone to capable of improving that they were unable to merge the changes due to fear, and the management couldn't see the business value of doing so. They've had a few nasty outages recently too. I believe the insides of the Atlassian kit are similarly riddled with technical debt.

An important difference being that in your Ford example, you can just throw new people at the problem while in software it generally needs to be handled in the responsible team.

I’ve found it helps to metric “how fast does it take to get a thing of x size done” - if you can measure the results of your improvement (like how fast it takes to get a new design implemented) it’s an easier sell. Eventually, it becomes known throughout the company that things are going faster, regardless of the metric results. Of course if you go around making high risk changes for low reward they’ll see the artifacts of increased bugs and less system reliability.

I've had some success building technical debt into my estimates. While I'm working on a new feature or a bug, I'll tidy up in the area around it - the tidying up is just part of the work necessary to complete the task.

The really cool thing is that eventually you're able to deliver large, complex tasks in very brief times and then spring on the PM/management that you're able to do this _because_ you've been refactoring. That's made a believer out of at least one of my PMs.

Obviously this doesn't work in all circumstances - it's not always feasible to get the really systemic, contagious debt cleaned up as part of feature work, and if the PM catches on then it makes this tactic difficult to continue.

The bigger obstacle I've had, though, is other developers who haven't fully bought into a culture of continuous improvement. Fear of breakages causes refactor paralysis, which makes it easier to break things when working on them, which increases fear, and so forth. I'm not really sure the best way to deal with that aside from adding a bunch of unit tests (which I still sometimes get pushback on)

Pushback on increasing test coverage?

In this case, a JavaScript front-end that had no unit tests previously. I also wasn't able to get NPM through the firewall, so I used Jasmine standalone and kept its files and copies of our third-party framework files in a "Frameworks" folder within a separate "Test" folder

The pushback I received was that keeping the framework code in Source Control would result in it being caught in the JS build/minification script, as well as my spec files. The individual that pushed back was also concerned about JS exceptions since we were up against a release, which speaks to a need for training about how unit test files work. Ultimately I .gitignored the framework folder but wouldn't budge on leaving the test files in, since .gitignoring unit tests defeats the purpose. Then I learned that the build script wouldn't grab those files anyway. :)

Ugh, hate that mindset.

My boss at my last job had the mind set of "refactoring only makes it different, not better". I asked him if I could spend some time refactoring our build system. He said no. I eventually did it anyway a few months later, spotted a bug due to the changes, and all of a sudden, build times were cut in half or in 10 in many instances.

Same story for a pretty nasty hunk of code we had for handling sparse arrays. Asked if I could refactor, got told no, did it anyway a while later, and all of a sudden a problem that had been considered borderline infeasible takes like 1 day of work.

Refactoring isn't always the right decision, a good boss/lead needs to carefully weigh the pros & cons.

There is always some risk that refactoring makes code not only different, but worse. Corner-cases are often there for a reason, and refactoring sometimes misses them, especially when there isn't complete unit test coverage. Since it's often easier to get the core logic right, this likely leads to issues that are discovered in production.


A good boss knows to get out of the way, clear a path if necessary.

If boss can't trust the minions to do the right thing, someone's got the wrong job.

There is no "right" thing. There may be an optimal thing from the development perspective and an optimal thing from the business perspective. Since the two pieces cannot exist without each other, both parties have to communicate effectively and trust each other to find the optimal decision for the combined problem space which may be sub-optimal when considered separately.

> If boss can't trust the minions to do the right thing, someone's got the wrong job.

There are many people who have wrong jobs.

More importantly, there are many people who are good, but not perfect. They do some aspects of their work greatly, other aspect less well. Good boss has some idea about that and is able to work with people who are not super great.

Least not last, even very good people often disagree about many things, including whether refactoring is needed or not or what kind of refactoring to do. Even if boss trusted all and listened all, he would still be told plenty of contradictory opinions.

The only time my improvements have even been noticed is the pointy haired boss said "Well, you should have thought of that sooner. What am I paying you for?"

That's the kind of joke that's great to chuckle over while you update your resume.

One of the things that I almost always insist on in a dev/PM feedback cycle is the concept of "chores." The Devs (usually via eng lead) get to schedule chores in the backlog, full stop. PM can have a convo with eng lead to say "hey, will this chore take a super long time? Can you possibly reschedule it?" but if it's work that is a pure refactor (no product implications) the PM doesn't get to block it, period.

Of course, this only works well on teams where your PM and eng lead don't have a fundamentally adversarial relationship. I like to think this is most teams but does take some getting used to in terms of eng lead and PM communicating priorities and needs, between product moment and code quality.

That's how I do it. It took some time to build up the trust relationship, but most of the time, our stakeholders and me can keep a good balance of maintenance and features. And this balance doesn't have to be rigid. I want my maintenance tasks done, but it's fine to prioritize deliverables for a sprint or two - we'll have a sprint or two of maintenance then. And that might be fine, or even beneficial, because then you have a bigger block of time to do some bigger cleanup tasks.

On most of my projects we have enlightened PMs who make allowances for paying down tech debt. For example, on my most recent rotation (an RoR app front-end to manage cloud orchestration software), the PM and tech lead worked out an arrangement where, for the four weeks following "feature freeze", half the dev time was spent paying down tech debt and other chores (the other half was spent fixing bugs).


Yeah, trusting developers to use their time wisely given a high-level alignment on the big goals can be very powerful. One of our struggles on the individual level is the uncertainty of "is this the little feature that will take the champion from good to great?" that leads to slow and steady feature creep. It's tough to weigh those against tech debt cleanup even though we have the autonomy to work on "mysterious code concerns" when we choose to.

I would like to suggest that there is a fourth dimension that might be called 'interest' as we are using a debt analogy - the tendency for the cost to increase over the time elapsed since the debt was incurred.

When an item of debt is first created, the people making it are often well aware of what they have done and are therefore in a relatively good position to fix it, but that knowledge quickly dissipates, to the point where it is often forgotten that there is a specific issue there. Furthermore, there is a tendency for it to be made less obvious as further changes are layered on top and around (this is distinct from contagion, as it can occur if the later changes are themselves debt-free, or at least independent of the decisions that created the debt and their consequences.)

One place I worked addresses this by having mandatory post-deploy monitoring / patch day. We’d all do a deploy and keep an ear to support / logs while going ahead and improving things we knew needed a little clean up. If we saw anything come in from the release, we fixed it immediately.

An entire day is excessive in a CD setup, but for a two week release cycle it worked well. Kept the rough edges out of customer view very well.

The top comment under the articles uses the hight of the interest rate to describe the level of contagion http://disq.us/p/1ros2o9

'tl;dr "contagion" is the most important attribute because its properties are similar to interest rates. Having a small loan (small impact/fix cost) but high interest rate (high contagion) can quickly dwarf large loan small interest rate.'

I found contagion to be a great clarifying concept too; it's something that I've been looking at in my codebase as the team expands.

My gut feel is that it's not necessarily about what you write in the first place, but what you refactor -- sometimes you can get away with a gradual replacement strategy (like std::string => AString from the article), but if the original pattern is contagious and bad, then you might have to take a more aggressive one-shot refactoring approach.

I've definitely seen this where a localized refactor is made to try to find a better way of doing something, we decide that we like the new way, and then don't find the time to replace the rest of the usages, resulting in a confusing state of affairs where you need to know which is the "blessed"/"correct" way of doing things.

I think that "contagion" is a good lens to use when assessing what the refactoring strategy should be for a given change to the codebase.

I really enjoyed the Lava Layer antipattern for incremental refactors that never complete. Having learned to recognize it, I think I'm more aware of the cost/benefit of introducing a new pattern, even if it's better in some way.


That article has changed my behavior in some places as well. Sometimes it's indeed better to sit down and replace the entire old solution, instead of going incrementally. It's a bigger immediate pain, but less following pain.

I've also seen bad pattern replication, and had a difficult time explaining to other teams why it was a problem.

I used to write a lot of app-wide Javascript at a previous job that would get consumed by multiple teams. If I didn't encapsulate something well enough or if I left a private open, I'd later find a code review with someone exploiting it.

The worst offender was a team that once used the prototype of a shared class as a mixin, duplicated/mocked just enough of my implementation logic to get three or four methods working, and then left it at that. Of course, the next time I changed any of my code, even in the constructor, their page broke.

My experience has been that when other teams see these patterns, they see a single page or feature that's working at the moment and assume "this must be fine." They don't see the three or four frantic show-stopping bugs that got logged last month.

When I would confront teams about this, often the response that I would get was "Well, if it's good enough as a quick fix for them, why can't we do the same thing? Why are we the only team that has to fix this?"

Of course, when teams don't want to be the first one to break from a bad pattern, the end result is that nobody changes anything.

I have found the closer I am to the product and the clients that will be affected, and the more thoroughly I understand the usecase from the client’s perspective, the better I am at understanding how much effort to spend on “getting it right” in this way. Still wrong sometimes though!

Contagion is why I want a VCS tool that allows me to keep code review comments with the code. Just because someone senior did something bad two years ago doesn’t mean you have carte Blanche to make new code that behaves the same way!

Would gitlens help? I love it. https://github.com/eamodio/vscode-gitlens

If I need to have the author explain something in the codebase as part of a PR, I usually make them write it down as a proper comment.

“Comments are for human context, code is for computers”

Interesting how you point to a slightly different kind of contagion in replicating code patterns. While the article seems to discuss the kind that is inevitably forced on whoever depend on the code.

This is my number one concern in my current team.

I have implemented a bunch of things that, while helpful short term, had clumky hacks to make up for either lack of tooling, or due to time constraints. And then the solutions get replicated verbatim, because "they work". The more time passes, the worse they become.

The whole tech debt concept might be the wrong abstraction.

This reminds me of the following, from the book Team Geek[1], chapter "Offensive" Versus "Defensive" Work:

[...] After this bad experience, Ben began to categorize all work as either “offensive” or “defensive.” Offensive work is typically effort toward new user-visible features—shiny things that are easy to show outsiders and get them excited about, or things that noticeably advance the sexiness of a product (e.g., improved UI, speed, or interoperability). Defensive work is effort aimed at the long-term health of a product (e.g., code refactoring, feature rewrites, schema changes, data migra- tion, or improved emergency monitoring). Defensive activities make the product more maintainable, stable, and reliable. And yet, despite the fact that they’re absolutely critical, you get no political credit for doing them. If you spend all your time on them, people perceive your product as holding still. And to make wordplay on an old maxim: “Perception is nine-tenths of the law.”

We now have a handy rule we live by: a team should never spend more than one-third to one-half of its time and energy on defensive work, no matter how much technical debt there is. Any more time spent is a recipe for political suicide.

[1] http://shop.oreilly.com/product/0636920018025.do

The XP guys had it right. Amortize all defensive work across EVERY piece of offensive work.

In the tech debt parlance most people are paying interest only payments instead of paying against the principle. Every check you write should do both (extra payments are good but they aren’t good enough).

Oooh, I like that a lot. Thanks!

It's a great article, but I do have one quibble.

> A hilariously stupid piece of real world foundational debt is the measurement system referred to as United States Customary Units. Having grown up in the US, my brain is filled with useless conversions, like that 5,280 feet are in a mile, and 2 pints are in a quart, while 4 quarts are in a gallon. The US government has considered switching to metric multiple times, but we remain one of seven countries that haven’t adopted Système International as the official measurement system. This debt is baked into road signs, recipes, elementary schools, and human minds.

A not-so-hilariously stupid mistake is to think that the traditional measurement system is stupid. His picture illustrates one of its virtues: the entire liquid-measurement system is based on doubling & halving, which are easy to perform with liquids. The French Revolutionary system, OTOH, requires multiplying & dividing by 10, which is easy to do on paper or with graduated containers, but extremely difficult to do with concrete quantities (proof: with one full litre container and two empty containers, none graduates, attempt to divide the litre into decilitres).

The real foundational debt is that we use a base-10 system for counting, due to the number of fingers & thumbs on our hands, rather than something better-suited to the task. If we fixed that problem, then suddenly all sorts of numeric troubles would vanish. There's actually a lot to be said about the Babylonian base-60 system, to be honest.

That's an... interesting point I haven't seen brought up before. Makes me appreciate the "traditional" system more.

Still, I guess we aren't going to drop base-10 any time soon, so I believe the US should just accept the "traditional" measurement system as something that used to be very practical, but no longer is due to progress of technology, and switch to SI.

Agreed; this is a really interesting perspective. It points to how different applications yield different optimizations. Base 60 is fucking cool. I really like musing on how we arrived at the duration of a second.

I stand by the assertion that being one of 7 countries that only sometimes uses SI has very real costs. https://www.jpl.nasa.gov/missions/mars-climate-orbiter/

> Base 60 is fucking cool.

It really is! The number of digits might be a bit much for normal use, so perhaps base-12 is more realistic. If we're going to upend tradition, might as well do it for good, well-founded reasons …

> I stand by the assertion that being one of 7 countries that only sometimes uses SI has very real costs. https://www.jpl.nasa.gov/missions/mars-climate-orbiter/

Of course, that would have been equally a problem had one team been using kilogramme-metre-seconds and the other gramme-metre-seconds, and could have been avoided by standardising on customary or on French Revolutionary units!

What's better about a base-60 system compared to a base-10 system?

Probably the same benefits as a base-12 system compared to base-10. More divisible factors.

Great article, loved how the examples were presented.

In my time as an engineer, I've found that thinking of tech debt as financial debt also helps. There is the initial convenience (borrowed money) of using the debt-ed approach. Then there is fix cost as Bill Clark name it, i.e. how much to pay back the debt if it were money. The impact is akin to the amortization schedule, i.e. what is the cost every time. For normal money, amortization schedule is over time, but for tech debt it is over usage. The amortization schedule of tech debt is discounted over time, as with money, _now_ is more important that _later_.

Contagion is a great concept, and I think it is a better name than interest rate, as the debt will spread through the system, and not just linearly with time.

Tech debt is also multi-dimensional and not fungible like money, which makes it a harder thing to reason about.

But the good news is, in my opinion, that sometimes it is perfectly fine to default on some tech debt, and never pay it back, delete the code. Then taking that tech debt was a win, if the convenience was more than the amortized payments.

I think the main difference is that technical debt is not fungible, i.e. you can’t necessarily easily choose to pay off the highest-interest technical debts first like you would for your personal financial debt.

put another way: you can have one item that is 5 days of work but really critical and another that’s 2 days of work but way less critical. If you have 2 days to work on tech debt, you basically are forced to do the 2 day one. especially since you are evaluated on what you finish, not how much you worked towards some long goal.

As financial analogy, I've seen a piece (linked on HN a few years ago) comparing technical debt to unhedged options, meaning you can get a benefit and you might or might not get bitten by it.

In data warehousing and BI, it's MacGyver and data technical debt all the way down. MacGyver because of all the "urgent" reports whipped up for CEO, duplicate copies of data and the reports done by consultants who barely understand industry. Data dept because of all the bugs and changes passed down as data from source system.

It's practically the definition of data warehousing that its whole purpose in life is to deal with everyone else's bullshit. If you want to combine data from different sources, you have to retroactive fix all the mistakes that the data owners made that don't cause issues for them but do cause issues for you.

Story of my life in data science, right there. We work hard to build a culture of shared data ownership, where the data producers have ownership and responsibility of the data they generate, rather than just lobbing garbage over the wall for us to deal with. But it'll always be hard-- as the ultimate the end users of the data, data science/analytics/business ops are always going to care most about its quality.

Does any programming paradigms protect better against data debt? The only way that I can imagine to significantly protect against this would be if there was some way to generate data migrations based on type changes.

I don’t know where it is now, but early versions of Angular encouraged you to isolate all your data debt to the sevice layer. With all the kludges in one place you had a better idea of how bad it was and it was easier to pick a block and insist that it now be handled on the backend.

I don't think there is any technical obstacle or pattern which can prevent dumbasses from shitting things up, humans are just too creative. As soon as you allow any extensibility, someone is going to start shoving integers in as strings, "Y/N" strings as booleans, etc.

It would help a lot if there was a well-formed, unambiguous specification for both sides to hold to. Something like the IETF terminology, in terms of MAY/SHALL, specifying things like "true/false" vs "Y/N", etc. Providing sample responses with decent coverage of the possible options is good as well.

Then you at least have the leverage to say "aha but the spec says it should be like this, why are you doing it wrong".

> It would help a lot if there was a well-formed, unambiguous specification for both sides to hold to.

It does sound like you're describing the schema language part of GraphQL. I think that GraphQL is a great tool for making sure that the right stuff goes in and out. Although it's far from solving all input validation problems. Hmm, perhaps you're describing a different problem.

After having worked with GraphQL user input validation at least seems like a manageable problem. There still seems there should be even better methods for handling contagion problems in the data of historical mistakes though.

I don't think there is any programming paradigm that protects against data dept. It is a question of whether the IT team is willing/able to fix the data, with requisite logging of change. This is always the best solution. It often turns into a political nightmare, business insisting that data needs to be corrected. IT not having the resource, time and the change being too risky.

I definitely agree that keeping data migration in mind at a foundational level can be very helpful. The ability to run scripts/regexes easily against the data can make it easier to reason about the consequences of your data, too.

You're triggering me real hard. 11th hour "I don't care how just make the system get to this number" regardless of the garbage number dumped in. Then you're stuck with a permanent bandaid in the core code that will inevitably screw things up in the future all because of one due date that probably didn't even matter anyway.

What about "fear"?

The most pernicious thing about technical debt, in my opinion, is that it creates fear in the sense of "I don't want to touch that module".

Even if you try to be objective and use hard facts to overcome the fear, it doesn't matter, because fear destroys creativity, so you've already lost.

Your tests should reduce that.

I might have missed it, but missing from the taxonomy: "Pay In Full" Debt.

In this debt, you pay the entire cost until the last use of it is cleaned up.

This kind of debt is especially insidious because there is no incremental benefit to cleaning it up.

I'd be curious to hear an example of that (I don't believe I have personally seen one that fits that pattern in the wild yet).

I used to work at a large software company. They had one giant monolith application, about a hundred different modules that did things from financial registration, displaying media. 30 years old, ported from clipper to delphi to .net. Lots of technical dept, but one issue fits the type of "pay-in-full". They only had a maximum of 3 gigs of ram to work with because the monolith was a 32 bits application. That was ok for most modules, but some did some rather complicated stuff that required more memory occasionally. It caused infrequent out-of-memory exceptions. The cause was one 32-bits library that was very hard to replace. It was used everywhere, and stored reports in a proprietary format, the company that made that library went out of business at some point, no source was available. The company prided itself on backwards compatibility, so it couldn't just dump the library without porting all the reports over.

As far as I'm aware, it's still a 32 bits application.

I'm not sure if it's "technical debt" but backward compatibility between versions can feel like this. Once you decide you stop supporting a previous version, you can rip out all that code (or strange compatibility code paths). Until then you're stuck with the whole thing.

removing a library? you can remove 100 usages of it but not until every single one is gone can you remove it.

porting to a backwards incompatible a language version? you can't use most of Python 3's new features while some part of your codebase is in Python 2

Reminds me of risk analysis: Impact times Probability equals Risk.

Contagion seems like a probability factor. Impact is the cost of leaving things unchanged. Fix cost is the cost of fixing the problem.

Risk management in this context then means comparing Impact cost to Fix cost in terms of impact for the business.

The one difference is that contagion is multiplicative over time (potentially logarithmically, linearly, or exponentially—probably a reasonable definition for 1/5, 3/5, and 5/5 respectively).

Somewhat aside, but the brain having to "flip" visual information because it's "upside down" seems suspect to me. Turn it sideways while maintaining all the connections it has to the rest of the body, and what changes? Is it getting visual information sideways that it has to rotate now? Probably not.

Moreover the idea that the collection of neurons that your retina connects to has any concept of "orientation" is nonsense to begin with IMO. It's not that "there's an upside-down image that your brain has to fix", it's just that your brain interprets signals from your retina as a picture in your mind, full stop.

Rods/cones in the top of your retina connect to your brain through neurons, so do the ones at the bottom. But to say that "this 'top' retinal cone should really connect to a 'top' neuron in your brain", doesn't even make sense to me. Since when do the locations of the neurons interpreting the input even matter?

It would be the same with hearing too... you have a left and right ear, but if for some reason those were swapped and your left fed things to the right half of your brain and vice-versa, your brain wouldn't be "flipping it back", because how could the absolute location of the neurons interpreting the sounds even matter?

This is the right way to look at it. In fact, your brain is plastic enough that if you wear glasses that flip your vision upside down for several days it will eventually relearn the mapping of retinal cells to neurons so that you see things normally while wearing them. This was studied in the 1890s by a guy called George Stratton.

"Since when do the locations of the neurons interpreting the input even matter?"

Incidentally, these neurons theoretically could go anywhere (as long as they're connected correctly), but in practice they end up arranged retinotopically (https://en.wikipedia.org/wiki/Retinotopy).

I agree, this is a perfect example of a homunculus fallacy https://en.wikipedia.org/wiki/Homunculus_argument.

Point taken. I maintain that the blind spot is still local debt. ;-)

People don't really understand what the terms 'up' or 'down' mean, in general. Down is the direction of the sum of the gravitation forces. That's why things fall down. The brain doesn't care about up or down location of nerves, because it is not affected much by gravity, it is held in place. If a nerve is above or below another, the brain doesn't care. On a side note, you should also realize that the direction 'down' is dependent on your location, unless you are on the discworld.

What if our eyes invert both vertically and horizontally?

> light, coming from your right, hits a cone on the left of your retina. Light coming from above, hits a cone on the bottom of the retina.

[0] https://www.quora.com/Why-does-the-brain-reverse-the-eyes-in...

I would love the idea of a technical credit score. For example, if you’re the kind of dev that racks up technical debt and never pays it down, you should have a shitty technical credit score, and be considered a poor hire. Whereas someone with great credit, would be a great asset to bring onto the team.

Tracking "credit" score sounds like good idea, but I would not go as far and assuming that persons with bad credit scores are poor hires.

Maybe person who creates tech debt is really great at prototyping, fixing urgent issues with unconventional methods (aka MacGyver) or do other tasks you find boring. While credit score of this person will be low, such people are also great assets in the team.

In general, this metric could be useful as tracking number of pull requests, lines of code, and so on: to spot anomalies and investigate: maybe that person is suddenly blocked by something, overwhelmed and need help, or just works differently, or on different tasks and the anomalous metric is ok.

A metric like that would also discourage people from actually documenting the debt.

“When a metric becomes a target it ceases to be a good metric.”

There are writers who just ooze technical depth of understanding - i thinks it's something to do with trying to explain something at a laypersons level, but leaving many assumptions just there for the reader to follow. It's almost the opposite of baffling with bullshit.

Good read and a really useful concept

Cool write up and classification system. A category that affects us is VendorDebt. Things that are inflicted on us by external vendors. Classifying then in a similar manner might help us decide which vendors to dump.

I am a senior product manager for a large financial technology company.

Over the years I have learnt to become comfortable with allowing my engineering teams to refactor code whilst delivering new functionality.

This has been a process and largely one of trust between me and the engineering leads.

It has also helped that I have seen payback from the investment made from reducing down the debt in terms of us delivering new functionality quicker and less error prone code. Although, this payback can take a while to see (6months + which is a long time for a product person operating in a competitive space!)

Most of my managers don't get this or if they do they are too blinded by immediate kpi's from further above they can't justify it so in most cases I just tell the engineering guys to add a spread to their estimates to cover the paydown of the debt.

Over they years this has definitely helped me build tighter relationships with engineers which as any product manager knows can have huge benefits.

I find it's always worth asking "will this get better over time, or worse" for everything, ever. Folks just fail to see past the next few months, having at least one person in the room asking this question makes them at least ignore it intentionally instead of complacency.

"I’ve rarely encountered discussions of contagion."

This surprised me: contagion is a good metaphor because it is a compounding measure of the growth of the problem. Just like an interest rate (a compounding measure of the growth of debt).

Most senior developers I've met have considered the interest rate of the debt, which seems like it has been renamed here as contagion. Maybe I've been lucky to just know smart people!

From the point of view of explaining these concepts, I'd suggest keeping the metaphors consistent. Tech debt should have an amount owed and an interest rate, tech infection (?) should have a potency and a contagion level.

At pretty much every game studio there is an epic internal battle of standard libs vs custom. std::string and [some custom string class] here it is AString is usually the spark. A constant of internal game development is that they think they can always build better strings, lists, dictionaries, collections etc than the standard lib, basically thinking the standard lib is as it was in the 90s and all the work that went into them is bunk. In some cases if you are really pushing memory and not writing custom allocators or using something like boost then yes, but in most cases the technical debt of custom classes written by an ancient from generations ago internally is more technical debt.

> One of the best examples of MacGyver debt in the LoL codebase is the use of C++’s std::string vs. our custom AString class. Both are ways to store, modify, and pass around strings of characters. In general, we’ve found that std::string leads to lots of “hidden” memory allocations and performance costs, and makes it easy to write code that does bad things. AString is specifically designed with thoughtful memory management in mind. Our strategy for replacing std::string with AString was to allow both to exist in the codebase and provide conversions between the two (via .c_str() and .Get() respectively). We gave AString a number of ease-of-use improvements that make it easier to work with and encouraged engineers to replace std::string at their leisure as they change code. Thus, we’re slowly phasing std::string out and the “duct tape” interface between the two systems slowly shrinks as we tidy up more of our code.

So now there are two string classes, that is technical debt... and one should be consolidated on and the arguments against std::string are sometimes valid but you can also do custom memory allocators or use better standard lib iterations.

EA even rewrote the whole standard lib EASTL [1] to adjust for some of these issues i.e. fragmented memory. Some games require it, some it is pure ego in game development teams. Game development teams have the highest ego driven development (EDD) I have ever seen and lots of tricks that take five minutes (but add 2-3 months to testing due to five minute solutions) but are more spaghetti than templates that write templates.

The one problem that comes about with your own standard lib or thinking you are better than boost or similar, is that the learning curve on the internal lib replacements add technical debt and start up costs, and the original guy that wrote them is long gone usually. Also, in the end portability suffers as there is invariably 3-4 versions of the internal libs.

Developers have to weigh the technical debt of your own custom classes outside standard libs and see if that outweighs the memory issues that may arise. Today most machines are not as affected by memory fragmentation issues and there is more cpu/memory to go around, and where they are you can write custom allocators for std/stl or use something like boost.

I do love Riot Games and all game development teams just I have never worked in one or with one that doesn't have the standard lib vs custom battle and wastes lots of time when one isn't standardized on or when not necessary. Some games and game engines require it, where they do you should fully commit one way or the other. Though going custom leads to slowdowns in coding for new devs and invariably there will be multiple versions of those internal libs over time that add up in the debt department.

[1] https://github.com/electronicarts/EASTL

One of the biggest problems of this is the tribal knowledge that develops around it. I worked at a studio that had something very similar to EASTL, but had joined the studio after an exodus of senior people.

It meant I had no idea how to use the custom libs. No documentation, no one left in the office to tell me how its used, no Stack Overflow to answer even trivial questions.

I left after less than a year. The studio closed down 2 months after I left.

EASTL mostly doesn't have this problem because it as far as possible is a compliant implementation of the STL with a few specific extensions, mostly around memory management. It's not a library that provides STL-like functionality with a different API. Much of the custom allocator stuff has finally now been superseded with polymorphic allocators in C++17.

Source: former maintainer of EASTL (not the original author).

I would imagine that being a larger organization, there are dedicated resources for at least some documentation/POC's for questions/changes/etc.

The smaller the company, the less resources you have to maintain, the more issues you're going to run into.

There's absolutely a "problem domain debt". If you are a game company, your problem domain that gets you paid is your game(s). Time spent rebuilding standard libraries is time not working directly on the game, and maybe time not getting properly "paid". Meanwhile, there are already people paid to work on the standard libraries, and its their job to make those work and continue to improve them.

Certainly there are tradeoffs where you may have to know the standard libraries well enough to know their performance characteristics, or how best to mitigate worst case scenarios, but if the people paid to build standard libraries are doing their jobs (that you pay them for when you buy that compiler), it should be less debt work to workaround an existing solution than build one from scratch.

If I had a nickel for every jackass who thought he understood URL parsing well enough to do it by hand instead of using the goddamned builtin library like a sane person and get all of its sanity checks and corner cases handling for free, I could retire (and would be a lot less bitter).

It’s not just game studios who rewrote the standard library. I’ve seen non game companies with their own containers too. For those cases, it’s always just very old legacy code that nobody wants to touch, but if they were to do it again fresh today, they’d just use what comes with the language.

For places that have their own legacy containers and actively try to move more code to them—I dunno! I think at some point back in the 90’s the standard library got the reputation of being junk (perhaps rightfully) among game programmers, and this belief has been cargo culted all the way into 201x. Who knows.

You're missing the point of reimplementing some or all of the standard libs. Similarly, disabling C++ exceptions and RTTI is a very common practice in gamedev.

Sometimes you reimplement a certain standard class (vector, string...) to adapt it to the very needs and usage patterns you have. Standard libs tend to be too general, plagued with allocations and other useless (in this specific context) behaviors that may negatively impact your performance/cache friendliness/memory fragmentation...

I agree a simple tiny game doesn't need all of these but when you need to squeeze all the performance you can there's no other option.

So please, do not just dismiss all the gamedev wisdom like that.

> You're missing the point of reimplementing some or all of the standard libs. Similarly, disabling C++ exceptions and RTTI is a very common practice in gamedev.

I clearly stated there are good reasons to do so and some games do require it. Mostly though they don't.

> disabling C++ exceptions

stl::throw is pretty lightweight unless you use the exception objects, you can not catch exceptions, and you can also pass -fno-exceptions.


RTTI merely helps with casting, usually none of that is going on at runtime as game loops need to be clean and perform zero allocations if possible, it should be already loaded up in memory, architecture of the game loop and game can remove this concern. You can also disable RTTI with -fno-rtti and enable it per class with virtual void nortti(); per class or on ms compiler __declspec(novtable) per class.

Rarely do exceptions or RTTI affect the game loop and framerate as most of that should not be needed during runtime game loops.

Usually the complaints that are valid are about allocations/fragmentation but you can also write custom allocators and other solutions and like you mentioned the code style/api style. It can also be a simplification not using stl but usually things start to grow in custom libs to re-implement much of the same functionality.

>> EA even rewrote the whole standard lib EASTL [1] to adjust for some of these issues i.e. fragmented memory. Some games require it, some it is pure ego in game development teams.

In engineering there has to be a GOOD REASON(s) to start maintaining buckets of new code and libs. There are also ways to do it that still allow for most of the standard and promote documentation and understanding to it.

EASTL is a great way to go about it and I linked to it to demonstrate that.

I was mainly calling out using both standard and custom, that seems like more technical debt, if you truly do need custom libs then go all in. Having both lead to more problems but understand there is weight/debt to it and it isn't always better.

> So please, do not just dismiss all the gamedev wisdom like that.

In no way did I dismiss it, I just said there is a constant of this battle (stl/boost/others vs custom) in all gamedev studios and many times it is unnecessary bike shedding and yak shaving that doesn't have a runtime difference on the game or make the game better.

Find me a game studio that doesn't have a stl/standards vs custom battle and I say... wait for it...

You would think the approach taken with this would be to just use the standard libs until they're actually contributing to a bottleneck and then worry about optimizations. Doing it prematurely does seem to be a problem. Though I don't think having custom alternatives to parts of stdlib are bad if you're actually making a meaningful optimization.

Premature and/or misplaced optimization. It’s kind of funny that they worried so much about std::string’s performance that they rolled their own, yet have this big honking lua layer thunked on top of everything for game logic. Wow!

While we're not perfect, I'm happy to defend this decision. ;-)

We did a ton of tracing and perf/memory captures to identify that string allocations were a significant drain in many locations in the code. We don't see those issues with our other uses of std:: (vector, unordered_map, set, etc.), just with std::string. So it was a logical place to do targeted optimization.

We did that optimization before lua because of the fact that there's a very clean way to make the foundational debt into MacGyver debt, since there's a trivial conversion between std::string and AString. Sadly we haven't been able to come up with any bite-sized moves that we can do to phase out the wasteful use of lua as kvp storage buckets. It's an all-or-nothing problem that makes it a much bigger chunk of work to undertake.

Great answer, thanks for the insight. I'm really puzzled at how many people aggressively answers when someone refuses to use standard libraries (or practices).

We were in a similar situation and using our own string implementation improved performance and reduced memory fragmentation.

Some people refuses to accept that certain software cannot rely on general purpose libraries and need to roll their own solution adapted to their specific needs.

Thanks for both your answers. It was not my intention to come off as aggressive there. Key takeaway is to measure rather than take it on faith that the library is the bottleneck. Sadly not everyone does measure.

Not only that, if you want good framerates and 60fps you aren't allocating at runtime, anyone who is doing that at a game dev studio is taken out back and either shot or to work on the cow clicker.

Usually standard lib vs custom arguments end up in the weeds like tabs vs spaces at game companies but ultimately it has almost nothing to do with framerate or runtime. Largely it is about that EGO. Why maintain a standard lib instead of improving gameplay and networking? Well some want to be a lord in their feifdom where they are the controller the code and the ring.

Riot Games has std::string and AString, but what happens when player two enters the game and you got BString? Then BString invites its friends and you got CString and DString. Now your 'standard' has many standards and is more standardy and like warring lords within internal factions like a Game of Thrones.

My company's string container is definitely MacGyver debt. At some point in the distant past we had to worry about having both Pascal-style and C-style strings...

This seems far too focused on dev tech debt, which has a very narrow scope. I like the article, so I'm not knocking it, just offering a little perspective. As a senior sysadmin in the past my primary issues have been technical debt across the entire board, number one being too few hires for too much workload due to cheap or nearsighted execs, but I would definitely agree that contagion is a great term for how techdebt grows faster the longer it's left alone.

It's worth remembering the CTO and senior sysadmin and a few others are dealing with all the tech debt of the entire company and IT department of which dev is only a subset (of course this depends on the company, but on HN sometimes I see convos like this where it feels devs are just talking at each other and not receiving much outside feedback.)

I'm not surprised at all to hear that I have blind spots outside of "dev". I've been working on shipping games for a decade, so I'm very fixated on the types of stuff I run into day-to-day in that dev process.

Nothing wrong with that at all. Has that all been on LoL? You guys did a lot right with it from the start. Unfortunately I burnt myself out on mobas due to HoN.

I've been binge listening to Software Engineering Radio for the past few months. I am currently listening to an episode where they are talking about technical debt.

He has the opinion that clean code is not as important as shipping code - ship the code first and then refactor as needed after you get customers.


Definitely. I build a lot of MVPs. So I focus on "make it work, then make it work well".

Shipped code is so much more valuable than unshipped code :-)

In my experience often the term 'technical debt' will be hijacked by product-oriented folks resulting in feature debt being presented as tech debt.

“We gave AString a number of ease-of-use improvements that make it easier to work with and encouraged engineers to replace std::string”

Are you absolutely sure this itself won’t become Foundational technical debt? You seem overly confident, given the metrics, that replacing std::string is a good decision.

We certainly can't know for certain. But we've had a significant, measurable reduction in CPU cost due to "hidden" memory allocations from things like passing a char* into a function that takes a std::string and stuff like that. (I may be being mildly inaccurate, as I wasn't the guy doing the perf captures etc. I just talked to him about it).

I'm particularly impressed by AStackString, which is a subclass that has initial memory allocated on the stack, but automatically converts to dynamic allocation if you exceed that space. So we get quick stack allocation by default, but it will safely handle when it needs to expand.

Most of the quality of life stuff is around having in-built support for printf style formatting, string searching (including case-insensitive).

I love this article. Quickly breaks down the types of debt.

(MacGyver's name is Angus!?)

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact