Although it describes the issue pertaining to statistics + machine learning, this is also exactly what end ups up happening with a large codebase without clear requirements or test cases, and people just making incremental, piecemeal changes over time. You end up with an application that has been trained (overfitted) with historical data and usecases, but breaks easily for slightly new variations that are different from anything that has ever been handled by the system before in some trivial way that better designed, cleaner, more abstract system would be able to deal with.
Given how much poor coding practices resemble machine learning (albeit in slow motion), it's hard to hold too much hope about what happens when you automate the process.
I especially noticed this in libraries/packages that were "community owned" in a company--instead of one team owning the package and being the authority on deciding the long term roadmap and communicating with other teams about feature requests, deprecations, documentation, bug fixes, etc, the community at large, where "community" was very broadly defined as a team that for whatever reason had an interest in using/maintaining/adding onto the package, would collectively own the package.
Naturally, the result was exactly the scenario you described. Each team hacked on their own bit of functionality for their specific purpose, while doing their best to not affect or break the increasingly precarious tightrope of backwards compatibility. There was no long term architectural vision, so there was a definite need for refactoring--and yet no team had the incentive to invest the amount of time needed to do that.The documentation was woefully incomplete as well, and few people understood how the entire thing worked since each team would only interact with their small fraction of the code.
1. Don't fear the refactor.
2. If you don't want to rebuild your entire application from scratch, don't worry, a competitor will do it for you.
There's nothing wrong with creating something in increments. It's the fear of revisiting something that destroy's a code base.
Technical debt, much like regular debt, can also be used as leverage to quickly gain a competitive advantage. While your competitors are busy refactoring/rebuilding perfect applications without hardly creating any more customer value, the scrappy startup that writes piles of spaghetti code might be building exactly what customers want.
Code quality != business value.
This is one of the reasons why you must not fear the refactor. Sometimes you need to get that code out the door because the business requires it. Then you need to pay back the debt -- by refactoring that mess every time you touch it in the future.
There is no such thing as "technical inflation" to magically wipe away our debt. It's important to have good lines of communication so that the business doesn't get used to squeezing development in order to eat pizza (because, why not? It's free!)
Much like regular debt, if you don't repay it, you go out of business and end up penniless.
I don't think that's a given: in some circumstances code quality absolutely is business value. It might be better to say code quality can be, but isn't always, business value. As ever, context is the deciding factor.
So yeah, technical debt can be used as a tool, but it doesn't come for free.
Add to this that the business people may have a bad grasp of the true cost of the hack, and the developers little insight into the business value of it, you get the current situation.
Unlike regular debt, technical debt is extremely hard to quantify.
You can't balance a business strategy if you can't estimate how much you're going to pay.
There is the story by Robert C Martin about the company that made a really good C debugger back in the day. Then C++ came out and the company promised to make a version for it. Well months came and went and eventually they went out of business. Because the first version of the debugger they wrote was awful code it made changes really hard to mark and so they couldn't adapt to the changing market.
Like most things in life, there is a balance. I have argued against large refactors many times. Often wanting to do a refactor is just a thinly disguised excuse to use some new technology (I'm as guilty of this as anyone else). Anytime a refactor comes up my goal is to figure out why:
1) What will the refactor fix?
2) What will the refactor potentially break? Are there tests around critical functionality?
3) Does the group proposing the refactor really understand the ins and outs of the application? When new people come into a system they often want to change it to fit their mental model of the problem, and miss subtleties of why the system is a certain way.
That being said, I evaluate small refactors anytime I have to touch a piece of code.
I often _redesign_ old code to meet new requirements and to support new features, but I would not call it refactoring.
I always strive to leave the code better than when I found it. But I would not name it refactoring.
Fixing might be a presentation, tests, documentation, refactoring, rewrite, deprecation, whatever. Just don't let it languish and the fear grow.
I've expanded on these thoughts before on my blog about technical debt inflation if anyone is interested https://scalabilitysolved.com/technical-debt-inflation/
I suspect this is something "software engineering" researchers might study.
It maybe doesn't fit the metaphor quite as well, but as an operations person, I've frequently run into the "underfitting" problem. For example, we run Chef to manage our physical and virtual infrastructure. There are a ton of community-authored Chef cookbooks available. Which at first blush, sounds great. But often, they have grown over time to become these awful hydras that try to be all things to all people. PR after PR has added support for the specific use case of every organization that wants to run the cookbook in their own special way. The "Getting Started" section of the README eventually becomes a dumping ground of 900 attributes you need to set correctly, and yet somehow it still doesn't quite perform how you'd like.
In many cases, we've tried to use community cookbooks and even merge our own customizations back upstream. Only to eventually give up and write our own version that's 50 lines of Chef DSL/Ruby instead of 5,000 but does exactly what we need, the way we need, and no more. It's very possible to make a system too generic and configurable, to the point where it loses all meaning.
Glad to hear we're not the only ones who found the community ones not perfect for every need.
Welcome to software development! Not as easy as it looks is it :)
EDIT you may find these articles helpful (or at the very least food for thought):
But when the “goal” of the system is just “arbitrary short term desires of management” you can easily point out the problems, but there is no agreement on what constraints you can use to trade-off against it.
Especially for extensibility, where you can get carried away easily with making a system extensible for future changes, many of which turn out to be wasted effort because you did not end up needing that flexibility anyway, and everything changed after Q2 earnings were announced, etc.
In those cases, it can actually be more effective engineering to “overfit” to just what the management wants right now, and just accept that you have to pay the pain of hacking extensibility in on a case by case basis. This definitely reduces wasted effort from a YAGNI point of view.
The closest thing I could think of to the same idea of “regularizing” software complexity would be Netflix’s ChaosMonkey , which is basically like Dropout  but for deployed service networks instead of neural networks.
Extending this idea to actual software would be quite cool. Something like the QuickCheck library for Haskell, but which somehow randomly samples extensibility needs and penalizes some notion of how hard the code would be to extend to that case. Not even sure how it would work...
: < https://github.com/Netflix/chaosmonkey >
: < https://en.m.wikipedia.org/wiki/Dropout_(neural_networks) >
To use a real world example, financial models on mortgage backed securities were the root cause of the financial crisis, because they were based on decades of mortgages that were fundamentally different than the ones they were actually trying to model. Even if someone was constructing a model by training on data from say, 1957-1996, and validating using 1997-2006, they would have failed to accurately predict the collapse because the underlying factors that caused the recession (the housing bubble, prevalence of adjustable rate mortgages, lack of verification in applications) were essentially unseen in the decades of data prior to that.
Validation protects against overfitting only to a certain degree, and only to the extent that the underlying data generating phenomena don't ever change, which, in the real world, is generally a terrible assumption.
It's not about outliers. Let's say you're at a startup and you fit some model to your first 30 customers. It works great for your next 10 customers, but fails dramatically for your first enterprise client. Why? Because the enterprise client was fundamentally different from your previous 40 customers. If you fit your model on a population in which the relationship looks one way, then try to apply your model to a population with a different relationship, it will fail.
Machine learning and statistics are both application of the same principles of probability and information theory. They work (for the most part) by modeling the world capturing the relationships between random variables. A random variable can be any natural process that we can't express in precise terms, so we express it in probabilistic terms.
This is the same principle underlying the premise that "past results do not guarantee future success." The relationships between random variables in the world that affect success in anything -- stock market performance, legal outcomes, etc. -- might not be the same tomorrow as they are today.
And that's not even a matter of overfitting. That's just your ever-present real-world threat of having all your modeling work invalidated by forces outside your control. Overfitting happens when you, the data scientist, fit your model to random noise in the training data. An overfitted model will have bad generalization performance on held-out samples, even from the same population. It's not always easy or possible to detect overfitting, especially with small training sets.
Or at least understand that you're entering a new market and budget appropriately for development. Usually, if you're switching from between prosumer -> enterprise, you are very, very lucky if the sum total of changes you need to make is training a new machine learning model. To start out with, you usually need to get used to sales cycles that take 6-18 months, hiring a dedicated sales guy to manage the relationship, and handling custom development requests.
Hopefully you also listen to them.
Even if you KNOW that your model is not-wrong in the right direction and within acceptable orders of magnitude, how do you fit the parameters for that structural model? You need some kind of data, even if you're just using anecdata to pick magic constants.
Fortunately models like these are often testable across many contexts, amenable to metastudies, available for calibration, etc.
When building a model, you divide your data into two parts, the training set and the testing set. The training set is usually larger (~80% of your original data set, although this can vary), and is used to fit your model. Then, you use the remaining data you set aside for the testing set by using your model to generate predictions for that data, and comparing it to the actual values for that data.
You can then compare the accuracy of the model for the training and testing sets to get an idea if your model generalizes well to the real world. If, for example, you find that your model has an accuracy of 95% on the training data, but 60% on your testing data, that means your model is overly tuned into features of the data used to build the model that may not actually be helpful for prediction in the real world.
Just some random thoughts in no particular order - curious what you make of them:
- On the subject of incremental piecemeal changes over time with no requirements: don't you all find that in your workflows (when you're doing something for yourself), it is hard to step back and "architect" something? It is easier to just let it evolve.
- Likewise it takes real work and thought to organize something as simple as a spice rack. (I just keep opened packages of spices in the cupboard.) The knowledge that company is coming is one of the few pushes. But it kind of feels like it's being done for show.
- It's hard to add architecture when you know there's no team that is coding against it as an API. It's just you. It feels like that extra power is, kind of wasteful.
- The other thing is that it may be the case that you know there is some deeper level of architecture. In the case of my spices, for example, most of the opened spice packets I mentioned are actually mixes. (Such as grilled chicken spice mix.)
- If I had to architect my own spice rack, I should start by learning which spices I'm actually using more of. And since what I'm doing works, I don't actually care. Plus, it would be a step down: the first time I mixed my own spices, I would probably end up with a worse dish than pouring some out of a premixed packet.
- The first time you architect a "proper" framework rather than let your machine learning algorithm "overfit", the result is probably demonstrably worse.
- That is a lot of pressure on not architecturing, and just continuing to (over)-fit.
The lifehack is to throw all your spices in a box and only pull hthem out when you need them and then leave them on the rack. Then throw away any spice you haven't used in n months and add it to a blacklist. The ones you use frequently should be prominently displayed and texted with extra care and possibly set up for autorenewal from the grocery.
Only introduce new spices when there's a recipe, and buy just the amount you need.
So too with code. Log your code paths, prune little used features, optimize the hell out of the most frequently used ones, introduce features sparingly and with purpose...
I like this spice metaphor, thabks for it.
The tricky bit is mostly that you need a new theory of the data to have a better abstraction. That's the tricky bit.
Models generated by DL lack even a paradigm or theory or abstraction.
One of the problems I’ve seen in research into technical debt is the lack of a good definition. This insight could form the basis of one.
Your whole argument seems to be based on your personal experiences. Perhaps it is also thus vulnerable to some sort of overfitting :)
Pruning is also the common ML practice to prevent statistical overfitting.
Previous HN Discussions
"The Morning Paper" has a nice summary
I feel like there are a few of these frequent flyers... is there anyway to figure out what they are?
Perhaps the best overall wisdom this paper tries to impart is this: build awareness, culture, and tooling around your ML systems, both upstream and downstream. Never stop exploring and improving. Relentlessly try to slim down your models, simplify your pipelines, and bring people together to talk about all kinds of dependencies.
If you are smart, then what you're doing is probably easily transformable to the set of rules you had before. At that point you can compare why exactly it's so good at the metric you're measuring so there's no "cheating".
Sadly most of the ML consultants just take an exemplary code from one of the tutorials and then show you the metric it generates after having run.
There is no reason why the consequences of false positives and false negatives cannot be incorporated in the model itself. In fact for certain kinds of systems such as 'alarms', or 'imbalanced classes' this is pretty standard.
Anyway, the misclassifications are much the same as with the original system, in fact on the same parts only with far lower incidence so to me it looks as if the ML system simply managed to extract a lot more features (and automatically) than I would have time for to do by hand, on top of that it adapts easier to new, previously unseen content because I don't need to come up with a bunch of (reliable!) rules to tell those parts apart from the previous ones (this does require a complete retraining of the net).
For some subset of the problems available ML works very well indeed, for others it may be a marginal improvement and in many cases ML is just dragged in to a project even though it has no place there. If you're in the first category: consider yourself very lucky and reap the benefits.
How to recognize which problems are well-suited for ML? Are there any rules of thumb for (relative) laymen already?
After having built and run a rule-based system for a while, you always get tremendous subject matter expertise, a feel for what works.
Any rewrite of the system at that point will lead to much improved accuracy. The clarity is reflected in a better choice of the input signals, features, data preprocessing, metrics, workflows…
A "magic ML" (without domain understanding) beating well-tuned SME rules is a dangerous fantasy, in any non-trivial endeavour. In other words, without that clarity, you're better off gaining it first through simple iterations of rules, figuring out what matters.