Hacker News new | past | comments | ask | show | jobs | submit login

An interesting thing about this article, potentially misleading to beginners trying to understand various machine learning and stats techniques, is that despite what the article says, it is not apparent at all that the polynomial model of degree 3 '"sticks" to the data but does not describe the underlying relationship of the data points'.

On the contrary...for this toy example, doesn't it look pretty good! There's really not enough information here to decide whether the model is actually over-fitting or not, and this can easily mislead the beginner into wondering "just why the hell are we taking that awesome model and doing some regularisation thingy to choose a worse model...which is then...better?"

To truly understand, you've got to tackle:

1. What is overfitting? 2. Why/when are too many parameters a problem?

Now...i don't know how intuitive this is for others, but I like to tell people that over-fitting is a fancy word for "projecting too much about the general from the specific".

So why does that matter and what does it have to do with too many parameters?

Well lets say you've got a sample of men and women, and in this case, you're trying to predict underlying rates of breast and testicular cancer (i'm assuming these are primarily gender related for my example), and the "real" relationship is indeed just gender: whether the person is male or female determines the basic underlying risks of these cancers. That's not very many variables for your model. But lets say, in your sample, several of the people with testicular cancer are named "Bob" and several of the people with breast cancer are named "Mary" so you add more variables, binary variables, which indicate whether a person is called "Bob", and whether a person is called "Mary", and suddenly your model prediction amongst your sample for cancer goes through the roof...and yet when you apply it to the population at large, not only did it not predict cancer any better...but suddenly there are all these angry letters from Bobs and Marys who were told they might have cancer. In fact, its doing worse than if you hadn't included those variables at all. What's going on? You overfit.

So you see, in many models, adding in more and more variables can lead you to do better in your sample, but at some point can actually make your model worse. Why does this happen?

Actually, amongst many machine learning and statistical algorithms, there's a pretty intuitive explanation...once its been explained to you.

Lets say that your model only had variables to indicate gender at first, and you come along and you throw in a handful more. You're judging your models performance on its prediction in your sample population. What could many machine learning algorithms do here? Well, for each new variable you introduce, one option is to do absolutely nothing. And if the algorithm chooses to do nothing, what you've actually got is your original gender indicators: you've gained nothing, but you've lost nothing (well, albeit adding more parameter numbers and algorithmic inefficiency/complexity). But most (almost all) methods are not that precise or accurate. So what else could happen? Well, each parameter you add has a small random and statistical chance of increasing your models predictiveness in your sample. We used the example of "Bob" and "Mary", but the people with cancer in your sample could have all sorts of qualities, and as you just through more variables/features at your algorithm, it will eventually hit some that, although having no explanatory power in the population at large, do correlate with statistical quirks of your sample. "Blue eyes", "four toes", "bad breath", "got a paycheck last week" that sort of things. Its a form of data-dredging and its far more widespread professionally than I'd like :P And if you keep throwing variables at it, eventually, many algorithms will choose to naively keep those characteristics that are overly specific to describing your sample, but don't describe the population at large.

And that's why we might want to "regularise". We want there to be a cost of adding variables to the model to make the phenomenon of including statistically spurious variables like this far more unlikely. It is hoped that strong generalisable variables, like male/female, will overcome this cost, while spurious ones added randomly or to game some metric will be less likely to pass that extra hurdle. To use a signal analogy, by implementing a cost for adding more variables, you're filtering out some of the statistical noise to get at the real-loud signal below.

Now, personal anecdote, even though you want to keep models simple (like code, ceteris paribus, and be suspicious of any machine learning/AI technique that uses too many variables...), I don't actually like regularisation often on the whole. In the example, its not actually clear at all that this is a case of over-fitting, and so by following it, you might actually be making your model worse by using it. And in the real world, there's often a number of other techniques that work better (test/train/resampling). But like all techniques, its another arrow in your quiver when the time is right.

And now I've written an essay.




The main problem is that regularisation here is described in model fitting context. In actual machine learning context, it is usually called data resynthesis or data hallucination.

The main point to be taken is that input data is modified in some way to hide irrelevant detail. Regularisation does that by injecting specific kind of noise into data.


> In actual machine learning context, it is usually called data resynthesis or data hallucination.

Your comment is the fourth-highest search result for "data resynthesis" on Google. These are not common phrases by any stretch.

Also, describing regularization in a model-fitting context is not "problematic," it's the main application of the concept.


Oh, and sometimes regularisation is confused with normalisation and other techniques aimed at making data scale invariant in some way.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: