
Machine Learning Crash Course: The Bias-Variance Dilemma - Yossi_Frenkel
https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/
======
taeric
This seems to ultimately come down to an idea that folks have a hard time
shaking. It is entirely possible that you cannot recover the original signal
using machine learning. This is, fundamentally, what separates this field from
digital sampling.

And this is not unique to machine learning, per se.
[https://fivethirtyeight.com/features/trump-noncitizen-
voters...](https://fivethirtyeight.com/features/trump-noncitizen-voters/) has
a great widget that shows that as you get more data, you do not necessarily
decrease inherent noise. In fact, it stays very constant. (Granted, this is in
large because machine learning has most of its roots in statistics.)

More explicitly, with ML, you are building probabilistic models. This is
contrasted to most models folks are used to which are analytic models. That
is, you run the calculations for an object moving across the field, and you
get something within the measurement bounds that you expected. With a
probabilistic model, you get something that is within the bounds of being in
line with previous data you have collected.

(None of this is to say this is a bad article. Just a bias to keep in mind as
you are reading it. Hopefully, it helps you challenge it.)

~~~
cromd
If it helps anyone, the FiveThirtyEight article describes a scenario where
people take a survey about immigration status and voting. Most legal citizens
will correctly identify themselves, but some will accidentally check the wrong
box and say they are an illegal immigrant. If you have a billion citizens and
10 illegal immigrants truly taking the survey, and people check the wrong box
1 in 1000 times, your "percentage of illegal immigrants who vote" statistic
will be about the same as for citizens (because almost all reported illegals
will be citizens). Collecting more data won't help.

It's a very good article, though in the context of deciding how many variables
should be in a model of some complex phenomenon, this example is a little
tougher to wrap your head around. It's not quite a predictive model, but there
were some variables left out. A naive model I suppose is "this data is
generated by infallible respondents", whereas a better model would incorporate
that error rate. There isn't as much of a question about which pieces of
information are relevant, though, like you might encounter when trying to
predict future drug use from household income, race, age, number of books read
as a child, number of pets, and so on.

~~~
corey_moncure
Doesn't this make the assumption that "illegal immigrants" won't check the
wrong box, intentionally or by accident?

~~~
Double_Cast
citizens labeled illegals: 1,000,000

illegals labeled citizens: 0.01

------
rdudekul
Here are parts 1, 2 & 3:

Introduction, Regression/Classification, Cost Functions, and Gradient Descent:

[https://ml.berkeley.edu/blog/2016/11/06/tutorial-1/](https://ml.berkeley.edu/blog/2016/11/06/tutorial-1/)

Perceptrons, Logistic Regression, and SVMs:

[https://ml.berkeley.edu/blog/2016/12/24/tutorial-2/](https://ml.berkeley.edu/blog/2016/12/24/tutorial-2/)

Neural networks & Backpropagation:

[https://ml.berkeley.edu/blog/2017/02/04/tutorial-3/](https://ml.berkeley.edu/blog/2017/02/04/tutorial-3/)

------
amelius
The whole problem of overfitting or underfitting exists because you're not
trying to understand the underlying model, but you're trying to "cheat" by
inventing some formula that happens to work in most cases.

~~~
petters
Yes, this is both uninteresting and true. The whole field of ML exists
precisely because many things are too complex to model directly.

~~~
sgt101
too complex for humans to create the models unassisted?

~~~
ovi256
Definitely.

Look at the ImageNet ICVLR competition. Hand-built models can't approach SoTA
results built with CNNs.

------
therajiv
Wow, the discussion on the Fukushima civil engineering decision was pretty
interesting. However, I find it surprising that the engineers simply
overlooked the linearity of the law and used a nonlinear model. I wonder if
there were any economic / other incentives at play, and the model shown was
just used to justify the decision?

Regardless, that post was a great read.

~~~
blah9874
In my opinion, the real problem in that case was not the overfitting, but that
they _extrapolated_ from that data. They didn't have anything above Magnitude
8.
([https://ml.berkeley.edu/blog/assets/tutorials/4/earthquake-f...](https://ml.berkeley.edu/blog/assets/tutorials/4/earthquake-
fit-analysis.png))

You should never, ever extrapolate. It doesn't matter what your model is, it
won't work.

On a side note, it could be that there is a breakpoint at Magnitude 7.25,
where the slope of the line _really_ changes, and a segmented linear
regression is appropriate
([https://en.wikipedia.org/wiki/Segmented_regression](https://en.wikipedia.org/wiki/Segmented_regression)).
But we would need more data to be sure, anyway.

~~~
nshepperd
Not extrapolating isn't really an option in cases like this. You have to give
_some_ prediction for earthquakes of magnitude 9. Ultimately you must make a
decision on whether to design for such an event.

But a sensible thing to do would be to draw _many_ samples from the posterior
distribution, instead of just using the maximum likelyhood estimate. That way
the prediction accurately represents the uncertainty resulting from not having
any data above magnitude 8 as well as, perhaps, your background knowledge that
earthquakes of magnitude 15 never happen.

------
eggie5
I've always liked this visualization of the Bias-Variance tradeoff:
[http://www.eggie5.com/110-bias-variance-
tradeoff](http://www.eggie5.com/110-bias-variance-tradeoff)

~~~
culturedsystems
That's OK as a visualisation of what bias and variance _are_ , but it's a bad
visualisation of the bias-variance _tradeoff_ , because in that image there is
no tradeoff - it shows a case where bias and variance are independent of one
another. An illustration like this one genuinely confused me when I was first
introduced to bias and variance: I couldn't understand why the lecturer was
claiming there is a tradeoff while showing a picture of a case where there is
no tradeoff. I eventually figured out what was going on, but I think I would
have got it quicker if it had been explained more like the linked post, and
less like that diagram.

------
plg
like many things in science and engineering, (and life in general) it comes
down to this: what is signal, what is noise?

most of the time there is no a priori way of determining this

you come to the problem with your own assumptions (or you inherit them) and
that guides you (or misguides you)

------
CuriouslyC
One good way to solve the bias-variance problem is to use Gaussian processes
(GPs). With GPs you build a probabilistic model of the covariance structure of
your data. Locally complex, high variance models produce poor objective
scores, so hyperparameter optimization favors "simpler" models.

Even better, you can put priors on the parameters of your model and give it
the full Bayesian treatment via MCMC. This avoids overfitting, and gives you
information about how strongly your data specifies the model.

------
ehsquared
Welch Labs has a great 15-part series, where they gradually build up a
decision tree model that counts the number of fingers in an image. Part 9 in
the series explains the bias-variance spectrum really well:
[https://youtu.be/yLwZEuybaqE?list=PLiaHhY2iBX9ihLasvE8BKnS2X...](https://youtu.be/yLwZEuybaqE?list=PLiaHhY2iBX9ihLasvE8BKnS2Xg8AhY6iV)

------
gpawl
Statistics is the science of making decisions under uncertainty.

It is far too frequently misunderstood as the science of making certainty from
uncertainty.

------
known
Brilliant post; Thank you;

------
Pogba666
wow nice. Then I have things to do on my flight now.

