
I Got More Data, My Model Is More Refined, but My Estimator Is Getting Worse [pdf] - charlysl
https://statmodeling.stat.columbia.edu/wp-content/uploads/2012/07/timeseries06272012.pdf
======
charlysl
I am the OP. For what it's worth, I barely understand the paper, way too
technical for me, so let me give some context as to why I posted this.

It was refered to in a course I've just started self learning, Berkeley's
Principles and Techniques of Data Science (aka Data 100), and the lecturer
wanted to highlight the crucial importance of data acquisition in the Data
Science lifecycle, sampling in particular, given that mistakes at this stage
will propagate to the later stages. My understanding of his example (~1h10
into [https://youtu.be/JtwBwogRZkI](https://youtu.be/JtwBwogRZkI)) is that a
400 sized probability sample (a small random sample from the whole population)
is often better than a millions sized administrative sample (of the kind you
can download from gov sites). The reason is that an arbitrary sample (as
opposed to a random one) is very likely to be biased, and, if large enough, a
confidence interval (which actually doesn't really make sense except for
probability samples) will be so narrow that, because of the bias, it will
actually rarely, if ever, include the true value we are trying to estimate. On
the other hand, the small, random sample will be very likely to include the
true value in its (wider) confidence interval.

The paper seems to elaborate on this, which I find interesting, and I was
hoping for someone to explain it [EDIT: the paper, that is, I think I
understand the lecture].

~~~
smallnamespace
The paper is talking about a closely related, but rather different topic than
the video (relying on your summary, I did not watch it).

The paper is mostly concerned with _model misspecification_ , in the sense
that your model is not actually capable of describing reality (no matter what
parameters you set it to). The first example uses the OLS model, e.g.

Y = βX + ε

where the implicit assumption is that your observations (Y) are purely a
linear function of your input(s) (X) plus a _independent, equal variance,
normally distributed_ error.

Let's say your data really does fit a straight line, but the later data points
you get are really high variance. Intuitively, including the later data points
will make your estimated parameters more noisy than if we had simply thrown
them out.

So in this sense, 'more data is worse', because your model assumes all the
data points are the same, but the later points you collected are actually
worse (noisier).

There are other more obvious ways this happen - for example, if you assume
that wealth is normally distributed, then try to estimate the distribution's
parameters, then any small sample + any billionaire will get you nonsense —
your model will be very intolerant of sampling noise compared to a better
model.

The video's observation only deals with data: you might prefer quality over
quantity.

E.g. compare a wider cone centered at zero (the true average) vs. a narrow
cone that's centered away from zero - depending on what we care about, we may
prefer some (hopefully uncorrelated) noise in our model rather than learning
some systematic untruth.

So these are two different ways more data can be counterproductive: either
your model is _biased_ in a way that more data can be hurtful, or your
sampling procedure for larger sets is increasingly biased.

Note that older stats/econometric techniques are more vulnerable to model
misspecification because they have many fewer parameters than, say, a DNN;
conversely, more modern ML techniques are able to closely replicate input
data. This helps explain the difference in focus.

~~~
em500
> Let's say your data really does fit a straight line, but the later data
> points you get are really high variance. Intuitively, including the later
> data points will make your estimated parameters more noisy than if we had
> simply thrown them out.

> So in this sense, 'more data is worse', because your model assumes all the
> data points are the same, but the later points you collected are actually
> worse (noisier).

The first example in the paper makes it less-than-obvious by writing in
formula form that the error noise variance increases with time, but what's
described is essentially this picture:
[https://en.wikipedia.org/wiki/Heteroscedasticity#/media/File...](https://en.wikipedia.org/wiki/Heteroscedasticity#/media/File:Hsked_residual_compare.svg)
So it's intuitively easy to grok that if the more recent variances are bad
enough it's probably better not to use them at all (or to properly downweight
them like with MLE or GLS).

The described result is a bit more subtle though. The classical results are
that with heteroskedasticity, OLS is not an _efficient_ estimator [1] (you
could do better with for example GLS/WLS), but it's still _unbiased_ [2] (the
estimated beta's are centered around the true beta's) and _consistent_ [3]
(the probability limit of the OLS beta is the true beta by the Law of Large
Numbers). These are what the paper calls "robustness" of OLS in the first
example. The consistency of OLS even with heteroskedastic errors is what makes
it surprising that adding more (true!) observations can make your estimates
worse. Intuitively most people expect consistent estimators to get better with
more observations.

[1]
[https://en.wikipedia.org/wiki/Efficient_estimator](https://en.wikipedia.org/wiki/Efficient_estimator)

[2]
[https://en.wikipedia.org/wiki/Bias_of_an_estimator](https://en.wikipedia.org/wiki/Bias_of_an_estimator)

[3]
[https://en.wikipedia.org/wiki/Consistent_estimator](https://en.wikipedia.org/wiki/Consistent_estimator)

------
ovi256
It is well known among practitioners that not all models have equal learning
capacity and that additional data is wasted on those with lower capacity - if
you plot a performance metric vs dataset size, you can see it just saturates
and stops increasing. The point where it saturates is a proxy for the model's
learning capacity.

One of the pros of very big models like as deep networks was that they never
seem to saturate in this manner - more data always gave better results,
because there are ways to increase the model's learning capacity.

We've recently learned that deep nets work in deep overfitting mode, beyond
the usual U-shaped regime of bias-variance tradeoff that can be seen in the
empirical risk curve. See this post for a good overview:

[https://lilianweng.github.io/lil-log/2019/03/14/are-deep-
neu...](https://lilianweng.github.io/lil-log/2019/03/14/are-deep-neural-
networks-dramatically-overfitted.html#modern-risk-curve-for-deep-learning)

It was a surprise to me that adding more data could lower a performance metric
tough - that's counter-intuitive.

~~~
conjectures
> It is well known among practitioners that not all models have equal learning
> capacity and that additional data is wasted on those with lower capacity

Not just practitioners, IIRC there's a whole bunch of theory under the handle
of Vapnik-Chervonenkis dimension.

------
tastroder
(2013) - edit: yeah, not critizing OP, it's just useful metadata

~~~
charlysl
I would indeed have liked to append the date to the title, but Too Long

~~~
geoalchimista
The link is a preprint. Here is the published version:
[https://www.tandfonline.com/doi/full/10.1080/07474938.2013.8...](https://www.tandfonline.com/doi/full/10.1080/07474938.2013.808567).

------
baazaa
Huh, the bit about just plugging in the real values for variables being a bad
idea is very interesting. Pretty sure I've done that in the past, without
really thinking too much about it.

Does anyone know of books that cover stuff like that? There's plenty of books
explaining Simpson's paradox to laymen, but I want something that explains
common statistical pitfalls for people with already some education in the
area.

------
thedudeabides5
If adding data AND variables to your model is making it less accurate, this is
probably good evidence that your first model was cheating.

