Hacker News new | past | comments | ask | show | jobs | submit login
I Got More Data, My Model Is More Refined, but My Estimator Is Getting Worse [pdf] (columbia.edu)
129 points by charlysl 11 days ago | hide | past | web | favorite | 17 comments





I am the OP. For what it's worth, I barely understand the paper, way too technical for me, so let me give some context as to why I posted this.

It was refered to in a course I've just started self learning, Berkeley's Principles and Techniques of Data Science (aka Data 100), and the lecturer wanted to highlight the crucial importance of data acquisition in the Data Science lifecycle, sampling in particular, given that mistakes at this stage will propagate to the later stages. My understanding of his example (~1h10 into https://youtu.be/JtwBwogRZkI) is that a 400 sized probability sample (a small random sample from the whole population) is often better than a millions sized administrative sample (of the kind you can download from gov sites). The reason is that an arbitrary sample (as opposed to a random one) is very likely to be biased, and, if large enough, a confidence interval (which actually doesn't really make sense except for probability samples) will be so narrow that, because of the bias, it will actually rarely, if ever, include the true value we are trying to estimate. On the other hand, the small, random sample will be very likely to include the true value in its (wider) confidence interval.

The paper seems to elaborate on this, which I find interesting, and I was hoping for someone to explain it [EDIT: the paper, that is, I think I understand the lecture].


Your lecturer must have meant to refer to this paper, also by Xiao-Li Meng: https://statistics.fas.harvard.edu/files/statistics-2/files/...

It directly answers the question of whether a small probability sample or large but biased sample is better, using only modestly technical results (at least for the first part). It’s a very nice paper.


Wow, thanks, it seems indeed far more relevant to the lecture, and, as you said, easier to follow.

EDIT: well, as it happens, the course's online book has a link (https://youtu.be/yz3jOIHLYhU) to a youtube lecture based on the very paper you mentioned.

EDIT: slides relevant to paper (not in youtube lecture): https://www.bnl.gov/nysds18/files/talks/session2/Meng-nysds1...


The paper is talking about a closely related, but rather different topic than the video (relying on your summary, I did not watch it).

The paper is mostly concerned with model misspecification, in the sense that your model is not actually capable of describing reality (no matter what parameters you set it to). The first example uses the OLS model, e.g.

Y = βX + ε

where the implicit assumption is that your observations (Y) are purely a linear function of your input(s) (X) plus a independent, equal variance, normally distributed error.

Let's say your data really does fit a straight line, but the later data points you get are really high variance. Intuitively, including the later data points will make your estimated parameters more noisy than if we had simply thrown them out.

So in this sense, 'more data is worse', because your model assumes all the data points are the same, but the later points you collected are actually worse (noisier).

There are other more obvious ways this happen - for example, if you assume that wealth is normally distributed, then try to estimate the distribution's parameters, then any small sample + any billionaire will get you nonsense — your model will be very intolerant of sampling noise compared to a better model.

The video's observation only deals with data: you might prefer quality over quantity.

E.g. compare a wider cone centered at zero (the true average) vs. a narrow cone that's centered away from zero - depending on what we care about, we may prefer some (hopefully uncorrelated) noise in our model rather than learning some systematic untruth.

So these are two different ways more data can be counterproductive: either your model is biased in a way that more data can be hurtful, or your sampling procedure for larger sets is increasingly biased.

Note that older stats/econometric techniques are more vulnerable to model misspecification because they have many fewer parameters than, say, a DNN; conversely, more modern ML techniques are able to closely replicate input data. This helps explain the difference in focus.


> Let's say your data really does fit a straight line, but the later data points you get are really high variance. Intuitively, including the later data points will make your estimated parameters more noisy than if we had simply thrown them out.

> So in this sense, 'more data is worse', because your model assumes all the data points are the same, but the later points you collected are actually worse (noisier).

The first example in the paper makes it less-than-obvious by writing in formula form that the error noise variance increases with time, but what's described is essentially this picture: https://en.wikipedia.org/wiki/Heteroscedasticity#/media/File... So it's intuitively easy to grok that if the more recent variances are bad enough it's probably better not to use them at all (or to properly downweight them like with MLE or GLS).

The described result is a bit more subtle though. The classical results are that with heteroskedasticity, OLS is not an efficient estimator [1] (you could do better with for example GLS/WLS), but it's still unbiased [2] (the estimated beta's are centered around the true beta's) and consistent [3] (the probability limit of the OLS beta is the true beta by the Law of Large Numbers). These are what the paper calls "robustness" of OLS in the first example. The consistency of OLS even with heteroskedastic errors is what makes it surprising that adding more (true!) observations can make your estimates worse. Intuitively most people expect consistent estimators to get better with more observations.

[1] https://en.wikipedia.org/wiki/Efficient_estimator

[2] https://en.wikipedia.org/wiki/Bias_of_an_estimator

[3] https://en.wikipedia.org/wiki/Consistent_estimator


Thank you for your patient explanation, it seems to me then, if I understood, that the paper isn't relevant to what was being explained in the lecture, data acquisition/sampling, but rather to a later stage in the Data Science lifecycle, prediction/modeling.

The course you refer to and the paper are somewhat different. The course simply says that for a given problem, there are good datasets and bad datasets, and the bad data may be bad no matter how big it is. Bad means biased here.

The paper considers a somewhat different situation. There you have a single dataset, for which you know in advance[1] that some points in it, and you know exactly which, are more noisy than others. The question is then whether to use these points, and how.

[1] But you know this because you assume a model. A particular timeseries model in this case. Its not necessarily in the data itself.


Thanks for pointing this out, looks like the lecturer put a screenshot of the paper in the slides because the title is clickbaity, to liven things up, not because it was relevant to the subject at hand.

It is well known among practitioners that not all models have equal learning capacity and that additional data is wasted on those with lower capacity - if you plot a performance metric vs dataset size, you can see it just saturates and stops increasing. The point where it saturates is a proxy for the model's learning capacity.

One of the pros of very big models like as deep networks was that they never seem to saturate in this manner - more data always gave better results, because there are ways to increase the model's learning capacity.

We've recently learned that deep nets work in deep overfitting mode, beyond the usual U-shaped regime of bias-variance tradeoff that can be seen in the empirical risk curve. See this post for a good overview:

https://lilianweng.github.io/lil-log/2019/03/14/are-deep-neu...

It was a surprise to me that adding more data could lower a performance metric tough - that's counter-intuitive.


> It is well known among practitioners that not all models have equal learning capacity and that additional data is wasted on those with lower capacity

Not just practitioners, IIRC there's a whole bunch of theory under the handle of Vapnik-Chervonenkis dimension.


(2013) - edit: yeah, not critizing OP, it's just useful metadata

I would indeed have liked to append the date to the title, but Too Long

The link is a preprint. Here is the published version: https://www.tandfonline.com/doi/full/10.1080/07474938.2013.8....

So? I just finished it and it was still edifying.

It is customary for the vintage of old articles to be added in the title to make it clear that they may not be current. GP is attempting to bring the vintage of this article to the attention of a moderator to make such a change.

Huh, the bit about just plugging in the real values for variables being a bad idea is very interesting. Pretty sure I've done that in the past, without really thinking too much about it.

Does anyone know of books that cover stuff like that? There's plenty of books explaining Simpson's paradox to laymen, but I want something that explains common statistical pitfalls for people with already some education in the area.


If adding data AND variables to your model is making it less accurate, this is probably good evidence that your first model was cheating.



Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: