I Got More Data, My Model Is More Refined, but My Estimator Is Getting Worse [pdf] 129 points by charlysl 11 days ago | hide | past | web | favorite | 17 comments

 I am the OP. For what it's worth, I barely understand the paper, way too technical for me, so let me give some context as to why I posted this.It was refered to in a course I've just started self learning, Berkeley's Principles and Techniques of Data Science (aka Data 100), and the lecturer wanted to highlight the crucial importance of data acquisition in the Data Science lifecycle, sampling in particular, given that mistakes at this stage will propagate to the later stages. My understanding of his example (~1h10 into https://youtu.be/JtwBwogRZkI) is that a 400 sized probability sample (a small random sample from the whole population) is often better than a millions sized administrative sample (of the kind you can download from gov sites). The reason is that an arbitrary sample (as opposed to a random one) is very likely to be biased, and, if large enough, a confidence interval (which actually doesn't really make sense except for probability samples) will be so narrow that, because of the bias, it will actually rarely, if ever, include the true value we are trying to estimate. On the other hand, the small, random sample will be very likely to include the true value in its (wider) confidence interval.The paper seems to elaborate on this, which I find interesting, and I was hoping for someone to explain it [EDIT: the paper, that is, I think I understand the lecture].
 Your lecturer must have meant to refer to this paper, also by Xiao-Li Meng: https://statistics.fas.harvard.edu/files/statistics-2/files/...It directly answers the question of whether a small probability sample or large but biased sample is better, using only modestly technical results (at least for the first part). It’s a very nice paper.
 Wow, thanks, it seems indeed far more relevant to the lecture, and, as you said, easier to follow.EDIT: well, as it happens, the course's online book has a link (https://youtu.be/yz3jOIHLYhU) to a youtube lecture based on the very paper you mentioned.EDIT: slides relevant to paper (not in youtube lecture): https://www.bnl.gov/nysds18/files/talks/session2/Meng-nysds1...
 > Let's say your data really does fit a straight line, but the later data points you get are really high variance. Intuitively, including the later data points will make your estimated parameters more noisy than if we had simply thrown them out.> So in this sense, 'more data is worse', because your model assumes all the data points are the same, but the later points you collected are actually worse (noisier).The first example in the paper makes it less-than-obvious by writing in formula form that the error noise variance increases with time, but what's described is essentially this picture: https://en.wikipedia.org/wiki/Heteroscedasticity#/media/File... So it's intuitively easy to grok that if the more recent variances are bad enough it's probably better not to use them at all (or to properly downweight them like with MLE or GLS).The described result is a bit more subtle though. The classical results are that with heteroskedasticity, OLS is not an efficient estimator [1] (you could do better with for example GLS/WLS), but it's still unbiased [2] (the estimated beta's are centered around the true beta's) and consistent [3] (the probability limit of the OLS beta is the true beta by the Law of Large Numbers). These are what the paper calls "robustness" of OLS in the first example. The consistency of OLS even with heteroskedastic errors is what makes it surprising that adding more (true!) observations can make your estimates worse. Intuitively most people expect consistent estimators to get better with more observations.
 Thank you for your patient explanation, it seems to me then, if I understood, that the paper isn't relevant to what was being explained in the lecture, data acquisition/sampling, but rather to a later stage in the Data Science lifecycle, prediction/modeling.
 The course you refer to and the paper are somewhat different. The course simply says that for a given problem, there are good datasets and bad datasets, and the bad data may be bad no matter how big it is. Bad means biased here.The paper considers a somewhat different situation. There you have a single dataset, for which you know in advance[1] that some points in it, and you know exactly which, are more noisy than others. The question is then whether to use these points, and how.[1] But you know this because you assume a model. A particular timeseries model in this case. Its not necessarily in the data itself.
 Thanks for pointing this out, looks like the lecturer put a screenshot of the paper in the slides because the title is clickbaity, to liven things up, not because it was relevant to the subject at hand.
 It is well known among practitioners that not all models have equal learning capacity and that additional data is wasted on those with lower capacity - if you plot a performance metric vs dataset size, you can see it just saturates and stops increasing. The point where it saturates is a proxy for the model's learning capacity.One of the pros of very big models like as deep networks was that they never seem to saturate in this manner - more data always gave better results, because there are ways to increase the model's learning capacity.We've recently learned that deep nets work in deep overfitting mode, beyond the usual U-shaped regime of bias-variance tradeoff that can be seen in the empirical risk curve. See this post for a good overview:https://lilianweng.github.io/lil-log/2019/03/14/are-deep-neu...It was a surprise to me that adding more data could lower a performance metric tough - that's counter-intuitive.
 > It is well known among practitioners that not all models have equal learning capacity and that additional data is wasted on those with lower capacityNot just practitioners, IIRC there's a whole bunch of theory under the handle of Vapnik-Chervonenkis dimension.
 (2013) - edit: yeah, not critizing OP, it's just useful metadata
 I would indeed have liked to append the date to the title, but Too Long
 The link is a preprint. Here is the published version: https://www.tandfonline.com/doi/full/10.1080/07474938.2013.8....
 So? I just finished it and it was still edifying.
 It is customary for the vintage of old articles to be added in the title to make it clear that they may not be current. GP is attempting to bring the vintage of this article to the attention of a moderator to make such a change.
 Huh, the bit about just plugging in the real values for variables being a bad idea is very interesting. Pretty sure I've done that in the past, without really thinking too much about it.Does anyone know of books that cover stuff like that? There's plenty of books explaining Simpson's paradox to laymen, but I want something that explains common statistical pitfalls for people with already some education in the area.
 If adding data AND variables to your model is making it less accurate, this is probably good evidence that your first model was cheating.

Applications are open for YC Winter 2020

Search: