It was refered to in a course I've just started self learning, Berkeley's Principles and Techniques of Data Science (aka Data 100), and the lecturer wanted to highlight the crucial importance of data acquisition in the Data Science lifecycle, sampling in particular, given that mistakes at this stage will propagate to the later stages. My understanding of his example (~1h10 into https://youtu.be/JtwBwogRZkI) is that a 400 sized probability sample (a small random sample from the whole population) is often better than a millions sized administrative sample (of the kind you can download from gov sites). The reason is that an arbitrary sample (as opposed to a random one) is very likely to be biased, and, if large enough, a confidence interval (which actually doesn't really make sense except for probability samples) will be so narrow that, because of the bias, it will actually rarely, if ever, include the true value we are trying to estimate. On the other hand, the small, random sample will be very likely to include the true value in its (wider) confidence interval.
The paper seems to elaborate on this, which I find interesting, and I was hoping for someone to explain it [EDIT: the paper, that is, I think I understand the lecture].
It directly answers the question of whether a small probability sample or large but biased sample is better, using only modestly technical results (at least for the first part). It’s a very nice paper.
EDIT: well, as it happens, the course's online book has a link (https://youtu.be/yz3jOIHLYhU) to a youtube lecture based on the very paper you mentioned.
EDIT: slides relevant to paper (not in youtube lecture): https://www.bnl.gov/nysds18/files/talks/session2/Meng-nysds1...
The paper is mostly concerned with model misspecification, in the sense that your model is not actually capable of describing reality (no matter what parameters you set it to). The first example uses the OLS model, e.g.
Y = βX + ε
where the implicit assumption is that your observations (Y) are purely a linear function of your input(s) (X) plus a independent, equal variance, normally distributed error.
Let's say your data really does fit a straight line, but the later data points you get are really high variance. Intuitively, including the later data points will make your estimated parameters more noisy than if we had simply thrown them out.
So in this sense, 'more data is worse', because your model assumes all the data points are the same, but the later points you collected are actually worse (noisier).
There are other more obvious ways this happen - for example, if you assume that wealth is normally distributed, then try to estimate the distribution's parameters, then any small sample + any billionaire will get you nonsense — your model will be very intolerant of sampling noise compared to a better model.
The video's observation only deals with data: you might prefer quality over quantity.
E.g. compare a wider cone centered at zero (the true average) vs. a narrow cone that's centered away from zero - depending on what we care about, we may prefer some (hopefully uncorrelated) noise in our model rather than learning some systematic untruth.
So these are two different ways more data can be counterproductive: either your model is biased in a way that more data can be hurtful, or your sampling procedure for larger sets is increasingly biased.
Note that older stats/econometric techniques are more vulnerable to model misspecification because they have many fewer parameters than, say, a DNN; conversely, more modern ML techniques are able to closely replicate input data. This helps explain the difference in focus.
> So in this sense, 'more data is worse', because your model assumes all the data points are the same, but the later points you collected are actually worse (noisier).
The first example in the paper makes it less-than-obvious by writing in formula form that the error noise variance increases with time, but what's described is essentially this picture: https://en.wikipedia.org/wiki/Heteroscedasticity#/media/File...
So it's intuitively easy to grok that if the more recent variances are bad enough it's probably better not to use them at all (or to properly downweight them like with MLE or GLS).
The described result is a bit more subtle though. The classical results are that with heteroskedasticity, OLS is not an efficient estimator  (you could do better with for example GLS/WLS), but it's still unbiased  (the estimated beta's are centered around the true beta's) and consistent  (the probability limit of the OLS beta is the true beta by the Law of Large Numbers). These are what the paper calls "robustness" of OLS in the first example. The consistency of OLS even with heteroskedastic errors is what makes it surprising that adding more (true!) observations can make your estimates worse. Intuitively most people expect consistent estimators to get better with more observations.
The paper considers a somewhat different situation. There you have a single dataset, for which you know in advance that some points in it, and you know exactly which, are more noisy than others. The question is then whether to use these points, and how.
 But you know this because you assume a model. A particular timeseries model in this case. Its not necessarily in the data itself.
One of the pros of very big models like as deep networks was that they never seem to saturate in this manner - more data always gave better results, because there are ways to increase the model's learning capacity.
We've recently learned that deep nets work in deep overfitting mode, beyond the usual U-shaped regime of bias-variance tradeoff that can be seen in the empirical risk curve. See this post for a good overview:
It was a surprise to me that adding more data could lower a performance metric tough - that's counter-intuitive.
Not just practitioners, IIRC there's a whole bunch of theory under the handle of Vapnik-Chervonenkis dimension.
Does anyone know of books that cover stuff like that? There's plenty of books explaining Simpson's paradox to laymen, but I want something that explains common statistical pitfalls for people with already some education in the area.