
More Data Is Not Better and Machine Learning Is a Grind - sonabinu
https://scm.ncsu.edu/scm-articles/article/more-data-is-not-better-and-machine-learning-is-a-grind-just-ask-amazon
======
digitalzombie
More data is better.

You can reduce it via PCA one of the many techniques in multivariate
statistic.

You can do anova to select your predictors.

In general you can use a subset of it using the tools that statistic have
provided.

Complaining about messy data... welcome to the real world. As for complaining
about non-reproducible models , choose a reproducible ones. I've only done
mostly statistical models and forest base algorithms and they're all
reproducible.

All I see in this post is complaints and no real solutions. The solution
that's given is what? Have less data?

> The results were consistent with asymptotic theory (central limit theorem)
> that predicts that more data has diminishing returns

CLT talks about sampling from the population infinitely. It doesn't say
anything about diminishing returns. I don't get how you go from sampling
infinitely to diminishing returns.

~~~
YeGoblynQueenne
>> All I see in this post is complaints and no real solutions. The solution
that's given is what? Have less data?

The solution is to direct research effort towards learning algorithms that
generalise well from few examples.

Don't expect the industry to lead this effort, though. The industry sees the
reliance on large datasets as something to be exploited for a competitive
advantage.

>> You can reduce it via PCA one of the many techniques in multivariate
statistic.

PCA is a dimensionality reduction technique. It reduces the number of featuers
required to learn. It doesn't do anything about the number of examples that
are needed to guarantee good performance. The article is addressing the need
for more _examples_ , not more _features_.

~~~
nimithryn
>>>Don't expect the industry to lead this effort, though. The industry sees
the reliance on large datasets as something to be exploited for a competitive
advantage.

This is only true for the Facebooks and Googles of the world. There are
definitely small companies (like the one I work for) trying very hard to
figure out how to build models that use less data because we don't have access
to those large datasets.

The industry is larger than just the Big N.

~~~
YeGoblynQueenne
Btw, if you have relational data and a few good people with strong computer
science backgrounds rather than statisticians or mathematicians, have a look
at Inductive Logic Programming. ILP is a set of machine learning techniques
that learn logic programs from logic programs. The sample efficiency is on a
class of its own and it generalises robustly from very little data[1].

I study ILP algorithms for my PhD. My research group has recently developed a
new technique, Meta Interpretive Learning. Its canonical implementation is
Metagol:

[https://github.com/metagol/metagol](https://github.com/metagol/metagol)

Please feel free to email me if you need more details. My address is in my
profile.

___________________

[1] As a source of this claim I always quote this DeepMind paper where Metagol
is compared to the authors' own system (which is itself an ILP system, but
using a deep neural net):

[https://arxiv.org/abs/1711.04574](https://arxiv.org/abs/1711.04574)

 _ILP has a number of appealing features. First, the learned program is an
explicit symbolic structure that can be inspected, understood, and verified.
Second, ILP systems tend to be impressively data-efficient, able to generalise
well from a small handful of examples. The reason for this data-efficiency is
that ILP imposes a strong language bias on the sorts of programs that can be
learned: a short general program will be preferred to a program consisting of
a large number of special-case ad-hoc rules that happen to cover the training
data. Third, ILP systems support continual and transfer learning. The program
learned in one training session, being declarative and free of side-effects,
can be copied and pasted into the knowledge base before the next training
session, providing an economical way of storing learned knowledge._

~~~
nimithryn
Ah yes I am very familiar with ILP - thanks for sending these references!

~~~
YeGoblynQueenne
You're welcome, and what a pleasant surprise, it's rare to find people who
know about ILP in the industry :)

------
tedd4u
In image recognition, Google asserted in 2017 that they were unable to find
decreasing returns -- "Performance increases logarithmically based on volume
of training data." Maybe for supply-chain and regression models there is a
limit but it seems in deep neural nets maybe there's a different answer.

Blog: [https://ai.googleblog.com/2017/07/revisiting-unreasonable-
ef...](https://ai.googleblog.com/2017/07/revisiting-unreasonable-
effectiveness.html)

Paper: [https://arxiv.org/abs/1707.02968](https://arxiv.org/abs/1707.02968)

~~~
FPGAhacker
Isn’t that the very definition of diminishing returns?

~~~
throwawaymath
Yes, precisely. Logarithmic growth asymptotically decreases - it's the
definitional opposite of exponential growth. This is clearly illustrated in
typical graphs depicting logarithmic growth: [https://jamesclear.com/wp-
content/uploads/2015/04/logarithmi...](https://jamesclear.com/wp-
content/uploads/2015/04/logarithmic-growth-curve-1200x800.jpg)

So in point of fact, according to the cited paper Google is asserting there
_are_ diminishing returns to increasing the volume of data.

~~~
nostrademons
The two of you are asserting different hypotheses than the OP presented:
diminishing returns != decreasing returns. The Google paper found that
increased amounts of data always improved performance, but did so at a lower
_rate_ the more data that had already been provided. First derivatives vs.
second.

The headline of this article is "More data is not better", which is a stronger
claim than diminishing returns - it's neutral or negative returns.

~~~
YeGoblynQueenne
>> The headline of this article is "More data is not better", which is a
stronger claim than diminishing returns - it's neutral or negative returns.

Well, if I was paying $10,000 for 10,000 examples (to collect, cleanup,
process, train with, etc), getting 90% accuracy and making $90,000 from the
training model, and now I'm paying $10,000,000 for 10,000,000 examples,
getting 91% accuracy and making $91,000 from the trained model, I'm losing
money where before I was making some. That's "not better".

~~~
nostrademons
Few big tech companies pay for training data - it usually arises organically
out of usage data from their products. You need to collect this anyway for
product function / business metrics / abuse prevention, and build the cleaning
& processing pipelines. So the only marginal cost of feeding more training
data into your machine learning pipeline is the computational cost of training
it, which is usually tiny fractions of a penny per sample.

~~~
YeGoblynQueenne
If it was so simple to collect, process and train with (very) large amounts of
data, everyone woudl be doing it. Instead, it's just a few very large
companies that can do that, Google, Facebook et al.

Anyway, the cost per example doesn't have to be astronomical. If you need a
few millinos of those, you can pay a fraction of a penny and still have a big
black hole in your budget, unless you can significantly improve performance.

~~~
nostrademons
Only a few big companies can do it because there's a bootstrapping problem. To
get large amounts of virtually free data, you need lots of users who have
signed up for giving you their data in exchange for a useful service. This was
much easier to achieve for companies started between 1995-2005, when the web
was young, because the Internet was such a huge leap forwards over what came
before it. Existing startups now have to compete with the products of these
giants, many of which have been enhanced by years of machine learning. That's
challenging.

To give you a sense of how cheap computing power is, my startup regularly
processes roughly 2B webpages with some complicated algorithms that need to go
node-by-node over the whole DOM tree. That's roughly 77TB of (gzipped) data,
and around 100 trillion nodes. It costs me a few hundred bucks of AWS time.
That's a rounding error for a big corp; a single data scientist's salary for
one day will run you around that much.

------
csours
My limited opinion: The Right Data is better than Big Data, if you can get it.

There was a push to get data out of app dbs and into big data repositories.
But then no one could use the big data because it made no sense. So then ML?

But if you already know what it means in the app db, just make it available in
a sensible format.

~~~
erikb
I think you can have both situations. I still remember a talk from a Google
person maybe around 2010 where he showed clearly that if you try to make the
data "better" but in general have way too little data points, then it can't be
useful either.

It's just now that we had almost a decade of pushing towards more IoT and more
Big Data, that many companies have huge data lakes that they don't know how to
make use off.

So instead of applying one of these lessons it's probably best to see where
one is lacking (quantity or quality) and work on resolving that specific
problem accordingly.

------
judge2020
See the "Law of large numbers" \-
[https://en.wikipedia.org/wiki/Law_of_large_numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers)

Saying you need more data is like saying you need to flip a quarter 500
million times to get a better percentage estimate of heads vs. tails compares
to 1 million coin flips. After a certain point, having more data only helps
with identifying outliers and changes in behavior over time (when dealing with
human/natural data).

~~~
Scene_Cast2
You're using low-dimensional intuition. If your data has a non-linear
relationship between 1000 features (or more), then 500 million samples is
quantifiably better than 1 million samples. If these are boolean features, you
have 2^1000 possible data points. Your samples are _extremely_ sparse.

The current state of the art of ML assumes nonlinear relationships between all
parameters. It can't assume simpler & reasonable models, and therefore it
can't extrapolate easily with reduced data.

~~~
throwawayjava
_> The current state of the art of ML assumes nonlinear relationships between
all parameters. It can't assume simpler & reasonable models, and therefore it
can't extrapolate easily with reduced data._

I'm not really sure what "low-dimensional intuition" means, but I pretty
regularly build models that do not "assume nonlinear relationships between all
parameters".

~~~
gnulinux
As the dimension d goes to infinity, if you sample integers from
d-multivariate normal distribution, your samples will collect near the unit
d-sphere; whereas in low dimensions they'd collect near the origin. This is
because as d goes to infinity, your samples become more and more far apart.
This is a topological intuition of why higher dimensions don't work like lower
dimensions.

~~~
DoctorOetker
I reserve the possibility that I am completely mistaken, in which case I
apologize in advance, on the condition that you refer me to an actual
calculation or derivation (not just a quote in the same spirit) in a paper or
text or textbook.

I have seen such insinuation multiple times, that samples "collect" near the
unit d-sphere when samples are drawn from the unit d-sphere in high d
dimensions.

From a physics perspective this is very familiar to me, but not in the sense
of fact, but in the sense of misinterpretation.

I do believe the observation is very useful in the educational sense as long
as it is pointed out as being a paradoxical _illusion_. In this sense I can
appreciate (even encourage) a professor or a TA showing this phenomenon, on
the condition they finish up by explaining why this _seems_ to be the case,
but is nevertheless a misinterpretation, and they should make sure the
connection with Jacobian determinants etc are made clear.

Consider a normal distribution of any dimension (as high as you want), but I
will showcase the phenomenon even with low dimensions (here merely d=3) to
illustrate this has nothing to do with high dimensions.

Clearly the probability density is maximal in the center of the distribution.

In computer processing of data points, we typically loop over points,
calculate some hopefully interesting function on each sample, and then plot
the samples say by binning with equal bin sizes. A programmer typically
disregards transformation properties like the Jacobian determinant. Suppose
the value we calculate for each data point is the absolute length or distance
from the center. The further we go from the center the smaller the probability
density of the normal distribution becomes... _but the larger the volume of a
shell of radius r!_

Since we are binning with equal bin sizes (equal length intervals per bin),
for small lengths, then even though the actual probability density is highest
near the center, we will get relatively few samples because the volume under
consideration is small compared to the volume under consideration for a shell
of a larger radius of equal thickness (area of a sphere grows quadratically
with radius). However for even larger distances, the exponential decay of the
normal distribution dominates and the number of samples in highest radius bins
will decrease again. So in between there will be a peak.

This explains the _fact_ that [ the probability density of [ the absolute
distance from the center over [ the sample points ] ] has a peak at some non-
zero length.

But it is a conceptual mistake to interpret this as if those sample points in
the original d-dimensional space form a dense shell on some "unit sphere" ...
This is a pure illusion which illustrates the interpreter is not familiar with
jacobian determinants etcetera.

Consider volume or triple integrals over some volume element dx dy dz and for
symmetry you prefer integrating in a spherically symmetric coordinate system,
theta,phi,r then you can not simply replace dx dy dz with dtheta dphi dr, you
need to use dV = dx dy dz = r^2 sin( phi ) dtheta dphi dr.

It is this necessary factor that is ignored when processing sample by sample
and causing this illusion in the AI community. I did not follow conventional
machine learning courses, but given the learned language used whenever I see
statements to the effect of samples lying near the unit sphere in high
dimensions, I can only conclude it has its origins in 1) direct observation or
experience of plotting in bins of the length of the vector without guidance in
interpretation; or 2) guidance during education, where the phenomenon is
shown, and the origin of the paradoxical illusion or confusion adequately
explained and then the illusory nature subsequently forgotten or 3) a teachers
assistant having gone through 2) and showing the phenomenon to students
without emphasizing the illusory nature of the misinterpretation.

But perhaps I am wrong, and the normal distribution in high dimensions
actually has a higher probability density near its "unit sphere" if the
dimension is beyond some critical dimension d_c... but again, I'd like to see
a derivation showing it :)

EDIT: For more precise language: they do "collect" (reach a peak) at a certain
non-zero length or distance, but they do _not_ collect to a unit sphere of
such radius in the original space!

~~~
yorwba
It seems you're mostly arguing with the description of "collection", because
to you it implies high density in the full n-dimensional space. But the point
of the curse of dimensionality is very much that most points do _not_ lie in
regions of high density, because regions of low but still non-negligible
density are so much larger. If you prefer, you can say that they are "close"
to the surface of a sphere, without implying high density.

There are also descriptions of the curse that do not involve spatial analogies
at all. Assume that the data is independently identically distributed along
each dimension and is an outlier if it's sufficiently far along one dimension,
which happens with probability _p_ in the one-dimensional case. Then in _n_
dimensions, the proportion of outliers is 1 - (1 - _p_ )^n -> 1 for _n_ to
infinity. Most points are outliers along at least one dimension.

~~~
DoctorOetker
I am _not_ arguing with the word "collect", as it can be interpreted in best
faith.

I DO argue against the "collect _near the unit d-sphere_ ", if it does not
come with an explicit pointer or reference to the explanation of this
illusion, I don't care if one points to Maxwell Boltzmann speeds vs velocity
vector distribution, or Jacobian determinant, but one should point to
_something_ else communication makes no sense. We communicate to teacch and
learn. Only when explaining why there _appears_ to be a sphere in the higher
dimension, and how this illusion arises is communicating about this
pseudosphere justified. The unit d-sphere makes no sense on the 1-dimensional
axis of absolute length on which we project the samples. A reference to a
"unit d-sphere" only makes sense as residing in the original d-dimensional
sample space. But in _that_ space there is absolutely no packing of samples
near the peak radius as it _appears_ in the distribution of lengths.

I was not responding to the "curse of dimensionality". I show that this effect
already exists at low dimensions, and physicists are very acquainted with it
because within their first years of university study they get drilled in 1)
jacobians for non-linear coordinate transformations 2) the velocity
distribution of molecules in the kinetic theory of gases, where there is a
similar plot for the absolute speed (Maxwell-Boltzmann) distribution [0]
showing a peak at a non-zero speed, even though the _velocity vector
distribution_ is a normal distribution... Every physicist worth his / her
salt, immediately recognizes the phenomenon as relating to the Jacobian
determinant, that this has nothing to do with velocities aggregating to some
sphere of non-zero radius, and that this is a misinterpretation of magnitude
distribution plot...

Clearly in this physics example d=3. Would you say d=3 already shows the curse
of dimensionality? I call bullocks, and suspect a misinterpretation of the
magnitude plot...

Again the peak in the magnitude plot is very real, any reference to a sphere
with radius of the peak in the magnitude plot and residing in the original
d-dimensional space is purely imaginary!

Of course the person with simultaneously avera height, avg weight, avg income,
avg capital, avg age, avg ... is very rare, ... but _less rare_ than a
similarily accurately specified person with _non-average_ weight, but all
other variables still average...

[0] [https://en.wikipedia.org/wiki/Maxwell-
Boltzmann_distribution...](https://en.wikipedia.org/wiki/Maxwell-
Boltzmann_distribution#Derivation_and_related_distributions)

see the last 2 sections on velocity vector distribution, and speed
distribution...

------
rv-de
> The results were consistent with asymptotic theory (central limit theorem)
> that predicts that more data has diminishing returns.

That's not asymptotic theory or central limit theorem. It's a meek proposition
that you learn to prove in first semester when studying math. A bounded and
monotonous function will always converge.

> Higher velocity data does not improve percentage accuracy and makes accuracy
> levels worse!

What on earth is this supposed to mean? An ML algorithm doesn't care about how
fast data "arrives". "Higher velocity" is marketing lingo related to their
Kinesis Data Streams service. But I assume that he refers to the usual loss of
performance observed with all online-learning algos. Feeding knowledge one by
one will mostly just overwrite learned generalizations. Or maybe he is
indicating that they are compromising algo quality for faster execution.

> It is important to pick a single metric to improve, even if it is not
> perfect, but to use it as the basis for measuring performance improvement.

I think that is wrong. Single metrics never capture complex behavior well and
will lead to distortions if fed back to the system. I mean it's normal to use
one metric for the error. But saying this is the real deal sounds ridiculous
to me as this is being done most of the time anyway and probably is going to
change in the future. Backprop only works with one metric at the moment -
that's just the fact.

> Pat noted that improvement and learning is often very slow – sort of like a
> slow weight loss program, where you lose weight very slowly. Processes may
> only be improving by 20 basis points a quarter, or 80 basis points a year.
> That isn’t a lot, but over a decade, it really makes a difference.

Now he's contradicting himself as he gives a reason for why big data _is_
beneficial.

> His final word of advice – students should be broad in their knowledge of a
> lot of things, but need to be very deep in one area.

And again he is contradicting himself. Because if you make an analogy from
student to a learning algorithm he now gives TWO orthogonal metrics to
optimize for.

~~~
darawk
> What on earth is this supposed to mean? An ML algorithm doesn't care about
> how fast data "arrives".

My guess is that he's referring to the resolution of a time series. E.g. Going
from months to weeks is 'more' data, but makes your models worse.

~~~
throwawaymath
Good point. If I retrofit the term "resolution" in place of "velocity", the
author's statement does make a lot more sense. If that's what was intended
they really should have used that terminology, because I was similarly
confused at the term "velocity."

Whenever I've worked with time series data I've always referred to the
granularity of the time dimension as its "resolution" \- this is also common
in geospatial data. I don't think (but am happy to be corrected) that
"velocity" is a term of art in timeseries analysis.

~~~
darawk
Ya I was confused by it at first too, it definitely could have been more
clear. But it's also been my experience with time series that more resolution
isn't always better.

------
ausbah
Parts of this piece appear to me as evidence for the importance of basic-mid
level classes / foundations in statistics, probability, and econometrics for
machine learning. Learning to build a model doesn’t mean you can always use it
well.

~~~
bunderbunder
Indeed. The first one, for example, reads to me as a dead ringer for a case
where someone didn't bother to think about ecological validity until _after_
they started having problems in production.

------
SemiTom
How and where data gets scrubbed will have significant consequences. While
clean data is more valuable than dirty data (the mass of raw data collected by
sensors), there is hidden value in those large masses of data. They can show
broad trends and patterns that are not obvious in clean data, which is the
modern equivalent of not seeing the forest for the trees
[https://semiengineering.com/data-vs-
physics/](https://semiengineering.com/data-vs-physics/)

~~~
jmatthews
This may be a bit elementary for this crowd but regarding the balance of data
cost vs capturing the most significant features. We use a simple decision tree
as a significance cluster and optimize data munging around these clusters.

On some levels it is anti-diversity but given real world constraints it has
yielded the best results. Any thoughts or links regarding this topic would be
appreciated.

------
YeGoblynQueenne
>> A model is now running in production but not producing the same results (or
at the same level of accuracy) as what was demonstrated during
experimentation… no one knows why

The simplest explanation is that the training sample (meaning the entire
dataset; not just the training partition in cross-validation) was not drawn
from the same distribution as the distribution of the data that is being
processed in production.

Machine learning is guaranteed some performance bounds under PAC learning,
assuming that the training dataset and unseen data, on which the training
model will be used, come from the same distribution. Absent this, performance
cannot be predicted. You might as well classify stuff by throwing a bunch of
dice.

Unfortunately, this assumption, that we're representing the real-world
distribution in our training dataset, cannot be justified as long as we don't
know the ground truth in the real world. Which is most of the time.

Essentially, there's no way to know for sure that a model that has performed
very well in experiments will continue to do so once it's deployed in
production.

~~~
Scene_Cast2
I agree with the intuition, but not sure that I agree with the takeaways (in
particular the lack of knowledge of the ground truth). I mean it _is_ a
problem, but there are bigger, more practical mechanisms at work tool.

In terms of sample distribution - let's say that you have an online model
serving traffic (and the outcomes of that traffic are logged as your training
& holdout data). Then, when you train a wildly different model, it may start
picking other things to serve that the original model never served - and
therefore, there was no training data. This is a pretty fundamental and hard
problem.

Second, funny thing, but sometimes you can't use the same metric for training
as you do for evaluating your model online. I don't want to (can't) get into
details, but it's also a pretty fundamental and hard problem.

Last, the production traffic always shifts. Using historical data, given how
much current models are "reactive" and "compression-like", as opposed to true
generalization - they perform worse given fresh traffic that changed slightly
from e.g. 5 days ago. If your training data is "day -10 to day -3", your
holdout is "day -2 to day 0" \- models will likely perform worse on holdout
than pure overfit theories would have you assume (mind you, still plenty
enough to have a ton of value), but when you launch it on day 1, it will
perform worse still - as day 1 is different from days -10 to day -3.

I haven't done the analysis, but I'd assume that for non-historical models
where you don't need to structure your holdout data to be from the "future" \-
that they'd perform better when you first launch them online.

~~~
YeGoblynQueenne
>> In terms of sample distribution - let's say that you have an online model
serving traffic (and the outcomes of that traffic are logged as your training
& holdout data). Then, when you train a wildly different model, it may start
picking other things to serve that the original model never served - and
therefore, there was no training data. This is a pretty fundamental and hard
problem.

I think we're talking about the same thing here, that the real world can be
very different than your training sample. Sorry, I think I contacted the
jargon flu this week :)

------
sgt101
Statistically they must be right - most of the statistical improvement by an
ML system is provided by the first "large" "good" data set, after that gains
are sure to be marginal. But, a lot of problems aren't really statistical in
nature; they are scientific in that if you can discover a new, better, theory
and use that then you can catch a few cases that you may often do the right
thing on without knowing why, and sort those out. For example - you may be
learning to diagnose a disease with one common driver, there may be a rare
disease that responds to the same treatment but has a slightly different
presentation, you often treat these patients because of false positives, but
if you learn to properly classify them then you can almost always catch them;
the statistical gain is negligible, the value is tangible.

~~~
sgt101
I should say that above is an example, I am very cautious about using ML for
medical diagnosis.

------
superconformist
After I collect a dataset I save duplicate entries with the text reversed.
More data is good data. Except now my machine learning robot is dyslexic.

~~~
why_only_15
Of course you're joking here, but for images people do reverse the image to
enhance the output of their models and this has shown to be beneficial. There
are other augmentation you can do as well. In total they're called "Synthetic
Data".

~~~
gnulinux
This is not surprising because if image A is an apple, its mirror image A' is
also an apple. Adding A' to your dataset is just plain ol' regularization. It
is meant to lower bias, to prevent overfitting.

~~~
jzwinck
It will also make the flag of Côte d’Ivoire be perceived as that of Ireland.

------
myWindoonn
Title was editorialized from "More Data is Not Better and Machine Learning is
a Grind…. Just Ask Amazon"

Was it the ellipsis or the reference to the employer of the lecturers? Either
way, please don't change titles needlessly. Thanks!

~~~
ggggtez
Disagree. "...Just ask Amazon" is needless clickbaity extra words. Ultimately
the opinion is that of the writer.

~~~
killjoywashere
He was writing a summary of a talk from an Amazon exec. I suppose "per Amazon
exec" would have been more exact, but are we just not allowed to have tone
anymore? Style is out?

------
paradoxparalax
Speak not untrue generalizations is better. Not that the opposite is true, by
no means. Generalizations are extremely rarely found to be true.

------
__coaxialcabal
As it turns out, more data is a far second to doing anything at all.

------
known
Diversity in data is better.

