You can reduce it via PCA one of the many techniques in multivariate statistic.
You can do anova to select your predictors.
In general you can use a subset of it using the tools that statistic have provided.
Complaining about messy data... welcome to the real world. As for complaining about non-reproducible models , choose a reproducible ones. I've only done mostly statistical models and forest base algorithms and they're all reproducible.
All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?
> The results were consistent with asymptotic theory (central limit theorem) that predicts that more data has diminishing returns
CLT talks about sampling from the population infinitely. It doesn't say anything about diminishing returns. I don't get how you go from sampling infinitely to diminishing returns.
The solution is to direct research effort towards learning algorithms that generalise well from few examples.
Don't expect the industry to lead this effort, though. The industry sees the reliance on large datasets as something to be exploited for a competitive advantage.
>> You can reduce it via PCA one of the many techniques in multivariate statistic.
PCA is a dimensionality reduction technique. It reduces the number of featuers required to learn. It doesn't do anything about the number of examples that are needed to guarantee good performance. The article is addressing the need for more examples, not more features.
This is only true for the Facebooks and Googles of the world. There are definitely small companies (like the one I work for) trying very hard to figure out how to build models that use less data because we don't have access to those large datasets.
The industry is larger than just the Big N.
I study ILP algorithms for my PhD. My research group has recently developed a new technique, Meta Interpretive Learning. Its canonical implementation is Metagol:
Please feel free to email me if you need more details. My address is in my profile.
 As a source of this claim I always quote this DeepMind paper where Metagol is compared to the authors' own system (which is itself an ILP system, but using a deep neural net):
ILP has a number of appealing features. First, the learned program is an
explicit symbolic structure that can be inspected, understood, and verified.
Second, ILP systems tend to be impressively data-efficient, able to generalise
well from a small handful of examples. The reason for this data-efficiency is
that ILP imposes a strong language bias on the sorts of programs that can be
learned: a short general program will be preferred to a program consisting of
a large number of special-case ad-hoc rules that happen to cover the
training data. Third, ILP systems support continual and transfer learning. The
program learned in one training session, being declarative and free of
side-effects, can be copied and pasted into the knowledge base before the next
training session, providing an economical way of storing learned knowledge.
As to the Big N (good one) what I meant to say is that I don't see them trying
very hard to undo their own advantage, by spending much effort developing
machine learning techniques that rely on, well, little data. That would truly
democratise machine learning- much more so than the release of their tools for
free, etc. But then, if everyone could do machine learning as well as Google
and Facebook et al, where would that leave them?
Yes it does. It even implies it in the name 'limit'. In the limit of infinitely many samples, we approximate a normal distribution. This approximation has diminishing returns.
>All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?
It's fine to point out problems without giving solutions. You seem very aggravated.
Also if you meant random forest by forests... those aren’t especially reproducible. Understanding what’s going on is not always easy, and most people seem to misinterpret the idea of “variable importance” when you have a mix of categorical and numeric features. Decision trees and linear regressions are nice and reproducible.
I mean, that's the crux is if you have bad data you will have bad results. Data cleanup/transformation is key for anything (reporting, etc...) and not just limited to ML because it's sexy these days.
So in point of fact, according to the cited paper Google is asserting there are diminishing returns to increasing the volume of data.
The headline of this article is "More data is not better", which is a stronger claim than diminishing returns - it's neutral or negative returns.
* As the number of product data grows, the benefits were negligible
* More observations per product was important
* The results were consistent with asymptotic theory (central limit theorem) that predicts that more data has diminishing returns.
In any case, this seems to be a case of "diminishing returns."
Well, if I was paying $10,000 for 10,000 examples (to collect, cleanup, process, train with, etc), getting 90% accuracy and making $90,000 from the training model, and now I'm paying $10,000,000 for 10,000,000 examples, getting 91% accuracy and making $91,000 from the trained model, I'm losing money where before I was making some. That's "not better".
Anyway, the cost per example doesn't have to be astronomical. If you need a few millinos of those, you can pay a fraction of a penny and still have a big black hole in your budget, unless you can significantly improve performance.
To give you a sense of how cheap computing power is, my startup regularly processes roughly 2B webpages with some complicated algorithms that need to go node-by-node over the whole DOM tree. That's roughly 77TB of (gzipped) data, and around 100 trillion nodes. It costs me a few hundred bucks of AWS time. That's a rounding error for a big corp; a single data scientist's salary for one day will run you around that much.
Logarithmic growth slows down, but not asymptotically. Think about it: what would the asymptote be? (There is none.)
Well, in that context every kind of (monotonic) growth is asymptotic, so the word has no meaning.
Since you made the correction you're probably already familiar with these, but for the benefit of others: https://en.wikipedia.org/wiki/Slowly_varying_function
In fact, unless the cost of adding and using training data exponentially decreases over time, it's a mathematical certainty that a logarithmically increasing n will quickly incur expensive, diminishing returns for using more data.
So in the context of this Google paper, you could conceive of a situation where training data actually becomes easier to load (albeit subexponentially) and still becomes too expensive to use relatively quickly.
If a 5% increase in accuracy worth more than the cost of data storage and computation then it's a pretty clear win.
It would be interesting if there was some corresponding law for more flexible models which did indeed give you logarithmic scaling.
There was a push to get data out of app dbs and into big data repositories. But then no one could use the big data because it made no sense. So then ML?
But if you already know what it means in the app db, just make it available in a sensible format.
It's just now that we had almost a decade of pushing towards more IoT and more Big Data, that many companies have huge data lakes that they don't know how to make use off.
So instead of applying one of these lessons it's probably best to see where one is lacking (quantity or quality) and work on resolving that specific problem accordingly.
I've seen some naive approaches struggle to get good results when all that was needed was beating some sense into the data. This typically requires domain knowledge. IMHO most of the engineering effort in ML is not tuning the algorithms but moving data around.
A little practical example: consider a dataset containing POIs and a ML based search ranking algorithm that is struggling with correctly ranking airports. Do you 1) spend a year trying to get your algorithm to work better using examples of 'good' results, or 2) figure out a better source for the limited number of airports that exist in the world with better metadata. Turns out a ~day of sorting this out with a few good open data sources gets you a lot further than months of trying out different ways to extract features that aren't there.
We had a whole team of ML PhDs. on this and they couldn't get it done. A single engineering intern came up with the obvious analysis: this data is shit, fixing that is easy, lets fix that. Problem solved.
The insights you're looking for could be cutting research problems or near trivial depending on the condition of the app db.
I have become fond of saying that the opportunities of ML and data science are just more rewards for getting your database schema right.
Saying you need more data is like saying you need to flip a quarter 500 million times to get a better percentage estimate of heads vs. tails compares to 1 million coin flips. After a certain point, having more data only helps with identifying outliers and changes in behavior over time (when dealing with human/natural data).
The current state of the art of ML assumes nonlinear relationships between all parameters. It can't assume simpler & reasonable models, and therefore it can't extrapolate easily with reduced data.
I'm not really sure what "low-dimensional intuition" means, but I pretty regularly build models that do not "assume nonlinear relationships between all parameters".
The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.
I have seen such insinuation multiple times, that samples "collect" near the unit d-sphere when samples are drawn from the unit d-sphere in high d dimensions.
From a physics perspective this is very familiar to me, but not in the sense of fact, but in the sense of misinterpretation.
I do believe the observation is very useful in the educational sense as long as it is pointed out as being a paradoxical illusion. In this sense I can appreciate (even encourage) a professor or a TA showing this phenomenon, on the condition they finish up by explaining why this seems to be the case, but is nevertheless a misinterpretation, and they should make sure the connection with Jacobian determinants etc are made clear.
Consider a normal distribution of any dimension (as high as you want), but I will showcase the phenomenon even with low dimensions (here merely d=3) to illustrate this has nothing to do with high dimensions.
Clearly the probability density is maximal in the center of the distribution.
In computer processing of data points, we typically loop over points, calculate some hopefully interesting function on each sample, and then plot the samples say by binning with equal bin sizes. A programmer typically disregards transformation properties like the Jacobian determinant. Suppose the value we calculate for each data point is the absolute length or distance from the center. The further we go from the center the smaller the probability density of the normal distribution becomes... but the larger the volume of a shell of radius r!
Since we are binning with equal bin sizes (equal length intervals per bin), for small lengths, then even though the actual probability density is highest near the center, we will get relatively few samples because the volume under consideration is small compared to the volume under consideration for a shell of a larger radius of equal thickness (area of a sphere grows quadratically with radius). However for even larger distances, the exponential decay of the normal distribution dominates and the number of samples in highest radius bins will decrease again. So in between there will be a peak.
This explains the fact that [ the probability density of [ the absolute distance from the center over [ the sample points ] ] has a peak at some non-zero length.
But it is a conceptual mistake to interpret this as if those sample points in the original d-dimensional space form a dense shell on some "unit sphere" ... This is a pure illusion which illustrates the interpreter is not familiar with jacobian determinants etcetera.
Consider volume or triple integrals over some volume element dx dy dz and for symmetry you prefer integrating in a spherically symmetric coordinate system, theta,phi,r then you can not simply replace dx dy dz with dtheta dphi dr, you need to use dV = dx dy dz = r^2 sin( phi ) dtheta dphi dr.
It is this necessary factor that is ignored when processing sample by sample and causing this illusion in the AI community. I did not follow conventional machine learning courses, but given the learned language used whenever I see statements to the effect of samples lying near the unit sphere in high dimensions, I can only conclude it has its origins in 1) direct observation or experience of plotting in bins of the length of the vector without guidance in interpretation; or 2) guidance during education, where the phenomenon is shown, and the origin of the paradoxical illusion or confusion adequately explained and then the illusory nature subsequently forgotten or 3) a teachers assistant having gone through 2) and showing the phenomenon to students without emphasizing the illusory nature of the misinterpretation.
But perhaps I am wrong, and the normal distribution in high dimensions actually has a higher probability density near its "unit sphere" if the dimension is beyond some critical dimension d_c... but again, I'd like to see a derivation showing it :)
EDIT: For more precise language: they do "collect" (reach a peak) at a certain non-zero length or distance, but they do not collect to a unit sphere of such radius in the original space!
There are also descriptions of the curse that do not involve spatial analogies at all. Assume that the data is independently identically distributed along each dimension and is an outlier if it's sufficiently far along one dimension, which happens with probability p in the one-dimensional case. Then in n dimensions, the proportion of outliers is 1 - (1 - p)^n -> 1 for n to infinity. Most points are outliers along at least one dimension.
I DO argue against the "collect near the unit d-sphere", if it does not come with an explicit pointer or reference to the explanation of this illusion, I don't care if one points to Maxwell Boltzmann speeds vs velocity vector distribution, or Jacobian determinant, but one should point to something else communication makes no sense. We communicate to teacch and learn. Only when explaining why there appears to be a sphere in the higher dimension, and how this illusion arises is communicating about this pseudosphere justified. The unit d-sphere makes no sense on the 1-dimensional axis of absolute length on which we project the samples. A reference to a "unit d-sphere" only makes sense as residing in the original d-dimensional sample space. But in that space there is absolutely no packing of samples near the peak radius as it appears in the distribution of lengths.
I was not responding to the "curse of dimensionality". I show that this effect already exists at low dimensions, and physicists are very acquainted with it because within their first years of university study they get drilled in 1) jacobians for non-linear coordinate transformations 2) the velocity distribution of molecules in the kinetic theory of gases, where there is a similar plot for the absolute speed (Maxwell-Boltzmann) distribution  showing a peak at a non-zero speed, even though the velocity vector distribution is a normal distribution... Every physicist worth his / her salt, immediately recognizes the phenomenon as relating to the Jacobian determinant, that this has nothing to do with velocities aggregating to some sphere of non-zero radius, and that this is a misinterpretation of magnitude distribution plot...
Clearly in this physics example d=3. Would you say d=3 already shows the curse of dimensionality? I call bullocks, and suspect a misinterpretation of the magnitude plot...
Again the peak in the magnitude plot is very real, any reference to a sphere with radius of the peak in the magnitude plot and residing in the original d-dimensional space is purely imaginary!
Of course the person with simultaneously avera height, avg weight, avg income, avg capital, avg age, avg ... is very rare, ... but less rare than a similarily accurately specified person with non-average weight, but all other variables still average...
see the last 2 sections on velocity vector distribution, and speed distribution...
you seem to be referencing the fact that repeatedly convolving uniform distributions approach a gaussian distribution
I can give you a beautiful and concise explanation of clouds too: as we all know objects fall back to earth, or more properly speaking they follow orbits, close to the surface they seem to follow parabolas, but in fact they follow ellipses. the moon orbits the earth in a nearly circular ellipse, as do geosynchroneous satellites. When water evaporates from the ocean they are in fact launched into an ellipcitcal orbit, however each elliptical orbit that intersects the surface of the earth will intersect it again, this explains rain, the individual water drops fly for a while until they fall back down...
beautiful, and relatively concise in comparison with real cloud physics, but totally wrong of course...
One of the more extreme examples of this situation is Big Data, where everything is a convenience sample, and therefore nothing is particularly trustworthy. Combine it with standard statistical hypothesis testing, whose p-values, being designed for the size of experiment that one could realistically manage in a lab setting circa 1910, are as much a reflection of your sample size as of anything else, and you're at great risk of leading yourself into a tar pit.
Now imagine that you want to measure that bias across different types of coins and year of minting. Does the bias between coins of the same type vary more than across types?
Curse of dimensionality starts to raise its ugly head.
That's not asymptotic theory or central limit theorem. It's a meek proposition that you learn to prove in first semester when studying math. A bounded and monotonous function will always converge.
> Higher velocity data does not improve percentage accuracy and makes accuracy levels worse!
What on earth is this supposed to mean? An ML algorithm doesn't care about how fast data "arrives". "Higher velocity" is marketing lingo related to their Kinesis Data Streams service. But I assume that he refers to the usual loss of performance observed with all online-learning algos. Feeding knowledge one by one will mostly just overwrite learned generalizations. Or maybe he is indicating that they are compromising algo quality for faster execution.
> It is important to pick a single metric to improve, even if it is not perfect, but to use it as the basis for measuring performance improvement.
I think that is wrong. Single metrics never capture complex behavior well and will lead to distortions if fed back to the system. I mean it's normal to use one metric for the error. But saying this is the real deal sounds ridiculous to me as this is being done most of the time anyway and probably is going to change in the future. Backprop only works with one metric at the moment - that's just the fact.
> Pat noted that improvement and learning is often very slow – sort of like a slow weight loss program, where you lose weight very slowly. Processes may only be improving by 20 basis points a quarter, or 80 basis points a year. That isn’t a lot, but over a decade, it really makes a difference.
Now he's contradicting himself as he gives a reason for why big data is beneficial.
> His final word of advice – students should be broad in their knowledge of a lot of things, but need to be very deep in one area.
And again he is contradicting himself. Because if you make an analogy from student to a learning algorithm he now gives TWO orthogonal metrics to optimize for.
My guess is that he's referring to the resolution of a time series. E.g. Going from months to weeks is 'more' data, but makes your models worse.
Whenever I've worked with time series data I've always referred to the granularity of the time dimension as its "resolution" - this is also common in geospatial data. I don't think (but am happy to be corrected) that "velocity" is a term of art in timeseries analysis.
For instance the "velocity data" and "but over a decade it makes a difference" parts. From a technical perspective it's awesome if you know that thanks to the data science you can outcompete others over a decade. But in business terms thinking is quarter based (i.e. three month) for short term and yearly for long term. The current leader that has to spend his budget on data science or other stuff has to find an advantage in it in the next few weeks after implementation, because usually making it happen takes already a quarter or two.
You might think they are idiots for thinking that short term, but their goals are also set in this way. So if they invest heavily in a topic and don't see any results for 3 quarters they might be replaced.
So if you say algorithm X will cost him 80% of his budget to implement but only shows results ten years later, for him that's the same as "no results". It's just the game he has to play to stay in the game.
I personally think this is part of why companies will not be able to make a drastic change in their DNA and instead will be replaced by the next generation of companies who comes out of start-ups, or be replaced by companies who already actively participate in the market in other areas and have the right DNA to take on the market. For instance Amazon is already more data driven than the traditional super market chains, therefore they now can attack that space with their modern technology.
> And again he is contradicting himself. Because if you make an analogy from student to a learning algorithm he now gives TWO orthogonal metrics to optimize for.
Loving that part. I bet few people have combined his advise for student learning with his advise on how to do ML earlier.
On some levels it is anti-diversity but given real world constraints it has yielded the best results. Any thoughts or links regarding this topic would be appreciated.
The simplest explanation is that the training sample (meaning the entire
dataset; not just the training partition in cross-validation) was not drawn
from the same distribution as the distribution of the data that is being
processed in production.
Machine learning is guaranteed some performance bounds under PAC learning,
assuming that the training dataset and unseen data, on which the training
model will be used, come from the same distribution. Absent this,
performance cannot be predicted. You might as well classify stuff by throwing
a bunch of dice.
Unfortunately, this assumption, that we're representing the real-world
distribution in our training dataset, cannot be justified as long as we don't
know the ground truth in the real world. Which is most of the time.
Essentially, there's no way to know for sure that a model that has performed
very well in experiments will continue to do so once it's deployed in
In terms of sample distribution - let's say that you have an online model serving traffic (and the outcomes of that traffic are logged as your training & holdout data). Then, when you train a wildly different model, it may start picking other things to serve that the original model never served - and therefore, there was no training data. This is a pretty fundamental and hard problem.
Second, funny thing, but sometimes you can't use the same metric for training as you do for evaluating your model online. I don't want to (can't) get into details, but it's also a pretty fundamental and hard problem.
Last, the production traffic always shifts. Using historical data, given how much current models are "reactive" and "compression-like", as opposed to true generalization - they perform worse given fresh traffic that changed slightly from e.g. 5 days ago. If your training data is "day -10 to day -3", your holdout is "day -2 to day 0" - models will likely perform worse on holdout than pure overfit theories would have you assume (mind you, still plenty enough to have a ton of value), but when you launch it on day 1, it will perform worse still - as day 1 is different from days -10 to day -3.
I haven't done the analysis, but I'd assume that for non-historical models where you don't need to structure your holdout data to be from the "future" - that they'd perform better when you first launch them online.
I think we're talking about the same thing here, that the real world can be very different than your training sample. Sorry, I think I contacted the jargon flu this week :)
In particular for computer vision tasks, creating a perturbed variant of your input images (slightly warped, flipped, mirror, what have you...) can do wonders for the generalization performance.
Was it the ellipsis or the reference to the employer of the lecturers? Either way, please don't change titles needlessly. Thanks!