Hacker News new | more | comments | ask | show | jobs | submit login
More Data Is Not Better and Machine Learning Is a Grind (ncsu.edu)
202 points by sonabinu 38 days ago | hide | past | web | favorite | 88 comments



More data is better.

You can reduce it via PCA one of the many techniques in multivariate statistic.

You can do anova to select your predictors.

In general you can use a subset of it using the tools that statistic have provided.

Complaining about messy data... welcome to the real world. As for complaining about non-reproducible models , choose a reproducible ones. I've only done mostly statistical models and forest base algorithms and they're all reproducible.

All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?

> The results were consistent with asymptotic theory (central limit theorem) that predicts that more data has diminishing returns

CLT talks about sampling from the population infinitely. It doesn't say anything about diminishing returns. I don't get how you go from sampling infinitely to diminishing returns.


>> All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?

The solution is to direct research effort towards learning algorithms that generalise well from few examples.

Don't expect the industry to lead this effort, though. The industry sees the reliance on large datasets as something to be exploited for a competitive advantage.

>> You can reduce it via PCA one of the many techniques in multivariate statistic.

PCA is a dimensionality reduction technique. It reduces the number of featuers required to learn. It doesn't do anything about the number of examples that are needed to guarantee good performance. The article is addressing the need for more examples, not more features.


>>>Don't expect the industry to lead this effort, though. The industry sees the reliance on large datasets as something to be exploited for a competitive advantage.

This is only true for the Facebooks and Googles of the world. There are definitely small companies (like the one I work for) trying very hard to figure out how to build models that use less data because we don't have access to those large datasets.

The industry is larger than just the Big N.


Btw, if you have relational data and a few good people with strong computer science backgrounds rather than statisticians or mathematicians, have a look at Inductive Logic Programming. ILP is a set of machine learning techniques that learn logic programs from logic programs. The sample efficiency is on a class of its own and it generalises robustly from very little data[1].

I study ILP algorithms for my PhD. My research group has recently developed a new technique, Meta Interpretive Learning. Its canonical implementation is Metagol:

https://github.com/metagol/metagol

Please feel free to email me if you need more details. My address is in my profile.

___________________

[1] As a source of this claim I always quote this DeepMind paper where Metagol is compared to the authors' own system (which is itself an ILP system, but using a deep neural net):

https://arxiv.org/abs/1711.04574

ILP has a number of appealing features. First, the learned program is an explicit symbolic structure that can be inspected, understood, and verified. Second, ILP systems tend to be impressively data-efficient, able to generalise well from a small handful of examples. The reason for this data-efficiency is that ILP imposes a strong language bias on the sorts of programs that can be learned: a short general program will be preferred to a program consisting of a large number of special-case ad-hoc rules that happen to cover the training data. Third, ILP systems support continual and transfer learning. The program learned in one training session, being declarative and free of side-effects, can be copied and pasted into the knowledge base before the next training session, providing an economical way of storing learned knowledge.


Ah yes I am very familiar with ILP - thanks for sending these references!


You're welcome, and what a pleasant surprise, it's rare to find people who know about ILP in the industry :)


You're absolutely right and I appreciate that very much. On the other hand, there's an incredible amount of hype around Big Data and deep learning, exactly because the large corporations are doing it. So now everyone wants to do it, whether they have the data for it or not, whether it really adds anything to their products or not.

As to the Big N (good one) what I meant to say is that I don't see them trying very hard to undo their own advantage, by spending much effort developing machine learning techniques that rely on, well, little data. That would truly democratise machine learning- much more so than the release of their tools for free, etc. But then, if everyone could do machine learning as well as Google and Facebook et al, where would that leave them?


>CLT talks about sampling from the population infinitely. It doesn't say anything about diminishing returns. I don't get how you go from sampling infinitely to diminishing returns.

Yes it does. It even implies it in the name 'limit'. In the limit of infinitely many samples, we approximate a normal distribution. This approximation has diminishing returns.

>All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?

It's fine to point out problems without giving solutions. You seem very aggravated.


PCA has specific use cases. It’s not a catch all dimensionality reduction technique. You can’t use it effectively, for example, if things are not linearly correlated. There of course many tools for addressing many problems, but as the title states, this is often a grind. For any practical problem, exclusive of huge black box neural nets where you don’t need to understand the model, you are probably better off starting with a smaller set of reasonable sounding features and then slowly growing out your model to incorporate others.

Also if you meant random forest by forests... those aren’t especially reproducible. Understanding what’s going on is not always easy, and most people seem to misinterpret the idea of “variable importance” when you have a mix of categorical and numeric features. Decision trees and linear regressions are nice and reproducible.


> Complaining about messy data... welcome to the real world.

I mean, that's the crux is if you have bad data you will have bad results. Data cleanup/transformation is key for anything (reporting, etc...) and not just limited to ML because it's sexy these days.


Nice to see a statistician weighing in on this post


Thank you, Im not crazy. I was reading HN very confused.


In image recognition, Google asserted in 2017 that they were unable to find decreasing returns -- "Performance increases logarithmically based on volume of training data." Maybe for supply-chain and regression models there is a limit but it seems in deep neural nets maybe there's a different answer.

Blog: https://ai.googleblog.com/2017/07/revisiting-unreasonable-ef...

Paper: https://arxiv.org/abs/1707.02968


Isn’t that the very definition of diminishing returns?


Yes, precisely. Logarithmic growth asymptotically decreases - it's the definitional opposite of exponential growth. This is clearly illustrated in typical graphs depicting logarithmic growth: https://jamesclear.com/wp-content/uploads/2015/04/logarithmi...

So in point of fact, according to the cited paper Google is asserting there are diminishing returns to increasing the volume of data.


The two of you are asserting different hypotheses than the OP presented: diminishing returns != decreasing returns. The Google paper found that increased amounts of data always improved performance, but did so at a lower rate the more data that had already been provided. First derivatives vs. second.

The headline of this article is "More data is not better", which is a stronger claim than diminishing returns - it's neutral or negative returns.


I'm only talking about the Google paper brought up in this thread, because logarithmic growth does asymptotically decrease over time. I didn't say that the Google paper asserts absolute decreases over time, such that more training data actually makes results worse. Just that more data becomes too expensive to be worth it in a logarithmic context.


The headline isn't consistent with the article itself, which states (somewhat confusingly):

* As the number of product data grows, the benefits were negligible

* More observations per product was important

* The results were consistent with asymptotic theory (central limit theorem) that predicts that more data has diminishing returns.

In any case, this seems to be a case of "diminishing returns."


>> The headline of this article is "More data is not better", which is a stronger claim than diminishing returns - it's neutral or negative returns.

Well, if I was paying $10,000 for 10,000 examples (to collect, cleanup, process, train with, etc), getting 90% accuracy and making $90,000 from the training model, and now I'm paying $10,000,000 for 10,000,000 examples, getting 91% accuracy and making $91,000 from the trained model, I'm losing money where before I was making some. That's "not better".


Few big tech companies pay for training data - it usually arises organically out of usage data from their products. You need to collect this anyway for product function / business metrics / abuse prevention, and build the cleaning & processing pipelines. So the only marginal cost of feeding more training data into your machine learning pipeline is the computational cost of training it, which is usually tiny fractions of a penny per sample.


If it was so simple to collect, process and train with (very) large amounts of data, everyone woudl be doing it. Instead, it's just a few very large companies that can do that, Google, Facebook et al.

Anyway, the cost per example doesn't have to be astronomical. If you need a few millinos of those, you can pay a fraction of a penny and still have a big black hole in your budget, unless you can significantly improve performance.


Only a few big companies can do it because there's a bootstrapping problem. To get large amounts of virtually free data, you need lots of users who have signed up for giving you their data in exchange for a useful service. This was much easier to achieve for companies started between 1995-2005, when the web was young, because the Internet was such a huge leap forwards over what came before it. Existing startups now have to compete with the products of these giants, many of which have been enhanced by years of machine learning. That's challenging.

To give you a sense of how cheap computing power is, my startup regularly processes roughly 2B webpages with some complicated algorithms that need to go node-by-node over the whole DOM tree. That's roughly 77TB of (gzipped) data, and around 100 trillion nodes. It costs me a few hundred bucks of AWS time. That's a rounding error for a big corp; a single data scientist's salary for one day will run you around that much.


> Logarithmic growth asymptotically decreases

Logarithmic growth slows down, but not asymptotically. Think about it: what would the asymptote be? (There is none.)


100 % accuracy is what the algorithm wants to achieve (in the most basic case), so that's your asymptote. If your precision increases logarithmically e.g. as 1-1/log(x) you get arbitrarily close to 1, which is the definition of an asymptote I think.


> 100 % accuracy is what the algorithm wants to achieve

Well, in that context every kind of (monotonic) growth is asymptotic, so the word has no meaning.


Okay, yeah you make a good point. Strictly speaking a logarithmic function does not imply a convergent series. To be more precise, what I'm talking about are slowly varying functions. These are functions which don't converge but which have similar characteristics to convergent functions (namely, extremely slow rate of change).

Since you made the correction you're probably already familiar with these, but for the benefit of others: https://en.wikipedia.org/wiki/Slowly_varying_function


It really slows down though. Log(n) indeed goes to infinity but if you numbered all the atoms in the universe and pass them through the function, the last one would be ~80.


That may be true, but that is a far cry from asymptotic


Mathematically, yes. For all real-world intents and purposes, it's bounded. I sometimes use this very example to illustrate the profound difference between the two which can be easily under-estimated. (That or log(log(n)) which also goes to infinity but is real-world bounded at <3)


That is in log_10, in log_e it would be near 190.


As a scientific term, "diminishing" means its derivative has limit/asymptote of 0 or negative approached fast enough that the integral (or series sum) is bounded. Logarithm, with derivative 1/x is the well-behaved function with smallest (ignoring constant factors and trivial higher order terms) non-diminishing return -- its returns flatten too slowly to bound it's value.


Perhaps it would be more accurate to say that marginal return is materially positive for all feasible n


A logarithmic increase in performance is positive for all n, but not necessarily _materially_ positive for all n. In fact there will be an n where the marginal gain from adding more data will _not_ be worth it, if the cost of adding more data scales linearly but the benefit scale logarithmically.


Right. If the cost of adding more training data remains flat or increases, a logarithmically increasing n will eventually reach a point such that it's more expensive to continue increasing training volume.

In fact, unless the cost of adding and using training data exponentially decreases over time, it's a mathematical certainty that a logarithmically increasing n will quickly incur expensive, diminishing returns for using more data.

So in the context of this Google paper, you could conceive of a situation where training data actually becomes easier to load (albeit subexponentially) and still becomes too expensive to use relatively quickly.


With diminishing returns it's interesting to think about this as a purely economic function when it makes sense to index more data.

If a 5% increase in accuracy worth more than the cost of data storage and computation then it's a pretty clear win.


I know it's an empirical result, but it would be interesting if it was truly logarithmic. Classic statistics gives various asymptotic results based on variations of the central limit theorem, which generally imply sqrt(n) scaling, but these assume various regularity conditions (typically smoothness and fixed dimensionality).

It would be interesting if there was some corresponding law for more flexible models which did indeed give you logarithmic scaling.


If the model is an ideal encoding, then the description length of every input grows logarithmically. Perhaps the performance is gated on the number of "features" in the ideal encoding.


My limited opinion: The Right Data is better than Big Data, if you can get it.

There was a push to get data out of app dbs and into big data repositories. But then no one could use the big data because it made no sense. So then ML?

But if you already know what it means in the app db, just make it available in a sensible format.


I think you can have both situations. I still remember a talk from a Google person maybe around 2010 where he showed clearly that if you try to make the data "better" but in general have way too little data points, then it can't be useful either.

It's just now that we had almost a decade of pushing towards more IoT and more Big Data, that many companies have huge data lakes that they don't know how to make use off.

So instead of applying one of these lessons it's probably best to see where one is lacking (quantity or quality) and work on resolving that specific problem accordingly.


ML definitely works better on good data. If you have bad data, putting some effort into turning it into good data can get you a lot of decent results very quickly. Whenever you hear a ML experts talk about 'feature extraction', they are talking about turning unstructured data into more structured data. With unsupervised learning, you rely on this happening automagically.

I've seen some naive approaches struggle to get good results when all that was needed was beating some sense into the data. This typically requires domain knowledge. IMHO most of the engineering effort in ML is not tuning the algorithms but moving data around.

A little practical example: consider a dataset containing POIs and a ML based search ranking algorithm that is struggling with correctly ranking airports. Do you 1) spend a year trying to get your algorithm to work better using examples of 'good' results, or 2) figure out a better source for the limited number of airports that exist in the world with better metadata. Turns out a ~day of sorting this out with a few good open data sources gets you a lot further than months of trying out different ways to extract features that aren't there.

We had a whole team of ML PhDs. on this and they couldn't get it done. A single engineering intern came up with the obvious analysis: this data is shit, fixing that is easy, lets fix that. Problem solved.


There is a very good talk on "The Unreasonable Effectiveness of Structure" that detail how just adding this structural information instead of droppping it can give great, low-hanging, rewards : https://www.youtube.com/watch?v=t4k5LKCpboc


One of the most interest HN comments I saw last year claimee many programming trends that burn out make false promises about not needing to worry about database design anymore.

The insights you're looking for could be cutting research problems or near trivial depending on the condition of the app db.

I have become fond of saying that the opportunities of ML and data science are just more rewards for getting your database schema right.


See the "Law of large numbers" - https://en.wikipedia.org/wiki/Law_of_large_numbers

Saying you need more data is like saying you need to flip a quarter 500 million times to get a better percentage estimate of heads vs. tails compares to 1 million coin flips. After a certain point, having more data only helps with identifying outliers and changes in behavior over time (when dealing with human/natural data).


You're using low-dimensional intuition. If your data has a non-linear relationship between 1000 features (or more), then 500 million samples is quantifiably better than 1 million samples. If these are boolean features, you have 2^1000 possible data points. Your samples are extremely sparse.

The current state of the art of ML assumes nonlinear relationships between all parameters. It can't assume simpler & reasonable models, and therefore it can't extrapolate easily with reduced data.


> The current state of the art of ML assumes nonlinear relationships between all parameters. It can't assume simpler & reasonable models, and therefore it can't extrapolate easily with reduced data.

I'm not really sure what "low-dimensional intuition" means, but I pretty regularly build models that do not "assume nonlinear relationships between all parameters".


I think the Wikipedia article on the Curse covers it pretty well.

https://en.m.wikipedia.org/wiki/Curse_of_dimensionality

The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.


As the dimension d goes to infinity, if you sample integers from d-multivariate normal distribution, your samples will collect near the unit d-sphere; whereas in low dimensions they'd collect near the origin. This is because as d goes to infinity, your samples become more and more far apart. This is a topological intuition of why higher dimensions don't work like lower dimensions.


I reserve the possibility that I am completely mistaken, in which case I apologize in advance, on the condition that you refer me to an actual calculation or derivation (not just a quote in the same spirit) in a paper or text or textbook.

I have seen such insinuation multiple times, that samples "collect" near the unit d-sphere when samples are drawn from the unit d-sphere in high d dimensions.

From a physics perspective this is very familiar to me, but not in the sense of fact, but in the sense of misinterpretation.

I do believe the observation is very useful in the educational sense as long as it is pointed out as being a paradoxical illusion. In this sense I can appreciate (even encourage) a professor or a TA showing this phenomenon, on the condition they finish up by explaining why this seems to be the case, but is nevertheless a misinterpretation, and they should make sure the connection with Jacobian determinants etc are made clear.

Consider a normal distribution of any dimension (as high as you want), but I will showcase the phenomenon even with low dimensions (here merely d=3) to illustrate this has nothing to do with high dimensions.

Clearly the probability density is maximal in the center of the distribution.

In computer processing of data points, we typically loop over points, calculate some hopefully interesting function on each sample, and then plot the samples say by binning with equal bin sizes. A programmer typically disregards transformation properties like the Jacobian determinant. Suppose the value we calculate for each data point is the absolute length or distance from the center. The further we go from the center the smaller the probability density of the normal distribution becomes... but the larger the volume of a shell of radius r!

Since we are binning with equal bin sizes (equal length intervals per bin), for small lengths, then even though the actual probability density is highest near the center, we will get relatively few samples because the volume under consideration is small compared to the volume under consideration for a shell of a larger radius of equal thickness (area of a sphere grows quadratically with radius). However for even larger distances, the exponential decay of the normal distribution dominates and the number of samples in highest radius bins will decrease again. So in between there will be a peak.

This explains the fact that [ the probability density of [ the absolute distance from the center over [ the sample points ] ] has a peak at some non-zero length.

But it is a conceptual mistake to interpret this as if those sample points in the original d-dimensional space form a dense shell on some "unit sphere" ... This is a pure illusion which illustrates the interpreter is not familiar with jacobian determinants etcetera.

Consider volume or triple integrals over some volume element dx dy dz and for symmetry you prefer integrating in a spherically symmetric coordinate system, theta,phi,r then you can not simply replace dx dy dz with dtheta dphi dr, you need to use dV = dx dy dz = r^2 sin( phi ) dtheta dphi dr.

It is this necessary factor that is ignored when processing sample by sample and causing this illusion in the AI community. I did not follow conventional machine learning courses, but given the learned language used whenever I see statements to the effect of samples lying near the unit sphere in high dimensions, I can only conclude it has its origins in 1) direct observation or experience of plotting in bins of the length of the vector without guidance in interpretation; or 2) guidance during education, where the phenomenon is shown, and the origin of the paradoxical illusion or confusion adequately explained and then the illusory nature subsequently forgotten or 3) a teachers assistant having gone through 2) and showing the phenomenon to students without emphasizing the illusory nature of the misinterpretation.

But perhaps I am wrong, and the normal distribution in high dimensions actually has a higher probability density near its "unit sphere" if the dimension is beyond some critical dimension d_c... but again, I'd like to see a derivation showing it :)

EDIT: For more precise language: they do "collect" (reach a peak) at a certain non-zero length or distance, but they do not collect to a unit sphere of such radius in the original space!


It seems you're mostly arguing with the description of "collection", because to you it implies high density in the full n-dimensional space. But the point of the curse of dimensionality is very much that most points do not lie in regions of high density, because regions of low but still non-negligible density are so much larger. If you prefer, you can say that they are "close" to the surface of a sphere, without implying high density.

There are also descriptions of the curse that do not involve spatial analogies at all. Assume that the data is independently identically distributed along each dimension and is an outlier if it's sufficiently far along one dimension, which happens with probability p in the one-dimensional case. Then in n dimensions, the proportion of outliers is 1 - (1 - p)^n -> 1 for n to infinity. Most points are outliers along at least one dimension.


I am not arguing with the word "collect", as it can be interpreted in best faith.

I DO argue against the "collect near the unit d-sphere", if it does not come with an explicit pointer or reference to the explanation of this illusion, I don't care if one points to Maxwell Boltzmann speeds vs velocity vector distribution, or Jacobian determinant, but one should point to something else communication makes no sense. We communicate to teacch and learn. Only when explaining why there appears to be a sphere in the higher dimension, and how this illusion arises is communicating about this pseudosphere justified. The unit d-sphere makes no sense on the 1-dimensional axis of absolute length on which we project the samples. A reference to a "unit d-sphere" only makes sense as residing in the original d-dimensional sample space. But in that space there is absolutely no packing of samples near the peak radius as it appears in the distribution of lengths.

I was not responding to the "curse of dimensionality". I show that this effect already exists at low dimensions, and physicists are very acquainted with it because within their first years of university study they get drilled in 1) jacobians for non-linear coordinate transformations 2) the velocity distribution of molecules in the kinetic theory of gases, where there is a similar plot for the absolute speed (Maxwell-Boltzmann) distribution [0] showing a peak at a non-zero speed, even though the velocity vector distribution is a normal distribution... Every physicist worth his / her salt, immediately recognizes the phenomenon as relating to the Jacobian determinant, that this has nothing to do with velocities aggregating to some sphere of non-zero radius, and that this is a misinterpretation of magnitude distribution plot...

Clearly in this physics example d=3. Would you say d=3 already shows the curse of dimensionality? I call bullocks, and suspect a misinterpretation of the magnitude plot...

Again the peak in the magnitude plot is very real, any reference to a sphere with radius of the peak in the magnitude plot and residing in the original d-dimensional space is purely imaginary!

Of course the person with simultaneously avera height, avg weight, avg income, avg capital, avg age, avg ... is very rare, ... but less rare than a similarily accurately specified person with non-average weight, but all other variables still average...

[0] https://en.wikipedia.org/wiki/Maxwell-Boltzmann_distribution...

see the last 2 sections on velocity vector distribution, and speed distribution...


I think most of the time when the "if you sample points [at random] from an n-dimensional sphere" line is written/told, "using uniform distribution, where each unit volume has equal chance of being picked" is substituted in the place of "at random", and not normal distribution. I think with normal distribution you are right, and GP is mistaken, but with uniform distribution it isn't just an illusion.


That's just a d-dimensional cube of uniform density and zero probability outside of this d-dimensional cube, there is no meaningful d-dimensional sphere though...

you seem to be referencing the fact that repeatedly convolving uniform distributions approach a gaussian distribution


Beautifully and concisely explained!


I agree it's beautiful and concise, but it is not correctly explained.

I can give you a beautiful and concise explanation of clouds too: as we all know objects fall back to earth, or more properly speaking they follow orbits, close to the surface they seem to follow parabolas, but in fact they follow ellipses. the moon orbits the earth in a nearly circular ellipse, as do geosynchroneous satellites. When water evaporates from the ocean they are in fact launched into an ellipcitcal orbit, however each elliptical orbit that intersects the surface of the earth will intersect it again, this explains rain, the individual water drops fly for a while until they fall back down...

beautiful, and relatively concise in comparison with real cloud physics, but totally wrong of course...


My understanding is: It's a fancy way of saying "lots of variables", because a lot of these problems are converted into linear algebra representations (which can have geometric analogies, hence: dimensions).


Meaning that you can run PCA and the first few components will explain a vast majority of the variance. Some things may have so many moving pieces that it's just not possible to grok how all the variables interact to get the results we see.


I think, more to the point, is that more data really is not better data. Oftentimes it's worse -- the larger the volume of data you collect, the harder (and more expensive) it can be to manage sources of systematic bias in your data collection.

One of the more extreme examples of this situation is Big Data, where everything is a convenience sample, and therefore nothing is particularly trustworthy. Combine it with standard statistical hypothesis testing, whose p-values, being designed for the size of experiment that one could realistically manage in a lab setting circa 1910, are as much a reflection of your sample size as of anything else, and you're at great risk of leading yourself into a tar pit.


Depends what you’re looking for. If you’re trying to measure a very slight bias then more observations can help.

Now imagine that you want to measure that bias across different types of coins and year of minting. Does the bias between coins of the same type vary more than across types?

Curse of dimensionality starts to raise its ugly head.


I think this intuition is fundamentally wrong and misleading. This intuition really only works in low dimensions. It is not necessarily true that more data is better but you cannot blindly invoke the law of large numbers here. As someone here pointed out in high dim data, diversity is more important than numbers.


> The results were consistent with asymptotic theory (central limit theorem) that predicts that more data has diminishing returns.

That's not asymptotic theory or central limit theorem. It's a meek proposition that you learn to prove in first semester when studying math. A bounded and monotonous function will always converge.

> Higher velocity data does not improve percentage accuracy and makes accuracy levels worse!

What on earth is this supposed to mean? An ML algorithm doesn't care about how fast data "arrives". "Higher velocity" is marketing lingo related to their Kinesis Data Streams service. But I assume that he refers to the usual loss of performance observed with all online-learning algos. Feeding knowledge one by one will mostly just overwrite learned generalizations. Or maybe he is indicating that they are compromising algo quality for faster execution.

> It is important to pick a single metric to improve, even if it is not perfect, but to use it as the basis for measuring performance improvement.

I think that is wrong. Single metrics never capture complex behavior well and will lead to distortions if fed back to the system. I mean it's normal to use one metric for the error. But saying this is the real deal sounds ridiculous to me as this is being done most of the time anyway and probably is going to change in the future. Backprop only works with one metric at the moment - that's just the fact.

> Pat noted that improvement and learning is often very slow – sort of like a slow weight loss program, where you lose weight very slowly. Processes may only be improving by 20 basis points a quarter, or 80 basis points a year. That isn’t a lot, but over a decade, it really makes a difference.

Now he's contradicting himself as he gives a reason for why big data is beneficial.

> His final word of advice – students should be broad in their knowledge of a lot of things, but need to be very deep in one area.

And again he is contradicting himself. Because if you make an analogy from student to a learning algorithm he now gives TWO orthogonal metrics to optimize for.


> What on earth is this supposed to mean? An ML algorithm doesn't care about how fast data "arrives".

My guess is that he's referring to the resolution of a time series. E.g. Going from months to weeks is 'more' data, but makes your models worse.


Good point. If I retrofit the term "resolution" in place of "velocity", the author's statement does make a lot more sense. If that's what was intended they really should have used that terminology, because I was similarly confused at the term "velocity."

Whenever I've worked with time series data I've always referred to the granularity of the time dimension as its "resolution" - this is also common in geospatial data. I don't think (but am happy to be corrected) that "velocity" is a term of art in timeseries analysis.


Ya I was confused by it at first too, it definitely could have been more clear. But it's also been my experience with time series that more resolution isn't always better.


It seems like the dissonance you felt from reading comes from the article being written more in a business style of writing and thinking rather than a technical style which you are more used to.

For instance the "velocity data" and "but over a decade it makes a difference" parts. From a technical perspective it's awesome if you know that thanks to the data science you can outcompete others over a decade. But in business terms thinking is quarter based (i.e. three month) for short term and yearly for long term. The current leader that has to spend his budget on data science or other stuff has to find an advantage in it in the next few weeks after implementation, because usually making it happen takes already a quarter or two.

You might think they are idiots for thinking that short term, but their goals are also set in this way. So if they invest heavily in a topic and don't see any results for 3 quarters they might be replaced.

So if you say algorithm X will cost him 80% of his budget to implement but only shows results ten years later, for him that's the same as "no results". It's just the game he has to play to stay in the game.

I personally think this is part of why companies will not be able to make a drastic change in their DNA and instead will be replaced by the next generation of companies who comes out of start-ups, or be replaced by companies who already actively participate in the market in other areas and have the right DNA to take on the market. For instance Amazon is already more data driven than the traditional super market chains, therefore they now can attack that space with their modern technology.

> And again he is contradicting himself. Because if you make an analogy from student to a learning algorithm he now gives TWO orthogonal metrics to optimize for.

Loving that part. I bet few people have combined his advise for student learning with his advise on how to do ML earlier.


Parts of this piece appear to me as evidence for the importance of basic-mid level classes / foundations in statistics, probability, and econometrics for machine learning. Learning to build a model doesn’t mean you can always use it well.


Indeed. The first one, for example, reads to me as a dead ringer for a case where someone didn't bother to think about ecological validity until after they started having problems in production.


How and where data gets scrubbed will have significant consequences. While clean data is more valuable than dirty data (the mass of raw data collected by sensors), there is hidden value in those large masses of data. They can show broad trends and patterns that are not obvious in clean data, which is the modern equivalent of not seeing the forest for the trees https://semiengineering.com/data-vs-physics/


This may be a bit elementary for this crowd but regarding the balance of data cost vs capturing the most significant features. We use a simple decision tree as a significance cluster and optimize data munging around these clusters.

On some levels it is anti-diversity but given real world constraints it has yielded the best results. Any thoughts or links regarding this topic would be appreciated.


>> A model is now running in production but not producing the same results (or at the same level of accuracy) as what was demonstrated during experimentation… no one knows why

The simplest explanation is that the training sample (meaning the entire dataset; not just the training partition in cross-validation) was not drawn from the same distribution as the distribution of the data that is being processed in production.

Machine learning is guaranteed some performance bounds under PAC learning, assuming that the training dataset and unseen data, on which the training model will be used, come from the same distribution. Absent this, performance cannot be predicted. You might as well classify stuff by throwing a bunch of dice.

Unfortunately, this assumption, that we're representing the real-world distribution in our training dataset, cannot be justified as long as we don't know the ground truth in the real world. Which is most of the time.

Essentially, there's no way to know for sure that a model that has performed very well in experiments will continue to do so once it's deployed in production.


I agree with the intuition, but not sure that I agree with the takeaways (in particular the lack of knowledge of the ground truth). I mean it is a problem, but there are bigger, more practical mechanisms at work tool.

In terms of sample distribution - let's say that you have an online model serving traffic (and the outcomes of that traffic are logged as your training & holdout data). Then, when you train a wildly different model, it may start picking other things to serve that the original model never served - and therefore, there was no training data. This is a pretty fundamental and hard problem.

Second, funny thing, but sometimes you can't use the same metric for training as you do for evaluating your model online. I don't want to (can't) get into details, but it's also a pretty fundamental and hard problem.

Last, the production traffic always shifts. Using historical data, given how much current models are "reactive" and "compression-like", as opposed to true generalization - they perform worse given fresh traffic that changed slightly from e.g. 5 days ago. If your training data is "day -10 to day -3", your holdout is "day -2 to day 0" - models will likely perform worse on holdout than pure overfit theories would have you assume (mind you, still plenty enough to have a ton of value), but when you launch it on day 1, it will perform worse still - as day 1 is different from days -10 to day -3.

I haven't done the analysis, but I'd assume that for non-historical models where you don't need to structure your holdout data to be from the "future" - that they'd perform better when you first launch them online.


>> In terms of sample distribution - let's say that you have an online model serving traffic (and the outcomes of that traffic are logged as your training & holdout data). Then, when you train a wildly different model, it may start picking other things to serve that the original model never served - and therefore, there was no training data. This is a pretty fundamental and hard problem.

I think we're talking about the same thing here, that the real world can be very different than your training sample. Sorry, I think I contacted the jargon flu this week :)


Statistically they must be right - most of the statistical improvement by an ML system is provided by the first "large" "good" data set, after that gains are sure to be marginal. But, a lot of problems aren't really statistical in nature; they are scientific in that if you can discover a new, better, theory and use that then you can catch a few cases that you may often do the right thing on without knowing why, and sort those out. For example - you may be learning to diagnose a disease with one common driver, there may be a rare disease that responds to the same treatment but has a slightly different presentation, you often treat these patients because of false positives, but if you learn to properly classify them then you can almost always catch them; the statistical gain is negligible, the value is tangible.


I should say that above is an example, I am very cautious about using ML for medical diagnosis.


After I collect a dataset I save duplicate entries with the text reversed. More data is good data. Except now my machine learning robot is dyslexic.


Of course you're joking here, but for images people do reverse the image to enhance the output of their models and this has shown to be beneficial. There are other augmentation you can do as well. In total they're called "Synthetic Data".


This is not surprising because if image A is an apple, its mirror image A' is also an apple. Adding A' to your dataset is just plain ol' regularization. It is meant to lower bias, to prevent overfitting.


It will also make the flag of Côte d’Ivoire be perceived as that of Ireland.


It's not surprising, but highly necessary because you might otherwise learn too little about structural features, and too much about unnecessary things such as the background colour of some objects.


The other day I was thinking that if I create a black cube in a simulation with white background, and generate a number of PNG images of the cube in different positions, The adjusted weights I get in my network when it classifies with success in further simulations(applying the learned weights to classify now, instead of learning and adjusting the weights), Can also be used to classify a real life black dice on a real life white table, if the camera is well positioned and the light is right. In this case, the learning would "generalize", from the simulation to the real life. Maybe this simulated data used to adjust the weights could also be called "synthetic data" or "simulated data" or just "artificially generated data".


You see, I'm speaking only the truth, merely referring to synthetic data. And to think that they're downvoting me, poor fools.


You may joking about it, but ML researchers use all kinds of strategies like this one in order to perform what they call _data augmentation_.

In particular for computer vision tasks, creating a perturbed variant of your input images (slightly warped, flipped, mirror, what have you...) can do wonders for the generalization performance.


Title was editorialized from "More Data is Not Better and Machine Learning is a Grind…. Just Ask Amazon"

Was it the ellipsis or the reference to the employer of the lecturers? Either way, please don't change titles needlessly. Thanks!


Disagree. "...Just ask Amazon" is needless clickbaity extra words. Ultimately the opinion is that of the writer.


He was writing a summary of a talk from an Amazon exec. I suppose "per Amazon exec" would have been more exact, but are we just not allowed to have tone anymore? Style is out?


"More Data is Not Better" is also clickbaity and needless


Speak not untrue generalizations is better. Not that the opposite is true, by no means. Generalizations are extremely rarely found to be true.


As it turns out, more data is a far second to doing anything at all.


Diversity in data is better.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: