Hacker News new | past | comments | ask | show | jobs | submit login
The Mixed Track Record of Machine Learning Algorithms (bloomberg.com)
52 points by briatx 5 months ago | hide | past | web | favorite | 22 comments

A good ML system has to be architect-ed to exploit known structure of the task that is being attempted.

CNN's exploit spatial locality and LSTMs exploit temporal locality. The SOTA models are architect-ed with even stronger assumptions about the nature of the task. Methods like Neural Networks, Random Forests and SVMs when used as unconstrained universal function approximators for unstructured data only learn some non-linear polynomial/ exponential/logarithmic combination of data itself, without much nuance.

It is critical to help a model out by constraining the space of models it searches over to find the right answer. I think, unless we figure out a way to constrain architectures to exploit specific traits of task they are trying to solve, (universal function approximator type) ML won't succeed in the same way that it has in vision / language.

As it of now, the alternative is to use PGMs where the model is fully interpretable as a graph structured combination of explicitly parameterized random variables. PGMs work well with low data and give really good uncertainty estimates, to evaluate the quality of a model. PGMs of course suffer from the problem where they are excruciatingly slow for large datasets and require require a decent amount of prior knowledge about the problem to explicitly define the type of graph structure / random variables we are going to be using.

I think ML is most certainly capable of solving this problem, but the community is probably waiting for another break through along the lines of AlexNet/LSTMs before that it the case.

> “Machine learning algorithms will always identify a pattern, even if there is none,” he says.

I think that's perfectly said. Humans are prone to the same thing, but we've developed better coping mechanisms.

It’s a bit foolish though, because “regression” will always identify a pattern too, or many other simplistic models.

In ML, techniques to avoid overfitting or reporting spurious relationships are a first-order, 101 topic, especially among the type of ML engineer a hedge fund might hire (they are not hiring data science hacks).

On the flip side, I worked in a quant finance firm before that mostly did factor investing with some twists, and the overall statistical rigor was embarrassing. Even with simple regressions, nobody was asking basic robustness questions, p-hacking was daily life, directly comparing t-stats from different univariate model fits was considered “advanced feature selection.”

If a firm is goimg to do bad stats, they don’t need machine learning for that.

I guess what I'm trying to say is that algorithms can't tell whether they're fooling themselves. Someone has to apply the 101 techniques for testing fit, etc. Humans at least have the opportunity, though, as you point out, they don't always take it.

Maybe it's just me, but to me it seems like 99%+ of humans don't check whether they're fooling themselves, even when using statistics.

Have you ever known anyone to check if the central limit theorem applies before taking an average ? I mean, we did it once when learning what it was and why you might want to check, but ...

The problem with statistics is that they in theory don't work in the real world. For instance, if you check a statistical variable, great. Now you fix something in the real world and recheck your variable. BZZT wrong ! You can't measure a variable after you've tried to influence it, because obviously you're not measuring the same thing anymore. So there is (potentially) no relationship whatsoever between the measurement after the change and the measurement before. So ... statistics CANNOT correctly be used to improve things in the real world.

But ... have you ever known anyone to use statistics any other way ? Also: we don't actually have anything better.

The thing is ... it mostly works in practice, though you can come up with examples where it doesn't.

And of course you can do things very wrong, as you're just adding, multiplying and so on. That works on any set of numbers.

The thing about machine learning is that a well designed machine learning algorithm contains far less details about the problem than a statistical model. So people far less versed in the problem being analyzed can improve things more using machine learning than by using statistics. But the potential maximum improvement you could ever hope to make, statistics is going to be higher. Compare a second-degree regression to and LSTM for a time series. ASSUMING the statistical model works at all, it'll beat the crap out of the LSTM. But the LSTM will sort-of succeed in nearly all cases. So if the variable fits the information you stuck into your statistical model there's no beating that model (in this case that the data is generated by a second-degree process with a not-too-close-to-zero determinant)

Issue for the future is that all interesting problems are beyond the comprehension of any human, so ... machine learning will win. This means humans can't make statistical models for them either. It'll win, not because it is always the best solution, but because for so many problems you might as well say for all problems we will never find anything remotely optimal or understand enough to even figure out how to apply statistics to it.

> Issue for the future is that all interesting problems are beyond the comprehension of any human, so ... machine learning will win.

I agree that there's a set of problems that are both beyond human comprehension and interesting to humans. Specifying them, measuring the results of algorithms to solve them, and paying for the results will probably have to remain within human analytical capability, or you wind up with Skynet (unlikely), or some analog to the 'gray goo' problem, where machines are optimizing with unintended consequences.

Most of machine "learning" in real world use seems to be humans fiddling with weights until they get the magic number they want.

That sentence is not true in the presence of good regularization. And doing proper regularization is a major part of machine learning as a discipline, just like having test and training sets to see if you overfit the data. So “Machine learning algorithms will always identify a pattern, even if there is none” is only true if you don't follow the basic best practices of machine learning.

“Machine learning algorithms will always identify a pattern, even if there is none”

Is this partly a problem with interpretation? Let's say I do a binary (supervised) classification with an algorithm that is also capable of assessing probabilities. If I generate a data set consisting of a randomized bag of words, and randomly assign them to 0 and 1 categories, and run it through a supervised ML classifier, then yeah, everything in the test set will get assigned to something.

But if you look at the probability estimates resulting from the ML, you'd almost certainly see something that indicates a high degree of randomness in the assignments (various techniques such as cross validation, or probabilities that indicate a high degree of uncertainty for almost all of the predictions).

I'm not sure this is a problem with the algorithm itself, because the output from many of these algorithms does indicate low predictive value.

Check out this classic paper on deep learning with randomized labels: https://arxiv.org/abs/1611.03530

Spoiler: the neutral net thinks it's doing a really good job!

Thank you for the link. I'll read this paper. I'm hoping to reply but the thread may be stale by the time I do. Right now, my thoughts are: if it is easily fitting, what are the assignment probabilities? Are we getting 90%+, or is it fitting easily, but to much lower probabilities. Also, is there a big difference between neural nets and other algorithms like RF?

The paper certainly does appear to address the question of categorizing completely randomized input:

From the abstract

"...our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a ran- dom labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by com- pletely unstructured random noise.

I'm going off a first pass through the paper, but it appears that what this paper shows is that the training error can be 0 on an entirely randomized data set, but the generalization error - the difference between the error on the test set and the training set, does increase dramatically as label corruption increases.

My understanding is that cross validation does multiple combinations of splitting the input data into test and training sets... so if cross validation measures the generalization error, wouldn't this catch the low predictive value resulting from randomization of labels or input?

I'm not saying the paper doesn't have value, but I think it's more about the fact that neural nets can obtain a training error of zero on randomized data, not a testing error (or generalization error, which represents the difference between training error and testing error, as far as I can tell).

To be clear, I'm not an expert, and this is just what I gleaned from a first pass over the paper.

All true. The interesting thing here is that the neural network has /no idea/ that it sucks at generalization, though. Yes, we can do extra work to calibrate outputs, but it would be much better to have some idea of uncertainty from the network itself.

(Added as edit) also keep in mind that datasets themselves often fail to generalize - overriding to a particular set makes for domain error when moving to slightly different data. Cross validation won't help wit that, but more "self aware" algorithms might.

But... isn't that the entire point of splitting your initial training data into a training set and a separate testing set? Why is it better to have an idea of uncertainty from the model itself when you can get the generalization error through cross validation, or by setting aside a testing set?

It's interesting to see that a neural net will reach a training error of zero on randomized data, and it's a worthwhile contribution to the literature to demonstrate this, test it, and measure it... but the outcome here doesn't surprise me. From experience I know that random forests will also show nearly 100% accuracy on a training set but show far lower accuracy for a testing set, so while I think it's great to measure it, the conclusion in this paper is not surprising.

In no way is that a knock on the paper, people weren't surprised that Fermat's last theorem turned out to be true, but that doesn't make the proof any less of an accomplishment!

First, because the dataset itself might be biased in subtle ways, in which case your cross validation won't help. This happens All. The. Time. For example, your training set for speech recognition might use a nice microphone uniformly, and everything goes to hell once you deploy to cell phones, because the microphones have different characteristics.

Or, in the case of financial markets, the future might not look like the past. And the present might not look like the past... So you get datasets that are very time-specific, and thus prone to overfitting to local conditions and/or noise.

Secondly, you can absolutely overfit your cross validation set, same as p-hacking. Run experiments until you have a slight positive, statistically significant result, then tell your managers you've got some crazy new sliver of alpha. And then when it hits really new data, it falls to pieces, because repeated experiments on noise will eventually produce a statistically significant result.

It's like the old saw about freshmen who don't know that they don't know... Our current ML models tend to be like freshmen, or freshmen with bandaids...

I'm certainly convinced that sampling from a narrow dataset can lead to problems down the line. You don't remove possible bias in your data by randomly setting aside a testing set and using it for cross validation. It's an important practice, but yes, I agree that you can end up over training on structures that exist in both the testing and training set, since they're drawn from the same, narrow source.

As another example, you might train a model to identify positive and negative movie reviews (a pretty common example in intro tutorials). Your original data set might just be views by Roger Ebert. Your model, based on a training set, thinks it's at 100% accuracy. Cross validation on the test set reveals it's at 85%. Not bad! Then you apply it to reviews for 100 film critics. It's down to 70%. Then you apply it to random reviews left anonymously on the internet by 1000s of people. It drops all the way down to 55%.

That's a good argument in favor of a robust data set, drawn from multiple independent sources.

However, here's where I'm not convinced. How would a low training error based on a narrow data set be any less misleading than a low cross validation error based on a narrow training and testing set? If you aren't drawing from a robust data source, it seems the problem would be just as bad either way.

Very true! But part of knowing that you don't know is recognizing when you're in unfamiliar territory. Being able to say "this example doesn't look like what I've seen before, and therefore I'm not as sure of my response." Algorithms which can express uncertainty /should be/ more robust to domain errors.

That's because ML is given a model to fit to the data, so it'll find the best fit, even if the model doesn't represent the data.

Well, right. A binary classification system must assign a 0 or 1. But cross validation or other methods may reveal that it isn't a good fit, just a fit.

As mlthoughts pointed out in a different comment, any kind of regression technique faces issues about goodness of fit. The thing is, there are techniques to show you that the fit isn't very good. A simple linear regression will fit randomized noise, but there are outputs that can show you that the fit isn't good and the regression may not be reliable.

The question I have here is whether ML techniques are failing in a different way, that it is fitting to randomized noise while appearing by various tests to be a very strong fit. If they're failing the same way that regression would (i.e.., someone applies it and fails to do basic tests for goodness of fit), that's a problem I suppose, but is it really a unique failing of ML or neural nets? It sounds like more like a standard misapplication of predictive modeling...

Most of those replies sound like: "we are experts, we know better" and stating badly outdated facts that probably passed through marketing department on the way to those experts. I hope no progressive person wants to work for them, but rather to compete with them to drive them out of business, a typical pattern repeating whenever somebody gets too cocky about their abilities.

Why are we seeing so many stories from bloomberg.com? What tier is bloomberg’s credibility and their sources’? What part of technology, or business is their journalism known for? I am new to US, but I think my questions are not senseless. Thanks.

I find this article moot. They are basically saying that applying a blank ML stamp can yield unsatisfying results on your data. That ought to be obvious and seems unnecessarily repetitive.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact