Hyperparameter tuning is not as much of an issue with deep neural networks anymore. Thanks to BatchNorm and more robust optimization algorithms, most of the time you can simply use Adam with a default learning rate of 0.001 and do pretty well. Dropout is not even necessary with many models that use BatchNorm nowadays, so generally tuning there is not an issue either. Many layers of 3x3 conv with stride 1 is still magical.
Basically: deep NNs can work pretty well with little to no tuning these days. The defaults just work.
(Just poking fun. :P)
Deep in the field, it's fine for machine learning experts to say "everything just works" [if you've mastered X, Y, Q esoteric fields and tuning methods] since they're welcome to "humble brag" as much as they want. But when this gets in the way of figuring out what really "just works" it's more of a problem.
The momentum ranges from 0 to 1. If it's close to 1, which the default of 0.99 is, the EMA of the batch mean/variance will change slowly across batches. If it's close to 0, the EMA will be close to the mean/variance of the current batch.
The EMA acts as a low-pass filter. With a momentum close to 1, the EMA changes slowly, filtering out high frequencies and leaving only frequencies close to DC. Note that this is opposite to what grandparent says: 0.99 has a lower frequency cutoff than 0.6 does. So I'm not really sure what they're getting at there.
One of the problems I see is that people abuse deep neural networks no end. One doesn't need to train a deep nn for recognizing structured objects like a coke can in a fridge. Simple hog/sift/other feature engineering may be a faster and better bet for small-scale object recognition. However expecting sift to out perform a deep neural net on imagenet is out of question. Thus when it comes to deploying systems in a short frame of time one should keep an open mind.
I disagree. Sure, you don't need a NN to recognize one Coke can in one fridge for your toy robot project. If you want to recognize all Coke cans in all fridges, for your real-world, consumer-ready Coke-fetching robot product? You're going to need a huge dataset of all the various designs of Coke cans out there, in all the different kinds of refrigerators, and your toy feature engineered approach is going to lose to a NN on that kind of varied dataset.
Trying to do it from images with a NN that doesn't comprehend 3D space is just silly.
Which you can solve?
Because the problem is silly>
What if I say : "I will give you $10m to solve it, and if you fail, I will kill this very kind old monkey?"
Why, I'm not sure, but I'm guessing because it is hard/inaccurate to do with just NNs and parameter/network architecture tweaking. Possibly also because benchmarks with single mono images are much easier to make.
Just because it is hard with method A, and is harder to make benchmarks, doesn't mean method B isn't better.
Sure, if you are building a Robot and I say "use this camera and a deep network" and you say "It'll work better with stereo" well... yes super do that!
But if we are working with mono images I don't understand how the observation helps?
> If you want to recognize all Coke cans in all fridges, for your real-world, consumer-ready Coke-fetching robot product?
If you're stuck with a mono dataset, post collection, then sure use NN and call it a day. But even if you have video you can do 3D reconstruction just from baseline movement. You won't know scale, so you can't differentiate between big coke cans and little coke cans, but at least you can rule out pictures of coke cans.
Also, the training is very asymmetric, since there are many more things NOT coke cans than there are coke cans.
Not if your training set is representative. And this is just as true of feature engineered approaches, the only difference is that dealing with real world variation requires a lot less work with NNs because once you add the variation to your dataset you're done. With feature engineering that's only the first step because now you have to figure out where the new variation is breaking your features and how to modify them to fix it.
And herein lies a prominent failure mode of a huge amount of this sort of work that I've seen - hard to just "add the variation to your dataset" when your data set is one or more orders of magnitude too small to contain it. At that point all that remains is the handwaving.
The right response to insufficient data is usually simplifying the modeling.
EDIT: to clarify: "j/k" with the thing in parenthesis ;-)
Sometimes GANs converge or not depending on the random number seed, even with the same hyperparameters.
XGBoost did not change much from the "Greedy function approximation: A gradient boosting machine." paper, but uses a few tricks to be much much faster, allowing for better tuning.
XGBoost is popular for structured data competitions on Kaggle. Even there: The winner is often an ensemble of XGBoost and Keras. And some structured data competitions are won by neural nets alone: http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it... and https://www.kaggle.com/c/higgs-boson/discussion/10425 (Neural nets won the Higgs Boson Detection Challenge, where XGBoost was introduced)
I'd say Tensorflow/Keras can handle a wider variety of problems, with the same, or improved accuracy, than tree-based methods can. NN's do well on structured problems (the domain of tree-based methods), but also own computer vision and, increasingly, NLP. I agree on the pitfalls of applying neural nets vs. forests.
It is true that tree-based methods are academically a bit out of vogue: The exciting stuff is happening in the neural network space. You have more chance of getting published with deep learning (this used to be the other way around).
All of this leads to ease of implementation being overlooked. Especially on NLP, you see a lot of overengineering with deep neural nets (where the feature engineering is hidden inside the architecture). These models are hard to implement/reuse.
But yeah: academia/theoretical machine learning creates the very tools for applied machine learning.
On general purpose ML, with out much time rigging a optimal solution, RF/Xgboost will preform better. But in many problems, ie vision, DL is vastly superior.
The other important point is that there many new opportunities for researchers regarding DL, where statistical ML approach on supervised problems is a much more well established field.
On the other side of the CNN coin is the image recognition that's getting a lot of hype from the self driving auto crowd. I think any data scientist worth their salt understands how the different algorithms stack up against each other. You wouldn't use xgboost for a computer vision problem just like you wouldn't use CNN for a tabular data problem.
https://www.youtube.com/watch?v=wPqtzj5VZus Trevor Hastie - Gradient Boosting Machine Learning
https://www.youtube.com/watch?v=sRktKszFmSk Ensembles (3): Gradient Boosting, Ihler
Problem : robust && optimized are very vague terms in ML.
It also seems each layer of random forest just concatenates a class distribution to the original feature vector. So this doesn't seem to get the same "hierarchy of features" benefit that you get in large-scale CNN and DNN.
That's generally true for DNNs, which is a good place to be if you have lots of data. This typically isn't true for tree based approaches, which is why they fell out of fashion in some problem domains; they don't generalize as well. This paper doesn't seem to change what we already know in this respect.
"if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems"
You can read more in: Antipredictable Sequences: Harder to Predict Than Random Sequences by Huaiyu Zhu and Wolfgang Kinzel
That's a very misleading TL;DR. NFLT certainly _applies_ -- deep networks are not immune to NFLT -- it's just that NFLT isn't very useful because we can't use it as a basis for decisions. You can't detect that your algorithm's performance is being limited as a consequence of NFLT; and even if it were consequential, you would just see that the algorithm wasn't working very well, and there are _much_ more likely causes for that than NFLT.
In other words, for any pair of machine learning algorithms there is an algorithm which only performs marginally worse than either of the two on any given problem, and which may perform arbitrarily better. NFL equalises these two cases by smearing tiny losses over vast regions of possibility-space that are vanishingly unlikely.
Is there a resource I can refer to, that is clear and explicit about the reasoning?
The subtle elements of the theorems (both for inference and search) have had huge impact on the field of optimization theory. They touch upon (Kolmogorov) Complexity, Incompleteness, Halting Problems, and later related work on the physical limits of inference: One can not know everything about the universe, if you are a part of it, and philosophy: Is the universe itself a computer?
That different algorithms and search strategies will succeed for different tasks, does not imply that we have to try them all: We can use prior knowledge on what worked on related tasks. Also the tasks we care about, is only a very small (explorable) subset of all possible tasks.
If we didn't care about resource usage then we could just pick a sufficiently expressive model class (e.g. all algorithms) to mix over, and perform Bayesian inference. You can look up AIXI and Solomonoff induction for more thoughts along these lines.
NFLT assumes that all machine learning problems have infinite information. Infinite Kolmogorov complexity. That all positive and negative examples are labelled completely randomly without any underlying pattern or reason. Which is obviously untrue.
True, but I'm surprised more work isn't being done on "searching the space of programs that produce the data". There's very little research on this topic other than a few papers on minimum description length (MDL). I feel that this is probably the eventual route to AGI. We know what the "optimal" predictor is in theory; now work out the best time/memory approximation to it for practical purposes.
The reason (the general) you don't hear much about them if you don't go looking is that the state of the art hasn't budged much for the past couple decades. It's the same graph algorithms, searching and sorting toy problems. The search space over programs is difficult to traverse and it remains to be seen what the added compute power + gradients gets us.
On a more practical level, the learning 2 search paradigm can be viewed as also searching for a particular program under certain strict constraints that make search tractable. Probabilistic programming where the priors and likelihoods are themselves complex programs instead of simple distributions from the exponential family are effectively also searching for programs.
Deep neural networks can be thought of as ensembles of smaller neural networks, though of course each member of the ensemble is going to share some degree of algorithmic bias. This suggests that perhaps deep neural networks with heterogeneous activation functions and branching structures will perform better than homogeneous networks.
The reason it's not actually all that useful for real world applications is that problems we value are a subset of "all possible problems" for want of a better explanation. Problems are typically smooth, that is a small change in input variables results in a small change in 'fitness', which already limits how applicable the NFL theorem is.
[disclaimer, based on lectures 10 years ago now]
Look I'm just kidding around. Every algorithm has its advocates, and the arguments get tedious after a while.
This is a surprisingly commonly held fallacy in some AI circles. It's the idea that humans are mathematically perfect. When you phrase it that way, it's fairly obviously false, but you still see a lot of people argue things like "NFL doesn't apply to ensembles because humans..." or "machines can never be as intelligent as humans because...".
The reality is that humans are subject to the same mathematical laws as machines. It's far more likely that my brain can't solve an NP-hard problem in polynomial time either. My brain can't beat random search on the set of all possible problems.
The second point is that NFL is often interpreted in a weird way, where we only think about "interesting problems". It is defined on the set of all search or optimization problems. It says if your algorithm is better than exhaustive search over some subset of problems, it must be worse on the complement of that set.
What does it mean for an algorithm to be efficient? Well, it's roughly speaking the number of steps it needed to take (assuming each step is the same amount of work, blah blah). OK, so an "efficient" algorithm must, by definition, prioritize some steps over others -- it's picking the "best" steps to take each time it has the choice. OK, so I'll just make up some instances of the problem that are custom-tailored so that what the algorithm thinks are the "best" steps always lead me in the wrong direction. You algorithm will then be worse than exhaustive search on my set of problems, precisely because it's choosing to avoid the steps I know to be good -- I defined the problem to make that happen.
That is true of any algorithm you can conceive of. There will be problems for which the bias that makes the algorithm good on the problems you intended it to work on will be exactly the wrong bias.
It doesn't matter if you say, "Aha, I'll let my algorithm generate new algorithms! Gotcha!" I'll just design a set of problems for which your algorithm generating algorithm will generate the wrong algorithms.
Search is always about bias. Without bias, you have random search -- that's literally the textbook machine learning definition of bias. All NFL says is that if you have to worry about every possible problem, any bias you choose will be worse than random sometimes. There's no escape clause here. Your algorithm generating algorithm is still a search algorithm with its own biases, and it will still be worse than random on some subset of all possible problems.
They claim that deep neural approaches get 98.75% or 99.05% by referencing obsolete decade old results, while in fact state of art exceeds 99.8% (i.e. 0.2% error rate, which is five times lower than 1.0% error rate reported in this paper). I have seen MNIST given as a homework exercise for undergraduate students in ML/NN class, and getting 99.0% would indicate that your code has some serious bugs, a decent undergrad with no prior experience can get 99.7% on MNIST after a few lecture introduction to basics and a dozen hours of homework coding practice.
"If we had stronger computational facilities, we would like to try big data and deeper forest, which is left for future work." and that:
"As a seminar study, we have only explored a little in this direction."
Not saying that the paper has no reason to exist, I think it is generally well written and decision trees certainly deserve attention. If they can do representation learning on high level this is certainly something to look into. But it shouldn't claim to be an alternative to state-of-the-art deep learning if there is no data for this comparison. Everyone can solve MNIST (or even CIFAR).
A real emerging area of opportunity is having systems train new systems. This has numerous applications, including assisting DSEs in the construction of new systems or allowing expert systems to learn more over time and even integrate new techniques into a currently deployed system.
I'n not an expert here, but I'd like to be, so I'm definitely going to ask my expert friends more about this.
Looking forward to try the code (especially on CIFAR or ImageNet), Zhi-Hua Zhou, one of the authors, said they are going to publish it soon.
For the rest of the (non-image) datasets, it's already common knowledge that boosting methods are competitive with neural nets.
1. R code implementation (could probably write this myself but would make things easier)
2. How to get feature importance? Otherwise difficult to implement in business context.
3. Better benchmarks