Hacker News new | past | comments | ask | show | jobs | submit login
What is the difference between deep learning and usual machine learning? (github.com)
228 points by hunglee2 on June 5, 2016 | hide | past | web | favorite | 124 comments

Feature Learning.

Deep Neural Networks can learn features from essentially raw data. Usual machine learning starts with features engineered manually.

DNNs also learn to predict from the features they learn, so you cold say (very roughly) "DNN = usual machine learning + feature learning".

In practice manually engineering features is a time-consuming "guess-and-check" process which benefits from domain expertise. Feature Learning, otoh, is more automatic and benefits from data, computing resources, and optimization algorithms.

No. In all supervised learning, the algorithm learns a model from the data and generalise it for unseen data. Better generalisation means better model. In unsupervised learning, (mostly clustering, density estimation etc) the same thing happens but we don't tell the algorithm what to learn.

Deep learning is not machine learning plus something else. It is a collection of techniques that overcomes the scalability problem of feed forward neural networks. NNs are very difficult to scale over number of layers. Standard training method of back propagation can't handle many layers because of vanishing gradient and the computational infeasibility brought on by the explosive growth of connections.

NNs are very difficult to scale with additional classification targets you may require (for example, you have a classifier for categorising 10 classes, but to scale it up to 20, requires a lot of topological changes and qualitative analysis.)

Deep learning addresses the scaling over layers with various techniques coupled with hardware acceleration (GPUs). Currently this stand at about 150 layers.

Even experts often use the "feature learning" analogy. I don't think it's wrong, or at least a bad way of explaining it.

The difference between (deep) neural networks and shallow machine learning, is that NNs can learn arbitrary features. Yes clustering doesn't require feature learning. But it is also super limited in the kinds of features it can learn. Neural nets can learn arbitrary circuits, and other types of functions.

Gaussian Processes can learn any arbitrary function. Is it "shallow" machine learning?

I think the point that the parent makes is valid: Most of the advantages of deep learning when using a "simple" feed-forward topology is advances related to scaling learning and solving problems encountered at with difficult tasks like image recognition, etc.

I do not know enough about neural nets to say if that is all there is to it, but one thing is sure: it's not just about "learning features", although it was shown that the output at every layer abstracts some sort of higher-level features (in the case of image recognition)

>> it's not just about "learning features"

So, I'm in no position to prove this, but my intuition is that any machine learning algorithm can be configured in a semi-supervised learning set-up, like deep nets have. You could train a decision forest classifier for instance to learn in an unsupervised manner. An algorithm I'm developing for my MSc dissertation is essentially unsupervised recursive partitioning, a.k.a. decision trees (only, first-order rather than propositional).

Well, possibly not _any_ algorithm. But I get the feeling that many classifiers in particular could be adapted to unsupervised learning with a bit of elbow grease, at which point you could connect them to their own input and, voila, semi-supervised learning.

But like I say, I don't reckon I'll be in a position to prove this any time soon.

GPs require exponentially many parameters though. They can't learn arbitrary functions, they just stupidly memorize a lookup table.

The way I know it goes along the lines of: "multilayer perceptrons with no more than three layers can learn any function to arbitrary precision given a large enough number of inputs".

But like halflings say, neural nets are not alone in this. Decision Trees can learn any binary decision diagram I guess (they can encode arbitrary disjunctions of conjunctions). I'm pretty sure there are similar results for other algorithms also.

In any case, you can represent a function as a set-theoretical relation and enumerate its parameters- and there you go, learning done with arbitrary precision. That's not what makes neural nets impressive. So what is it?

"Shallow machine learning" is a worrying neologism. "Shallow" and "deep" only apply to neural networks, really. You couldn't very well distinguish between shallow and deep K-NN classifiers, say. Or shallow and deep k-means clustering. I mean, what the hell?

Clustering also isn't a supervised learning technique. Even though you might say DNNs can be unsupervised (autoencoders), it generally is not the case in practical systems. So it's not a good comparison at all.

I don't care about the supervised/unsupervised distinction. I'm saying they can learn features automatically (with or without supervision.)

Well, feature learning is unsupervised by necessity, otherwise you're not learning features, you're learning a mapping between features and labels.

Deep nets used in the way you say are first trained unsupervised to extract features, then the features are used in supervised learning, to learn a mapping from those new features to labels.

You can also do this "by hand" using unsupervised learning techniques like clustering, Principal Component Analysis etc: you make your own features then, and train a classifier afterwards, on the features you extracted in that way.

Deep nets just sort of automate the process.

Holy cr*p on a cracker!

150 layers? It boggles the mind.

How do you even start propagating over 150 layers? Do you assign specific functions / targets to some of the inner layers?

Deep Residual Learning for Image Recognition https://arxiv.org/abs/1512.03385

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

And some good answers here: https://www.quora.com/How-does-deep-residual-learning-work

Also, very deep NN without residuals: https://arxiv.org/abs/1605.07648

Highway layers http://arxiv.org/abs/1505.00387 help with this propagation.

I agree (except for the first word), however I read the question with emphasis on "usual", as in, "What makes DNNs special?"

There's pure performance (ex., in a Kaggle competition [http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it...] or on a standard data set [http://yann.lecun.com/exdb/mnist/], [http://blogs.microsoft.com/next/2015/12/10/microsoft-researc...] ), but that's what makes any ML method better than another.

I think the deeper awesomeness is that DNNs so good at Feature Learning from raw data. On vision, NLP, and speech problems [nice overview by Andrew Ng: https://m.youtube.com/watch?v=W15K9PegQt0] DNNs have achieved superior performance to the combination of expertly-engineered features + some usual ML algorithm.

Where a "usual ML" pipeline might look like (1) engineer features through manual effort by studying raw data and the problem domain, (2) apply ML to those features, a new DNN pipeline might look like (1) Apply DNN to raw data.

First off, removing the feature engineering step could be a huge savings in human time spent. Second, there's the potential to get a better answer (!) when you're done.

But more than that, the DNN pipeline holds the promise of more regular, systematic improvement. We (as engineers) don't have to wait for a bright idea about how to construct a feature from the data. Instead, we can focus on (1) collecting more and better data, (2) improving the optimization algorithms, and (3 acquiring more computing resources.

These latter tasks, I suspect, are easier to define and evaluate than the task "discover a new feature".

You might not call it feature engineering, but let's face it - most DNN models vary dramatically in structure based on the problem at hand.

Yep. Have a look at DNNs for image recognition, or LSTM RNN. They're the results of some furious architectural work by researchers and not at all simple to come up with (though they may be simple enough to understand now someone's created them).

> Deep Neural Networks can learn features from essentially raw data. Usual machine learning starts with features engineered manually.

What does this mean? What is "raw data" and what is a "feature engineered manually"?

Tens of thousands of engineers (audio, vision, linguists etc.) spent millions of hours for billions of dollars in the past 30 years to invent algorithms that reliably tell us something about a bunch of data. For example, an corner feature algorithm (such as SIFT) can extract the locations of corners in an image and characterize them. This is essential to many kinds of information processing tasks because we want to apply the same algorithm to different data (generalization), so we kind of need an interface to the data. This interface is called a feature (or feature algorithm, feature extractor or feature-descriptor).

All of this work (some of these papers have on the order of ten thousands of citations) is now obsolete because you can start with a random initialization of the weights of a neural network and iteratively improve the weights using backprop for any kind of task. All you need a measure of improvement that is relatively smooth and differentiable with respect to the network weights. What is surprising is that the circuits and programs within reach of backprop training of fully connected neural networks are actually astonishingly good at what they do. But ultimately, this is maybe not so surprising given that our brains do something similar all the time.

>All of this work (some of these papers have on the order of ten thousands of citations) is now obsolete because you can start with a random initialization of the weights of a neural network and iteratively improve the weights using backprop for any kind of task

Hardly correct. You can't magically learn any kind of task. You can't add arbitrary number of layers and hope for the back prop to do its magic. It is difficult. Deep learning techniques are what makes it somewhat feasible.

SIFT is not obsolete because of NNs. They all have their pros and cons. You have to select the right tool for the job. BTW SIFT is not an edge detector (That's the Canny Transform). It describes images using salient features in scale invariant manner.

Typos fixed. "All" was hyperbole of course, but I think it definitely does not look good for the majority of the work done on features. SIFT was recently outperformed PN-Net for example.

SIFT is also quite old. It's amazing a single technique has retained so much value. Isn't it curious that modern convnets use convolution. On top that, they do convolutions at multiple scales (pooling). Starting to sound very familiar...

Actually the neural net approaches are older than SIFT.

Neural nets learn the distribution and even causal factors in the data. To me it seems that this distribution is often just too complex for it to be robustly captured by something that doesn't learn. Learning causal factors critically depends on learning along the depth of the network of latent variables which is a particularly opaque process, but this is what MLPs seem to do quite canonically (convnet being just a restricted special case of MLPs). I mean discerning causal factors is pretty much canonically the act of accumulating evidence with priors (weighted summation), deciding whether it is sufficient evidence and signaling how much it is (non-linearity).

Some of the approaches are, some aren't. SIFT itself builds upon knowledge that is much older than it. Either way it doesn't matter. The OP was arguing that the many years of man effort put into SIFT was a complete waste. I am saying that this is very shortsighted, as non-machine learning vision techniques have heavily influenced how we approach and think about vision problems even when using ML.

>> SIFT is not obsolete because of NNs. They all have their pros and cons.

Case in point- DNNs for image recogn. use Sobel edge detectors and other "obsolete" filters to do their magickal magic.

>> we kind of need an interface to the data. This interface is called a feature (or feature algorithm, feature extractor or feature-descriptor).

Excellent in-a-nutshell explanation of features and thank you for a definition I really hadn't thought of.

This though:

>> this is maybe not so surprising given that our brains do something similar all the time.

Is just so much fantasies, sorry to say. Neural nets (and machine learning in general) learn in ways that are completely unlike the human. They need huge, dense datasets, we can make do with scraps of sparse data. They need huge amounts of computational power, and time, we learn in the blink of an eye. They learn one thing at a time and can't generalise knowledge to even neighbouring domains, we can, oh yes indeed. An infant that can recognise images at the level of AlexNet, can at the same time tie its own shoelaces, speak rudimentary language and protect itself from danger etc. AlexNet can only map images to labels. It does that very well, but it's a one trick pony and so are all machine learning algorithms, fearsomely effective but heart-breakingly limited. Human minds are generalisation machines of the higest order and we are nowhere near figuring out how they (we) do it.

Think of it this way: it took a few dozen researchers a few decades to come up with backprop. It took evolution billions of years to come up with a human mind. Which one do you think is the more optimised, and how much hubris does it take to convince oneself that they are pretty much the same in capabilities?

One example is this recent paper on learning (somewhat) high-level attributes of text from character streams alone (i.e., without telling the convolutional networks that things like words and punctuation exist).


>> We show that temporal ConvNets can achieve astonishing performance

Yay, astonishing performance! I'm totally gonna waste half an hour of my life to read about what awesome badassery convnets are! Because that sounds so objective!


The Neural Network Playground is great for understanding this[0]!

The default example is classification of a circle of one class surrounded by a donut of another. There are two features x_1 and x_2 (this is the "raw data").

One solution to this problem is to use a single layer and a single neuron but engineer features manually. These manually engineered features are x_1*x_2, x_1^2,x_2^2, sin(x_1) and sin(x_2). Here's a link to this model (long url)[1].

This model performs very well at learning to classify the data just by combining these manual features with a single neuron. The problem is a human needs to figure out these features. Try removing some and observe the different performance given different manual features. You'll see how important it is to engineer the correct ones.

Alternatively you can have 2 layers of 4 neurons [2]. In nearly the exact number of iterations this network also learns to classify the data correctly. This is because the non-linear interactions between neurons are actually transforming the inputs the appropriate ways. That is to say the networks is learning to engineer the features itself. Try removing layers/nodes and you'll find that a simpler network will have a harder and harder time at this.

I recommend playing around with the various tradeoff between manually engineered features and network complexity. The interesting thing you will observe is that in some cases the manual features are much faster to learn a simplier model than the network. The big issues comes up when we can't simply "see" the problem in 2d so we have no idea what features may and may not be useful.

[0] http://playground.tensorflow.org/

[1] http://playground.tensorflow.org/#activation=tanh&batchSize=...

[2]. http://playground.tensorflow.org/#activation=tanh&batchSize=...

Some of the prominent achievements in deep learning, like AlphaGo, used manually specified features.

Have you got a reference?

How about the AlphaGo paper itself:


Bottom of page 23 and appendix tables around page 31 and 32.

Edit: The stated big "next attempts" for them will be to learn these of features via an algorithm rather than handcoded. And to learn based on self play rather than a database of master games.

There are lots of other unsupervised learning methods in machine learning.


I think the article does a good job at explaining the major points. But what you have described in here can also be said about MLPs, nothing deep about them on their own. For example XOR function with MLP combines features to come up with more complex features.

A ray of light. MLPs can approximate any nonlinear function in the domain they have been trained on. What is is about the depth that makes DNNs more tractible to train than shallow networks? Is it that the particular tricks that have been developed for DNNs haven't been generalized to work at arbitrary depths? Is it that it is easier for humans to design the abstractions that are used when they are layered? Are you aware of any theoretical work in this direction?

>MLPs can approximate any nonlinear function..

Theoritically yes. But the drama is when you have to actually do it. DNNs are not more tractable on their own, they are made feasible by current set of techniques.

>Is it that it is easier for humans to design the abstractions..

You could argue that activation maps generated in convolutional layers by the filters are feature engineering, as those filters are manually created. These are problem dependent, and we know more about the problem than the algos. That's why feature engineering hasn't gone away completely.

Oops, accidentally flagged GP. Sorry for that, hope the unflag works :/.

I'm going out on a limb here but I'm guessing if this explanation makes sense to you, you don't need to be told the difference between deep and usual machine learning. Could be wrong!

It sounds like you're implying, but don't want to state, that the explanation is not clear enough for outsiders (such as yourself?)

If I have read your comment correctly, I'll say it for you: as an outsider, I read this carefully until I gave up because it was too technical, which happened right at the very top, in the third paragraph:

>Those hidden layers normally have some sort of sigmoid activation function (log-sigmoid or the hyperbolic tangent etc.). For example, think of a log-sigmoid unit in our network as a logistic regression unit that returns continuous values outputs in the range 0-1

All this implies I know all about multi-layer perceptrons - and I don't. I can't follow the instructions to "think of a log-sigmoid unit in our network as a logistic regression unit" because I don't know what those terms mean.

Just as I would give up on a recipe if I got to an instruction I didn't know. For example, if I read:

>Glaze the meringue with a blow torch, or briefly in a hot oven.

Yeah, uh, no... I don't even know what glazing means, or what is "briefly in a hot oven". So I just stop reading. When I'm instructed to do something I can't, I go look at something else unless I'm feeling very adventurous.[1]

This blog post isn't written at my level.


[1] as a last hoorah I'll open a tab and Google https://www.google.com/search?q=what+is+glazing - likewise I tried https://www.google.com/search?q=what+is+a+multilayer+percept... but decided after reading the Wikipedia link that it was too "deep" for me.

A month ago I was in the same place- I would start reading a short blog posting on RNNs/ConvNets/etc., and within 2-3 paragraphs my eyes would glaze over from the math and other foreign terminology. Frustrating. To try and fix this I am "auditing" the Stanford course on ConvNets: http://cs231n.stanford.edu/syllabus.html

I'm about 2/3 done with the homeworks, and I understand this stuff now. I'll never be a data scientist, but I know enough to implement these networks on my own, and to understand blog posts like this. It's a lot of work for one course, much more than I remember from my own undergrad years. I had to revisit Calculus & Linear Algebra too. But if you're genuinely interested in this stuff you can pick it up.

"I had to revisit Calculus & Linear Algebra too" - what resource would you recommend for this? after being a web developer for a couple of years i find myself rusty and unable to find good resources for this. Trying to get into machine learning but i've forgotten most of the math

Nope. I have a 1990s understanding of perceptrons; this helped show me a bit of what the current rage is about.

Not true. I know quite a bit about three-layer networks and backprop but had been puzzled about how people were training networks with more layers. This article was helpful.

To me anyway, this short explanation was extremely useful. I had played with simple neural nets just once in the past.

This may come as a shock to the layperson, but there's more to artificial intelligence than neural nets, and many non-NN AI approaches could arguably be said to "learn", and are thus "machine learning" too, depending on your definition.

I'm thinking of evolutionary algorithms, various other biologically inspired computation techniques (of which NN's are but one example), more traditional AI techniques such as expert systems, and a whole host of stochastic, non-biologically inspired algorithms.

I'm not super familiar with NN's myself, so I can't say whether the gigantically disproportionate attention from the media and the research community is deserved based on actual superiority in effectiveness of NN's compared to other techniques, or whether they're used and talked about mostly because that's what most people know.

It would be interesting to hear the thoughts of a non-NN AI researcher regarding this.

The answer to the question which technique will lead to AI depends on who you ask.

Geoffrey Hinton and Jürgen Schmidhuber are convinced that neural nets trained with backprop will likely lead to AI with minimal additional fixed-function structure. Schmidhuber's intuition is that recurrent neural networks (RNNs) are capable of universal computation, so given sufficient computational resources the RNN can learn any task, including generalization, goal and action selection and everything else that we associate with intelligence. Hinton's intuition is that RNNs essentially accumulate representations in their hidden state vectors that are very similar to human thoughts ("thought vectors"). Hinton thinks that thought vectors will naturally lead to the kind of reasoning that humans are capable of. In his view the human brain is basically a reinforcement-modulated recurrent network of stacks of autoencoders (needed for unsupervised representation learning).

Pedro Domingo sees such universally that may lead to AI in all major fields of AI: (1) Symbolists have inverse deduction (finding a general rule for a set of observations), (2) Connectionists have backprop and RNNs (see above) (3) Evolutionists have genetic programming (improvement of programs via selection, mutation and cross -over), (4) Bayesianists have incremental integration of evidence using Bayes' theorem, e.g. dynamic Bayes networks, and finally (5) Analogizers have kernel machines and algorithms to compare things and create new concepts.

Domingo's hypotheses is basically that only a combination of various of these approaches will lead to AI.

It's also possible that all these approaches (1-5) are different viewpoints of some more fundamental, underlying concept/description of intelligence and will converge at some point (as happened to Turing Machines vs. Lambda Calculus). This would be very interesting.

Is there a proof that generalization or goal and action selection or other things associated with intelligence are actually computable?

If the Strong Free Will theorem is right, there are distinct limits on the ability of computational devices to simulate our physics.

For one, anything a computable machine does "on its own" will be a function of the past history of the universe (deterministic), whereas the SFWT says that elementary particles are free to act without regard to the entirity of history.

E: even a one word reply conveys so much more than an anonymous downmod.

I didn't downvote you but I suppose the criticism would be that our machines are built from the same elementary particles that a brain is, so why would our machines suddenly be so deterministic as to disallow strong ai but a brain is not.

Furthermore a ai does not compute a physics model any more than a brain does, so the criticism does not apply.

But the theory of the machines is entirely deterministic. Any non-deterministic behavior of a machine would be considered an error under any approaches being discussed here.

The point is that the universe can do things that functions cannot. If your only tool is functions (only thing that algorithms can compute), then you will necessarily be unable to handle all possibilities (by which point, I question your intelligence).

Functions routinely use random number generators. Pretty much all AI/Machine Learning techniques use non-determinism as part of their design.

But even so, it's confusing to phrase intelligence in terms of non-determinism. It's easy to come up with a non-deterministic answer to an arbitrary question. It's hard to come up with a correct answer to an arbitrary question. If unpredictability is a component of sound reasoning, it's because we humans are so bad at reasoning.

Pseudo-random generators, for sure, which are entirely deterministic. It does not even matter if you had a source of truly random numbers. Given the same sequence of numbers, an algorithm will return the same result. That is what it means to be a function.

The problem is not that, "my answer must be 'unpredictable'", but, "the actual answer may not be computable" (and so, no algorithm may ever derive it).

That is why random generator has seed, which sometimes considers non-deterministic factors such as hardware noise and environment noise.

I think if the one-word reply had been offered, it would say: "Dualism".

Dualism is a very lonely position around HN. You and I may be the only dualists here.

I think none of the things I've mentioned rule dualism completely out, but they likely restrict the possibilities of where to find it.

I did not downvote (and you're clear now) but your post is not a relevant argument. The determinism that the SFWT is arguing against is that of certain hidden variable theories of quantum mechanics. It states that if the humans are free to choose particular configurations for an experiment measuring this or that spin, then bounded by relativity and experimentally verified aspects of quantum mechanics, the behaviors of the particles cannot be dependent on the past history of the universe. The main characters are the particles, people are incidental.

> "Our argument combines the well-known consequence of relativity theory, that the time order of space-like separated events is not absolute, with the EPR paradox discovered by Einstein, Podolsky, and Rosen in 1935, and the Kochen-Specker Paradox of 1967"

So as far as I can tell, it takes for granted the humans' ability to choose the configurations freely, which though suspect in of itself doesn't matter so much to their argument as it's not really an argument for free will, it's a discussion of how inherent to quantum mechanics non-determinism is.

> "To be precise, we mean that the choice an experimenter makes is not a function of the past."

> "We have supposed that the experimenters’ choices of directions from the Peres configuration are totally free and independent."

> "It is the experimenters’ free will that allows the free and independent choices of x, y, z, and w ."

It is actually, if anything, in favor of no distinction between humans and computers (more precisely, it is not dependent on humans, only a "free chooser") as they argue that though the humans can be replaced by pseudo random number generators, the generators need to be chosen by something with "free choice" so as to escape objections by pendants that the PRNG's path was set at the beginning of time.

> The humans who choose x, y, z, and w may of course be replaced by a computer program containing a pseudo-random number generator.

> "However, as we remark in [1], free will would still be needed to choose the random number generator, since a determined determinist could maintain that this choice was fixed from the dawn of time."

There is nothing whatsoever in the paper that stops an AI from having whatever ability to choose freely humans have. The way you're using determinism is more akin to precision and reliability—the human brain has tolerances but it too requires some amount of reliability to function correctly, even if not as much as computers do. In performing its tasks, though the brain is tolerant to noise and stochasticity, it still requires that those tasks happen in a very specific way. Asides, the paper is not an argument for randomness or stochasticity.

> ” In the present state of knowledge, it is certainly beyond our capabilities to understand the connection between the free decisions of particles and humans, but the free will of neither of these is accounted for by mere randomness."

If an AI is an algorithm, then it will be unable to produce "answers" to what we observe. That is the relevance. One would need to show a contradictory example to the theorem to ignore it.

>There is nothing whatsoever in the paper that stops an AI from having whatever ability to choose freely humans have.

There is if an AI is dependent on deterministic methods. I agree that AI is not a well-defined term, but all proposals I have seen are algorithms, which are entirely deterministic. This is entirely at odds with the human conception of free choice. An algorithm will always produce the same choice given the same input. Any other behavior is an error.

The SFWT says that observations can be made that cannot be replicated through deterministic means, which would seem (I agree there is a very slight leap in logic here) to rule out any AI from ever being able to understand at least some aspects of our reality (and also reveals them to be simple, logical machines, with no choice).

Algorithms are not by definition deterministic, which seems to be one of your key points. Probabilistic algorithms exist. They may or may not be used in machine learning currently, but they do exist.

Can you provide an example? All probabilistic algorithms I have seen rely on a pseudo-random generator a rely on an external source if numbers. I have argued elsewhere in these comments that both cases my be considered deterministic.

NN are being talked about because they have been excessively effective at advancing the state of the art in areas that were previously in a dead end.

Computer vision is such a field. The results provided by conventional approaches had been plateauing for years and then one day, Yann Le Cunn submitted a paper describing the work that he and his team had done in the area with neural networks and which blew away the previous results on image recognition.

Interestingly, this paper was rejected and it took a full year to the CV community to finally turn around and accept NN as a valid approach to this problem. I believe no one questions that it's the best methodology we have today.

Speech recognition, automatic translation and natural language processing in general are other areas that have benefited immensely from neural networks.

So this is what confuses me about what is 'new' with respect to deep learning. Neural nets are not new - I was aware of the existence of, and some of the basic ideas behind, neural nets as a technology in the 1990s and I wasn't even involved in computer science, so I assume that means that even then they were a mainstream AI technique.

When I read content about 'deep learning' neural nets today I don't see anything especially different to what my (admittedly shallow) understanding of neural nets was back then. So what I'm missing mainly is - what changed? Is it just that advances in compute power mean that problems for which neural nets were impractical have now become practical? Is there something different about the way neural nets are employed in 'deep learning' that is different than the neural nets that were discussed in the past?

Previously, neural networks were trained by taking single steps down the direction of sharpest gradient for the network. However, in deep networks with lots of layers, the backprop algorithm (which I assume you already know about) got stuck in local minima.

Deep learning got started when Hinton observed that a certain way of training restricted Boltzmann machines wouldn't get stuck as easily, and hence by pretraining the network as if it were an RBM and then switching to backprop, it wouldn't get stuck in a local minimum as early.

As I understand it, nowadays the best method looks something like a generalization of Newton's Method, wherein the direction you move takes into account the second differential and not just the first differential or direction of sharpest descent. You move furthest in the directions that curve the least, and move the least in the directions that are most sharply curved. It turns out that this (plus some other tricks) make it way easier to follow continuous gradients in big weird parameter spaces, so now it's possible to train deep nets, which are a kind of continuous gradient in a big weird parameter space.

Tl/dr: People figured out how to move better through the parameter space of neural networks, by taking into account the second differential plus some other tricks. So now we can train deeper nets.

While there are newer algorithms (almost all first order btw), one of the most commonly used for deep convnets, sgd with momentum, is from the 80s. It really is mostly about computing power - one gtx 1080 has more tflops than world's fastest supercomputer till 2001 [0]. The actual speed difference is probably an order of magnitude larger due to an absence of communication overhead and latency inherent in sharing work across thousands of separate cpus and lots of slow ram. That would make one gtx 1080 equivalent - for neural net training purposes - to a supercomputer from 2004.

[0] https://en.wikipedia.org/wiki/History_of_supercomputing#Hist...

More specifically techniques such as Momentumn https://www.willamette.edu/~gorr/classes/cs449/momrate.html (great article visualizing the problem space discussed above) and thigns like RMSPROP http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slid... are used.

The rectified linear unit also converges about 75% faster than the classic Sigmoid or Tanh. the ReLu is F(x) = max(0, x) so the gradient propogated is 1 rather than a more "curvy" value.

Thank you! That makes a lot of sense.

Thanks, this explanation made a lot of sense to me.

>> Speech recognition, automatic translation and natural language processing in general are other areas that have benefited immensely from neural networks.

In NLP in particular, I'm not so sure about the "immense" benefit of ANNs. LSTM-RNNs are very nice for language modelling but we haven't seen the 10-percentile jump in performance that the image processing community has. Stanford I believe uses ANNs in their dependency parser (to learn transitions) and Google of course uses ANNs everywhere, but the gains against goold old HMMs are nowhere near the gains in image processing and it's not clear what exactly is the difference. Maybe teams have better hardware and more data now. Rather, they definitely do, so maybe that's the reason for any increases in performance.

With translation in particular (it's a branch of NLP) it's very hard to know what's going on. Companies who sell machine translation products still use different techniques, including rule-based software. The problem is that machine translation needs some way to deal with semantics, else it's a bit shit. Google again- they do use ANNs (apparently- it's hard to tell because they don't) but is their translation any better than others'? That's almost impossible to know, because there are no good measures of machine translation performance. I'll repeat that: we don't have any good measures of machine translation performance. So anyone can come up with a solution and drum it up as "state-of-the-art", and if their user base is large enough it'll be impossible for anyone (even themselves) to know for sure whether that means anything.

Speech recognition- I don't know much about that, but I think it's similar to NLP. You can do a lot with a lot of data and hardware, if you're lucky to have them and at that point it doesn't really make a real difference to your end-user if you're using an ANN model or an HMM model.

DNNs have lead to massive improvements in speech, specifically in the acoustic model. RNN based language models interpolated with modified kn smoothed ngrams (often plus some other cool stuff) are also the state of the art, although they're often too slow to use in practice.

Thank you - I stand corrected, although I think the "too slow to use in practice" bit is possibly what I had in mind.

Which may just go to show that Hinton, Schmidhuber and the rest of the connectionist gang are right about the value of more data and faster computers.

What justification does a journal have for rejecting an objectively groundbreaking result, just for using an unfashionable technology?

Here is Yann's summary of the situation back then:


I seem to remember this article as having a lot of very interesting comments, including from the chairman of the conference. The exchange was cordial but showed some clear antagonism between the two parties. Somehow, I don't see these comments any more on Google+.

That was four years ago, things are a lot better now.

Things are better for neural nets because they've become fashionable. But nothing changed to make the peer review process tolerate unfashionable technologies.

>> more traditional AI techniques such as expert systems

I believe you'll be hard pressed to find anyone that describes expert systems as "learning" in any way. They're typically considered as the antithesis of machine learning: they're knowledge-driven but you have to enter all the knowledge "by hand". Indeed some machine learning algorithms were proposed as ways to automate the creation of the large rule-bases typical of expert systems (the most prominent such algorithm being Decision Trees).

>> It would be interesting to hear the thoughts of a non-NN AI researcher regarding this.

Researchers are rarely as proscriptive as that. You may find folks who specialise in ANNs to the exclusion of other techniques (but it's true it's a vast field) but you'll probably not find anyone who wouldn't try ANNs a couple of times, and maybe publish a paper or two on them. Same goes for other trendy algorithms also.

I've been looking to learn more about non-machine learning AI research for the past couple of months (essentially, something that is the spiritual successor to the GOFAI/expert system path instead of the currently popular statistical machine learning path). For some reasons it's quite hard. I'd appreciate if anyone can point me to the state of the research in those areas? (Conferences, research groups' names etc).

Classic non-ML AI research these days sometimes tends to be baked into the background of systems where deep learning is the star-- e.g., all of the search techniques used in AlphaGo are interesting in their own right, but aren't really where the innovation is there.

There are researchers actively looking at more holistic AI systems that aren't just the core learning part (often called cognitive architectures). For instance, take a look at Soar: http://soar.eecs.umich.edu/

It under active development, and has been since the 80s. Soar is very much a spiritual successor to GOFAI (or maybe still is GOFAI), but does incorporate statistical machine learning-- much of the recent work has been around integrating reinforcement learning with symbolic decision-making.

ACT-R is another cognitive architecture under active development, but is more used for cognitive modeling (i.e. psychology research): http://act-r.psy.cmu.edu/

You can find other related systems by searching for "cognitive architecture", but it is unfortunately a field that attracts a lot of proposals/'designs' for things that never get properly implemented (the above being two notable exceptions).

If you're looking for symbolic approaches within machine learning, Luc De Raedt's textbook has the works:


Lise Getoor from MIT is also doing a lot of work with statistical relational learning in particular, which is kind of a cross between GOFAI (the "relational" bit) and nouveau AI (the "statistical" part):


In the UK I know of Kristian Kersting who does statistical relational learning research, at Dortmund:


I think this is not exactly what you were asking for but I think you might find it interesting, particularly Inductive Logic Programming (of which there's oodles in De Raedt's book).

Personally, I started an MSc in AI because I was interested in all the symbolic stuff, and I was a bit disappointed to find there were almost none of that left in the curriculum (at my school anyway, the University of Sussex). I guess I kind of see the point though- knowledge engineering is pretty hard and costly and learning from data is very attractive. I really don't like the idea of doing it all with statistical black-boxes, though. So relational learning is a kind of halfway house for me, where I can still use mathematical logic principles and not have to eat the statisticians' dust, so to speak. For me relational learning is the spiritual successor of GOFAI that I was looking for, so I think you may find something of interest in the links above.

Edit: there's also the probabilistic logic programming community, forgot about them; here:


That's from the University of Leuven, where De Raedt is from. The name says it all: Horn clauses with probabilistic weighting. So that's not machine learning (or any kind of learning, it's just what it says on the tin). You will find many more here:


Although that's not all _logic_ programming (but there's quite a bit of it).

In fact, the recent massive success of Deep Neural Nets have led to the terms 'Artificial Intelligence', 'Machine Learning' and 'Deep Learning' being used interchangeably. This was not the case till early 2015. In order to inform the masses about the breakthroughs, the media started generalising DNNs as AI, and also because this was the only AI technique to show such results.

A state-of-the-art handwriting recognition algorithm was published late last year, based on a new approach the authors termed Bayesian Program Learning: http://science.sciencemag.org/content/350/6266/1332

There is no such thing as "usual machine learning". There are many machine learning techniques, Neural Nets is one of them, and deep learning is a very powerful way of training deep neural networks.

Deep Learning is absolutely interesting and powerful, but it's not like all the other techniques are all the same, or necessarily less powerful, for that matter. Different techniques have different applications.

The words "deep learning" don't refer to a way of training neural nets.

It's a marketing term for neural nets that have "a lot" of layers. Neural nets have been around for over 30 years, calling them deep learning was a smart re-branding move.

From what little I've read, deep learning is a new way of training nets. Deep neural nets have also been around for 30 years, but backpropagation, their traditional learning algorithm, isn't as efficient in training the middle layers.

You should keep reading. Backprop is still the best, and pretty much the only way to train neural nets (deep or not). Other ways exist (e.g. weight perturbation), but no one uses them. EDIT: I have to clarify: backprop is just one half of the training algorithm, you also gradient descent, and there are many variants of it.

Man, I've been out of Neural Nets a long time. I studied them back in the early 1990s. Perceptrons and backpropagation. Didn't really keep up with the state of the art, I'm afraid. Maybe I should catch up.

Deep learning is machine learning involving neural nets with more than one hidden layer.

Most definitions of deep learning don't include world "neural network". They are synonymous for multi layer neural network techniques for now, but not necessarily in the future.

For example:

>Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level.

Deep Learning, Nature Volume 521 issue 7553, 2015, LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey

To do increasingly fancy things with feed forward neural networks, we require increasingly many layers. But with many layers, it becomes increasingly difficult to train it (vanishing gradient, explosive growth of connections, etc). But Hinton et al. found a few tricks to go past those obstacles and that's what 'deep learning' encompasses. Current state of the art of deep learning is at 150 odd layers with a new trick called 'residual learning' by Microsoft Research China.

I didn't say it was easy to do, if anything I was just taking a pot shot at the laziness of using the term as a label for all of modern machine learning.

I've been studying up on deep neural nets recently though and there are a bunch of very cool tricks (and I don't mean that pejoratively, I think real intelligence consists of just such a grab-bag of tricks), which make it effective in a wide range of cases.

... and a method to train the hidden layers efficiently.

That's arguable; most successful applications of deep learning aren't particularly efficient on the training side, which is why they come out of research groups that have Google-sized compute clusters available, typically with big arrays of recent GPUs. I suppose they're efficient in the sense of not requiring more computation than exists on earth.

There have been a bunch of algorithmic advances as well, but some advances have simply been those increases in compute power: there are techniques that didn't really work in 1985, where the technique or a close variant now works, mainly because we didn't the big arrays of GPUs in 1985 that we do now. As these get scaled up, the algorithmic advances are being found to not even be necessary in some cases. For a period it was thought that autoencoder pretraining was the big algorithmic breakthrough that made it feasible to train many-layered neural networks, but a number of recent applications no longer bother with the autoencoder pretraining.

Well, therein lies the rub, yes.

It has an unusual use of "we" instead of "us" in a few places. "... that helps we to detect features ...". I wonder if the author originally used "you" and "your" then did a search and replace to "we" and "our" because 2nd person sounds too informal? I notice the former two words don't appear anywhere.

Re-reading it with "you" and "your" and I realize that it works just as well. I guess it's really just a fashion which person you use. It used to be 3rd for academic writing. Now it's 1st, but I guess the most informal way is still 2nd.

Could they be ESL?

I don't buy it. Even DL researchers will point to representation learning systems like word2vec, a shallow NN, as as examples of the success of DL approaches.

My take: "Deep Learning" is performative (https://en.wikipedia.org/wiki/Performative_utterance). An approach falls under the header of "Deep Learning" when used or developed by someone who identifies as a Deep Learning Researcher.

None of these techniques are good in the context of human-style learning, where a child (or an adult) can learning how to do something new from only a few cases of trial and error, whereas a machine using any modern technique requires an immense number of examples to see first (often millions) before it can do the equivalent task but with less accuracy.

NN and others are totally awesome, but we are so far away from true learning that is normally meant in the human context.

People can't learn things which are truly new from only a few cases of trial and error either. For example, it takes a long time to learn a foreign language or become an expert at chess. Humans mainly learn things better because our training data is usually larger than that of the model for the task at hand.

Suppose you want a computer vision model to start recognizing a particular object (and its subtle variations) in photos, but it has never seen this object before. How many of these objects must the model be trained on before it can generalize and recognize the object again? Now show a child some bizarre object just once, and see if it can find the object in any photos, even if the object is at different angles or subtly varied. The child will certainly do better, off just one training example.

Humans have a semi-hardwired architecture that evolved for classic human needs but is less general, less powerful , and less useful than artificial neural nets.

>less general, less powerful , and less useful than artificial neural nets

yet still much more capable in most cases. Handwriting recognition, for example. How many examples must a NN see to start accurately arriving at valid outputs for new handwritten cases? Now, how many examples must a human see to learn a particular letter before she starts recognizing it with 100% accuracy?

Slightly OT, but what exactly is the difference between machine learning and statistics?

  Machine learning                 Statistics
  network, graphs                  model
  weights                          parameters
  learning                         fitting
  generalization                   test set performance
  supervised learning              regression/classification
  unsupervised learning            density estimation, clustering
  large grant = $1,000,000         large grant = $50,000
  nice place to have a meeting:    nice place to have a meeting:
  Snowbird, Utah, French Alps      Las Vegas in August

You may enjoy Leo Breiman's famous article "Statistical Modeling: The Two Cultures": http://projecteuclid.org/euclid.ss/1009213726

What he calls "algorithmic modeling" is what I see as machine learning style thinking.

Naturally there's a lot of overlap and people in stats and ML often do related work (our statistics department, at CMU, collaborates a lot with the machine learning department), but there's a basic mindset difference.

Really interesting article. After reading it, it seems that statistics are about devising a model (like y=ax+b+epsilon) and then fitting it on the data while machine learning algorithms are about letting the data make their own model).

I think a statistical view of that dichotomy is "parametric statistics" vs. "non-parametric statistics": https://en.wikipedia.org/wiki/Nonparametric_statistics

I see "machine learning" as exploring computational approaches to non-parametric statistics. The idea of data-driven model-estimation is not outside the scope of "statistics".


Can someone say a bit about minimum number of layers and neurons needed for structure to be called artificial neural network? I have seen papers with something like 10 neurons (which is very very tiny but authors claim it does the job), while on the other hand there are Google sized ANNs. Thanks :)

You can call it an artificial neural network (ANN) as soon as you have two or more artifical neurons connected to each other. As simple as that.

A small amount of neurons might already solve some problem you're having. The XOR problem can be learned by 4 neurons connected to each other.

When you want raw images or similar as an input and have it be classified into 100 classes (e.g. look up CIFAR-10 or CIFAR-100), you will need an architecture with many more neurons.

After all, ANN are simply a tool. Depending on the task, that tool might need to more elaborate. And when you have all those different possible architectures, you want a common way of naming them. Labels such as Deep Learning are simply nomenclature of talking about certain groups of artificial neural networks.

ANN is a description about structure, not size.

XOR (an input layer of 2 neurons, 1 hidden layer of usually 2 neurons, and an output layer of 1 neuron) is a classic toy example of an ANN.

The minimal would probably be something like a NOT gate - input layer with 1 neuron connected directly to the output layer with 1 neuron.

In practice you either use whatever size your problem needs (for small problems) or match the network size to match the RAM amount in the video cards you're using.

Thanks for all replies. For some reason I felt that ANN has to be large to perform something useful, and small ~10 neurons ANNs seemed like a toy compared to ~1e6 nets. But, as you pointed out, if it does the job then it is good enough, no matter the count of neurons.

Technically, you can say that even a single perceptron unit is a artificial neural network -- a single-layer ANN. Multi-layer neural networks (such as a multi-layer perceptron) have at least one input layer, one hidden layer, and one output layer.

Trying to explain deep learning in a single FAQ response is silly. I submitted a PR with an improved question and answer: https://github.com/rasbt/python-machine-learning-book/pull/1... (trying not to change the author's original content much).

i think one of the one kf the problems is that articles like this; not just about ML but advanced CS and science concepts, answer a question at a complexity well above any person who would really ask it. By that I mean, if you could follow the casual explanation and complex concepts that are assumed that the reader grasps, you probably already know the difference between deep learning & ML and thus you read it out of interest to compare your view or get more information to add to your core base on the topics.

That is my 0.02, I couldn't follow this article without doing ancilliary research/wikipedia lookups at least.

Hi @vonklaus, thanks for the feedback. This was actually not intended to be an "article" but more like an answer to a targeted question (I am the author of this little write-up). Basically, someone asked me this specific question some time ago (I think via email), and I answered it with this person's background in mind. Then, I generalized it a bit more and added it to the FAQ section in the GitHub repo in hope that it is also helpful to others. It's really more like a quick overview, idea, explanation in contrast to a fully fleshed-out blog article :)

hey thanks for the response, I am sure it is quite well done to your target audience. Do you know of any higher level material that provides a good outline conceptually but only assumes a really general knowledge? This is a bit above my comfort zone :). I do really like, i think Joel Gruus, writing. His fizzbuzz with tensor flow was hilarious. I am unfortunately not particularly gifted in mathematics so libraries like that will likely be the furthest ill go into ML/DL, oh btw whats the difference ;). jk


you mean material specific to deep learning that is more general and less math heavy? Hm, that's a good question, the resources I'd reference are all a bit math heavy. However, don't be afraid of diving into TensorFlow, it's really a nice library that takes care of all the tedious, mathematical details. E.g., in contrast/addition to NumPy (leaving out the comp. efficiency part out of the discussion for now), it already implements several optimzation algorithms, so you wouldn't have to worry about implementing backpropagation from scratch or so. Sure, it still requires a bit of linear algebra, but it's really more straight-forward than it seems at first glance :). Maybe, you'd be interested in Keras (http://keras.io); it's a wrapper around Theano and TensorFlow which provides a really intuitive interface for building neural nets! Haha, btw. I really enjoyed Joel Gruus, post ;)

great thanks. ill check out keras.io seems perf. cheers

For a good conceptual outline you should definitely check out neuralnetworksanddeeplearning.com by Michael Nielsen. Then if you want to go deeper, look for Machine Learning course on Coursera by Pedro Domingos.

"We then connect those "receptive fields" (for example of the size of 5x5 pixel) with 1 unit in the next layer"

I felt this could have been expanded on.. still not sure how the sliding windows map into units.

I agree with you, it could surely expand this answer (I am the person who wrote this little answer). Since this was originally just an answer to a question I answered via email (if I remember correctly), I didn't go into too much depth regarding ConvNets, because I just wanted to answer this question "briefly" in this mail. But if there's a demand for that (if it's useful to others) I may end up writing a tutorial on ConvNets one day :). However, there's already an excellent one out there, I highly recommend Dumoulin & Visin's "A guide to convolution arithmetic for deep learning" at https://arxiv.org/abs/1603.07285

Deep learning is learning-from-learning (stacking machine learning). And it works because of backprop. That's the basic difference.

There is no difference.

No, the hole they dug now is deeper. Better spades.

It will also take 30 years to get out of, as usual.

Winter is coming.

Both are facile bullshit designed to eat up CPU cycles. Which, incidentally is what major corporations are basing their business models on selling. I've had clients come to me and say they want to use machine learning. When I ask what for, it becomes very awkward.

Because five years ago you could Google search for "picture of the priest from fifth element" and get relevant results, instead of pictures of priests, photo frames, and boron. Right?

Bullshit? Erm. Would you elaborate?

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact