Deep Neural Networks can learn features from essentially raw data. Usual machine learning starts with features engineered manually.
DNNs also learn to predict from the features they learn, so you cold say (very roughly) "DNN = usual machine learning + feature learning".
In practice manually engineering features is a time-consuming "guess-and-check" process which benefits from domain expertise. Feature Learning, otoh, is more automatic and benefits from data, computing resources, and optimization algorithms.
Deep learning is not machine learning plus something else. It is a collection of techniques that overcomes the scalability problem of feed forward neural networks. NNs are very difficult to scale over number of layers. Standard training method of back propagation can't handle many layers because of vanishing gradient and the computational infeasibility brought on by the explosive growth of connections.
NNs are very difficult to scale with additional classification targets you may require (for example, you have a classifier for categorising 10 classes, but to scale it up to 20, requires a lot of topological changes and qualitative analysis.)
Deep learning addresses the scaling over layers with various techniques coupled with hardware acceleration (GPUs). Currently this stand at about 150 layers.
The difference between (deep) neural networks and shallow machine learning, is that NNs can learn arbitrary features. Yes clustering doesn't require feature learning. But it is also super limited in the kinds of features it can learn. Neural nets can learn arbitrary circuits, and other types of functions.
I think the point that the parent makes is valid: Most of the advantages of deep learning when using a "simple" feed-forward topology is advances related to scaling learning and solving problems encountered at with difficult tasks like image recognition, etc.
I do not know enough about neural nets to say if that is all there is to it, but one thing is sure: it's not just about "learning features", although it was shown that the output at every layer abstracts some sort of higher-level features (in the case of image recognition)
So, I'm in no position to prove this, but my intuition is that any machine learning algorithm can be configured in a semi-supervised learning set-up, like deep nets have. You could train a decision forest classifier for instance to learn in an unsupervised manner. An algorithm I'm developing for my MSc dissertation is essentially unsupervised recursive partitioning, a.k.a. decision trees (only, first-order rather than propositional).
Well, possibly not _any_ algorithm. But I get the feeling that many classifiers in particular could be adapted to unsupervised learning with a bit of elbow grease, at which point you could connect them to their own input and, voila, semi-supervised learning.
But like I say, I don't reckon I'll be in a position to prove this any time soon.
But like halflings say, neural nets are not alone in this. Decision Trees can learn any binary decision diagram I guess (they can encode arbitrary disjunctions of conjunctions). I'm pretty sure there are similar results for other algorithms also.
In any case, you can represent a function as a set-theoretical relation and enumerate its parameters- and there you go, learning done with arbitrary precision. That's not what makes neural nets impressive. So what is it?
"Shallow machine learning" is a worrying neologism. "Shallow" and "deep" only apply to neural networks, really. You couldn't very well distinguish between shallow and deep K-NN classifiers, say. Or shallow and deep k-means clustering. I mean, what the hell?
Deep nets used in the way you say are first trained unsupervised to extract features, then the features are used in supervised learning, to learn a mapping from those new features to labels.
You can also do this "by hand" using unsupervised learning techniques like clustering, Principal Component Analysis etc: you make your own features then, and train a classifier afterwards, on the features you extracted in that way.
Deep nets just sort of automate the process.
150 layers? It boggles the mind.
How do you even start propagating over 150 layers? Do you assign specific functions / targets to some of the inner layers?
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
And some good answers here:
There's pure performance (ex., in a Kaggle competition [http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it...] or on a standard data set [http://yann.lecun.com/exdb/mnist/], [http://blogs.microsoft.com/next/2015/12/10/microsoft-researc...] ), but that's what makes any ML method better than another.
I think the deeper awesomeness is that DNNs so good at Feature Learning from raw data. On vision, NLP, and speech problems [nice overview by Andrew Ng: https://m.youtube.com/watch?v=W15K9PegQt0] DNNs have achieved superior performance to the combination of expertly-engineered features + some usual ML algorithm.
Where a "usual ML" pipeline might look like (1) engineer features through manual effort by studying raw data and the problem domain, (2) apply ML to those features, a new DNN pipeline might look like (1) Apply DNN to raw data.
First off, removing the feature engineering step could be a huge savings in human time spent. Second, there's the potential to get a better answer (!) when you're done.
But more than that, the DNN pipeline holds the promise of more regular, systematic improvement. We (as engineers) don't have to wait for a bright idea about how to construct a feature from the data. Instead, we can focus on (1) collecting more and better data, (2) improving the optimization algorithms, and (3 acquiring more
These latter tasks, I suspect, are easier to define and evaluate than the task "discover a new feature".
What does this mean? What is "raw data" and what is a "feature engineered manually"?
All of this work (some of these papers have on the order of ten thousands of citations) is now obsolete because you can start with a random initialization of the weights of a neural network and iteratively improve the weights using backprop for any kind of task. All you need a measure of improvement that is relatively smooth and differentiable with respect to the network weights. What is surprising is that the circuits and programs within reach of backprop training of fully connected neural networks are actually astonishingly good at what they do. But ultimately, this is maybe not so surprising given that our brains do something similar all the time.
Hardly correct. You can't magically learn any kind of task. You can't add arbitrary number of layers and hope for the back prop to do its magic. It is difficult. Deep learning techniques are what makes it somewhat feasible.
SIFT is not obsolete because of NNs. They all have their pros and cons. You have to select the right tool for the job. BTW SIFT is not an edge detector (That's the Canny Transform). It describes images using salient features in scale invariant manner.
Neural nets learn the distribution and even causal factors in the data. To me it seems that this distribution is often just too complex for it to be robustly captured by something that doesn't learn. Learning causal factors critically depends on learning along the depth of the network of latent variables which is a particularly opaque process, but this is what MLPs seem to do quite canonically (convnet being just a restricted special case of MLPs). I mean discerning causal factors is pretty much canonically the act of accumulating evidence with priors (weighted summation), deciding whether it is sufficient evidence and signaling how much it is (non-linearity).
Case in point- DNNs for image recogn. use Sobel edge detectors and other "obsolete" filters to do their magickal magic.
Excellent in-a-nutshell explanation of features and thank you for a definition I really hadn't thought of.
>> this is maybe not so surprising given that our brains do something similar all the time.
Is just so much fantasies, sorry to say. Neural nets (and machine learning in general) learn in ways that are completely unlike the human. They need huge, dense datasets, we can make do with scraps of sparse data. They need huge amounts of computational power, and time, we learn in the blink of an eye. They learn one thing at a time and can't generalise knowledge to even neighbouring domains, we can, oh yes indeed. An infant that can recognise images at the level of AlexNet, can at the same time tie its own shoelaces, speak rudimentary language and protect itself from danger etc. AlexNet can only map images to labels. It does that very well, but it's a one trick pony and so are all machine learning algorithms, fearsomely effective but heart-breakingly limited. Human minds are generalisation machines of the higest order and we are nowhere near figuring out how they (we) do it.
Think of it this way: it took a few dozen researchers a few decades to come up with backprop. It took evolution billions of years to come up with a human mind. Which one do you think is the more optimised, and how much hubris does it take to convince oneself that they are pretty much the same in capabilities?
Yay, astonishing performance! I'm totally gonna waste half an hour of my life to read about what awesome badassery convnets are! Because that sounds so objective!
The default example is classification of a circle of one class surrounded by a donut of another. There are two features x_1 and x_2 (this is the "raw data").
One solution to this problem is to use a single layer and a single neuron but engineer features manually. These manually engineered features are x_1*x_2, x_1^2,x_2^2, sin(x_1) and sin(x_2). Here's a link to this model (long url).
This model performs very well at learning to classify the data just by combining these manual features with a single neuron. The problem is a human needs to figure out these features. Try removing some and observe the different performance given different manual features. You'll see how important it is to engineer the correct ones.
Alternatively you can have 2 layers of 4 neurons . In nearly the exact number of iterations this network also learns to classify the data correctly. This is because the non-linear interactions between neurons are actually transforming the inputs the appropriate ways. That is to say the networks is learning to engineer the features itself. Try removing layers/nodes and you'll find that a simpler network will have a harder and harder time at this.
I recommend playing around with the various tradeoff between manually engineered features and network complexity. The interesting thing you will observe is that in some cases the manual features are much faster to learn a simplier model than the network. The big issues comes up when we can't simply "see" the problem in 2d so we have no idea what features may and may not be useful.
Bottom of page 23 and appendix tables around page 31 and 32.
Edit: The stated big "next attempts" for them will be to learn these of features via an algorithm rather than handcoded. And to learn based on self play rather than a database of master games.
Theoritically yes. But the drama is when you have to actually do it. DNNs are not more tractable on their own, they are made feasible by current set of techniques.
>Is it that it is easier for humans to design the abstractions..
You could argue that activation maps generated in convolutional layers by the filters are feature engineering, as those filters are manually created. These are problem dependent, and we know more about the problem than the algos. That's why feature engineering hasn't gone away completely.
If I have read your comment correctly, I'll say it for you: as an outsider, I read this carefully until I gave up because it was too technical, which happened right at the very top, in the third paragraph:
>Those hidden layers normally have some sort of sigmoid activation function (log-sigmoid or the hyperbolic tangent etc.). For example, think of a log-sigmoid unit in our network as a logistic regression unit that returns continuous values outputs in the range 0-1
All this implies I know all about multi-layer perceptrons - and I don't. I can't follow the instructions to "think of a log-sigmoid unit in our network as a logistic regression unit" because I don't know what those terms mean.
Just as I would give up on a recipe if I got to an instruction I didn't know. For example, if I read:
>Glaze the meringue with a blow torch, or briefly in a hot oven.
Yeah, uh, no... I don't even know what glazing means, or what is "briefly in a hot oven". So I just stop reading. When I'm instructed to do something I can't, I go look at something else unless I'm feeling very adventurous.
This blog post isn't written at my level.
 as a last hoorah I'll open a tab and Google https://www.google.com/search?q=what+is+glazing - likewise I tried https://www.google.com/search?q=what+is+a+multilayer+percept... but decided after reading the Wikipedia link that it was too "deep" for me.
I'm about 2/3 done with the homeworks, and I understand this stuff now. I'll never be a data scientist, but I know enough to implement these networks on my own, and to understand blog posts like this. It's a lot of work for one course, much more than I remember from my own undergrad years. I had to revisit Calculus & Linear Algebra too. But if you're genuinely interested in this stuff you can pick it up.
I'm thinking of evolutionary algorithms, various other biologically inspired computation techniques (of which NN's are but one example), more traditional AI techniques such as expert systems, and a whole host of stochastic, non-biologically inspired algorithms.
I'm not super familiar with NN's myself, so I can't say whether the gigantically disproportionate attention from the media and the research community is deserved based on actual superiority in effectiveness of NN's compared to other techniques, or whether they're used and talked about mostly because that's what most people know.
It would be interesting to hear the thoughts of a non-NN AI researcher regarding this.
Geoffrey Hinton and Jürgen Schmidhuber are convinced that neural nets trained with backprop will likely lead to AI with minimal additional fixed-function structure. Schmidhuber's intuition is that recurrent neural networks (RNNs) are capable of universal computation, so given sufficient computational resources the RNN can learn any task, including generalization, goal and action selection and everything else that we associate with intelligence. Hinton's intuition is that RNNs essentially accumulate representations in their hidden state vectors that are very similar to human thoughts ("thought vectors"). Hinton thinks that thought vectors will naturally lead to the kind of reasoning that humans are capable of. In his view the human brain is basically a reinforcement-modulated recurrent network of stacks of autoencoders (needed for unsupervised representation learning).
Pedro Domingo sees such universally that may lead to AI in all major fields of AI: (1) Symbolists have inverse deduction (finding a general rule for a set of observations), (2) Connectionists have backprop and RNNs (see above) (3) Evolutionists have genetic programming (improvement of programs via selection, mutation and cross -over), (4) Bayesianists have incremental integration of evidence using Bayes' theorem, e.g. dynamic Bayes networks, and finally (5) Analogizers have kernel machines and algorithms to compare things and create new concepts.
Domingo's hypotheses is basically that only a combination of various of these approaches will lead to AI.
If the Strong Free Will theorem is right, there are distinct limits on the ability of computational devices to simulate our physics.
For one, anything a computable machine does "on its own" will be a function of the past history of the universe (deterministic), whereas the SFWT says that elementary particles are free to act without regard to the entirity of history.
E: even a one word reply conveys so much more than an anonymous downmod.
Furthermore a ai does not compute a physics model any more than a brain does, so the criticism does not apply.
The point is that the universe can do things that functions cannot. If your only tool is functions (only thing that algorithms can compute), then you will necessarily be unable to handle all possibilities (by which point, I question your intelligence).
But even so, it's confusing to phrase intelligence in terms of non-determinism. It's easy to come up with a non-deterministic answer to an arbitrary question. It's hard to come up with a correct answer to an arbitrary question. If unpredictability is a component of sound reasoning, it's because we humans are so bad at reasoning.
The problem is not that, "my answer must be 'unpredictable'", but, "the actual answer may not be computable" (and so, no algorithm may ever derive it).
Dualism is a very lonely position around HN. You and I may be the only dualists here.
> "Our argument combines the well-known consequence of relativity theory, that the time order of space-like separated events is not absolute, with the EPR paradox discovered by Einstein, Podolsky, and Rosen in 1935, and the Kochen-Specker Paradox of 1967"
So as far as I can tell, it takes for granted the humans' ability to choose the configurations freely, which though suspect in of itself doesn't matter so much to their argument as it's not really an argument for free will, it's a discussion of how inherent to quantum mechanics non-determinism is.
> "To be precise, we mean that the choice an experimenter makes is not a function of the past."
> "We have supposed that the experimenters’ choices of directions from the Peres configuration are totally free and independent."
> "It is the experimenters’ free will that allows the free and independent choices of x, y, z, and w ."
It is actually, if anything, in favor of no distinction between humans and computers (more precisely, it is not dependent on humans, only a "free chooser") as they argue that though the humans can be replaced by pseudo random number generators, the generators need to be chosen by something with "free choice" so as to escape objections by pendants that the PRNG's path was set at the beginning of time.
> The humans who choose x, y, z, and w may of course be replaced by a computer program containing a pseudo-random number generator.
> "However, as we remark in , free will would still be needed to choose the random number generator, since a determined determinist could maintain that this choice was fixed from the dawn of time."
There is nothing whatsoever in the paper that stops an AI from having whatever ability to choose freely humans have. The way you're using determinism is more akin to precision and reliability—the human brain has tolerances but it too requires some amount of reliability to function correctly, even if not as much as computers do. In performing its tasks, though the brain is tolerant to noise and stochasticity, it still requires that those tasks happen in a very specific way. Asides, the paper is not an argument for randomness or stochasticity.
> ” In the present state of knowledge, it is certainly beyond our capabilities to understand the connection between the free decisions of particles and humans, but the free will of neither of these is accounted for by mere randomness."
>There is nothing whatsoever in the paper that stops an AI from having whatever ability to choose freely humans have.
There is if an AI is dependent on deterministic methods. I agree that AI is not a well-defined term, but all proposals I have seen are algorithms, which are entirely deterministic. This is entirely at odds with the human conception of free choice. An algorithm will always produce the same choice given the same input. Any other behavior is an error.
The SFWT says that observations can be made that cannot be replicated through deterministic means, which would seem (I agree there is a very slight leap in logic here) to rule out any AI from ever being able to understand at least some aspects of our reality (and also reveals them to be simple, logical machines, with no choice).
Computer vision is such a field. The results provided by conventional approaches had been plateauing for years and then one day, Yann Le Cunn submitted a paper describing the work that he and his team had done in the area with neural networks and which blew away the previous results on image recognition.
Interestingly, this paper was rejected and it took a full year to the CV community to finally turn around and accept NN as a valid approach to this problem. I believe no one questions that it's the best methodology we have today.
Speech recognition, automatic translation and natural language processing in general are other areas that have benefited immensely from neural networks.
When I read content about 'deep learning' neural nets today I don't see anything especially different to what my (admittedly shallow) understanding of neural nets was back then. So what I'm missing mainly is - what changed? Is it just that advances in compute power mean that problems for which neural nets were impractical have now become practical? Is there something different about the way neural nets are employed in 'deep learning' that is different than the neural nets that were discussed in the past?
Deep learning got started when Hinton observed that a certain way of training restricted Boltzmann machines wouldn't get stuck as easily, and hence by pretraining the network as if it were an RBM and then switching to backprop, it wouldn't get stuck in a local minimum as early.
As I understand it, nowadays the best method looks something like a generalization of Newton's Method, wherein the direction you move takes into account the second differential and not just the first differential or direction of sharpest descent. You move furthest in the directions that curve the least, and move the least in the directions that are most sharply curved. It turns out that this (plus some other tricks) make it way easier to follow continuous gradients in big weird parameter spaces, so now it's possible to train deep nets, which are a kind of continuous gradient in a big weird parameter space.
Tl/dr: People figured out how to move better through the parameter space of neural networks, by taking into account the second differential plus some other tricks. So now we can train deeper nets.
The rectified linear unit also converges about 75% faster than the classic Sigmoid or Tanh. the ReLu is F(x) = max(0, x) so the gradient propogated is 1 rather than a more "curvy" value.
In NLP in particular, I'm not so sure about the "immense" benefit of ANNs. LSTM-RNNs are very nice for language modelling but we haven't seen the 10-percentile jump in performance that the image processing community has. Stanford I believe uses ANNs in their dependency parser (to learn transitions) and Google of course uses ANNs everywhere, but the gains against goold old HMMs are nowhere near the gains in image processing and it's not clear what exactly is the difference. Maybe teams have better hardware and more data now. Rather, they definitely do, so maybe that's the reason for any increases in performance.
With translation in particular (it's a branch of NLP) it's very hard to know what's going on. Companies who sell machine translation products still use different techniques, including rule-based software. The problem is that machine translation needs some way to deal with semantics, else it's a bit shit. Google again- they do use ANNs (apparently- it's hard to tell because they don't) but is their translation any better than others'? That's almost impossible to know, because there are no good measures of machine translation performance. I'll repeat that: we don't have any good measures of machine translation performance. So anyone can come up with a solution and drum it up as "state-of-the-art", and if their user base is large enough it'll be impossible for anyone (even themselves) to know for sure whether that means anything.
Speech recognition- I don't know much about that, but I think it's similar to NLP. You can do a lot with a lot of data and hardware, if you're lucky to have them and at that point it doesn't really make a real difference to your end-user if you're using an ANN model or an HMM model.
Which may just go to show that Hinton, Schmidhuber and the rest of the connectionist gang are right about the value of more data and faster computers.
I seem to remember this article as having a lot of very interesting comments, including from the chairman of the conference. The exchange was cordial but showed some clear antagonism between the two parties. Somehow, I don't see these comments any more on Google+.
That was four years ago, things are a lot better now.
I believe you'll be hard pressed to find anyone that describes expert systems as "learning" in any way. They're typically considered as the antithesis of machine learning: they're knowledge-driven but you have to enter all the knowledge "by hand". Indeed some machine learning algorithms were proposed as ways to automate the creation of the large rule-bases typical of expert systems (the most prominent such algorithm being Decision Trees).
>> It would be interesting to hear the thoughts of a non-NN AI researcher regarding this.
Researchers are rarely as proscriptive as that. You may find folks who specialise in ANNs to the exclusion of other techniques (but it's true it's a vast field) but you'll probably not find anyone who wouldn't try ANNs a couple of times, and maybe publish a paper or two on them. Same goes for other trendy algorithms also.
There are researchers actively looking at more holistic AI systems that aren't just the core learning part (often called cognitive architectures). For instance, take a look at Soar:
It under active development, and has been since the 80s. Soar is very much a spiritual successor to GOFAI (or maybe still is GOFAI), but does incorporate statistical machine learning-- much of the recent work has been around integrating reinforcement learning with symbolic decision-making.
ACT-R is another cognitive architecture under active development, but is more used for cognitive modeling (i.e. psychology research): http://act-r.psy.cmu.edu/
You can find other related systems by searching for "cognitive architecture", but it is unfortunately a field that attracts a lot of proposals/'designs' for things that never get properly implemented (the above being two notable exceptions).
Lise Getoor from MIT is also doing a lot of work with statistical relational learning in particular, which is kind of a cross between GOFAI (the "relational" bit) and nouveau AI (the "statistical" part):
In the UK I know of Kristian Kersting who does statistical relational learning research, at Dortmund:
I think this is not exactly what you were asking for but I think you might find it interesting, particularly Inductive Logic Programming (of which there's oodles in De Raedt's book).
Personally, I started an MSc in AI because I was interested in all the symbolic stuff, and I was a bit disappointed to find there were almost none of that left in the curriculum (at my school anyway, the University of Sussex). I guess I kind of see the point though- knowledge engineering is pretty hard and costly and learning from data is very attractive. I really don't like the idea of doing it all with statistical black-boxes, though. So relational learning is a kind of halfway house for me, where I can still use mathematical logic principles and not have to eat the statisticians' dust, so to speak. For me relational learning is the spiritual successor of GOFAI that I was looking for, so I think you may find something of interest in the links above.
Edit: there's also the probabilistic logic programming community, forgot about them; here:
That's from the University of Leuven, where De Raedt is from. The name says it all: Horn clauses with probabilistic weighting. So that's not machine learning (or any kind of learning, it's just what it says on the tin). You will find many more here:
Although that's not all _logic_ programming (but there's quite a bit of it).
Deep Learning is absolutely interesting and powerful, but it's not like all the other techniques are all the same, or necessarily less powerful, for that matter. Different techniques have different applications.
It's a marketing term for neural nets that have "a lot" of layers. Neural nets have been around for over 30 years, calling them deep learning was a smart re-branding move.
>Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each
transform the representation at one level (starting with the raw input)
into a representation at a higher, slightly more abstract level.
Deep Learning, Nature Volume 521 issue 7553, 2015, LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey
I've been studying up on deep neural nets recently though and there are a bunch of very cool tricks (and I don't mean that pejoratively, I think real intelligence consists of just such a grab-bag of tricks), which make it effective in a wide range of cases.
There have been a bunch of algorithmic advances as well, but some advances have simply been those increases in compute power: there are techniques that didn't really work in 1985, where the technique or a close variant now works, mainly because we didn't the big arrays of GPUs in 1985 that we do now. As these get scaled up, the algorithmic advances are being found to not even be necessary in some cases. For a period it was thought that autoencoder pretraining was the big algorithmic breakthrough that made it feasible to train many-layered neural networks, but a number of recent applications no longer bother with the autoencoder pretraining.
Re-reading it with "you" and "your" and I realize that it works just as well. I guess it's really just a fashion which person you use. It used to be 3rd for academic writing. Now it's 1st, but I guess the most informal way is still 2nd.
My take: "Deep Learning" is performative (https://en.wikipedia.org/wiki/Performative_utterance). An approach falls under the header of "Deep Learning" when used or developed by someone who identifies as a Deep Learning Researcher.
NN and others are totally awesome, but we are so far away from true learning that is normally meant in the human context.
yet still much more capable in most cases. Handwriting recognition, for example. How many examples must a NN see to start accurately arriving at valid outputs for new handwritten cases? Now, how many examples must a human see to learn a particular letter before she starts recognizing it with 100% accuracy?
Machine learning Statistics
network, graphs model
generalization test set performance
supervised learning regression/classiﬁcation
unsupervised learning density estimation, clustering
large grant = $1,000,000 large grant = $50,000
nice place to have a meeting: nice place to have a meeting:
Snowbird, Utah, French Alps Las Vegas in August
What he calls "algorithmic modeling" is what I see as machine learning style thinking.
Naturally there's a lot of overlap and people in stats and ML often do related work (our statistics department, at CMU, collaborates a lot with the machine learning department), but there's a basic mindset difference.
I see "machine learning" as exploring computational approaches to non-parametric statistics. The idea of data-driven model-estimation is not outside the scope of "statistics".
A small amount of neurons might already solve some problem you're having. The XOR problem can be learned by 4 neurons connected to each other.
When you want raw images or similar as an input and have it be classified into 100 classes (e.g. look up CIFAR-10 or CIFAR-100), you will need an architecture with many more neurons.
After all, ANN are simply a tool. Depending on the task, that tool might need to more elaborate. And when you have all those different possible architectures, you want a common way of naming them. Labels such as Deep Learning are simply nomenclature of talking about certain groups of artificial neural networks.
XOR (an input layer of 2 neurons, 1 hidden layer of usually 2 neurons, and an output layer of 1 neuron) is a classic toy example of an ANN.
The minimal would probably be something like a NOT gate - input layer with 1 neuron connected directly to the output layer with 1 neuron.
In practice you either use whatever size your problem needs (for small problems) or match the network size to match the RAM amount in the video cards you're using.
That is my 0.02, I couldn't follow this article without doing ancilliary research/wikipedia lookups at least.
I felt this could have been expanded on.. still not sure how the sliding windows map into units.
It will also take 30 years to get out of, as usual.
Winter is coming.