
Machine learning works spectacularly well, but mathematicians aren’t sure why - retupmoc01
https://www.quantamagazine.org/20151203-big-datas-mathematical-mysteries/
======
j2kun
I can chime in for the theoretical computer scientists. Deep learning (in its
simplest form) corresponds to the class of circuits whose gates are linear
threshold functions. Our primary goal with such functions is not to show what
problems can be solved by small circuits using linear threshold gates, but
what problems _cannot_ be solved with such circuits.

Until last week [1], it was an open problem whether every function with n
inputs that is computable in _nondeterministic_ time 2^O(n) could also be
computed with a two-layer circuit using only O(n) gates (that is, a deep net
with just one hidden layer with a linear number of gates). This is an
embarrassing state of knowledge.

Now we know the following slightly less embarrassing thing: there is an
explicit function, computable in linear time, that needs at least n^3/2 gates
in a two-layer deep net (modulo some log(n) factors), and another function
that needs n^3/2 gates in a three-layer deep net (with the additional
restriction that the output gate is a majority vote of the previous layer).

This is still a long way away from truly understanding the representational
power of deep nets, but it's the first general progress that has been made
since the early 90's.

[1]: [http://eccc.hpi-web.de/report/2015/188/](http://eccc.hpi-
web.de/report/2015/188/)

~~~
eli_gottlieb
Actually, there's been a few papers showing solid theoretical progress on
understanding what structure in data deep networks are learning to represent.

[https://github.com/gregversteeg/CorEx](https://github.com/gregversteeg/CorEx)

[http://arxiv.org/abs/1406.1222](http://arxiv.org/abs/1406.1222)

[http://arxiv.org/abs/1410.7404](http://arxiv.org/abs/1410.7404)

Disclaimer: I reinvented the idea about a month or two ago, and ran into the
existing papers when googling for joint entropy estimators after making some
interesting graphs with hierarchical probabilistic programs. I only wish I
could've been the first!

~~~
Xcelerate
Those papers are very interesting. I was particularly surprised that the
method exactly recovers the Big-5 components of personality from the results
of the survey questions.

The whole idea of "simplest model that explains most of the data" has always
been very appealing to me. The concept is closely tied to reproducing kernel
Hilbert spaces, which have recently experienced a revival in interest due to
the representer theorem.

~~~
eli_gottlieb
>I was particularly surprised that the method exactly recovers the Big-5
components of personality from the results of the survey questions.

Why? Is that a standard dataset? They could have just cherrypicked tasks.

I also noticed that they seem to only have the actual computation working for
discrete random variables (integer features) right now, which limits its
applicability. They also seem to use mass-function estimators, which again can
work well for discrete data while becoming statistically intractable when
dealing with continuous random variables.

>The whole idea of "simplest model that explains most of the data" has always
been very appealing to me.

In this case it's a bit more like, "The set of latent variables that best
screen off the observables from each-other."

------
rdlecler1
One problem with understanding ANNs is that the weight matrix carries a lot of
spurious interactions. Running perturbation analysis you can see that many of
the interactions do not contribute to the information processing of the
circuit. This is the same for Gene Regulatory Networks. I wrote a paper
published in Nature's "Systems Biology" entitled Survival of the Sparsest Gene
Networks are Parsimonious. It's been cited ~130 times. There I represent an
algorithm to evolve the connectivity of the network. What you find if that a
network will tend to remove spurious interactions if the system is allowed to
evolve. Because there will be very few network topologies that are both
sparsely connected and functionally equivalent (think of how many ways you
could create a minimally complex 8-bit added) there is likely only a small
handful of non-isomorphic network topologies for any given function. With
these sparse networks we should get a better grasp on the functional circuits
that drive them. When the networks appear fully connected, at least in each
layer, that circuitt does not reveal itself.

~~~
algomanic
Could you explain weight matrix "spurious interactions"?

~~~
rdlecler1
In short, a spurious interaction is a non-zero entry in the weight matrix that
has no (positive) contribution to the network function, and that removing that
interaction the network will perform at least as good as it would with the
interaction.

Most neural network models assume that all neurons in one level are fully
connected to all neurons in the next level. This leads to confusion about how
ANNs work.

From my research I'd argue that MOST interactions in these networks are
spurious. Once you remove them it reveals the (visual) topology (circuit
diagram) that's driving the function of that network.

In the paper I wrote, I evolved gene regulatory networks (of ANNs have the
same mathematical representation) such that interactions between any two nodes
could be deleted, created (if W_ij = 0), or modified according to
probabilities of deletion, creation, and modification. Given these
probabilities, you can calculate the number of interactions that should result
when the network reaches equilibrium, however what I found was that the
network evolves less interactions than you would expect from the equilibrium
calculation. This says that all things being equal, a network is paying a
price for spurious interactions and that these will be removed in an
evolutionary environment. Basically, each interaction needs to pay it's way
otherwise it leads to unnecessary complexity that reduced the fitness of the
network.

~~~
algomanic
Interesting, thank you.

------
vonklaus
There was a great article called _The Space Doctor 's Big Idea_, published in
the New Yorker a few months back. It explained Einstein's theories in the top
1000s English words used in America.

While sone might wave their hands and call this an eli5, it was quite well
done and between that and some xkcd comics I was able to learn enough to
ubderstand how orbit works and a few other awesome facts.

I am interested in finding something similar for machine learning. I am not
embarassed to admit that, _Super-Linear Gate and Super-Quadratic Wire Lower
Bounds for Depth-Two and Depth-Three Threshold Circuits_ , is just way too
complex for my casual interest but would like to know more about how AI/ML
works conceptually. Clearly the discipline underpins a bunch of services I use
regularly and as computing and techniques improve it will start playing a more
conspiscious role.

[http://www.newyorker.com/tech/elements/the-space-doctors-
big...](http://www.newyorker.com/tech/elements/the-space-doctors-big-idea-
einstein-general-relativity)

~~~
HockeyPlayer
You may enjoy The Master Algorithm: [http://www.amazon.com/Master-Algorithm-
Ultimate-Learning-Mac...](http://www.amazon.com/Master-Algorithm-Ultimate-
Learning-Machine/dp/0465065708)

~~~
deepnet
Google Talk on the Master Algorithm by the author Domingas.

[https://www.youtube.com/watch?v=B8J4uefCQMc](https://www.youtube.com/watch?v=B8J4uefCQMc)

------
dave_sullivan
In addition to "We need a cluster for deep learning", the second most popular
mostly untrue thing I hear is "We have no idea how neural networks learn".

However, there are many papers that explore various ways to make a network
learn, and they keep improving on performance, suggesting they're on to
_something_. There are also many papers that discuss possible theoretical
implications of experimental results.

But what does Knowing Why The Network Works mean exactly? "It works because
universal approximation and gradient descent", but that's not a very
satisfactory answer. "It works because it starts at a general solution and,
over the course of many iterations, takes many small steps in an ever changing
direction defined by a gradient approximation generated by looking at the
difference between an average error and a target output (which should trend
towards 0)".

What would a satisfactory "why" even look like exactly? As in, what form might
it take compared to some other scientific discipline where we _do_ know what's
going on?

Personally, I think the whole thing is a red herring -- people in the field
have _some idea_ of how neural nets work, and there are many disciplines
considered by many to be mature sciences that are far from settled on a grand
theoretical scale.

That said, the theory I'm most interested in is recent attempts to connect a
memory module to neural networks so they can "learn" to store
important/complex/distributed information that can be recalled with high
accuracy later. That will make it easier to do things like ask a neural
network to remember your name, or where you left your keys, or whatever.

~~~
TTPrograms
The why is largely why does gradient descent converge to a good answer instead
of getting stuck in a local minima.

~~~
modeless
Because our intuition about local minima is wrong in extremely high
dimensional spaces. In two and three dimensions, local minima are common. In a
million dimensions, local minima are rare. The intuitive explanation is that
for a local minimum to exist, the function must be curving up (first
derivative = 0, second derivative >= 0) simultaneously in _every_ dimension.
It makes sense that as you add more dimensions this becomes less and less
likely, and for a million dimensions it's vanishingly unlikely.

What you get instead of local minima are saddle points, where some dimensions
are curving up and some are curving down. Saddle points can also be
problematic for optimization but they can be dealt with using fancy
optimization techniques. For a more rigorous explanation, see
[http://arxiv.org/abs/1406.2572](http://arxiv.org/abs/1406.2572)

~~~
bendbro
Wouldn't it still be likely to settle on a local minima when the deciding
factors for the existence of a local minima are limited to those that
contribute to the likelihood of a single output category, and not whether all
functions are curving up?

An example I can think of would be an absurd million input neural network,
where one of the inputs only has a pronounced effect on one of the outputs. It
seems like it would be possible for the path of the input to output to be
dragged downhill in the context of all outputs, but uphill in the context of
the single output it affects.

Is what I've described not likely, or am I just completely off base?

~~~
modeless
An interesting question. A million input neural network isn't necessarily
absurd. Images are very high dimensional. One could easily imagine a one
megapixel input to a neural net. But in natural images no single pixel is
indicative of any single image characteristic by itself. I think this isn't a
coincidence but a common characteristic of "natural" high dimensional data, on
which neural nets tend to work well. So yes, I'd say what you've described is
not likely for a large category of "natural" high dimensional data which
probably includes most of the data we care about in the real world.

------
wsxcde
To take the discussion on a slight tangent, how uncommon is this phenomenon?
An applied tool works really well but nobody knows why. I can give another
example from the domain of formal verification: SAT solvers which are at the
core of most modern verification/synthesis tools.

You can download open source SAT solvers today that work spectacularly well on
"real" SAT instances with millions and millions of variables and clauses. Yes,
SAT is the quintessential NP-complete problem and in fact it is pretty easy to
come up with a SAT instance with only a few tens of variables/clauses that
would kill these solvers. But somehow these "hard" instances almost never
occur in practical problems generated in hardware/software
verification/synthesis as well as ton of other applications (planning,
constraint programming etc.)

So this must there is some characteristic of the problems that we generate in
practice that makes them "easy" but we don't have a good understanding of what
this characteristic is. All we know, for now, is that we've somehow stumbled
upon a near-perfect set of heuristics that work amazingly well on the SAT
instances we encounter in practice.

~~~
bdhe
> So this must there is some characteristic of the problems that we generate
> in practice that makes them "easy" but we don't have a good understanding of
> what this characteristic is. All we know, for now, is that we've somehow
> stumbled upon a near-perfect set of heuristics that work amazingly well on
> the SAT instances we encounter in practice.

There's also the intriguing possibility that "natural instances" however one
might define them arise through some complicated process that ends up sampling
from easy SAT instances. I think of it as analogous to the Benford's law (see:
[https://en.wikipedia.org/wiki/Benford%27s_law#Explanations](https://en.wikipedia.org/wiki/Benford%27s_law#Explanations))
except that we do not have nearly as good explanations for SAT solvers.

It is also highly likely that MOST instances of SAT are easy but that doesn't
preclude it from being NP hard, nor does it stop us from easily constructing
hard instances. (Such things are absolutely necessary for cryptography as an
example.)

~~~
sdenton4
The 'other' possibility is that hard instances have 'measure zero' in the
space of all instances... So when you pick some random problem in the real
world, it ends up being relatively easy to solve with high probability.

My feeling is there's a bit of each; the problem space has a bit of unknowable
quasi-structure, and also tends to miss the perverse corner cases and
counterexamples that we like to think about as mathematicians.

------
cdavid
Note that the writer is the famed Ingrid Daubechies, who was one of the main
driver behind wavelets and early works on sparse representations.

~~~
fitzwatermellow
Noticed that as well. Still writes with such reverence for the power
mathematics has in unveiling the mysteries of nature. Particularly enjoyed the
moniker "distinguished differential geometer" to describe Calabi! I guess its
only a matter of time before "wavelet nets" are re-discovered and resurrected
as the "new innovation" in machine learning ;)

~~~
murbard2
Have you seen Mallat's recent work on the scattering transform?

[http://www.di.ens.fr/data/scattering/](http://www.di.ens.fr/data/scattering/)

[https://www.youtube.com/watch?v=wHhYvtnY2zI](https://www.youtube.com/watch?v=wHhYvtnY2zI)

------
algomanic
The fact that the class of NN functions is universal is almost vacuous: the
basic idea is that if you allow a "neuron" for every point in your input space
then you can mimic any function you like (i.e. each neuron handles a single
input). Obviously such representations become arbitrarily large.

Which almost immediately suggests a solution to why NN learning works: the
processes that produce the types of datasets humans are interested in are
produced by (effectively) polysized networks and you can probably say things
like the probability of recovering a polysize function from polysize samples
is high.

~~~
akavi
My gut response was, I think, a less rigorous version of this: Humans care
about tasks our own brains are good at. If NNs are in fact decent
approximations of the way our brains work, then it makes sense they will be
anomalously effective on this subset of the "actual" problem space.

~~~
sawwit
You seem to be saying that ANNs are good at pattern recognition because BNNs
are. That just defers it to the question why BNNs are good at it.

OTOH, the parent comment suggested we should draw our attention to the
processes that produce the data and find correspondences to how NNs decompose
it.

~~~
akavi
No, I'm saying that what we perceive as "being good at pattern recognition" is
_defined_ by the sort of pattern recognition BNNs are good at.

~~~
eli_gottlieb
Which still doesn't explain precisely what sort of statistical inference BNNs
are performing.

------
abrichr
> In practice, neural networks use only two or three layers...

The famous AlexNet [1] that blew away the ImageNet competition in 2012
contained 8 layers; more recent networks have even more.

[1]
[http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf](http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf)

~~~
ilzmastr
Yea it's more like the opposite, in theory you only need 1 hidden layer to get
the same capacity in the network as with more hidden layers, but in practice,
since auto encoders, many hidden layers gradually decreasing in size are
easier to get to from local energy minimums to globally low energy minimums

~~~
crazypyro
Yes, people took the Universal approximation theorem[0] as evidence that they
only need 1 layer, but there is zero guarantee of efficiency. A single hidden
layer may mean many magnitudes more neurons needed over a n-hidden layer, n >
1 network, which could cause an unrealistic training time. Having multiple
layers can reduce this training time with an optimal structure.

[0][https://en.wikipedia.org/wiki/Universal_approximation_theore...](https://en.wikipedia.org/wiki/Universal_approximation_theorem)

~~~
argonaut
Nobody ever took that theorem seriously. Deep nets were around since the 90s.

~~~
crazypyro
I've seen people with passing knowledge of NNs throw it around sometimes and
its also referred to in a lot of literature as one reason for needing to
parallelize neural networks (which I've been reading a lot on, due to a
project), even if its not hugely important.

~~~
argonaut
It's not hugely important because it tells us little of practice use. I mean,
k nearest neighbors, given infinite data, can model any function as well. In
practice, single layer neural nets are not very useful and don't do a good job
of learning feature representations.

------
jacquesm
I love her definition of big data.

Another bit that stood out for me:

> You’re awarded an extremely generous grant that allows you to give 200,000
> people a 500-question personality test, with answers that vary on a scale
> from one to 10.

Dating sites like OkCupid should have such data.

------
amelius
This is the reason I am not so much interested in this field. It is too much
"let's try this and see what happens" rather than really engineering a
solution.

I guess more people feel that way, and I guess that is a good thing, otherwise
everybody would now be working on AI, since it is such a promising field.

~~~
tjr
Some AI pioneers seem to feel that rule-based symbolic solutions still have a
place, e.g.

[http://dustycloud.org/blog/sussman-on-
ai/](http://dustycloud.org/blog/sussman-on-ai/)

If we are depending on AI systems to make important decisions, possibly life-
or-death decisions, shouldn't we understand how they come to the conclusions
that they do?

I cannot deny the tremendous advances in neural networks recently, but I find
Sussman's point compelling.

------
princeb
there are a number of features in machine learning that seem to have
counterparts in statistics - stuff like m-like estimators as kernels (I think
- correct me if wrong), information criteria for feature selection, that have
existed a long time ago.

maybe it's the combination of sheer computing power and availability of data
that allows better models. maybe we've never looked at algorithmically
generated models because we also want a narrative (commonsensical explanation)
to the model, not just a matter of algorithmically finding correlations
between jelly beans and acne, say.

part of me thinks, ok the machines found something. now can we actually use
that to understand the world, rather than build more recommendation
algorithms? (haha). I'm not sure what an advisor will say about a doc student
who says, let's just throw reams of data at a machine until we find a
meaningful correlation, and then let's reason from the correlations (my guess
is 'no', that's not really the scientific method is it).

I'm hopeful, and I don't think the answer to the question will come from
academia.

~~~
nl
_I 'm not sure what an advisor will say about a doc student who says, let's
just throw reams of data at a machine until we find a meaningful correlation,
and then let's reason from the correlations_

Currently they are saying _yes, please_ , and desperately trying to source
ever bigger volumes of data.

Norvig et. al's article "The unreasonable effectiveness of data"[1] has been
cited over 400 times which is a lot for something with no formula.

[1]
[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf)

------
tacos
As for "spectacularly well" \-- well, the person behind the curtain wiggling
the levers retains a lot of influence. Garbage in, garbage out, remember?

I'm reminded of the time Google Translate autodetected "Gesundheit" as
Spanish. And Gmail kindly offering to translate "hahaha" from Portuguese,
putting an ad for coconuts next to it.

Data science is improving, but you might be surprised how slowly. Especially
in the consumer space, because the metrics on effectiveness are so warped.

Voice recognition of numbers only, over a phone connection, can be below 40%
accuracy! Much of the perceived success of these systems comes not from the
core machine algorithm, but from clever human tweaks around it. Also end-users
who are happy with what they get, not quite realizing how goofy it all is if
they were to get a glimpse of the raw data.

~~~
nl
I'm not sure how old you are, but what exactly are you expectations when you
state _Data science is improving, but you might be surprised how slowly._

We have machines that can categorise pictures better that humans. In 2011 that
seemed completely impossible.

~~~
tacos
Age is unrelated to wisdom and I'm talking about the full experience.

A Google Image search for "Wonder Wheel" (the famous Coney Island Ferris
Wheel) shows this spoked diagram within the first page of results:

[http://searchengineland.com/figz/wp-
content/seloads/2011/07/...](http://searchengineland.com/figz/wp-
content/seloads/2011/07/google-wonder-wheel.png)

Also this year, Google Photos classified black people as gorillas.

[http://www.usatoday.com/story/tech/2015/07/01/google-
apologi...](http://www.usatoday.com/story/tech/2015/07/01/google-apologizes-
after-photos-identify-black-people-as-gorillas/29567465/)

Consumers are rarely exposed to the raw machine output -- for good reason. My
experience building these sorts of systems is that they're pretty goofy and
they fail unexpectedly. After chasing audio and video problems using custom
software as well as three major toolkits, I find myself hyper-aware of the
flaws in public systems.

Also common sense dictates that it's more about the data scientist on the way
in and the UX person on the way out than the machine.

~~~
gnaritas
> Age is unrelated to wisdom and I'm talking about the full experience.

That's not true; age and wisdom are quite related, just not directly causal.
They are correlated. Older people are generally wiser, it's just not a
guarantee, nor is it impossible for young people to be wise, but it's
certainly far less common. With age comes experience and with experience,
wisdom has fertile ground to grow, though it doesn't always.

~~~
tacos
Notwithstanding the fact that "age is unrelated to wisdom" was a gentle
contextual device used to dismiss the commenter's odd assumption about my age,
a search for "age is unrelated to wisdom" turns up actual research papers.

"neither general nor personal wisdom have a positive linear relationship to
age... age is not only not related to personal wisdom (as is the case for
general wisdom) but even negatively related..."

[Mickler & Staudinger (2008), Sneed & Whitbourne (2003)]

Source: The Scientific Study of Personal Wisdom: From Contemplative Traditions
to Neuroscience

[http://www.amazon.com/Scientific-Study-Personal-Wisdom-
Conte...](http://www.amazon.com/Scientific-Study-Personal-Wisdom-
Contemplative-ebook/dp/B00HRNLC5U)

~~~
gnaritas
If you actually searched for that, then you know there's far more studies
showing the opposite. You can find a research paper supporting just about any
position, that doesn't make it the majority opinion of the field. You actually
had to ignore a lot of stuff saying the opposite, to find something saying age
and wisdom weren't related. That's OK, you're young, you'll get wiser with
age.

------
shas3
I think another interesting analogy (in terms of field-development trajectory)
to deep learning is the whole idea of l1-regularized regression or LASSO or
sparsity priors or sparse signal processing ([1] goes by many names depending
on which field you work in). The whole idea is that by penalizing 'dense'
solutions to a regression problem, you can 'promote' 'sparse' solutions like
the ones that occur in many many applications. This had been used by various
communities with some theoretical justification for years, at least since the
1970s. However, the real theoretical breakthroughs framing this problem in
something close to the rigor and usefulness of Shannon's sampling theorem,
came in 2004 from two papers independently: David Donoho; and Emmanuel Candes,
Justin Romberg, Terence Tao. Theoretical CS community also got close to the
answer in their work in the late 1990s and early 2000s in many papers on
random projections and sketching. There are many connections between deep
learning and ideas in the broad area of sparse regression. But one
epistemological point of intersection is how theoretical results pertain to
asymptotically large cases, but in applications work very well even with much
smaller systems.

[1]
[https://en.wikipedia.org/wiki/Compressed_sensing](https://en.wikipedia.org/wiki/Compressed_sensing)

------
the_duck
>In the last 15 years or so, researchers have created a number of tools to
probe the geometry of these hidden structures. For example, you might build a
model of the surface by first zooming in at many different points. At each
point, you would place a drop of virtual ink on the surface and watch how it
spread out.

It sounds to me like this 'ink drop' is a metaphor to explain some state-of-
the-art dimensionality reduction technique. Does anyone know the common name
of this technique?

~~~
chestervonwinch
There are diffusion based methods for unsupervised learning based on the graph
laplacian, or heat kernel [0]. Maybe this is what she was referring to? I
would guess so, because Coifman (who is also closely tied to wavelets [1])
introduced the diffusion maps [2] method. On the other hand, I have not seen
any unsupervised method based on say, Navier-Stokes, which is the only other
thing that comes to mind for regarding ink drops diffusing.

[0]:
[http://www.mit.edu/~9.520/spring11/slides/class20_misha.pdf](http://www.mit.edu/~9.520/spring11/slides/class20_misha.pdf)

[1]:
[https://en.wikipedia.org/wiki/Coiflet](https://en.wikipedia.org/wiki/Coiflet)

[2]:
[https://en.wikipedia.org/wiki/Diffusion_map](https://en.wikipedia.org/wiki/Diffusion_map)

------
joakleaf
Neural networks is advanced curve fitting -- That's why.

It isn't really all that magical or mysterious.

~~~
nl
It really kinda is.

In the Ilya Sutskever _Talking Machines_ podcast[1] he describes exactly how
it is "magic". He talks about how there is no theoretical basis to think that
a deep neural network _should_ converge, and prior to around 2006 the accepted
wisdom was that networks deep enough to outperform other methods of machine
learning were useless because they couldn't be trained.

And then they discovered how to initialize the random starting values in such
a way that they _do_ converge - for reasons no one really understands.

I've heard a lot of people claim there is no magic, but I tend to think
Sutskever[2] knows a little bit more than most people about deep neural
networks.

[1] [http://www.thetalkingmachines.com/blog/2015/1/15/machine-
lea...](http://www.thetalkingmachines.com/blog/2015/1/15/machine-learning-and-
magical-thinking)

[2] [http://www.cs.toronto.edu/~ilya/](http://www.cs.toronto.edu/~ilya/)

~~~
dasboth
I was just listening to that podcast yesterday and loved the way he presented
this information. I know next to nothing about NNs but I could pretty much
follow everything that he said (from an intuitive sense, anyway, if not a
technical one).

This "magic" of converging networks reminds me of how ensemble methods, such
as random forests, are effective but people aren't sure why. There's certainly
(AFAIK) no theoretical grounds to say that a bunch of random decision trees
should yield universally good results.

~~~
mziel
There are. It's all about variance reduction. See the Breiman's paper on
Bagging. There's nothing special about random forest though (apart from the
fact that a decision tree is a good learner, because of the nonlinearities for
example), you can you ensemble learning with any "basic" learner.

~~~
dasboth
I read Breiman's random forest paper, but I'll give his others a read too.
Does the variance reduction effect apply to any ensemble method then?

------
powera
My super-hand-waving explanation of why machine learning works:

MNIST is a handwritten digit database. Each 784-pixel image (28x28)
corresponds to a digit from 0 to 9. As a pure mathematical construction, there
are at most 2^784 inputs possible, and a small number of possible outputs.

So if you have 784 completely different ways of analyzing the image, and you
combine them in the right ways, you will get roughly an approximation of an
answer. This is a tautology if the 784 ways are "the value of each pixel" and
the combinations are "magic", but if you have more "intelligent" combinations
you should have combinations that are less magic. And in this case, since
humans can generally determine the digit value from a 7-light display, it
seems reasonable that there exists some way to have "intelligent" combinations
such that they combine to form a neural network that solves the problem of
digit identification.

And that (still hand-wavy) explanation can also plausibly describe how a human
would describe identifying a number. If I ask you "why is this a 1 and not a
3", you might say "because it's straight" or "because it's narrow" or "because
it doesn't have a point in the middle" or any number of other descriptions of
the object. So you can envision a 2-layer network where the middle layer
calculates this (and due to the structure of images, in practice it might
better be a 3 or 4-layer network. but the important point is that the search
algorithms don't rely on it or you knowing what these middle layers are ahead
of time)

Which only leaves the question of how "neural network learning" is supposed to
find this. And there are a few heuristics which combine to (in practice) be a
very effective search. We have back-propagation (which is much easier with
automatic differentiation), so we can adjust the entirety of the network based
on the output. (and it's an axiom that if you have a lot of things, they will
be similar in the ways they are the same, and different [hopefully in some
regular way] in the ways they are not the same). We have drop-off, where we
attempt to prune connections that are irrelevant. We can add new connections
to see if they are relevant. We can do any number of hill-climbing algorithms
on the output of the fitness function. And, as a valid search algorithm, it
tends to converge to a valid result.

Obviously none of this is at all rigorous, but if you know the math here
enough I don't think you're asking the questions in this article.

~~~
gone35
_So if you have 784 completely different ways of analyzing the image[...]_

Interesting intuitions; but if by "ways of analyzing the image" you mean
functions, there are _an awful lot_ more than 784...

~~~
powera
I mean, there are zillions of different functions, but no more than 784
linearly independent versions. (I mean, not technically since the functions
aren't linear, but at any specific point on the functions, the linear
approximations have no more than 784 independent bases).

------
zepolen
Thought about how cool it would be to use machine learning to generate machine
learning algorithms and how that could be the basis of artificial "life".

Maybe start with a minimal implementation 'bootstrap protozoa' that can evolve
to highly complicated forms.

Then realised that since Machine Learning processes takes a long time that
would take forever making auto evolving computer life forms a long way away.

Although if it were possible, in the same way we became sentient based on very
simple inputs, that could also mean computers, with a sufficiently complex
process, could too...

~~~
obsim
A LSTM recurrent neural network was used to learn a learning algorithm:
[http://www.bioinf.jku.at/publications/older/1504.pdf](http://www.bioinf.jku.at/publications/older/1504.pdf)

------
tianlins
it seems like a general phenomenon that our scientific understanding
decreasing with the complexity of the system. in classic physics, we could
have very clean equations capturing the dynamics of a system. this is not true
any more in chemistry/biology/social science. modern deep learning pipeline,
compared to its shallow counterparts, gains more complexity, so it's not
surprising at all to me that we are not able to understand it well at the
moment.

------
cschmidt
Can anyone give me a pointer on what techniques the author is talking about
for unsupervised learning at the end of the article?

~~~
StephenAshmore
Sure. These techniques are called Dimensionality Reduction or manifold
learning. The wikipedia article actually has a good description:
[https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduc...](https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction)
Some of the more interesting ones: locally linear embedding, isomap, MLLE,
Maximum Variance Unfolding.

------
jheriko
maybe i am missing something here... i really don't know.

it might not be rigourous, but for standard, deep neural networks its
intuitively obvious enough that you can reinvent the idea from scratch in your
bedroom in a time before the internet just by getting a vague description of
the idea and thinking about how it /could/ work.

proving convergence may be difficult, but its not particularly challenging to
see why it happens imo. :/

a lot of the things the article points at are utterly irrelevant to the
subject. e.g. sigmoid functions, depth of networks.

i think there are some bold baseless claims here - for instance linearly
interpolating data points is pretty simple as a way to approximate a function,
and its not hard to see how neurons can provide this with linear activation
functions and some very naive back propagation. (e.g. evenly dividing error
and correcting weights towards the correct result)

if this didn't converge /that/ would be surprising.

------
deepnet
Deep Neural Networks learn layers of linear functions - represented by
adjustable 'learning' weighted connections.

Each layer of neuron activations is a new representation of the data -
produced by the weighted connections.

Function compositions of linear functions are still only a linear function.

Each neuron in a layer sums it's weighted inputs this summation is a non-
linearity that allows layers to be composed - function composition.

This is famously expressed in Minsky and Papert's 1968 'Perceptron': a single
layer of network weights is incapable of learning XOR.

One analysis is that Neural nets transform the shape of the data until a
single line on a twisty high dimensional manifold produces the desired
distinction. [http://colah.github.io/posts/2014-03-NN-Manifolds-
Topology/](http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/)

A single layer network is a universal approximator and a net can be trained or
distilled from another net - but deep nets are overwhelmingly better at the
initial learning and discovery.

Neural nets have been related to Kadanof's spin glasses, suggesting learning
is alike to phase transitions or sand pile collapse where small local changes
can produce profound global changes.
[https://charlesmartin14.wordpress.com/2015/04/01/why-deep-
le...](https://charlesmartin14.wordpress.com/2015/04/01/why-deep-learning-
works-ii-the-renormalization-group/)

Generally when training nets the learning initially learns a lot very quickly,
most of the error vanishes at the start.

Word2Vec demonstrates that nets learn very powerful representations, that
word2vec vector algebra is semantic points to unexpectedly powerful
representations.

Similar semantic vector math can be performed on images and the same vectors
can translate modalities, e.g. text to images.
[https://github.com/Newmu/dcgan_code#arithmetic-on-
faces](https://github.com/Newmu/dcgan_code#arithmetic-on-faces)

Natural Evolution produces efficient solutions.

I propose the successes of deep learning so far are partially explicable
because they are working within human culture and perception, relearning our
very efficient solutions - akin to distilling a deepnet into a shallow one.

This hypothesis will be tested if embodied deep neural nets using re-
inforcement learning discover their own efficient solutions to performing
tasks in the real world - robotics.

IMHO Peter Abeel's deep net, learning to robustly output robot motor torques
directly from camera pixels will show if embodied deep nets can do discovery
rather than relearning what we know.
[http://www.cs.berkeley.edu/~pabbeel/research_rll.html](http://www.cs.berkeley.edu/~pabbeel/research_rll.html)

------
sonabinu
It would be interesting to see what the purists arrive at

~~~
stdbrouw
Well, a good example would be linear regression. We know pretty much all there
is to know about its asymptotic behavior, and can prove that its results are
"maximum likelihood estimates."

Similarly, Bayesian statistics (a.k.a. probabilistic programming, graphical
models, Bayes networks etc.) has a rock solid mathematical foundation – both
the statistics behind it and the sampling algorithms (MCMC).

Of course, many mathematical statistical methods started off as hacks too. The
invention of the L2 loss function, I imagine, must've gone a little like
"alright, let's try to find the pattern here, let's draw the line through this
series of points that minimizes the sum of the distance of each point to the
line... hm, I don't like how this looks, let's try squaring the distances
instead and then minimize that sum so we punish large deviations from our line
a bit more... oh, that looks better."

~~~
yummyfajitas
Actually the squaring was for a more practical reason. Lets try to minimize
the sum of the distance, oops |y-ax| is not differentiable so we can't use
calculus, lets square it instead.

We later realized this worked so well because it's max likelihood of the model
y=ax+b+e where e is a gaussian error function. But that came much later.

~~~
tel
In grad school I asked that so many times trying to figure out why it was the
"right" solution and this historical cheat caused so much cognitive
dissonance.

Then again, on the other hand, "calculus works nicely" and "it's simple"
probably are good spooky directors toward useful models anyway.

~~~
senekerim
It's not really a cheat, as it produces the MLE (i.e. most likely) estimate
under certain assumptions (e.g. errors are normally distributed, which occurs
naturally if the errors are large sums of many unrelated measurements errors).

~~~
tel
Oh, I'm aware. It's just hard to always justify it and especially hard to do
so in historical context.

~~~
alexmlamb
I don't think that this is quite the right explanation.

|y - wx| isn't differentiable but it is subdifferentiable. So the derivative
is defined at every point except where y = wx, and in that case you're more or
less fine if you just pretend that the derivative is zero at that point.

One reason why squaring is preferred is that the derivative has a nice closed
form, which can be used to find closed form solutions.

Another is that people like to fit their models with gradient descent, and
gradient descent has better guarantees for strongly convex loss functions. It
also works better in practice. Intuitively: if you try to minimize x^2 by
gradient descent, you have the largest gradients when you're far from the
minimum. If you try to minimize |x|, then your gradient is always either +1 or
-1, making it harder to converge around the minimum. See the Huber loss for a
strongly convex relaxation of the absolute value function.

------
DyslexicAtheist
whatever you do, __don 't __try to figure out why it works[0].

[0]
[https://en.wiktionary.org/wiki/schroedinbug](https://en.wiktionary.org/wiki/schroedinbug)

------
calewis
Skynet is coming.

------
darkhorn
As far as I know no mathematician was able to graduate from statistics
department as second major or (second) minor (whatever you call it in
English). I mean in my university. I hope you get the idea; it looks like
statistics is hard for mathematicians, at least for most of them. Thus, I
don't understand why they talk about this statistics topic with mathematicians
but not statisticians. Yeah, may be should ask to a statistician! Duh!

