Hacker News new | comments | ask | show | jobs | submit login
Machine learning works spectacularly well, but mathematicians aren’t sure why (quantamagazine.org)
403 points by retupmoc01 on Dec 4, 2015 | hide | past | web | favorite | 132 comments

I can chime in for the theoretical computer scientists. Deep learning (in its simplest form) corresponds to the class of circuits whose gates are linear threshold functions. Our primary goal with such functions is not to show what problems can be solved by small circuits using linear threshold gates, but what problems cannot be solved with such circuits.

Until last week [1], it was an open problem whether every function with n inputs that is computable in nondeterministic time 2^O(n) could also be computed with a two-layer circuit using only O(n) gates (that is, a deep net with just one hidden layer with a linear number of gates). This is an embarrassing state of knowledge.

Now we know the following slightly less embarrassing thing: there is an explicit function, computable in linear time, that needs at least n^3/2 gates in a two-layer deep net (modulo some log(n) factors), and another function that needs n^3/2 gates in a three-layer deep net (with the additional restriction that the output gate is a majority vote of the previous layer).

This is still a long way away from truly understanding the representational power of deep nets, but it's the first general progress that has been made since the early 90's.

[1]: http://eccc.hpi-web.de/report/2015/188/

Actually, there's been a few papers showing solid theoretical progress on understanding what structure in data deep networks are learning to represent.




Disclaimer: I reinvented the idea about a month or two ago, and ran into the existing papers when googling for joint entropy estimators after making some interesting graphs with hierarchical probabilistic programs. I only wish I could've been the first!

An interesting and early (1992) demonstration of deep nets learning latent structure in the visual domain was presented by Usui et al. They presented a deep autoencoder with the spectral values of color chips and show that the hidden layer representation corresponds to the Munsell color space where each dimension corresponds to a psychological/intuitive color attribute. Pretty amazing. Even cooler that they showed this in 1992.

10.1364/JOSAA.9.000516 Shiro Usui, Shigeki Nakauchi, and Masae Nakano, "Reconstruction of Munsell color space by a five-layer neural network," J. Opt. Soc. Am. A 9, 516-520 (1992)

Those papers are very interesting. I was particularly surprised that the method exactly recovers the Big-5 components of personality from the results of the survey questions.

The whole idea of "simplest model that explains most of the data" has always been very appealing to me. The concept is closely tied to reproducing kernel Hilbert spaces, which have recently experienced a revival in interest due to the representer theorem.

>I was particularly surprised that the method exactly recovers the Big-5 components of personality from the results of the survey questions.

Why? Is that a standard dataset? They could have just cherrypicked tasks.

I also noticed that they seem to only have the actual computation working for discrete random variables (integer features) right now, which limits its applicability. They also seem to use mass-function estimators, which again can work well for discrete data while becoming statistically intractable when dealing with continuous random variables.

>The whole idea of "simplest model that explains most of the data" has always been very appealing to me.

In this case it's a bit more like, "The set of latent variables that best screen off the observables from each-other."

You will probably be interested in Kolmogrov complexity, which studies this using an information theoretic lens. However it is just an abstract concept and not really something you can measure precisely.

I meant "deep nets" in terms of boolean circuits, which seems like a different beast from these papers.

> reinvented the idea about a month or two ago, and ran into the existing papers

this happens all the time. i'm thankful that you took the time to post the results, because a discovery that's not communicated is not really a discovery. indeed, i suspect that what's going on in the field is as much a lack of communication as much as a lack of understanding.

i wish the author of the original article had included links to at least a couple of of the "number of tools to probe the geometry of these hidden structures."

That's really interesting, especially with respect to the personality trait recovery. I'm particularly amazed given that I've never been able to replicate the Big5 model in its entirety against particular datasets.

Only commenting on that portion, as it's what I know. The author doesn't appear to give any indication of which test was used, and where the sample was taken from. This makes me a little suspicious, but its probably just a disciplinary thing.

Additionally, many tests reduce the pool of available questions for a scale after testing it against against other scales + theory. This means that what the author is recovering, is the scale imposed on the items by the original authors.

Nonetheless, this has forced me to reinvestigate the Big5 model, so it's not all bad :).

Massive appreciation for the links.

Possibly also of interest: Recursive Neural Networks Can Learn Logical Semantics[1]

Probably not as theoretical as the work you referenced, but interesting to me because of the deeply practical outcomes in NLP.

[1] http://www.aclweb.org/anthology/W/W15/W15-40.pdf#page=22

This is still a long way away from truly understanding the representational power of deep nets, but it's the first general progress that has been made since the early 90's.

Exactly. Someone should tell Yann LeCun this --see e.g. Section 3.2 in [1], or pretty much every time he brings up circuit complexity theory to automagically imply that "most functions representable compactly with a deep architecture would require a very large number of components if represented with a shallow one." [ibid, p.14]

[1] http://yann.lecun.com/exdb/publis/pdf/bengio-lecun-07.pdf

Related: "On the Expressive Power of Deep Learning: A Tensor Analysis". http://arxiv.org/abs/1509.05009

One problem with understanding ANNs is that the weight matrix carries a lot of spurious interactions. Running perturbation analysis you can see that many of the interactions do not contribute to the information processing of the circuit. This is the same for Gene Regulatory Networks. I wrote a paper published in Nature's "Systems Biology" entitled Survival of the Sparsest Gene Networks are Parsimonious. It's been cited ~130 times. There I represent an algorithm to evolve the connectivity of the network. What you find if that a network will tend to remove spurious interactions if the system is allowed to evolve. Because there will be very few network topologies that are both sparsely connected and functionally equivalent (think of how many ways you could create a minimally complex 8-bit added) there is likely only a small handful of non-isomorphic network topologies for any given function. With these sparse networks we should get a better grasp on the functional circuits that drive them. When the networks appear fully connected, at least in each layer, that circuitt does not reveal itself.

NVIDIA NIPS paper doing just that for NN.


Not sure if it tells us more about why the network works, but they sure as heck get rid of a lot of connections.

Is perturbation analysis one of the things the mathematicians don't know about or don't accept?

I don't think so - it is fairly common in simulations and statistical analysis as a way to explore the robustness of your results given small changes within the tolerances you expect for your model. Usually it is applied to the input data, but training neural networks can be expensive so tinkering with the weights is much cheaper (which is the input to some gradient descent algorithm looking for a local minimum and gives insight into the stability of your local minimum).

Isn't this what dropout does?

Could you explain weight matrix "spurious interactions"?

In short, a spurious interaction is a non-zero entry in the weight matrix that has no (positive) contribution to the network function, and that removing that interaction the network will perform at least as good as it would with the interaction.

Most neural network models assume that all neurons in one level are fully connected to all neurons in the next level. This leads to confusion about how ANNs work.

From my research I'd argue that MOST interactions in these networks are spurious. Once you remove them it reveals the (visual) topology (circuit diagram) that's driving the function of that network.

In the paper I wrote, I evolved gene regulatory networks (of ANNs have the same mathematical representation) such that interactions between any two nodes could be deleted, created (if W_ij = 0), or modified according to probabilities of deletion, creation, and modification. Given these probabilities, you can calculate the number of interactions that should result when the network reaches equilibrium, however what I found was that the network evolves less interactions than you would expect from the equilibrium calculation. This says that all things being equal, a network is paying a price for spurious interactions and that these will be removed in an evolutionary environment. Basically, each interaction needs to pay it's way otherwise it leads to unnecessary complexity that reduced the fitness of the network.

Interesting, thank you.

Very, very roughly a neural network consists of a bunch of connected nodes that you propagate signals through. Each of those connections carries a weight that affects how the signal gets propagated through the rest of the network.

Let's say a specific problem has only one very specific set of connections that matters. You'll eventually add up with weights that reflects that, but that doesn't prevent a lot of other connections from having weights set during training, but that may end up being cancelled out or reduced enough to have no meaningful impact whatsoever on the end result.

Is that why sparsity terms have been introduced into objective functions?

noise, biases, collisions, etc. due to imperfect or insufficient data, representational space issues, etc.

There was a great article called The Space Doctor's Big Idea, published in the New Yorker a few months back. It explained Einstein's theories in the top 1000s English words used in America.

While sone might wave their hands and call this an eli5, it was quite well done and between that and some xkcd comics I was able to learn enough to ubderstand how orbit works and a few other awesome facts.

I am interested in finding something similar for machine learning. I am not embarassed to admit that, Super-Linear Gate and Super-Quadratic Wire Lower Bounds for Depth-Two and Depth-Three Threshold Circuits, is just way too complex for my casual interest but would like to know more about how AI/ML works conceptually. Clearly the discipline underpins a bunch of services I use regularly and as computing and techniques improve it will start playing a more conspiscious role.


You may have noticed this, but that New Yorker article was written by Randall Munroe, the creator of XKCD.

Which makes sense, considering his last book is "Thing Explainer", and explains stuff like the "Up Goer Five" (Saturn V): https://xkcd.com/1133/

This deserves to be put here:


Google Talk on the Master Algorithm by the author Domingas.


thanks, looks super interesting.

In addition to "We need a cluster for deep learning", the second most popular mostly untrue thing I hear is "We have no idea how neural networks learn".

However, there are many papers that explore various ways to make a network learn, and they keep improving on performance, suggesting they're on to something. There are also many papers that discuss possible theoretical implications of experimental results.

But what does Knowing Why The Network Works mean exactly? "It works because universal approximation and gradient descent", but that's not a very satisfactory answer. "It works because it starts at a general solution and, over the course of many iterations, takes many small steps in an ever changing direction defined by a gradient approximation generated by looking at the difference between an average error and a target output (which should trend towards 0)".

What would a satisfactory "why" even look like exactly? As in, what form might it take compared to some other scientific discipline where we do know what's going on?

Personally, I think the whole thing is a red herring -- people in the field have some idea of how neural nets work, and there are many disciplines considered by many to be mature sciences that are far from settled on a grand theoretical scale.

That said, the theory I'm most interested in is recent attempts to connect a memory module to neural networks so they can "learn" to store important/complex/distributed information that can be recalled with high accuracy later. That will make it easier to do things like ask a neural network to remember your name, or where you left your keys, or whatever.

> What would a satisfactory "why" even look like exactly? As in, what form might it take compared to some other scientific discipline where we do know what's going on?

To me this is obvious: a proof is an adequate answer. All of the explanations given in this area are heuristic reasons. They're nice, but we can't know if they're correct or if we're being fooled by our intuition without a proof.

The why is largely why does gradient descent converge to a good answer instead of getting stuck in a local minima.

Because our intuition about local minima is wrong in extremely high dimensional spaces. In two and three dimensions, local minima are common. In a million dimensions, local minima are rare. The intuitive explanation is that for a local minimum to exist, the function must be curving up (first derivative = 0, second derivative >= 0) simultaneously in every dimension. It makes sense that as you add more dimensions this becomes less and less likely, and for a million dimensions it's vanishingly unlikely.

What you get instead of local minima are saddle points, where some dimensions are curving up and some are curving down. Saddle points can also be problematic for optimization but they can be dealt with using fancy optimization techniques. For a more rigorous explanation, see http://arxiv.org/abs/1406.2572

This intuition is not correct. For a random point in a high-dimensional space--yes--I totally agree it's highly unlikely all the eigenvalues will have the same sign (as is needed for a local minimum). The problem with Surya's claim is that the set needs to be restricted to just the critical points (where the derivative is zero). It's very hard to describe the spectral properties generally for this set, and there has been results for only 2 and 3 dimensions as far as I'm aware. Maybe the measure is the same, at least for Deep Nets, and thus a critical point is as unlikely to be a local minima as a random point is. But no work has shown that. Furthermore, we know the number of critical points to be exponential for broad classes of functions so the subset cannot be ignored.

ELI5 version: One can ask, what's the probability that a random person drawn from Earth's population (points in high dimensions) owns a Bugatti (is a local minimum)? It's very small, obviously. But that doesn't tell us anything about the probability of Bugatti ownership among select subsets of people (critical points).

Wouldn't it still be likely to settle on a local minima when the deciding factors for the existence of a local minima are limited to those that contribute to the likelihood of a single output category, and not whether all functions are curving up?

An example I can think of would be an absurd million input neural network, where one of the inputs only has a pronounced effect on one of the outputs. It seems like it would be possible for the path of the input to output to be dragged downhill in the context of all outputs, but uphill in the context of the single output it affects.

Is what I've described not likely, or am I just completely off base?

An interesting question. A million input neural network isn't necessarily absurd. Images are very high dimensional. One could easily imagine a one megapixel input to a neural net. But in natural images no single pixel is indicative of any single image characteristic by itself. I think this isn't a coincidence but a common characteristic of "natural" high dimensional data, on which neural nets tend to work well. So yes, I'd say what you've described is not likely for a large category of "natural" high dimensional data which probably includes most of the data we care about in the real world.

By that argument gradient descent should work for any classifier with a sufficiently high number of independent parameters.

Do you have any good examples of where it is likely to fail?

My thinking is that if this line of reasoning is true it suggests that we may as well be using kernel methods or other feature generation systems with high dimensionality with better algebraic properties that be might be able to train quicker or better interpret. My understanding is that those other techniques haven't been especially competitive with deep neural nets on large data sets these days, so that implies that the high dimension argument for gradient descent is insufficient to explain the success of deep neural nets.

Take any smooth function f and consider the multi-dimensional g(x,y,z...) = f(x) + f(y) + f(z) + ... If f had m local minima, then the n-dimensional version of g has m^n local minima.

You are correct in that there will be m^n minima, but since your surface is highly symmetric, local minima are trivially easy to find. Are you simplifying the statement by skipping over the importance of intervals?

If the intervals spanned by x,y,z do not overlap, then your statement does not hold unless local minima are evenly distributed over the entire region, which defeats the purpose of searching for minima in the first place. If the intervals spanned by x,y,z are identical, then gradient descent can be reduced to just searching over f(x) by symmetry. You are right that there are m^n minima, but they are found in (nm)^1 steps. If the intervals spanned by x,y,z partially overlap, you can further reduce the search space to just f(x) and the union of the intervals.

I don't see how your example shows that gradient descent is likely to fail for any classifier with a sufficiently high number of independent parameters. Am I missing something?

I was just tossing out an example of a higher dimensional function and counting the local minima. The reasoning that critical points tend to be saddle points, not minima, seems irrelevant if critical points are increasing exponentially for a typical higher dimension function.

This is a very cool insight for the uninitiated. Thanks.

Mainly because in high-dimensional space saddle points are much more common and many local minima have very close fitness. Look for example http://arxiv.org/abs/1405.4604 and following papers.

Because the solution space is convex if you've chosen your representation well.

This is definitely not the case for general neural networks, though.

If gradient descent is working reliably, the problem is convex. See the sibling comments for the intuition for large dimensional spaces.

"The problem is convex" and "the algorithm is unlikely to get stuck in a local minimum on realistic problems" are very different things.

Right. Hence the word 'reliably'.

> What would a satisfactory "why" even look like exactly?

Before deep learning happened, neural networks used to be popular for regression and interpolation problems. For a long time, our understanding of how they worked wasn't much better than our current understanding of how deep learning works.

In 1994, Radford Neal showed [1] that weight-decay neural networks were in fact an approximation to Bayesian inference on a Gaussian process prior over the space of possible functions. Amongst other benefits, this allowed better approximations to be utilised, and essentially (along with the advent of SVMs) marked the end the first neural network era.

Something like that is what I would consider a satisfactory "why"

[1] http://www.cs.toronto.edu/~radford/pin.abstract.html

But what does Knowing Why The Network Works mean exactly?

I would say it means: Why do these algorithms seem to be doing "well" despite their well-known theoretical intractability? Is it something about the instances? Or is it simply a matter of scale (so the local optima they are getting stuck in are not as obvious)? Or is there something "deep" we do not understand yet about these algorithms (or the theory)?

Either way, there is something going on here for sure.

Heck, I want that for the neural net in my head.

To take the discussion on a slight tangent, how uncommon is this phenomenon? An applied tool works really well but nobody knows why. I can give another example from the domain of formal verification: SAT solvers which are at the core of most modern verification/synthesis tools.

You can download open source SAT solvers today that work spectacularly well on "real" SAT instances with millions and millions of variables and clauses. Yes, SAT is the quintessential NP-complete problem and in fact it is pretty easy to come up with a SAT instance with only a few tens of variables/clauses that would kill these solvers. But somehow these "hard" instances almost never occur in practical problems generated in hardware/software verification/synthesis as well as ton of other applications (planning, constraint programming etc.)

So this must there is some characteristic of the problems that we generate in practice that makes them "easy" but we don't have a good understanding of what this characteristic is. All we know, for now, is that we've somehow stumbled upon a near-perfect set of heuristics that work amazingly well on the SAT instances we encounter in practice.

> So this must there is some characteristic of the problems that we generate in practice that makes them "easy" but we don't have a good understanding of what this characteristic is. All we know, for now, is that we've somehow stumbled upon a near-perfect set of heuristics that work amazingly well on the SAT instances we encounter in practice.

There's also the intriguing possibility that "natural instances" however one might define them arise through some complicated process that ends up sampling from easy SAT instances. I think of it as analogous to the Benford's law (see: https://en.wikipedia.org/wiki/Benford%27s_law#Explanations) except that we do not have nearly as good explanations for SAT solvers.

It is also highly likely that MOST instances of SAT are easy but that doesn't preclude it from being NP hard, nor does it stop us from easily constructing hard instances. (Such things are absolutely necessary for cryptography as an example.)

The 'other' possibility is that hard instances have 'measure zero' in the space of all instances... So when you pick some random problem in the real world, it ends up being relatively easy to solve with high probability.

My feeling is there's a bit of each; the problem space has a bit of unknowable quasi-structure, and also tends to miss the perverse corner cases and counterexamples that we like to think about as mathematicians.

My understanding is that in SAT world, random instances are typically easily, even for large numbers of variables. Hard instances only exist in a very narrow band of instance space pertaining to the #constraints/#variables ratio. And yes, it seems like there should be some explanation for this, or maybe it's simply "your intuition for how hard problems should be is wrong".

Can you recommend a good introduction tutorial on SAT theory?

This should be an accessible introduction to SAT: http://www.georg.weissenbacher.name/papers/mod12.pdf.

If you want something more "high-level", and I'll look for something appropriate.

Good look, homey. Thanks.

Note that the writer is the famed Ingrid Daubechies, who was one of the main driver behind wavelets and early works on sparse representations.

Noticed that as well. Still writes with such reverence for the power mathematics has in unveiling the mysteries of nature. Particularly enjoyed the moniker "distinguished differential geometer" to describe Calabi! I guess its only a matter of time before "wavelet nets" are re-discovered and resurrected as the "new innovation" in machine learning ;)

Have you seen Mallat's recent work on the scattering transform?



The fact that the class of NN functions is universal is almost vacuous: the basic idea is that if you allow a "neuron" for every point in your input space then you can mimic any function you like (i.e. each neuron handles a single input). Obviously such representations become arbitrarily large.

Which almost immediately suggests a solution to why NN learning works: the processes that produce the types of datasets humans are interested in are produced by (effectively) polysized networks and you can probably say things like the probability of recovering a polysize function from polysize samples is high.

My gut response was, I think, a less rigorous version of this: Humans care about tasks our own brains are good at. If NNs are in fact decent approximations of the way our brains work, then it makes sense they will be anomalously effective on this subset of the "actual" problem space.

You seem to be saying that ANNs are good at pattern recognition because BNNs are. That just defers it to the question why BNNs are good at it.

OTOH, the parent comment suggested we should draw our attention to the processes that produce the data and find correspondences to how NNs decompose it.

No, I'm saying that what we perceive as "being good at pattern recognition" is defined by the sort of pattern recognition BNNs are good at.

Which still doesn't explain precisely what sort of statistical inference BNNs are performing.

> In practice, neural networks use only two or three layers...

The famous AlexNet [1] that blew away the ImageNet competition in 2012 contained 8 layers; more recent networks have even more.

[1] http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf

Yea it's more like the opposite, in theory you only need 1 hidden layer to get the same capacity in the network as with more hidden layers, but in practice, since auto encoders, many hidden layers gradually decreasing in size are easier to get to from local energy minimums to globally low energy minimums

Yes, people took the Universal approximation theorem[0] as evidence that they only need 1 layer, but there is zero guarantee of efficiency. A single hidden layer may mean many magnitudes more neurons needed over a n-hidden layer, n > 1 network, which could cause an unrealistic training time. Having multiple layers can reduce this training time with an optimal structure.


Nobody ever took that theorem seriously. Deep nets were around since the 90s.

I've seen people with passing knowledge of NNs throw it around sometimes and its also referred to in a lot of literature as one reason for needing to parallelize neural networks (which I've been reading a lot on, due to a project), even if its not hugely important.

It's not hugely important because it tells us little of practice use. I mean, k nearest neighbors, given infinite data, can model any function as well. In practice, single layer neural nets are not very useful and don't do a good job of learning feature representations.

I wrote a comment on the article's page, saying this. They have now added a correction.

I understand it as _classic_ neural networks as opposed to deep networks. Not a good choice of words, though.

That's mostly tautological: the networks you call "deep" are the ones that use more layers. Unless you need funding or media attention, in which case you call everything "deep".

But you're making it sound like shallow networks are a thing of the past. I would compare this to the field of NLP, where it seems we don't have a good general idea what to do with deep networks, and the things that work so far are mostly shallow.

word2vec is one layer plus a softmax. GloVe is similar to a one-layer network. char-rnn seems to do as well with two layers as it does with three. All the gradient-descent classifiers out there are equivalent to one-layer networks.

I love her definition of big data.

Another bit that stood out for me:

> You’re awarded an extremely generous grant that allows you to give 200,000 people a 500-question personality test, with answers that vary on a scale from one to 10.

Dating sites like OkCupid should have such data.

This is the reason I am not so much interested in this field. It is too much "let's try this and see what happens" rather than really engineering a solution.

I guess more people feel that way, and I guess that is a good thing, otherwise everybody would now be working on AI, since it is such a promising field.

Some AI pioneers seem to feel that rule-based symbolic solutions still have a place, e.g.


If we are depending on AI systems to make important decisions, possibly life-or-death decisions, shouldn't we understand how they come to the conclusions that they do?

I cannot deny the tremendous advances in neural networks recently, but I find Sussman's point compelling.

there are a number of features in machine learning that seem to have counterparts in statistics - stuff like m-like estimators as kernels (I think - correct me if wrong), information criteria for feature selection, that have existed a long time ago.

maybe it's the combination of sheer computing power and availability of data that allows better models. maybe we've never looked at algorithmically generated models because we also want a narrative (commonsensical explanation) to the model, not just a matter of algorithmically finding correlations between jelly beans and acne, say.

part of me thinks, ok the machines found something. now can we actually use that to understand the world, rather than build more recommendation algorithms? (haha). I'm not sure what an advisor will say about a doc student who says, let's just throw reams of data at a machine until we find a meaningful correlation, and then let's reason from the correlations (my guess is 'no', that's not really the scientific method is it).

I'm hopeful, and I don't think the answer to the question will come from academia.

I'm not sure what an advisor will say about a doc student who says, let's just throw reams of data at a machine until we find a meaningful correlation, and then let's reason from the correlations

Currently they are saying yes, please, and desperately trying to source ever bigger volumes of data.

Norvig et. al's article "The unreasonable effectiveness of data"[1] has been cited over 400 times which is a lot for something with no formula.

[1] http://static.googleusercontent.com/media/research.google.co...

> maybe it's the combination of sheer computing power and availability of data that allows better models. maybe we've never looked at algorithmically generated models because we also want a narrative (commonsensical explanation)

The word you want is "interpretable". Powerful blackbox models have been known for a while, but some industries prefer interpretable models like linear and decision trees, sometimes you're even forbidden from using others due to regulation.

As for "spectacularly well" -- well, the person behind the curtain wiggling the levers retains a lot of influence. Garbage in, garbage out, remember?

I'm reminded of the time Google Translate autodetected "Gesundheit" as Spanish. And Gmail kindly offering to translate "hahaha" from Portuguese, putting an ad for coconuts next to it.

Data science is improving, but you might be surprised how slowly. Especially in the consumer space, because the metrics on effectiveness are so warped.

Voice recognition of numbers only, over a phone connection, can be below 40% accuracy! Much of the perceived success of these systems comes not from the core machine algorithm, but from clever human tweaks around it. Also end-users who are happy with what they get, not quite realizing how goofy it all is if they were to get a glimpse of the raw data.

I'm not sure how old you are, but what exactly are you expectations when you state Data science is improving, but you might be surprised how slowly.

We have machines that can categorise pictures better that humans. In 2011 that seemed completely impossible.

Age is unrelated to wisdom and I'm talking about the full experience.

A Google Image search for "Wonder Wheel" (the famous Coney Island Ferris Wheel) shows this spoked diagram within the first page of results:


Also this year, Google Photos classified black people as gorillas.


Consumers are rarely exposed to the raw machine output -- for good reason. My experience building these sorts of systems is that they're pretty goofy and they fail unexpectedly. After chasing audio and video problems using custom software as well as three major toolkits, I find myself hyper-aware of the flaws in public systems.

Also common sense dictates that it's more about the data scientist on the way in and the UX person on the way out than the machine.

I'm not commenting at all on wisdom, merely on how quickly time passes for people of different ages. 5 years seems like a long time for a 15 year old, and hardly anything for a 50 year old.

The first result for me is https://upload.wikimedia.org/wikipedia/commons/2/27/Wonder_W...

Looks good to me?

As for the gorilla incident, I don't think anyone is claiming that errors don't occur, and it's very fair to say that particular error was very embarrassing for Google. It's interesting how children make the same kind of embarrassing mistakes, eg: https://www.reddit.com/r/Parenting/comments/24me24/embarrass...

> Age is unrelated to wisdom and I'm talking about the full experience.

That's not true; age and wisdom are quite related, just not directly causal. They are correlated. Older people are generally wiser, it's just not a guarantee, nor is it impossible for young people to be wise, but it's certainly far less common. With age comes experience and with experience, wisdom has fertile ground to grow, though it doesn't always.

Notwithstanding the fact that "age is unrelated to wisdom" was a gentle contextual device used to dismiss the commenter's odd assumption about my age, a search for "age is unrelated to wisdom" turns up actual research papers.

"neither general nor personal wisdom have a positive linear relationship to age... age is not only not related to personal wisdom (as is the case for general wisdom) but even negatively related..."

[Mickler & Staudinger (2008), Sneed & Whitbourne (2003)]

Source: The Scientific Study of Personal Wisdom: From Contemplative Traditions to Neuroscience


If you actually searched for that, then you know there's far more studies showing the opposite. You can find a research paper supporting just about any position, that doesn't make it the majority opinion of the field. You actually had to ignore a lot of stuff saying the opposite, to find something saying age and wisdom weren't related. That's OK, you're young, you'll get wiser with age.

That spoked diagram's image is named "google-wonder-wheel" and is linked from an article referring to Google's Wonder Wheel. That doesn't seem like a spurious result to me...

It's clearly an outlier. Try an image search for "Google Wonder Wheel" and you'll see a few Ferris wheels among lots of spoked diagrams -- but not the Wonder Wheel.

The gorilla example is more to the point. Google pushed a quick fix and couldn't fix it. So they blocked the gorilla tag altogether. Why? Because a data scientist couldn't figure out how to fix the problem properly and deploy a solution. To me, that says quite a bit about the current realities of machine learning systems.

We also have history books. The history of AI is funny.

And for the first fifty years of AI research, the process more or less goes:

1) An AI researcher decides that problem X is hard enough that solving it represents intelligence 2) Researcher develops an algorithm to solve X 3) Now that we have that algorithm, it's no longer an AI problem

When you teach a computer to play chess, it's just depth first search with a few heuristics added in. When you write an expert system to evaluate mortgage loans, it's just following a if-then script.

Machine learning techniques though, seem to to have stayed solidly in the AI camp.

I think another interesting analogy (in terms of field-development trajectory) to deep learning is the whole idea of l1-regularized regression or LASSO or sparsity priors or sparse signal processing ([1] goes by many names depending on which field you work in). The whole idea is that by penalizing 'dense' solutions to a regression problem, you can 'promote' 'sparse' solutions like the ones that occur in many many applications. This had been used by various communities with some theoretical justification for years, at least since the 1970s. However, the real theoretical breakthroughs framing this problem in something close to the rigor and usefulness of Shannon's sampling theorem, came in 2004 from two papers independently: David Donoho; and Emmanuel Candes, Justin Romberg, Terence Tao. Theoretical CS community also got close to the answer in their work in the late 1990s and early 2000s in many papers on random projections and sketching. There are many connections between deep learning and ideas in the broad area of sparse regression. But one epistemological point of intersection is how theoretical results pertain to asymptotically large cases, but in applications work very well even with much smaller systems.

[1] https://en.wikipedia.org/wiki/Compressed_sensing

>In the last 15 years or so, researchers have created a number of tools to probe the geometry of these hidden structures. For example, you might build a model of the surface by first zooming in at many different points. At each point, you would place a drop of virtual ink on the surface and watch how it spread out.

It sounds to me like this 'ink drop' is a metaphor to explain some state-of-the-art dimensionality reduction technique. Does anyone know the common name of this technique?

There are diffusion based methods for unsupervised learning based on the graph laplacian, or heat kernel [0]. Maybe this is what she was referring to? I would guess so, because Coifman (who is also closely tied to wavelets [1]) introduced the diffusion maps [2] method. On the other hand, I have not seen any unsupervised method based on say, Navier-Stokes, which is the only other thing that comes to mind for regarding ink drops diffusing.

[0]: http://www.mit.edu/~9.520/spring11/slides/class20_misha.pdf

[1]: https://en.wikipedia.org/wiki/Coiflet

[2]: https://en.wikipedia.org/wiki/Diffusion_map

This reminded me of this paper on using deep autoencoders as a generative model: http://papers.nips.cc/paper/5023-generalized-denoising-auto-... I don't think it's exactly a right analogy, but there's something there about the local flow of MCMC locally describing the manifold structure, and thus generating it globally.

I thought it just simply had something to do with gradient descent.

Neural networks is advanced curve fitting -- That's why.

It isn't really all that magical or mysterious.

It really kinda is.

In the Ilya Sutskever Talking Machines podcast[1] he describes exactly how it is "magic". He talks about how there is no theoretical basis to think that a deep neural network should converge, and prior to around 2006 the accepted wisdom was that networks deep enough to outperform other methods of machine learning were useless because they couldn't be trained.

And then they discovered how to initialize the random starting values in such a way that they do converge - for reasons no one really understands.

I've heard a lot of people claim there is no magic, but I tend to think Sutskever[2] knows a little bit more than most people about deep neural networks.

[1] http://www.thetalkingmachines.com/blog/2015/1/15/machine-lea...

[2] http://www.cs.toronto.edu/~ilya/

Reasons like these are why I get frustrated when people claim the only reason why NNs are popular now is because of computational advances. This isn't true - there is lots of little learnings that have been developed in recent time to make training these networks reliable in production. Large companies (and universities) have had access to high performance computers and clusters to train interesting networks (say char-rnn) for some time.

I was just listening to that podcast yesterday and loved the way he presented this information. I know next to nothing about NNs but I could pretty much follow everything that he said (from an intuitive sense, anyway, if not a technical one).

This "magic" of converging networks reminds me of how ensemble methods, such as random forests, are effective but people aren't sure why. There's certainly (AFAIK) no theoretical grounds to say that a bunch of random decision trees should yield universally good results.

There are. It's all about variance reduction. See the Breiman's paper on Bagging. There's nothing special about random forest though (apart from the fact that a decision tree is a good learner, because of the nonlinearities for example), you can you ensemble learning with any "basic" learner.

I read Breiman's random forest paper, but I'll give his others a read too. Does the variance reduction effect apply to any ensemble method then?

That isn't saying much. Every supervised learning algorithm can be cast as a problem of advanced curve fitting, because at the end of the day, your 'curve' is the mapping that takes you from your 'x-axis' of data to the 'y-axis' of labels.

My super-hand-waving explanation of why machine learning works:

MNIST is a handwritten digit database. Each 784-pixel image (28x28) corresponds to a digit from 0 to 9. As a pure mathematical construction, there are at most 2^784 inputs possible, and a small number of possible outputs.

So if you have 784 completely different ways of analyzing the image, and you combine them in the right ways, you will get roughly an approximation of an answer. This is a tautology if the 784 ways are "the value of each pixel" and the combinations are "magic", but if you have more "intelligent" combinations you should have combinations that are less magic. And in this case, since humans can generally determine the digit value from a 7-light display, it seems reasonable that there exists some way to have "intelligent" combinations such that they combine to form a neural network that solves the problem of digit identification.

And that (still hand-wavy) explanation can also plausibly describe how a human would describe identifying a number. If I ask you "why is this a 1 and not a 3", you might say "because it's straight" or "because it's narrow" or "because it doesn't have a point in the middle" or any number of other descriptions of the object. So you can envision a 2-layer network where the middle layer calculates this (and due to the structure of images, in practice it might better be a 3 or 4-layer network. but the important point is that the search algorithms don't rely on it or you knowing what these middle layers are ahead of time)

Which only leaves the question of how "neural network learning" is supposed to find this. And there are a few heuristics which combine to (in practice) be a very effective search. We have back-propagation (which is much easier with automatic differentiation), so we can adjust the entirety of the network based on the output. (and it's an axiom that if you have a lot of things, they will be similar in the ways they are the same, and different [hopefully in some regular way] in the ways they are not the same). We have drop-off, where we attempt to prune connections that are irrelevant. We can add new connections to see if they are relevant. We can do any number of hill-climbing algorithms on the output of the fitness function. And, as a valid search algorithm, it tends to converge to a valid result.

Obviously none of this is at all rigorous, but if you know the math here enough I don't think you're asking the questions in this article.

So if you have 784 completely different ways of analyzing the image[...]

Interesting intuitions; but if by "ways of analyzing the image" you mean functions, there are an awful lot more than 784...

I mean, there are zillions of different functions, but no more than 784 linearly independent versions. (I mean, not technically since the functions aren't linear, but at any specific point on the functions, the linear approximations have no more than 784 independent bases).

Thought about how cool it would be to use machine learning to generate machine learning algorithms and how that could be the basis of artificial "life".

Maybe start with a minimal implementation 'bootstrap protozoa' that can evolve to highly complicated forms.

Then realised that since Machine Learning processes takes a long time that would take forever making auto evolving computer life forms a long way away.

Although if it were possible, in the same way we became sentient based on very simple inputs, that could also mean computers, with a sufficiently complex process, could too...

A LSTM recurrent neural network was used to learn a learning algorithm: http://www.bioinf.jku.at/publications/older/1504.pdf

it seems like a general phenomenon that our scientific understanding decreasing with the complexity of the system. in classic physics, we could have very clean equations capturing the dynamics of a system. this is not true any more in chemistry/biology/social science. modern deep learning pipeline, compared to its shallow counterparts, gains more complexity, so it's not surprising at all to me that we are not able to understand it well at the moment.

Can anyone give me a pointer on what techniques the author is talking about for unsupervised learning at the end of the article?

Sure. These techniques are called Dimensionality Reduction or manifold learning. The wikipedia article actually has a good description: https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduc... Some of the more interesting ones: locally linear embedding, isomap, MLLE, Maximum Variance Unfolding.

I think one of the seminal papers is by Belkin and Niyogi:


maybe i am missing something here... i really don't know.

it might not be rigourous, but for standard, deep neural networks its intuitively obvious enough that you can reinvent the idea from scratch in your bedroom in a time before the internet just by getting a vague description of the idea and thinking about how it /could/ work.

proving convergence may be difficult, but its not particularly challenging to see why it happens imo. :/

a lot of the things the article points at are utterly irrelevant to the subject. e.g. sigmoid functions, depth of networks.

i think there are some bold baseless claims here - for instance linearly interpolating data points is pretty simple as a way to approximate a function, and its not hard to see how neurons can provide this with linear activation functions and some very naive back propagation. (e.g. evenly dividing error and correcting weights towards the correct result)

if this didn't converge /that/ would be surprising.

Deep Neural Networks learn layers of linear functions - represented by adjustable 'learning' weighted connections.

Each layer of neuron activations is a new representation of the data - produced by the weighted connections.

Function compositions of linear functions are still only a linear function.

Each neuron in a layer sums it's weighted inputs this summation is a non-linearity that allows layers to be composed - function composition.

This is famously expressed in Minsky and Papert's 1968 'Perceptron': a single layer of network weights is incapable of learning XOR.

One analysis is that Neural nets transform the shape of the data until a single line on a twisty high dimensional manifold produces the desired distinction. http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

A single layer network is a universal approximator and a net can be trained or distilled from another net - but deep nets are overwhelmingly better at the initial learning and discovery.

Neural nets have been related to Kadanof's spin glasses, suggesting learning is alike to phase transitions or sand pile collapse where small local changes can produce profound global changes. https://charlesmartin14.wordpress.com/2015/04/01/why-deep-le...

Generally when training nets the learning initially learns a lot very quickly, most of the error vanishes at the start.

Word2Vec demonstrates that nets learn very powerful representations, that word2vec vector algebra is semantic points to unexpectedly powerful representations.

Similar semantic vector math can be performed on images and the same vectors can translate modalities, e.g. text to images. https://github.com/Newmu/dcgan_code#arithmetic-on-faces

Natural Evolution produces efficient solutions.

I propose the successes of deep learning so far are partially explicable because they are working within human culture and perception, relearning our very efficient solutions - akin to distilling a deepnet into a shallow one.

This hypothesis will be tested if embodied deep neural nets using re-inforcement learning discover their own efficient solutions to performing tasks in the real world - robotics.

IMHO Peter Abeel's deep net, learning to robustly output robot motor torques directly from camera pixels will show if embodied deep nets can do discovery rather than relearning what we know. http://www.cs.berkeley.edu/~pabbeel/research_rll.html

It would be interesting to see what the purists arrive at

Well, a good example would be linear regression. We know pretty much all there is to know about its asymptotic behavior, and can prove that its results are "maximum likelihood estimates."

Similarly, Bayesian statistics (a.k.a. probabilistic programming, graphical models, Bayes networks etc.) has a rock solid mathematical foundation – both the statistics behind it and the sampling algorithms (MCMC).

Of course, many mathematical statistical methods started off as hacks too. The invention of the L2 loss function, I imagine, must've gone a little like "alright, let's try to find the pattern here, let's draw the line through this series of points that minimizes the sum of the distance of each point to the line... hm, I don't like how this looks, let's try squaring the distances instead and then minimize that sum so we punish large deviations from our line a bit more... oh, that looks better."

Actually the squaring was for a more practical reason. Lets try to minimize the sum of the distance, oops |y-ax| is not differentiable so we can't use calculus, lets square it instead.

We later realized this worked so well because it's max likelihood of the model y=ax+b+e where e is a gaussian error function. But that came much later.

In grad school I asked that so many times trying to figure out why it was the "right" solution and this historical cheat caused so much cognitive dissonance.

Then again, on the other hand, "calculus works nicely" and "it's simple" probably are good spooky directors toward useful models anyway.

It's not really a cheat, as it produces the MLE (i.e. most likely) estimate under certain assumptions (e.g. errors are normally distributed, which occurs naturally if the errors are large sums of many unrelated measurements errors).

Oh, I'm aware. It's just hard to always justify it and especially hard to do so in historical context.

I don't think that this is quite the right explanation.

|y - wx| isn't differentiable but it is subdifferentiable. So the derivative is defined at every point except where y = wx, and in that case you're more or less fine if you just pretend that the derivative is zero at that point.

One reason why squaring is preferred is that the derivative has a nice closed form, which can be used to find closed form solutions.

Another is that people like to fit their models with gradient descent, and gradient descent has better guarantees for strongly convex loss functions. It also works better in practice. Intuitively: if you try to minimize x^2 by gradient descent, you have the largest gradients when you're far from the minimum. If you try to minimize |x|, then your gradient is always either +1 or -1, making it harder to converge around the minimum. See the Huber loss for a strongly convex relaxation of the absolute value function.

Regarding L2, I think it is just the most convenient way to do the maths from a set of observations (differentiability). Later one, the link with maximum likelihood-based methods was made, by Gauss.

In 'machine learning', structured learning by Vapnik and co (theory behind SVM), has a beautiful and strong mathematical underpinning. But in general, it is true we don't really have a good understanding of why learning algorithms work. The very notion of generalization to unobserved data is not well understood (I like D. Wolpert papers on that topic).

>The very notion of generalization to unobserved data is not well understood (I like D. Wolpert papers on that topic).

Which papers?

IIRC, that one was a decent overview, though he has worked on similar issues in older papers: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

It has been a while since I read those papers, there may better references.

1.) Dropout.

Gal and Ghahramani "We show that a neural network with arbitrary depth and non-linearities, with dropout applied before every weight layer, is mathematically equivalent to an approximation to the probabilistic deep Gaussian process" http://arxiv.org/pdf/1506.02142.pdf

2.) Deep.

There have been many hierarchical networks in Bayesian statistics. Hierarchical Dirichlet Process, hierarchical beta process, Pitman-Yor process, Gaussian process, nested, dependent, translated, generalized versions of the Chinese Restaurant process, or the Indian Buffet one. The deep part isn't new.

3.) Auto-encoders.

People seem to talk about auto-encoders a lot. Its very definition introduces hidden variables about which very little is defined. It corresponds with some kind of prior on what representation should be used by the network. This is studied in Bayesian compressive sensing. http://machinelearning.wustl.edu/mlpapers/paper_files/icml20...

4.) Layered training.

Layers are trained one by one. Doesn't really ring a bell. Maybe it corresponds to some MCMC method I'm not aware of. There are often inner and outer loops in MCMC and I guess something fundamental can be told about the right moments to switch from inner to outer. However, that does not correspond to learning low-level features first.

> let's try squaring the distances instead and then minimize that sum so we punish large deviations from our line a bit more

More like so we can differentiate easy. Then if anyone asks, you wave your hands and mumble something along these lines :)

whatever you do, don't try to figure out why it works[0].

[0] https://en.wiktionary.org/wiki/schroedinbug

Skynet is coming.

As far as I know no mathematician was able to graduate from statistics department as second major or (second) minor (whatever you call it in English). I mean in my university. I hope you get the idea; it looks like statistics is hard for mathematicians, at least for most of them. Thus, I don't understand why they talk about this statistics topic with mathematicians but not statisticians. Yeah, may be should ask to a statistician! Duh!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact