“The Unreasonable Effectiveness of Deep Learning Representations”

kujaomega · on July 6, 2018

I see a good explanation of the problem and a good evolution of the done steps. But I see a problem in the approach. When you are getting the most similar result, you are supposed to compute high cosine similarity between all the embeddings. If you have more than a billion of embeddings and the embeddings have 1k dimensions, it will take a lot of time. How would you solve this problem? Clustering the embeddings?

bunderbunder · on July 6, 2018

There are off-the-shelf libraries like ANNOY and nmslib that index the vectors in a way that allows for fast (possibly approximate) nearest neighbors searches.

bglazer · on July 6, 2018

This is generally called the k-Nearest Neighbors problem. You should check out the various data structures for doing this, like the ball tree and kd tree.

JoeAltmaier · on July 6, 2018

Looking at pictures and categorizing them by appearance has very limited application. The human context can't be in grasped that way, can it?

For instance, 'wedding pictures'. A cake being cut; a cute kid throwing flower petals; a black-clad clergyman; a hand with a ring on it. Any human could categorize a pile of pictures into those that are in the 'wedding' category, and those that aren't. But no strategy based on weighting pixels is ever going to get there.

Likewise, 'cute' or 'scary' or 'funny'. And on and on.

jchw · on July 6, 2018

There are definitely networks in our brain that classify what we're seeing. An uneducated guess, though, is that the networks in our brains have many, many intermediate representations and don't go directly from image -> words, but rather go to abstract classifiers that can go back to words.

robius · on July 6, 2018

What I find unreasonable is doing all this without knowing what the model is doing. It's blind with no way to steer and correct it.

That is what feed forward networks and back propagation do for us. So why do we keep using them?

Then there's the statistics of it all.. what are we actually modeling? 'The real world' you say? Think again.

Data has to be changed and manipulated into i.i.d. form, or the algorithms won't work. How does an independent set of random variables give us a model of the actual dataset which is a very limited representation of the real world? It doesn't. It's modeling something else.

Okay, why don't we take dependence into account? Surely that would represent the real world better. Good question! (Shirley has nothing to do with it.)

It's because there is no formal definition of dependence in statistics. Let that sink in for a minute.

So the math needs work, statistics needs a revolution, and then we can begin to change AI enough for it to finally start making sense. Focus on explainable algorithms and actual ability to validate that what models generate make sense and will not be unlawfully biased or have outliers that will cause harm.

There appears to be only one company who has something like this. But few actually care.

nerdponx · on July 6, 2018

It's because there is no formal definition of dependence in statistics. Let that sink in for a minute.

What? Statistical dependence (of random variables) is defined clearly and precisely.

Data has to be changed and manipulated into i.i.d. form, or the algorithms won't work

Neural networks don't use the iid assumption.

I downvoted you because it seems like you don't really know what you're talking about and you're currently the top post in the thread. Please don't spread misinformation.

srean · on July 6, 2018

Strongly agreed. It seems robius really is clue less when he/she's talking about modeling independence or modeling the lack of independence.

bjourne · on July 6, 2018

All statistical learning, which neural nets are a form of, use the iid assumption. See https://stats.stackexchange.com/questions/213464/on-the-impo...

bunderbunder · on July 6, 2018

But they use it in different ways. For example, an ARMA model is specifically looking for dependencies among the data points, so there assuming iid among them would be an absurdity. In time series analysis, you're looking for the model's residuals, not the source data, to be independent and identically distributed.

Also, in real-world statistical modeling, there's nuance. Just like for any assumption of a parametric model, the data not being iid doesn't mean that the model is 100% crap, it means that you can't draw specific conclusions about the quality of the model.

Which is fine, because maybe you don't care to draw those conclusions, anyway. One of the key differences between machine learning and traditional statistical analysis is that you aren't so worried about developing parsimonious models with well-defined parameters. You're typically just empirically interested in the model's predictive or descriptive utility. This difference isn't a result of one school being more principled and the other being more lackadaisical. It's reflective of differing goals: One approach was developed for use in scientific hypothesis testing, where your primary deliverable is (in the case of something like regression, anyway) the model's parameters, and its estimates are a means to evaluate those parameters. The other approach is used for modeling processes, where the primary deliverable is the estimates, and the parameters are a means to get those estimates.

yorwba · on July 6, 2018

That kind of iid assumption could be summarized as "the training data is representative of the data we want to apply the model to", and if it doesn't hold, that's indeed a problem.

But "Data has to be changed and manipulated into i.i.d. form, or the algorithms won't work. How does an independent set of random variables give us a model of the actual dataset which is a very limited representation of the real world?" strongly implies that the data itself should be decomposed into iid variables. While whitening ("manipulating into iid form") is a common preprocessing technique because it's simple and effective, that doesn't mean that learning algorithms wouldn't work without it. They'd just take a bit longer to arrive at the same result.

avaku · on July 6, 2018

Agree. I can't downvote, so I just agree :)

etaioinshrdlu · on July 6, 2018

I would try to remember... Just because you don't like it, doesn't mean it doesn't work. Deep learning is creating an awful lot of actual value right now, and I think we're just getting started.

kornish · on July 6, 2018

Which company, and what do they have?

nerdponx · on July 6, 2018

https://cran.r-project.org/web/packages/datarobot/vignettes/...

robius · on July 6, 2018

The company is a small startup with an amazing breakthrough called Optimizing Mind.

They have magical ways of 'explaining' black box models.

But it's not what DARPA is pushing (box remains black), rather the opposite, illuminating what's inside the box, making it a transparent open box. So much so, that the models they make you can edit by hand, since they make sense (to mere humans). Has rather immense implications.

Here's their crappy website: https://optimizingmind.com

yorwba · on July 6, 2018

I guess by DARPA you mean https://www.darpa.mil/program/explainable-artificial-intelli... ?

It might be their crappy website, but I don't feel like Optimizing Mind is likely to create a better solution than that DARPA project. Their single "Static Demo" shows the importance of various factors in a linear regression model ... but that isn't exactly revolutionary. There might be some value in nicely packaging this for decision makers who use linear regression models but don't know how they work, but I doubt that it scales to much larger models.

wadkar · on July 6, 2018

> So the math needs work

Finally! I thought I was alone (and stupid) for thinking like this.

Is there any literature or any meta-work that discusses the notion of probability itself? What is expectation? What is dependence?

soVeryTired · on July 6, 2018

> What is expectation?

There is a formal mathematical definition:

Let (\Omega, \mathcal{F}, P) be a probability space, and let X: \Omega -> S be a random variable taking values in some measurable space (S, \mathcal{S}).

Then the expectation is \int X(\omegs)dP

In computer science terms, do an experiment with every possible random seed and average the outcome (set \Omega to be the set of all seeds, and set P to be the uniform measure on them).

nerdponx · on July 6, 2018

Any probability textbook would answer that.

I would be surprised if Khan Academy didn't cover at least expectation.

meikos · on July 6, 2018

what do you mean by the notion of probability itself?

probability was mastered far before computers were a thing

akvadrako · on July 6, 2018

Probability is far from clear. Very briefly, there are two main camps:

1. Bayesian probability is about degrees of belief. But that's always subjective and belief about what, if not probability? It's circular.

2. Frequentist probability is about, after X >> 1 runs of an experiment, an outcome with odds of Y occurs Y/X times. But it's only exact with an infinite number of runs, which never happens. And what's the odds of exactly Y x 1000 outcomes after 1000 runs? Again, that's circular.

My favourite way to think about probability is the multiverse kind:

3. Assuming there are an infinite number of fungible identical worlds, if a coin flip has 50% of heads, it means observers in exactly half the worlds see heads. However, this isn't actually probability at all - from a god's eye view it's objectively certain what happens.

chriselrod · on July 6, 2018

Your "3" is a Bayesian view. Specifically, from the Jaynesian school, which views probability as ignorance. When we can't calculate which of those world's we're in, we express our remaining uncertainty with probability. The connection to subjective "beliefs" is recognizing that these probabilities are all in our own heads. Believing otherwise is the "mind projection fallacy"; in reality -- as you noted -- these things are certain from the god's eye view, and we fall somewhere in between that and total ignorance/entropy. (I'm not a physicist, but I know some use the Many Worlds interpretation to apply this determinism even to quantum physics.)

E.T. Jaynes fleshes out his worldview in "Probability Theory: The Logic of Science", which was published posthumously in 2003.

kgwgk · on July 6, 2018

> Your "3" is a Bayesian view.

Except for the infinite number of universes nonsense :-)

kgwgk · on July 6, 2018

Dear downvoters: Even if we assume an infinite number of (real or imaginary) universes, what does “in exactly half of the worlds" mean? This definition doesn’t seem at all less problematic that the usual ones.

perl4ever · on July 7, 2018

I think "infinite" is just sloppy language. If every possible universe exists in some sense, that is a large number, but not infinite - because nothing about a universe has infinite precision. Thus, "half" would still mean something.

wadkar · on July 6, 2018

Lets not forget the non Kolmogorovian notion of probability - or quantum mechanics. I personally believe that we would want to accommodate a more generalized notion of probability to significantly improve our statistical models of the world. You certainly hint at it in #3

reitanqild · on July 6, 2018

> Probability is far from clear. Very briefly, there are two main camps:

Isn't this a bit like saying there are two main camps when it comes to coins:

1. "heads"

2. and "tails"

?

At least to me it felt like the different forms of statistics where only different techniques.

perl4ever · on July 7, 2018

I don't even understand how the frequentist view is a valid alternative. It always seemed to me like either you are honest about your priors, and use Bayesian logic to take them into account, or you sweep it under the rug. Lying to yourself always produces bad results, is my overriding heuristic. But I'm not good at math.

akvadrako · on July 6, 2018

As far as we know, there is no underlying theory of probability for them to be techniques of. So maybe they are equivalent in some sense, but on the face of it, they are separate ideas.

eli_gottlieb · on July 6, 2018

Probability is just measure theory where total measure norms to 1.0 ;-).

breatheoften · on July 6, 2018

Is there a way to confirm this experimentally?

beagle3 · on July 6, 2018

Probability was understood long before computers, true, but it waited for Kolmogorov’s axiomatic formulation to actually become the coherent field of math that it is today, rather than a hodge lodge of definitions, tricks and theorems.

And that only happened in 1933, which is around the time that computers became a thing. Not general purpose ones yet - I agree it was before computers were widespread, but definitely not far before they were a thing.

robius · on July 6, 2018

There is very little, but what I have found is good work from The Math Citadel folks: https://www.themathcitadel.com/category/spire/probability-th... ,, also one in the stats category.

ssivark · on July 6, 2018

Er... maybe not that much reworking?