Hacker News new | past | comments | ask | show | jobs | submit login
The Limitations of Deep Learning (keras.io)
794 points by olivercameron on July 17, 2017 | hide | past | favorite | 260 comments

As someone primarily interested in interpretation of deep models, I strongly resonate with this warning against anthropomorphization of neural networks. Deep learning isn't special; deep models tend to be more accurate than other methods, but fundamentally they aren't much closer to working like the human brain than e.g. gradient boosting models.

I think a lot of the issue stems from layman explanations of neural networks. Pretty much every time DL is covered by media, there has to be some contrived comparison to human brains; these descriptions frequently extend to DL tutorials as well. It's important for that idea to be dispelled when people actually start applying deep models. The model's intuition doesn't work like a human's, and that can often lead to unsatisfying conclusions (e.g. the panda --> gibbon example that Francois presents).

Unrelatedly, if people were more cautious about anthropomorphization, we'd probably have to deal a lot less with the irresponsible AI fearmongering that seems to dominate public opinion of the field. (I'm not trying to undermine the danger of AI models here, I just take issue with how most of the populace views the field.)

I don't have ML or deep learning background (no Masters or PhD), adding comment from experience with backtesting trading systems. We will collect market data and design algorithms that seem to produce the kind of outcomes we want. Then test on some other data sets which the algorithms have never been applied on. Many iterations later, you can get a decent profitable algorithm. And if the 'holy grail' algo is run in market long enough, eventually there will be severe drawdown and going bust. The quality of the algo and I assume the deep learning model lies in the quality (breadth and depth) of the data, and how honest with himself the person choose to model it. There will be time and again new 'black swan' or edge events happening (remember LTCM), because using machine learning is like using the past to predict the future.

I guess as long as the users' expectations are correct it can be useful in some very specific areas. Referencing the AlphaGo game last year, I was a Go player for more than a decade. But yet AlphaGo's weird move inspires new insights that break the conventional structure / thinking-framework of a Go player. From that angle, I do think that even though DL is somewhat a blackbox, humans can pick up new insights because it explores areas which are normally ridiculous to a human with 'common sense' to explore.

> The quality of the algo and I assume the deep learning model lies in the quality (breadth and depth) of the data, and how honest with himself the person choose to model it.

I've only dabbled with machine-learning here and there for the past 10 years or so, but if there's one thing I've learned so far is that the data behind your ML code (and the way it is structured) is responsible for almost all the success or failure of any given ML algorithm. I have an younger colleague at work who I've started tutoring, and he seems really interested in doing ML work (maybe because of all of the recent hype).

I've tried to emphasize to him several times that ML algorithms come and go and that he should focus a lot of his time on the data itself (from where he intends to collect it? how is it structured? is it reliable? is it "enough"? etc), but it looks that my data-related advice falls on deaf ears every time, he's only interested in me pointing to him the latest cool ML algorithm. I guess he'll live and learn, so to speak.

> I've learned so far is that the data behind your ML code (and the way it is structured) is responsible for almost all the success or failure of any given ML algorithm

Data is indeed a necessary condition but certainly not sufficient. You require a good marriage between engineering features and data to have a good success rate. Learning curves [0] are a good way to understand if your ML algorithm requires more data or better feature engineering.

[0] http://mlwiki.org/index.php/Learning_Curves

Much of the programming with ML has moved towards cleaning, extrapolating and generating the data.

But this type of programing is - miracles- bugfree. We never hear of data-conversion gone wrong, data corrupted or data-mining withou conclusive results here. Obviously such bugs lack the glamour of security bugs.

It's also very difficult to catch these errors. Your trained model just doesn't work as well as it could, but how would you be able to tell?

> focus a lot of his time on the data itself... from where he intends to collect it? how is it structured? is it reliable? is it "enough"?

What's the best books on this subject? I suppose it's a very broad topic and thus more difficult to talk about than a single "neural network" algorithm.

Interested in what part of that you feel needs to be explained in more depth? Not sure reading several books is necessary for explaining data collection and data munging...to me it's definitely something best learned by doing.

work in data analysis/stats

Lots of things are best learned by doing. I just noticed there are dozens of books about machine learning algorithms but none on how to gather data. Of course, both those things can be learned independently, but I think there's room for at least a few books about data gathering considering it's so important for good machine learning results.

Here at Manning (we're publishing Francois Book) have something in our early access program on this now - https://www.manning.com/books/the-art-of-data-usability

This is the domain of statistics, isn't it?

Agreed. AFAIK, only statistics has addressed the question of info sufficiency in data and discriminative power of method. Personally, I think the former is an enormously important subject that isn't addressed well in most ML texts. How much data is necessary to answer a given question in practice? How do you know if your data or method are "good enough"?

From what I've seen, statistics addresses these questions better than CS-taught ML does. CS-based ML is no different from algorithm analysis; it suffers from sensitivity to limits inherent in the data. But ML courses often don't address these limits very rigorously. Yet knowing those limits is all important when effectively mining information at a professional level.

If you can't tell the decision maker what you know and what you don't, your inference/prediction really isn't useful. From what I've seen, statistics addresses this best.

Thanks for sharing your experience. I'm happy that my previous exposure to trading algorithms at least helped me understand more what the experts here are talking about. I believe the output model is only as good as the data (at least for the deep learning branch of ML). If the dataset does not cover data-points which exist in a wider space but in the same domain of the problem, or which haven't yet have a precedent, then we really can't simply assume that it is the algo/model that needs tweaking when shit hits the fan.

This is incredibly true, even with crappy old algorithms you can do A LOT if you have great data.

Recent experience with a company that is building some models based on.. few guys recording few hours of audio and annotating it. I still can't get over the fact that otherwise smart people think this is going to work at all.

> but it looks that my data-related advice falls on deaf ears every time, he's only interested in me pointing to him the latest cool ML algorithm.

So, it seems their learning/planning algorithm fails, even when it is given the right data. That's unfortunate.

Sorry, I can't help but notice that you aren't happy with their brain's algorithm, while talking about importance of data. I don't say that data doesn't matter or anything. Just random observation.

Could actually be their data, right? Imagine if you had only had experience with software engineering. The only data you use when engineering software are the data you learn when using the product or writing tests, it's all the algorithms behind it that's important. So to them, they just don't have data on situations where the data are important.

Wow that's confusing wording. I hope it makes sense.

It does, but the algorithm doesn't seems to be state-of-the-art, it's more like current ML algorithms, which need lots of data to work successfully in each new domain. Well, there's a lot of improvement possibilities, at least.

The data processing inequality says processing data does not increase its information content.

But processing does increase the "obviousness" of the information content.

E.g. projecting the data onto independent dimensions doesn't change the information it contains, but it highlights that those dimensions are indeed independent. Decomposing a multimodal distribution into a mixture of unimodal distribution gives more insight than just viewing it as a bunch of data mushed together. And so on.

I think there should be a branch of information theory that quantifies the obviousness of information and how it is changed by various data processing methods.

The "creative" moves may very well come from the search part of the AlphaGo algorithm, though of course the networks have done their jobs of pruning the search space.

I see.. That's true. Though credit still goes to the algo for choosing that particular weird move out of the entire search space (it's just 'weird' and something you will think is a move made by a total newbie to the game). I remembered for that whole week during lunchtime I would watch the broadcast live on YouTube. How devastated I was to see Lee Sedol losing match after match. It was a moment I would never forget, in my mind the computer had crossed an imaginary threshold and it won. I know ML/DL experts will say it is only for a very specific area. But what's stopping more mastery of enough 'specific' areas that the mastery will be broad enough to pass Turing tests?

Careful, that's the sort of thinking that led to the last 'AI Winter': assuming that if enough rule-based expert systems were built, general-purpose systems could be assembled from them and/or enough could be learned to build general-purpose systems.

Now, it is worth noting that DL models are already being assembled together (often with a coordinating DL model to switch between them). This can have the advantage of the smaller models being reusable to some extent (certainly more than expert systems ever were) but is not a panacea. The results are still essentially bespoke models rather than general purpose ones.

Deep Learning obviously has a lot more mileage left in it, given that much human mental labor is 'just' training and using our general-purpose intellects for what amount to a series of rather narrowly defined tasks, but it won't surprise me if there is a wall of some sort lurking just over the horizon that will require a different approach (albeit one that may still be called 'deep learning') to cross.

OTOH, it does seem as though the folks at DeepMind are fairly aggressively pursuing whatever is on the other side of that particular horizon:




We can debate, but I don't think another AI winter will happen again in my lifetime. AI work is just earning way too much money for its funding to get cut, and a lot of funding is currently private too.

I wasn't arguing for another AI Winter per-se. My warning was more along the lines of pointing out a potential personal "career winter".

I'd be surprised to see inductive learning anytime soon. But I definitely see the next generation of AI systems, robots and their implementation across industry. But that will rapidly fill out and then we will still be left with self determination.

My understanding is that innovation comes from reinforcement learning during self-play (rather than supervised learning of pro games), and thus goes against the best moves suggested by AlphaGo's policy network, in turn pushing it towards new options.

In a sense, it seems innovation arises when the value network forces the policy network to expand the search space because an apparently unlikely move leads to downstream positions deemed favorable.

It's not that simple. The creativity is that the combination of rollouts, policy and value networks allow for more efficient traversal of the search space. Which gets you better exploration of possible paths, meaning more options than a human considered and therefore more creativity.

> Pretty much every time DL is covered by media, there has to be some contrived comparison to human brains

Well, what we've done so far is emulate maybe 1 mm^3 of brain matter - some isolated, very specialized functional blocks in the greater architecture of the brain. They behave as expected - are experts on very narrow topics, but of course fail to integrate their functioning with a larger body of knowledge, because that body just isn't there (yet).

The strength of the human mind is that is has this profusion of little subject matter experts all over the place, covering an enormous array of topics - and then it has an intricate superstructure that integrates the outputs of these narrow expert machines, tweaks their functioning, even subtly alters their inputs, providing coherence to the global output according to the capabilities of the whole system.

We're still far from that complex high level architecture.

> Well, what we've done so far is emulate maybe 1 mm^3 of brain matter - some isolated, very specialized functional blocks in the greater architecture of the brain. They behave as expected - are experts on very narrow topics, but of course fail to integrate their functioning with a larger body of knowledge, because that body just isn't there (yet).

I think you're falling into the same anthropomorphism trap that the GP is talking about. We haven't even breached the most important topic: neural plasticity - a brain's ability to rewire itself based on a complex feedback loop driven by environmental inputs (which are, at this point in human development, an almost infinitely more complex system of culture built up over tens of thousands of years). From my work in neuroscience, it seems that the computational complexity of the state of the art DL algorithms barely register when compared to a network of a few hundred biological neurons like the nervous system of Caenorhabditis elegans, which is itself far less capable of self reorganization than even the simplest mammalian brain. Hell, even the most basic potentiation that you'd find in decades old research on addiction is far outside the scope of modern machine learning research and we don't yet have any clean mathematical theories that can emulate plasticity like back propagation or gradient descent can with simple learning.

The current hype around neural networks is the equivalent of saying that we've analytically solved the n-body problem when all we've done is solve a system of equations with two linear variables. The domains are connected but only in the trivial sense that both have variables named "x" and "y."

I think you're far too eager to look for and criticize anthropomorphism - hence you see it where it's not.

You said "what we've done so far is emulate maybe 1 mm^3 of brain matter," comparing computational neural networks to us, a biological system - that's literally anthropomorphising.

You seem to be under the assumption that a typical feedforward DNN is anywhere close to operating like the brain, just on a smaller scale. But that assumption is not correct.

Both the brain and artificial neural networks are connectivist, but that's about where the similarities end. The brain uses completely unknown algorithms and mechanisms that are almost certainly very different from our (current) ANNs. So it's not just a matter of increasing the scale.

That is nowhere near what I am saying.

I think it would help a lot if we brought random forests and SVMs to the same level of performance as DNNs. Demonstrating that more "mechanical" algorithms can be as efficient would dispel some of the anthropomorphism and allow for better analysis of why certain things work.

I also believe that researches have responsibility to outline the limits of their own algorithms in research papers. (For example, presenting examples that aren't recognized or data sets on which the approach doesn't work at all.) That is valuable information and they almost certainly have it at the time of publication.

Not possible, unfortunately

I've occasionally found that SVM's work great for one shot learning if you have good features and nicely labelled dataset. CNN's are really good at extracting features. Once you've extracted features that are generic, using an SVM as the last layer to train while keeping the CNN parameters intact yields great accuracy.

I think that's where we are really headed. A combination of deep learning, boosted trees, svm, evolutionary algos, knowledge graphs e.t.c all stitched together to build stronger AI systems.

Remember our aeroplanes don't flap wings but still carry tonnes of weight and fly half way around the world. Once we discovered fundamentals of aerodynamics a lot of supernatural things were possible.

Same with intelligence, once we discover the essentials of intelligence and mathematically formulate it, supernatural intelligence is very possible. This is the thing that really scares people. I have no idea how close we are to it, but I'm sure it will change society the way internet and mobile phones changed the world.

Wow, I had never considered superintelligence that wasn't at least at some level modeled after the human brain. That is crazy to think about. We could be at the very low end of the spectrum of intelligence I guess.

Homo sapiens is the dumbest creature able to spawn a civilization that evolution could produce.


That's a good comment, and yes, SVM are very powerful in itself, they might not be "deep learning" but they're more powerful than linear learning and good for a lot of cases (as a last layer, as you mentioned, it's a good use case)

Yes, we'll have GAs building CNN architectures, or a mix of several techniques, I'm enthusiastic for what the future holds

> I'm sure it will change society the way internet and mobile phones changed the world.

It will change the entire world the way humans changed the world. And that's scary.

Kaggle has already proven hundreds of times over that deep learning is not a silver bullet.

Thanks, I'm familiar with Kaggle and how most of the time a Random Forest (or XGBoost, or something like Vowpal Wabbit) will solve your problem

True - until some clever guy proves us all wrong and finds ways to train some multidimensional/complex/deep/... kernel/forest/swarm/... that can learn those nonlinearities that currently only deep nets can be trained to detect (essentially, due to their relative simplicity, I'd say) :-)

I don't think we'll see a deep svm, but if we see one I think we'll have something very powerful

Same for a deep decision tree (forest?). Or maybe a combination of several techniques, etc

Probably comes down to whether the model can be trained with gradient descent (at least in the short term).

A general pre-trained RL guided architecture search (#1) together with more choices of nonlinearity (#2), feature extraction (#3), pooling and memory argumentation (#4) and other tricks (#5) could be very powerful amongst many domains. Make it be able to accept multiple pre-trained models as priors and we're well on our way to general AI or at least a place where most data-scientists could be automated away.

(#1 deepmind had a demo a year back or so that was quite novel) (#2 vaguely remember someone training decision trees with gradient descent; could definitely see a 'randomforest' layer appearing in the middle of deep nets) (#3 just convolutions + tricks really). (#4 neural turing machine etc) (#5 any attention mechanism/any sequence mechanism (rnn/lstm etc)/ any graph relational understanding like the recent deepmind paper).

One of the greatest clear and present dangers of AI is that various existing algorithms are called just that, rather than what they are: statistical analysis algorithms, or, in short, statistics. Statistics used to be what we called the worst kind of lie; now it's becoming associated with intelligence, hinting at the ability to expose some great hidden truth. The problem lies not only with the algorithms, but with the models they learn (which are indirectly shaped by the algorithms' limitations) that are simplistic to begin with. E.g., they are trained to predict behavior based on a snapshot of statistical data, using either a constant model (which assumes behavior doesn't change over time) or some simplistic first-order model of change. They certainly aren't usually trained to take into account long-term changes or how their own recommendations impact behavior. The result is a powerful yet completely unjustified boost to the public image of statistical data with simplistic change models.

This. I still cannot forget the disappointment of my parents and some family friends, all retired scientist or MDs, when I explained them how deep learning and natural language processing works a few years ago. They were truly upset that all this was "nothing more than clever accounting and statistics" at the end of the day, and no trace of the "advertised intelligence" - with Hinton's RBMs maybe coming closest, but by the time I was explaining how you use MCMC to train a Boltzmann machine, they again were complaining that even this is just modeling "statistical likelihoods, not true intelligence"...

In essence, we are only modeling patterns and their transformations, even if rather complex ones. But even the most basic prokaryote can model patterns, that has nothing to do with intelligence or consciousness per se. (And please don't get me started on swarm intelligence now... :-))

Perhaps the problem simply lies in calling them neural networks.

This terminology goes back to McCulloch and Pitts in 1943, who said they were making an analogy or model based on the behavior of biological neurons.


There are many things that are inexact about this analogy or model, and many of them were known to be inexact in 1943, but that was the direct inspiration.

Apparently there are lots of different mathematical models available about biological neuron behavior:


turns out it's very hard to model a thing that we don't know how it actually works

To be fair, we do understand how neurons work, at least on a singular level. Perceptrons model that quite well.

Implementing a basic perceptron classifier is an undergrad homework assignment. Biological modeling of neurons is a work of decades:



McCulloch's argument was that perhaps the gross behaviour of a NN as layers of simple transfer functions is where the real action is, and the rest of the details are just gravy.

The fact we now give this to undergrads as homework suggests that there was some value to this idea.

Students in computer science may implement a perceptron as a homework problem. Students in biology don't do that, nor do they use perceptrons to learn about brains, because perceptrons bear only faint resemblance to biological neurons. Reproducing important biological features of real neurons requires much more complicated software.

I'm not denigrating perceptrons or other neuro-inspired approaches to classification. I'm just pointing out that perceptrons are not a faithful model of neurons.

But it turns out that they don't have to be. We know that radically different low-level implementations can approximate the same higher-level functions given a large enough network and enough training (eg. half-precision floating point, integer, or even binary ANNs, not to mention the wide variety of activation functions such as relu, sigmoid, tanh, maxout, softmax, etc.), and we've seen increasingly varied ANN architectures applied to the same tasks with good results, so I would expect this to continue to hold true for ever more sophisticated tasks.

I am certain, BTW, that further study of biological neurons will continue to yield insights for the design of ANNs, but it does not at all follow that ANN design will become more similar to biological NNs as a result. Given the completely different substrates, simulating a biologically plausible NN in order to perform a task (for purposes other than gaining further understanding of biological NNs, that is) would be incredibly wasteful and unnecessary, even if your goal is to create an AGI of some sort.

I was disagreeing with someone who wrote that we understand how neurons work and that perceptrons model them "quite well." They do not model biological neurons well at all. I agree that biological fidelity is not important for building useful ANNs.

I presented (a vulgar summary of) McCulloch's hypothesis, not my own. And since I didn't use the words "quite well", you are not entitled to put them in quotes.

philipkglass was referring to curiousgal's comment upthread: https://news.ycombinator.com/item?id=14790965

OK, thanks. Too late to edit. Adjust flames accordingly. Of course a perceptron is not an accurate model of a biological neuron. But as a reduction to a minimal model it's still pretty darn interesting.

But how does a neuron decide to grow new axons or how to change input weights? Biological neurons does this when solving tasks and not just during training. Isn't it possible that human-like intelligence depends on the network being dynamic? For example, when you play a game for the first time a lot of things suddenly starts to click, couldn't that be the result of new connections forming or at least some weights being changed? If this is true then it would be impossible to create a general game playing AI with human-like performance using our current model.

Biological neurons are fundamentally different to models used in deep learning. They can have multiple outputs, can span over whole brain and do local protein-based computations we don't really understand yet. What we have in perceptron is just a very simple model based on what we observed using rudimentary electricity detectors.

Don't fully connected layers do exactly what you describe?

As well as the title "artificial intelligence".

One can say that the human mind consist of millions of not very special parts. It's the aggregate, the complexity of which they interact that makes it special.

Once you start to connect all these seemingly non-special abilities in deep learning the "magic" starts to happen. You get something that is more than the sum of it's parts. Of course it's not DL in itself thats interesting but the potential emergent complex relationships.

That's just another version of the trap GP spoke about. About a decade ago everybody was expecting emergent complex behavior from all kinds of evolutionary, intelligent ("swarm") systems. Didn't happen, seen that.


About a decade ago winning GO or self-driving cars were seen as pipedreams many decades away. Yet here we are.

The author is making the mistake of thinking that just because he can show some areas were we aren't as far as we thought he has made an argument against AI.

Thats not how it works. We don't get to decide what is the right metrics. All we can see is that we keep making progress sometimes large leaps sometimes slow.

I always find it fascinating that we have no problem accepting the idea that human consciousness evolved from basically nothing but the most elementary building blocks of the universe and once we became complex enough we ended up being conscious yet somehow the idea of technology going through the same just in a different media seems to many impossible.

I know where my bet is at least and I haven't seen anything to counter that neither the OP's essay.

The fallacy there is glorifying consciousness. Full consciousness as in omniscence is an unachievable ideal. If we prescribe consciousness to ourselves, depending on the individual theory of conscious thought, that's likely faulty in some respect already.

I don't see anyone glorifying consciousness especially not as some omniscient ideal. In fact I only see people arguing that consciousness isn't really the goal or the focus here but rather that you can't talk with any certainty about whether or not it's possible. You can however point to the fact that we are making progress towards more and more complex relationships and that this looks very much like how we became conscious. Thats all really.

>Didn't happen,

...yet. See my comment here: https://news.ycombinator.com/item?id=14770230

"Never" is a strong prediction. But yes, ANNs have nothing in common with BNNs (biological ... :-)) at all, other than taking them as a very rough abstraction for teaching the basic intuition of the chained up tensor transformations.

The hard thing is to predict the when, or even if, of AI. If it will happen, it will be a sudden, light-switch like moment. I don't think AI can happen gradually. At least the first artificially scentient entity will be a moment much like a singularity some love to predict in the near future...

But as to when that moment will occur, or even if, I think we have no real data that shows we are any closer today than say 10 or 30 years ago. Pattern matching, no matter how complex, isn't "all there is" to intelligence and conciseness.

EDIT: OP changed his reply from "will never happen" to "hasn't happened yet" while I was replying, explaining why mine might read a bit strange now... :-)

> If it will happen, it will be a sudden, light-switch like moment. I don't think AI can happen gradually.

But our own intelligence happened gradually.

Human intelligence at the individual level evolved pretty gradually, but there hasn't been enough time for biology to explain our advancement in the last 10,000 years or 500 years. Culture and social organization are the essential nurturing factors there.

Every human genius would be out foraging for roots, perhaps reinventing the wheel or the lever, if it grew up without the benefit and influence of a society that makes greater achievement possible. Modern science and high technology that we attribute to human intelligence are really the products of a superintelligence (not to be conflated with consciousness) acting through us as appendages.

I think it's entirely possible (even likely) that all of the components of a new computational superintelligence already exist, but they are still "hunting and gathering" in the halls of academia or the stock market or biotech or defense...

Has anyone been able to do this? Is anyone working on it?

I only follow the field as a hobby, but as far as I can tell we are nowhere near getting to this point. I think the ability to combine all these parts in a way that the sum is greater than it's parts is going to require many many breakthroughs still.

The thing is that it's most likely not something anyone does per se but something that happens with enough complexity.

If you happen to believe evolutionary theory is the most convincing then we weren't built either but a byproduct of emergent complexity.

It is my belief that humans are pattern recognizing feedback loops and carriers of information. We externalized some of that into books and built libraries to be able to keep even more than humans can remember as individuals and now have technology to save even more information and even manipulate it in ways impossible up until 80 years ago or so.

I am fairly certain that a technology is part of nature and that technology based conscience is nothing like our limited conscience but something rather different. The end result will not be like humans just better but nothing like humans at all but much better at the carrying of information part.

And so with that (my personal belief) perspective in mind no one is going to be able to do it it will happen as a by-product.

Please keep in mind that I saying "we exeternalized" in the same way we say "selfish genes" it's not a conscious effort as such but rather something which happen to be favorized in the game of life.

Why that is I have no idea but I am fairly certain humans aren't the last species. But yes it's all very speculative I just haven't been able to find better explanations for now.

The problem is that we don't really know for sure. We kind of predict things by extrapolating what we know and what we have, but we can never be sure there won't be any sudden breakthroughs.

I agree with your position. But I want to add a warning against the humanization of the brain. Many parts of it are complex in unknown ways, but some parts are truly mechanical.

The parts of your central nervous system that respond to reflexes, that locate the source of sound or parse the color of retinal input are far more similar to deep learning algorithms than they are to what we think of as human consciousness.

Because that has nothing to do with consciousness... Every living cell can perceive such inputs, even the simplest of prokaryotes can "sniff" out their food sources.

> Many parts of it are complex in unknown ways, but some parts are truly mechanical.

I feel like this is a bit of a false dichotomy. We've never encountered any spooky non-mechanical non-physical part of the brain, and we've been looking since Cartesian dualism was in vogue.

What we think of as human consciousness is likely just a bunch of feedback loops allowing the brain to analyze some of its own state as if it were an external entity.

The same oversimplification could have been made of the visual system before we became aware of specialized cortical units and their federated/hierarchical arrangement.

In time I suspect we'll yet discover that much of the brain is inhomogeneous in unexpected ways and peculiarly interconnected. If it were not, we'd understand more about how it works by now.

It isn't anthropomorphizing. There are undeniable architectural similarities between ANN's and biological neural networks. We don't understand either very well yet but the parts we do understand have led to a lot of cross pollination. I don't think computational intelligence will ever match biological networks detail by detail due to the different substrates and resource usage tradeoffs, and they don't need to match. Intelligence can develop in different ways and we are learning about the universal aspects of it.

This is exactly my point - the danger of "anthropomorphization" lies in taking the brain analogy too far. That is, there shouldn't necessarily be a link between research in neuroscience and advances that make deep learning models more accurate. The tasks are completely different (human learning vs. minimizing a loss function), and it's important for researchers in both fields - neuroscience and AI - to keep that in mind.

However, there definitely are analogies! E.g. early work in convnets was inspired by the architecture of cat brains.

I think the fields have useful things to say to each other, but we're getting over a (maybe justified) taboo in talking about machine learning methods being biologically inspired.

The origins of that analogy are very flimsy:

1) Hubel and Wiesel discover simple and complex cells in cat's V1 in the 60's. They came up with an ad hoc explanation that somehow the complex cells "pool" among many simple cells of the same orientation. No one to date knows how such pooling would be accomplished (that selects exactly simple cells of similar orientation and different phase, not vice versa), or whether that pooling is only on V1 or elsewhere in the cortex.

2) Fukushima expanded that ad hoc model into neocognitron in 80's, though there is exactly zero evidence for similar "pooling" in higher cortical areas. In fact, higher cortical areas are essentially impossible to disentangle and characterize even today.

3) Yann Lecun took neocognitron and made a convnet which worked OK for MNIST in the late 80's. Afterward the thing was forgotten for many years.

4) Some few years ago Hinton and some dude who could write good GPU code (Alex Krizhevsky), took the convent and won ImageNet. That is when the current wave of "AI" started.

In summary, covnets and very loosely based on an ad hoc explanation to Hubel and Wiesel findings in primary visual cortex, which today in neuroscience are regarded as "incomplete" to say the least (more likely completely wrong). Now this stuff works to a degree, but really all these biological inspirations are very minimal.

How do you know your brain's not minimizing a loss function?

For the analogy to hold, it's more of a question of whether or not ML algorithms operate in the same way as the brain. Right now, ML models use algorithms from continuous optimization that require certain structure. Namely, we require a Hilbert space, so that we can define things like derivatives and gradients. This puts certain requirements on the kinds of functions that we can minimize and the kinds of spaces that we can work with. These are requirements that are difficult to have precise analogies in biology. What does it mean to have an inner product in the brain? We does twice continuously differentiable mean in the context of a neuron? Even if there is a minimization principle, which I am not sure there is or is not, if ML uses algorithms, which are fundamentally not realizable in biology, how can we say it replicates the brain?

Based on what goes on in every cell in our bodies when it comes to the information processing involved with DNA, I don't think there is any such algorithm which is fundamentally not realizable in biology. I'll grant you, I don't think biological neurons are calculating derivatives across connection strengths, but there must be some analogous process to control neural connection strengths.

That may very well be and I think it's a fantastic area to do research on. Namely, can we accurately model the body with an algorithmic process and what does this process look like? However, unless ML directly mirrors: the algorithms involved in the body, the models used by the body, and the the misfit function used by the body, which together already assumes that the body really does operate on a strict minimization principle, then I contend it's improper to anthropomorphize the algorithms. They're good algorithms, but a better name would be empirical modeling since we're creating models from empirical data.

You might find a slide of my talk interesting:


You have to read it from left to right with an twinking eye of course ;)

In your slide - why is back propogation a further stretch from a true bio-NN than an ANN without back propogation?

An ANN still resembles major features of an bio-NN.

1. A network

2. Flow of information is mainly unidirectional through a node

3. Multiple inputs, but one output, which is connected to the inputs of other neurons.

4. The connection strength between 2 neurons can be changed.

5. Non-linear behavior.

After all, I think, this is not such a bad first approximation. Hence the picture in the middle.

But I cannot believe that we learn by comparing thousands or millions of input and output patterns and back propagate the error through the network to perform a gradient descent at the neurons. That is simply not, what our brain does.

When there is feedback in neurons, what do you think that conveys?

I agree it is not some simple error correction like what is propagated backwards, but it happens often and I presume its something useful or it wouldn't be there.

Top down predictions are likely mediated by feedback connections from higher to lower areas. Functions include possibly encoding a generative prior for prediction, speeding up inference. They also play an important role in coding more informative error signals than simple derivatives and are part of how the brain learns even as it predicts.

This is only true because we don't know how the brain actually works. But the NN architecture is not unreasonable, it maps structures seen in the brain. Backpropagation is also reasonable to abstract the changes in gene and protein regulation (e.g. how learning could be encoded).

Well said. It's just curve fitting.

Maybe everything is "curve fitting." -- Note: I think it's more hierarchical than that but curve fitting is certainly one of the important capabilities of biological systems.

I don't think so. There's an incredibly important art and science to model selection that is not encapsulated in curve fitting. For example, say we observe a boy throwing a ball and we want to predict where the ball will land. From basic physics, we know the model is `y = 0.5 a t^2 + v0 t + y0` where `a` is the acceleration due to gravity, `v0` is the initial velocity, and `y0` is the initial height. After observing one or two thrown balls, even with error, we can estimate the parameters `a`, `v0`, and `y0` relatively well. Alternatively, we could apply a generic machine learning model to this problem. Eventually, it will work, but how much more data do we need? How many additional parameters do we need? Do the machine learning parameters have physical meaning like those in the original model? In this case, I contend the original model is superior.

Now, certainly, there are cases where we don't have a good or known model and machine learning is an extremely important tool for analyzing these cases. However, the process of making this determination and choosing what model to use is not solved by curve fitting or machine learning. This is a decision made by a person. Perhaps some day that will change, and that will be a major advance in intelligent systems, but we don't have that now and it's not clear to me how extending existing methods will lead us there.

Basically, I agree with the sentiment of the grandparent post. Machine learning is largely just curve fitting. How and when to apply a machine learning model vs another model is currently a decision left up to the user.

You're talking about the complexity of the model. If you take a purely input-output view of the world (which by the way, even classical Physics does), every problem _is_ curve fitting in a sufficiently high dimensional space. There is no _conceptual_ problem here. There is perhaps a complexity problem, but that's why I wrote that "I think it's more hierarchical than that."

I disagree. Many problem spaces are not continuous and can involve incomplete information that make a continuous model like a curve useless.

For instance, a linguistic model that lacks definitions for some words, or which allows too much ambiguity can leave sentences unparsable or uninterpretable. Disruptions to word order in sentences can lose sufficient information that no curve or fitment can recover it. A curve has to capture sufficient information for fitting it to be useful. I think not all concepts or relations are amenable to N-dimensional cartesian representation. (Though I'd like to see a reference confirming this.)

And hidden interdependence between dimensions can make any curve drawn in that coordinate space a misrepresentation of the actual info space, and any curve fit in it, dysfunctional.

Any mapping of info onto a cartesian coordinate space presumes constraints that limit the utility of any function that across that space. So no curve is guaranteed to be meaningful in "the real world" unless those assumptions are conserved upon reentry from the abstract world.

George Box's "All models are wrong, but some are useful" suggests that while fitting curves in wrong models may be possible, it well may be form without function.

>If you take a purely input-output view of the world (which by the way, even classical Physics does), every problem _is_ curve fitting in a sufficiently high dimensional space.

Not all spaces are Euclidean, and "purely input-output" still contains a lot of room for counterfactuals that ML models fail to capture.

What do you mean by counterfactuals? NNs are function approximation algorithms, in any geometry. No ifs ands or buts about it.

Oh, I agree that neural networks are function approximators with respect to some geometry. When I say "counterfactuals", I'm talking about typical Bayes-net style counterfactuals, but as also used in cognitive psychology. We know that human minds evaluate counterfactual statements in order to test and infer causal structure. We thus know that neural networks are insufficient for "real" cognition.

You seem to have replied on a tangent: how is what you describe not just "curve fitting"?

Humans didn't magic that model up: you're ignoring the huge amount of human effort over thousands of years that it took to arrive at that model. If we gave a ML algorithm a similar amount of time and asked it to construct a simple model of the situation, it might very well hand back the formula you presented.

Your entire post basically begs the question: it supposes that humans are doing something that isn't "curve fitting", and then uses that to argue that they do more.

What, specifically, are you supposing can't be done by "curve fitting"?

I believe the process for deriving fundamental physical models differs from the techniques used in ML. For example, say we want to use the principle of least action to derive an expression for energy similar to what Landau and Lifshitz derive in their book Mechanics. Here, we assume that the motion of a particle is defined by its position and velocity. We assume that the motion of the particle is defined by an optimization principle. We assume Galilean invariance. We assume that space and time are homogeneous is isotropic. Then, putting this all together we can derive an expression for energy that `E=0.5 m v^2`. At this point, we can validate our model with a series of experiments that curve fits this expression to the results.

Alternatively, we could just run a bunch of experiments on data using ML models. Eventually, someone may have a wonderful idea and realize that we can just reduce the ML model into a parabola. Of course, this is due to intuition and not the ML model. Nevertheless, even though we end up at the same result, I contend the first result is different. It has a huge amount of information embedded into it about the assumptions we made into how the world works. When those assumptions are no longer satisfied, we have a rubric for constructing a fix. For example, if Galilean invariance no longer holds, we can fix the above model using the same sort of derivations to obtain relativistic expressions. Again, we could just throw more data at this new problem and fit an ML model to and perhaps someone would stare at this new model and realize that `E = m c^2`. However, I think that's discounting the embedded information in deriving these models and I don't think this information is present in ML models. ML models are generic. Our most powerful physical models are not.

Now, sure, once we have the models, we're just going to fit them to the data and it's all just curve fitting. Other fields call this parameter estimation, parameter identification, or a variety of other names. At that point it's all curve fitting. However, again, I contend the process for determining a new model is not.

Of course. "What do I fit this curve to" is a prerequisite to "what is the shape of this curve?"

You shouldn't feel the need to defend theory-based modeling against some imagined incursion from arrogant deep learning researchers. NNs work tremendously well in a few specific problem domains that we had no way to approach otherwise. Elsewhere, they're not much better than any other prediction algorithm. By the way XGBoost is curve-fitting, too.

I very much agree! Barring some kind of special intuition to the problem, I think ML are a fantastic tool for building models from empirical data. Even with intuition, sometimes they work as well. My core argument is that anthropomorphizing the algorithms has led to a great deal of confusion as to when we should or should not use these models. I often do computational modeling work with engineers and many of them are starting to eschew good, foundationaly sound models for ML not because they work better, in fact, on many of these problems they work far, far worse, but because good computational modeling is hard and it sounds like all they have to do with ML is teach the algorithms how physics works and how to be an engineer. Since they're good teachers, they should be able to teach the algorithm, right? In reality, it's still dirty, grinding computational modeling work. If we just called these models what they really are, empirical models, I think there'd be far less confusion as to when they should be used.

You haven't explained how the first case isn't "curve fitting": the agents performing the compilation of those facts into the new fact are just spitting out the "best" fit string of symbols based on learned rules, etc etc. Somethings computers can (theoretically) do, and which fits the description "curve fitting" just fine. School (and other education) is training the model they're using to do that compilation, but it's still just "curve fitting" based on reward/punishment signals.

What part of that can't an ML agent learn to do?

From my perspective, you're just describing the "higher order" layers of the network and pretending that humans aren't actually running those functions embedded on deep networks, then proclaiming that deep networks can't do it.

Alright, so from my perspective, curve fitting consists of three things

1. Definition of a model. ML models like multilayer perceptrons used a superposition of sigmoids, but newer models have superpositions of other functions and more nested hierarchies.

2. A metric to define misfit. Most of the time we use least squares because it's differentiable, but other metrics are possible.

3. An optimization algorithm to minimize misfit. Backpropogation is a combination of an unglobalized steepest descent combined with automatic differentiation like algorithm to obtain the derivatives. However, there is a small crowd that uses Newton methods.

Literally, this means curve fitting is something like the problem

min_{params) 0.5 sum_i || model(params,input_i) - output_i ||^2

Of course, there's also a huge number of assumptions in this. First, optimization requires a metric space since we typically want to make sure we're lower than all the points surrounding it. Though, this isn't all that helpful from an algorithmic point of view, so we really need an complete inner product space in order to derive out optimality conditions like the gradient of the objective being zero. Alright, fine, that means if we want to do what you say then we need to figure out how to compile these facts into a Hilbert space. Maybe that's possible and it raises some interesting questions. For example, Hilbert spaces have the property that `alpha x + y` also lie in the vector space. If `x` is an assumption like Galilean invariance and `y` is an assumption that time and space are isotropic, I'm not sure what the linear combination would be, but perhaps it's interesting. Hilbert spaces also require inner products to be well defined and I'm not sure what the inner product between these two assumptions are either. Of course, we don't technically need a Hilbert or Banach space to optimize. Certainly, we lose gradients and derivatives, but there may be something else we can do. Of course, that would involve creating an entire new field of computational optimization theory that's not dependent on derivatives and calculus, which would be amazing, but we don't currently have one.

From a philosophical point of view, there may be a reasonable argument that everything in life is mapping inputs to outputs. From a practical point of view, this is hard and the foundation upon which ML is cast is based on certain assumptions like the three components above, which have assumptions on the structures we can deal with. Until that changes, I continue to contend that, no, ML does not provide a mechanism for deriving new fundamental physical models.

What do you think about a bayesian interpretation of the above as MAP/MLE?


Unless I'm missing something, and I likely am, the linked paper is still based on the the fundamental assumptions behind curve fitting that I listed above. Namely, their optimization algorithms, metrics, and models are still based on Hilbert spaces even though they've added stochastic elements and more sophisticated models.

Interesting abstract. I love Bayesian stats so hopefully this will be a fun commute read. Thanks!

I think you're reading way too far into my post. I was just pointing out that our amazing AI revolution is really just a new type of function approximation being that has magical-seeming results.

I can't think of a succinct way to describe my response, but I'm not sure we disagree, so much as we're talking about slightly different things.

Regardless, I wanted to thank you for the detailed replies -- having a back and forth helped me ponder my thoughts on the matter.

Have a good one. (:

Thanks for chatting!

> Eventually, it will work, but how much more data do we need?

For a model that small, with so little variance (assume you measure correctly where the ball lands) it would be enough to do just a few throws to fit the parameters.

I hope Elon Musk understands that.

I am sure he does.

His public statements would indicate otherwise.

Consider that his public statements are made on the advice of his publicist, and that encouraging the AI hype is self-serving.

Or finding eigenvalues.

Is what you do not "just curve fitting"?

The anthropomorphization was done by academic researchers to gain/increase funding for themselves and the field. You can read the papers and see. This is commonly done for marketing purposes and is important since the pool of research money can be limited.

I always counter, the intelligence is not in the machine, but the builder. Antropomorphism is in line with that, because it projects the human qualities onto the machine, because, in a broad sense, they are modelled after those. Egoistic as we are, that's the only way to understand anything, to remove the shizm between animate and inanimate objects. Just like a fishing rod is just the extension of an arm.

> The model's intuition doesn't work like a human's

The model doesn't have intuition, it is just a series of computations.

There is some good information in there and I agree with the limitations he states, but his conclusion is completely made up.

"To lift some of these limitations and start competing with human brains, we need to move away from straightforward input-to-output mappings, and on to reasoning and abstraction."

There are tens of thousands of scientists and researchers who are studying the brain from every level and we are making tiny dents into understanding it. We have no idea what the key ingredient is , nor if it is 1 or many ingredients that will take us to the next level. Look at deep learning, we had the techniques for it since the 70's, yet it is only now that we can start to exploit it. Some people think the next thing is the connectome, time, forgetting neurons, oscillations, number counting, embodied cognition,emotions,etc. No one really knows and it is very hard to test, the only "smart beings" we know of are ourselves and we can't really do experiments on humans because of laws and ethical reasons. Computer Scientists like many of us here like to theorize on how AI could work, but very little of it is tested out. I wish we had a faster way to test out more competing theories and models.

>I wish we had a faster way to test out more competing theories and models.

Luckily, the state of actual cognitive science and neuroscience is fairly far ahead of, "Gosh there's all these things and we just don't know." Unfortunately, MIT-style cogsci hasn't generated New Jerseyan fast-though-wrong algorithms for Silicon Valley to hype up, so the popular press keeps spreading the myth of our total ignorance.

Besides which, we do know what's missing from deep learning: the ability to express anything other than a trivial Euclidean-space topological structure. We know that real data is sampled from a world subject to cause-and-effect, and that any manifold describing the data should carry the causal structure in its own topology.

Hardly a "made-up" conclusion -- just a teaser for the next post, which deals with how we can achieve "extreme generalization" via abstraction and reasoning, and how we can concretely implement those in machine learning models.

> we can't really do experiments on humans because of laws and ethical reasons.

Ethics and laws constrain but do not forbid experimenting on humans. We do experiments on humans all the time, including experiments on how people learn and reason. There are numerous academic journals devoted to these topics.

I don't think he's talking about literally mimicking how the human brain works. It seems like he's just talking about making neural nets more effective in certain tasks by allowing for more types of abstraction, just like a human brain has more types of abstraction than Artificial Neural Networks do.

This article is a bit misleading. I believe NNs are a lot like the human brain. But just the lowest level of our brain. What psychologists might call "procedural knowledge".

Example: learning to ride a bike. You have no idea how you do it. You can't explain it in words. It requires tons of trial and error. You can give a bike to a physicist that has a perfect deep understanding of the laws of physics. And they won't be any better at riding than a kid.

And after you learn to ride, change the bike. Take one where the handle is inversed. And turning it right turns the wheel left. No matter how good you are at riding a normal bike, no matter how easy it seems it should be, it's very hard. Requires relearning how to ride basically from scratch. And when you are done, you will even have trouble going back to a normal bike. This sounds familiar to the problems of deep reinforcement learning, right?

If you use only the parts of the brain you use to ride a bike, would you be able to do any of the tasks described in the article? E.g. learn to guide spacecraft trajectories with little training, through purely analog controls and muscle memory? Can you even sort a list in your head without the use of pencil and paper?

Similarly recognizing a toothbrush as a baseball bat isn't as bizarre as you think. Most NNs get one pass over an image. Imagine you were flashed that image for just a millisecond. And given no time to process it. No time to even scan it with your eyes! You certain you wouldn't make any mistakes?

But we can augment NNs with attention, with feedback to lower layers from higher layers, and other tricks that might make them more like human vision. It's just very expensive.

And that's another limitation. Our largest networks are incredibly tiny compared to the human brain. It's amazing they can do anything at all. It's unrealistic to expect them to be flawless.

It's a good article in a lot of ways, and provides some warnings that many neural net evangelists should take to heart, but I agree it has some problems.

It's a bit unclear whether Fchollet is asserting that (A) Deep Learning has fundamental theoretical limitations on what it can achieve, or rather (B) that we have yet to discover ways of extracting human-like performance from it.

Certainly I agree with (B) that the current generation of models are little more than 'pattern matching', and the SOTA CNNs are, at best, something like small pieces of visual cortex or insect brains. But rather than deriding this limitation I'm more impressed at the range of tasks "mere" pattern matching is able to do so well - that's my takeaway.

But I also disagree with the distinction he makes between "local" and "extreme" generalization, or at least would contend that it's not a hard, or particularly meaningful, epistemic distinction. It is totally unsurprising that high-level planning and abstract reasoning capabilities are lacking in neural nets because the tasks we set them are so narrowly focused in scope. A neural net doesn't have a childhood, a desire/need to sustain itself, it doesn't grapple with its identity and mortality, set life goals for itself, forge relationships with others, or ponder the cosmos. And these types of quintessentially human activities are what I believe our capacities for high-level planning, reasoning with formal logic etc. arose to service. For this reason it's not obvious to me that a deep-learning-like system (with sufficient conception of causality, scarcity of resources, sanctity of life and so forth) would ALWAYS have to expend 1000s of fruitless trials crashing the rocket into the moon. It's conceivable that a system could know to develop an internal model of celestial mechanics and use it as a kind of staging area to plan trajectories.

I think there's a danger of questionable philosophy of mind assertions creeping into the discussion here (I've already read several poor or irrelevant expositions of Searle's Chinese Room in the comments). The high-level planning, and "true understanding" stuff sounds very much like what was debated for the last 25 years in philosophy of mind circles, under the rubric of "systematicity" in connectionist computational theories of mind. While I don't want to attempt a single-sentence exposition of this complicated debate, I will say that the requirement for "real understanding" (read systematicity) in AI systems, beyond mechanistic manipulation of tokens, is one that has been often criticised as ill-posed and potentially lacking even in human thought; leading to many movements of the goalposts vis-à-vis what "real understanding" actually is.

It's not clear to me that "real understanding" is not, or at least cannot be legitimately conceptualized as, some kind of geometric transformation from inputs to outputs - not least because vector spaces and their morphisms are pretty general mathematical objects.

EDIT: a word

I similarly find myself frustrated with philosophy of mind "contributions" to conversations on deep learning/consciousness/AI. There seems to be a lot of equivocation between the things you label as (a) and (b) above, and a lot of apathy toward distinguishing between them. But (a) and (b) are completely different things, and too often it seems like critics of computers doing smart things treat arguments for one like they are arguments for the other.

Probably the most famous AI critic, Hubert Dreyfus, said "current claims and hopes for progress in models for making computers intelligent are like the belief that someone climbing a tree is making progress toward reaching the moon." But it is progress. Because by climbing a tree I've gained much more than height. I actually did move toward the moon. I've gained the insight that I'm using the right principle.

Surely we shouldn't rush to anthropomorphize neural networks, but we'd ignoring the obvious if we didn't at least note that neural networks do seem to share some structural similarities with our own brains, at least at a very low level, and that they seem to do well with a lot of pattern-recognition problems that we've traditionally considered to be co-incident with brains rather than logical systems.

The article notes, "Machine learning models have no access to such experiences and thus cannot "understand" their inputs in any human-relatable way". But this ignores a lot of the subtlety in psychological models of human consciousness. In particular, I'm thinking of Dual Process Theory as typified by Kahneman's "System 1" and "System 2". System 1 is described as a tireless but largely unconscious and heavily biased pattern recognizer - subject to strange fallacies and working on heuristics and cribs, it reacts to it's environment when it believes that it recognizes stimuli, and notifies the more conscious "System 2" when it doesn't.

At the very least it seems like neural networks have a lot in common with Kahneman's "System 1".

>In particular, I'm thinking of Dual Process Theory

Which has been at least partly debunked as psychology's replication crisis went on, and has been called into question on the neuroscientific angle as well.

Partly, yes - especially with ego depletion on the ropes. I'm not sure that dual process theory needs to be thrown out along with ego depletion, though.

I can see three reasons to "throw it out":

1) Replication failure, plain and simple.

2) Overfitting. There are dozens to hundreds of "cognitive biases" on lists: https://en.wikipedia.org/wiki/List_of_cognitive_biases. When you have hundreds of individual points, you really ought to draw some principles, and the principle should not be, "The system generating all this is rigid and inflexible."

3) Imprecision! Again, dozens to hundreds of cognitive biases. What possible behavior or cognitive performance can't be assimilated into the heuristics and biases theory? What can falsify it overall, even after so many of its individual supporting experiments and predictions have fallen down?

It looks like a mere taxonomy of observations, not a substantive theory.

1) Replication failure, plain and simple.

How many meta-analyses have been conducted as of 2017 showing one result or the other? I don't think ego depletion itself has been thoroughly "debunked" yet. If it is a real effect, it's probably quite small - but I don't think that ego depletion has been thrown in the bin just yet.

2) Overfitting. There are dozens to hundreds of "cognitive biases" on lists: https://en.wikipedia.org/wiki/List_of_cognitive_biases. When you have hundreds of individual points, you really ought to draw some principles, and the principle should not be, "The system generating all this is rigid and inflexible."

3) Imprecision! Again, dozens to hundreds of cognitive biases. What possible behavior or cognitive performance can't be assimilated into the heuristics and biases theory? What can falsify it overall, even after so many of its individual supporting experiments and predictions have fallen down?

Wait a second - has anyone ever tried to explain the "IKEA Effect" using Dual Process Theory? What does a laundry-list of supposed cognitive biases have to do with the theory? Is anyone really trying to explain/predict all this almanac-of-cognitive-failings with Dual Process?

>Is anyone really trying to explain/predict all this almanac-of-cognitive-failings with Dual Process?

To my understanding, yes. That's basically what Dual Process theories exist for: to separate the brain into heuristic/bias processing as one process, and computationally expensive model-based cause-and-effect reasoning as another process. Various known cognitive processes or results are then sort of classified on one side of the line or another.

When you apply Dual Process paradigms to specific corners of cognition, they can be useful. For example, I've seen papers purporting to show that measured uncertainty allows model-free and model-based reinforcement learning algorithms to trade off decision-making "authority". This is less elegant than an explicitly precision-measuring free-energy counterpart, but it's still a viable hypothesis about how the brain can implement a form of bounded rationality when bounded in both sample data and compute power.

But when you scale Dual Processes up to a whole-brain theory, it's just too good at describing anything that involves dichotomizing into a "fast-and-frugal" form of processing and another expensive, reconstructive form of processing. One of the big issues here is that besides the potentially false original evidence for Dual Processes, we don't necessarily have reason to believe there exists any dichotomy, rather than a more continuous tradeoff between frugal heuristic processing and difficult reconstructive processing. The precision-weighting model-selection theory actually makes much more sense here.

This is a fantastic answer - thank you, Eli. So what do you think of the original article?

>This is a fantastic answer - thank you, Eli.

Thanks! I've been doing a lot of amateur reading in cog-sci and theoretical neurosci. The subject enthuses me enough that I'm applying to PhD programs in it this upcoming season.

>So what do you think of the original article?

Thorough and accurate. I'll give a little expansion via my own thought. One thing taught in every theoretically-focused ML class is the No Free Lunch Theorem. In colloquial terms it says, "If you don't make some simplifying assumptions about the function you're trying to learn (and the distribution noising your data), you can't reliably learn."

I think experts learn this, appreciate it as a point of theory, and then often forget to really bring it back up and rethink it where it's applicable. All statistical learning takes place subject to assumptions of "niceness". Which assumptions, though?

Seems to me like:

* If you make certain "niceness" assumptions about the functions in your hypothesis space, but few to none about the distribution, you're a Machine Learner.

* If you make niceness assumptions about your distribution, but don't quite care about the generating function itself, you're an Applied Statistician.

* If you make niceness assumptions about your data, that it was generated from some family of distributions on which you can make inferences, you're a fully frequentist or Bayesian statistician.

* If you want to make almost no assumptions about the generating process yielding the data, but still want just enough assumptions to make reasoning possible, you may be working in the vicinity of any of cognitive science, neuroscience, or artificial intelligence.

The key thing you always have to remind yourself is: you are making assumptions. The question is: which ones? The original article reminds us of a whole lot of the assumptions behind current deep learning:

* The "layers" we care about are compositions of a continuous nonlinear function with a linear transform.

* The functions we care about are compositions of "layers".

* The transforms we care about are probably convolutions or just linear-and-rectified, or just linear-and-sigmoid.

* Composing layers enables gradient information to "fan out" from the loss function to wider and wider places in the early layers.

* The data spaces we care about are usually Euclidean.

These are things every expert knows, but which most people only question when it's time to look at the limitations of current methods. The author of the original article appears well-versed in everything, and I'm really excited to see what they've got for the next part.

"Machine learning models have no access to such experiences and thus cannot "understand" their inputs in any human-relatable way"

It may be that distinctions like the one you're describing here are useful to make, but I don't think this claim refutes the possibility of ML "fitting a particular piece within a larger, yet unarticulated model."

I think the assertion is more that our current ways of representing elements of human experience are necessarily very lossy - or that there's some aspect of the situation that you can't describe/implement in terms of models of neural nets.

The problem with neural nets is that they have a fixed input type - tensors or sequences. For example, imagine the task is to count objects in an image and say if the number of red objects is equal to the number of green objects. You make a net that solves this situation. Then you want to change the colors, or add an extra color, and it will fail. Why - because it learns a fixed input representation.

What neural nets need is to change their data format from plain tensors to object-relation graphs. The input of the network is represented as a set of objects that have relations among them, and the network has to be permute invariant to the order of presentation. An implementation is Graph Convolutional Nets. They learn to compose concepts in new ways and once they learned to count, compare, select by color, they can solve any combination of those concepts as well. That way the nets generalize better and transfer knowledge from a problem to the next.

Graphs are able to reduce the complexity of learning a neural net that can perform flexible tasks. But in order to get to even better results, it is necessary to add simulation to the mix. By equipping neural nets with simulators, we can simplify the learning problem (because the net doesn't have to learn the dynamics of the environment as well, just the task at hand). Examples of simulators used in DL are AlphaGo, the Reinforcement Learning applications on Atari Games, protein/drug property prediction, generative adversarial networks (in a way).

The interesting thing is that graphs are natural for simulation. They can represent objects as vertices and relations as edges, and by signal propagation the graph works like a circuit, a simulator, producing the answer. My bet is on graphs + simulators. That's how we get to the next level (abstraction and reasoning). DeepMind seems to be particularly focused on RL, games and recently, relation networks. There is also work on making dynamic routing in neural nets, in fact applying graphs implicitly inside the net, by multiple attention heads.

There is actually work being done on this problem, at least to some extent. A DNC, for instance, can accept variable-structure inputs by storing each piece in its external memory bank. This is illustrated in the original Nature paper by feeding in an arbitrary graph definition piece by piece, then feeding in a query about the graph.

This doesn't necessarily address all the nuances of your post, but I do believe it's a step in the right direction. It pushes networks from

"learn how to solve this completely statically defined problem via sophisticated pattern matching"


"learn how to interpret a query, drawn from some restricted class of possible queries; accept variable-structure input to the query; strategize about techniques for answering the query; and finally compute the answer, possibly over multiple time-steps"

A neat technique to help 'explain' models is LIME: https://www.oreilly.com/learning/introduction-to-local-inter...

There is a video here https://www.youtube.com/watch?v=hUnRCxnydCc

I think this has some better examples than the Panda vs Gibbon example in the OP if you want to 'see' why a model may classify a tree-frog as a tree-frog vs a billiard (for example). IMO this suggests some level of anthropomorphizing is useful for understanding and building models as the pixels the model picks up aren't really too dissimilar to what I imagine a naive, simple, mind might use. (i.e the tree-frog's goofy face) We like to look at faces for lots of reasons but one of them probably is because they're usually more distinct which is the same, rough, reason why the model likes the face. This is interesting (to me at least) even if it's just matrix multiplication (or uncrumpling high dimensional manifolds) underneath the hood,

I think the requirement for a large amount of data is the biggest objection to the reflex "AI will replace [insert your profession here] soon" that many techies, in particular on HN, have.

There are many professions where there is very little data available to learn from. In some case (self-driving), companies will invest large amount of money to build this data, by running lots of test self-driving cars, or paying people to create the data, and it is viable given the size of the market behind. But the typical high-value intellectual profession is often a niche market with a handful of specialists in the world. Think of a trader of financial institutions bonds, or a lawyer specialized in cross-border mining acquisitions, a physician specialist of a rare disease or a salesperson for aviation parts. What data are you going to train your algorithm with?

The second objection, probably equally important, also applies to "software will replace [insert your boring repetitive mindless profession here]", even after 30 years of broad adoption of computers. If you decide to automate some repetitive mundane tasks, you can spare the salary of the guys who did these tasks, but now you need to pay the salary of a full team of AI specialists / software developers. Now for many tasks (CAD, accounting, mailings, etc), the market is big enough to justify a software company making this investment. But there is a huge number of professions where you are never going to break even, and where humans are still paid to do stupid tasks that a software could easily do today (even in VBA), and will keep doing so until the cost of developing and maintaining software or AI has dropped to zero.

I don't see that happening in my life. In fact I am not even sure we are training that many more computer science specialists than 10 years ago. Again, didn't happen with software for very basic things, why would it happen with AI for more complicated things.

I'd say on the contrary, the problem with experts is that they are so expensive to train and so rare. It is easier to collect data, train the AI and then equip doctors all over the world with it than to have thousands of experts in that particular field.

A doctor that treats patients all day long doesn't have time to keep up with the research and state of the art. A researcher that is on the cutting edge of medicine doesn't have time to treat the patients. We need to equip doctors with AIs to keep them up to date with the best practices.

well, you are mentioning an example where:

- there is data

- there is a wide market that could justify large investments in AI

With this combination, yeah I can see AI being used. In fact medecine is one of the few professions that never industrialised. But there are loads of other professions where either or none of the conditions above are met.

If you are talking about a doctor specialised in a rare disease, where there is very little data, and very few patients to cure, how do you think AI will replace that?

> If you are talking about a doctor specialised in a rare disease, where there is very little data, and very few patients to cure, how do you think AI will replace that?

Well, since transfer learning is a thing, you would start with a general purpose medical system and then train it on what little data you do have on the rare disease to produce an appropriate model (which isn't too different from the way a human expert is produced). In fact, I would assume that the first such systems will be created and used by the researchers focusing on rare diseases.

Yes, experts will use AI to build better tools that will be accessible via computer and by less-expert physicians to augment their skill base in diagnosis. Trying to fully automate the physician promises to be so difficult and expensive (and legally fraught) that the value it adds will be nowhere near worth the investment. No doubt a few direct-to-patient AI-based apps and services will arise in this space, and maybe AI will allow generalists will extend their reach further into the space of specialists. But robot doctors will remain the stuff of fantasy for many decades yet, I suspect.

> and will keep doing so until the cost of developing and maintaining software or AI has dropped to zero.

I have no idea about the progress of AI, but normal software will get an order of magnitude cheaper to develop as we slowly wake up from the Unix/worse-is-better/everything-is-text mindset and abandon the dynamically typed and imperative languages, broken systems abstractions, etc. that hold us back.

I think it's a more fundamental problem than the choice of languages (though I share your griefs!).

To the vast majority of the educated population, software is very much a black art and people would have no idea of how to do even the most basic things. That's of course true for more senior people, but I find that it is as true for the generation who graduates today. They can do incredible things with their smartphone that I didn't suspect was possible, but wouldn't know where to start to code something.

Until this skill gap changes dramatically, and that everyone gets out of high school with basic knowledge in programming, like they have basic knowledge of maths, biology, physics or history, this gap will never close.

May God will it that the heathens should see the light and come to Haskell ;-).

I sincerely would like to know what you think the alternatives are?

Sounds something like Haskell with a Smalltalk environment. Functional, statically typed with powerful type extensions, but with an image instead of text files that you modify.

From just using Jupyter Notebooks, I can see the appeal of working with a live environment, and it's just a fancy REPL, not a full Lisp or Smalltalk environment.

If it has to be a general public language, I'm afraid it will have to be light on special characters and abbreviations or acronyms that made sense 30 years ago. I'd say a Basic or Python-like language, but modernised, and with strong typing to enable the IDE to help a lot the users with auto-completion and error checking.

But if you think about it, most business users are even intimidated by VBA. So it will have to be very fluffy, and I don't think you can spare the mandatory coding 101 teaching at school.

Programming is describing a solution space, and we describe things with words. I don't see how anything but text/speech would map to that aspect of programming.

I'm not holding my breath for that development

Correct me if I'm wrong but I don't see that with 'deep learning' we have answered/solved any of the philosophical problems of AI that existed 25 years ago (stopped paying attention about then).

Yes we have engineered better NN implementations and have more compute power, and thus can solve a broader set of engineering problems with this tool, but is that it?

Yep. The whole machine learning craze is just fueled by the fact that it's now feasible to create models for handwriting/voice/image recognition that actually work reliably. But in terms of the underlying technology, we haven't had some "breakthrough" that explained how the brain works or anything even close to that.

I 90% agree, but if that were all there was to it, we should have been able to achieve the same level of success decades earlier, just by running our ML code longer. Given enough time, old slow AI software should have produced the same analytical results as deep learning does today.

But that's not the case. Deep nets can model vastly more information / state than any other AI/ML method. Once Hinton (and others) showed how to train NNs with more than three layers (ca. 2006) it was finally possible to learn and store all that state. Then with the rise of GPGPUs soon after, deep nets became efficient as well. Thereafter several tasks that had been infeasible even using curated information became amenable to mostly brute force learning strategies driven only by labeled examples -- just lots of 'em.

The question now is how far can we extend DL's tools and examples. Are they sufficient to build higher level cognitive AI agents. Must AGI employ many thousands of deep nets? Or can all those specific-skill nets be folded together somehow into one unified "deep mind"?

Like you, I'm doubtful that today's very specific successes in DL will lead to higher level cognition in the foreseeable future. That path isn't at all clear to me.

This is totally true, but I think it's still important to note that while something like Artificial General Intelligence is still way beyond the state of the art, the state of the art still has a huge impact on the world. A tiny slice of that can be seen in autonomous vehicles and the impact that they seem poised to have.

Don't underestimate the self fulfilling prophecy effect. Quite possible that the massive influx into the field right now will move the needle.

Hmm, sometimes I think that we won't get super close to AGI until we can actually model something the size of a Human Brain (in terms of neurons or Synapses). Human Brain has 1B+ Neurons, or 4Qu+ synapses. So that's 12.5 GB all at once to deal with, if you're representing neurons as either 0 or 1. However, in reality Neurons are much more complicated, and could only treat them as binary if you have a neuromorphic computer. So we would need to deal with many many times that many GB at once, even if we had really efficient ways of storing the data.

That's a lot of data to deal with, especially since you need to train it, running huge computations using each neuron.

I know nothing about hardware, and this is a very crude prediction/estimation of how AGI would happen, but my point is that we might be limited by Hardware for a few more years.

I fully agree with the above responses, and I am optimistic about major break throughs, however like many of you guys, I don't think we should just assume a bit more horsepower and things will magically work.

Yeah, I think the author's just priming the pump for a few more posts in this series that show a new way of abstraction/new NN architecture that solves some new problems.

Doesn't seem like he's trying to claim anything philosophical.

> In short, deep learning models do not have any understanding of their input, at least not in any human sense. Our own understanding of images, sounds, and language, is grounded in our sensorimotor experience as humans—as embodied earthly creatures.

Well maybe we should train systems with all our sensory inputs first, like newborns leans about the world. Then make these models available open source like we release operating systems so others can build on top of that.

For example we have ImageNet, but we don't have WalkNet, TasteNet, TouchNet, SmellNet, HearNet... or other extremely detailed sensory data recorded for an extended time. And these should be connected to match the experiences. At least I have no idea they are out there :)

Brooks' 'Intelligence Without Representation' (http://people.csail.mit.edu/brooks/papers/representation.pdf) starts with a pretty strong argument imo against the story of 'stick-together' AGI you're describing.

I think Brooks' Cog initiative was an attempt to 'ground' the robot's perceptions of the physical world into forming a rich scalable representation model. But it looks like that line of investigation ended ~2003 with Brooks' retirement. Too bad, given the seeming suitability of using deep nets to implement it.


Thanks for the link to this interesting paper.

I think we're seeing some recapitulation of those arguments WRT 'ensembles of DL models' approaches.

I agree. Google has come out with some papers that are, to put it harshly, basic gluing together of DL models followed by loads of training on their compute resources.

Not just Google. The FractalNet paper comes to mind.

This approach has always interested me. I can train an decent Cats Vs. Dogs classifier in a few minutes. But real human intelligence takes many years of continuous and varied input to develop.

Are there systems out there that are taking influence from newborns being exposed to the world? An unsupervised learning system with a huge array of inputs running for years?

All these current examples of AI and ML are just a very small fraction of what we mean by intelligence so I'm not surprised by the pessimistic posts that hit HN from time to time.

Training systems with rich real world experiences sounds something that Open AI should be developing. It's probably not something that you can do over a weekend plus it takes serious of funding and wetware so it's probably the reason it's not there yet.

See: Kant on a priori notions of space-time.

People doing empirical experiments cannot claim to know the limits of their experimental apparatus.

While the design process of deep networks remains founded in trial and error, and there are no convergence theorems and approximation guarantees, no one can be sure what deep learning can do, and what it could never do.

"Here's what you should remember: the only real success of deep learning so far has been the ability to map space X to space Y using a continuous geometric transform, given large amounts of human-annotated data."

This statement has a few problems - there is no real reason to interpret the transforms as geometric (they are fundamentally just processing a bunch of numbers into other numbers, in what sense is this geometric), and the focus on human-annotated data is not quite right (Deep RL and other things such as representation learning have also achieved impressive results in Deep Learning). More importantly, saying " a deep learning model is "just" a chain of simple, continuous geometric transformations " is pretty misleading; things like the Neural Turing Machine have shown that enough composed simple functions can do pretty surprisingly complex stuff. It's good to point out that most of deep learning is just fancy input->output mappings, but I feel like this post somewhat overstates the limitations.

Just because there's a paper on it, and the model has a name, doesn't mean it works. NTM and deep RL don't work for real problems.

yeah this was my main problem, I guess he is technically right because they are geometric but many of his analogies like the paper crumpling were deeply misleading as they would imply that the transformations are linear. The fact that they are not is fundamental to neural networks working.

Paper-crumpling is nonlinear. Maybe your complaint is rather that paper-crumpling is "only" a topology-preserving diffeomorphism?

Presumably that's why the word "just" is in scare quotes.

This point is very well made: 'local generalization vs. extreme generalization.' Advanced NN's today can locally generalize quite well and there's a lot of research spent to inch their generalization further out. This will probably be done by increasing NN size or increasing the NN building-blocks complexity.

Or maybe increasing NN size/complexity is the 21st century version of adding epicycles to make geocentrism work.


Heh, but it makes geocentrism works better! And we don't yet know how 21st century heliocentrism will look like, while adding epicycles is less daunting.

Yay, I found the rabbit hole - technically, no cellestial body rotates purely around the other (thanks mass!). So, perhaps adding epicycles wasn't erronious after all - just a measurement from a different reference point.

Or "adding a few more rules" to a symbolic system ...

I would really like to hear a definition of what generalization means, because I don't think we have one.

Unless we're talking generalization to arbitrary distributions, which is of course unsolvable.

I tend to think of "generalization", in the general sense (pun not intended), to be information compression.

We are simply trying to answer the question: what is the shortest description (i.e. most informationally compressed/dense version) that fits what we see in this infinite (at least to us mortals) universe of ours? Mathematically the length of such a description can be thought of as the Kolmogorov complexity.

Edit: I should add, the information compression performed when generalizing can (and often is) lossy.

Usually in ML context generalization means anytime a model makes a prediction on unseen (not in the training set) inputs. Usually you do CV to see how well your model generalized because you limit the training data seen, predict unseen inputs that you know the answer to, and see how far off the prediction is.

Programmers contemplating the automation of programming:

"To lift some of these limitations and start competing with human brains, we need to move away from straightforward input-to-output mappings, and on to reasoning and abstraction. A likely appropriate substrate for abstract modeling of various situations and concepts is that of computer programs. We have said before (Note: in Deep Learning with Python) that machine learning models could be defined as "learnable programs"; currently we can only learn programs that belong to a very narrow and specific subset of all possible programs. But what if we could learn any program, in a modular and reusable way? Let's see in the next post what the road ahead may look like."

The author said in a Twitter conversation today that he is aware that this phrase is ignoring something essential - namely, that we have systems with memory and attention. That is something different than simple X to y mappings. With memory you can do general computation, recursivity, graphs, anything. They work well on some problems such as translation, but still need to become much better in order to match general purpose programming. But at least we're past the X->y phase.

considering they're the author of a python based machine learning library I would sure hope so. Still it seems like a pretty grievous oversight in writing the dang thing at all considering how at least in my fields of research memory-ful networks are increasingly popular.

It was reserved for part 2.

I'm sorry, but I don't understand why wider & deeper networks won't do the job. If it took "sufficiently large" networks and "sufficiently many" examples, I don't understand why it wouldn't just take another order of magnitude of "sufficiency."

If you look at the example with the blue dots on the bottom, would it not just take many more blue dots to fill in what the neural network doesn't know? I understand that adding more blue dots isn't easy - we'll need a huge amount of training data, and huge amounts of compute to follow; but if increasing the scale is what got these to work in the first place, I don't see we shouldn't try to scale it up even more.

"sufficiently large" could be much more than number of atoms in the universe. You just do not have resources to run computation at such scale.

This is my problem with the thesis that simply scaling deep nets to new heights will ultimately subsume all brain function. If it takes weeks to train a simple object recognizer deep net, how long would it take a grand unified deep net to learn to tie its shoelaces? Puberty?

>But what if we could learn any program, in a modular and reusable way? Let's see in the next post what the road ahead may look like.

I'm really looking forward to this. If it comes out looking like something faster and more usable than Bayesian program induction, RNNs, neural Turing Machines, or Solomonoff Induction, we'll have something really revolutionary on our hands!

Put a lot simpler: Even DL is still only very complex, statistical pattern matching.

While pattern matching can be applied to model the process of cognition, DL cannot really model abstractive intelligence on its own (unless we phrase it as a pattern learning problem, viz. transfer learning, on a very specific abstraction task), and much less can it model consciousness.


Here's how I've been explaining this to non-technical people lately:

"We do not have intelligent machines that can reason. They don't exist yet. What we have today is machines that can learn to recognize patterns at higher levels of abstraction. For example, for imagine recognition, we have machines that can learn to recognize patterns at the level of pixels as well as at the level of textures, shapes, and objects."

If anyone has a better way of explaining deep learning to non-technical people in a few short sentences, I'd love to see it. Post it here!

I really enjoyed this article. It's the first attempt I've seen to assess deep learning toward the integrated end of human level cognition or AGI.

I found one point especially noteworthy: " So even though a deep learning model can be interpreted as a kind of program, inversely most programs cannot be expressed as deep learning models—for most tasks, either there exists no corresponding practically-sized deep neural network that solves the task, or even if there exists one, it may not be learnable, i.e. the corresponding geometric transform may be far too complex, or there may not be appropriate data available to learn it.

Scaling up current deep learning techniques by stacking more layers and using more training data can only superficially palliate some of these issues. It will not solve the more fundamental problem that deep learning models are very limited in what they can represent, and that most of the programs that one may wish to learn cannot be expressed as a continuous geometric morphing of a data manifold. "

What he seems to be suggesting is that a human level cognition built from deep nets will not be a single unified end-to-end "mind" but a conglomeration of many nets, each with different roles, i.e., a confederation or "society" of deep nets.

I suspect Minsky would have agreed, and then suggested that the interesting part is how one defines, instantiates, and then interconnects the components of this society.

I'm excited to hear about how we bring about abstraction.

I was wondering how a NN would go about discovering F = ma and the laws of motion. As far as I can tell, it has a lot of similarities to how humans would do it. You'd roll balls down slopes like in high school and get a lot of data. And from that you'd find there's a straight line model in there if you do some simple transformations.

But how would you come to hypothesise about what factors matter, and what factors don't? And what about new models of behaviour that weren't in your original set? How would the experimental setup come about in the first place? It doesn't seem likely that people reason simply by jumbling up some models (it's a line / it's inverse distance squared / only mass matters / it matters what color it is / etc), but that may just be education getting in my way.

A machine could of course test these hypotheses, but they'd have to be generated from somewhere, and I suspect there's at least a hint of something aesthetic about it. For instance you have some friction in your ball/slope experiment. The machine finds the model that contains the friction, so it's right in some sense. But the lesson we were trying to learn was a much simpler behaviour, where deviation was something that could be ignored until further study focussed on it.

You can self train, like AlphaGo or that pong playing thing. As for the aesthetics part, there are theories on what aesthetics is, and part of it has to do with parsimony. Machines can certain constrain themselves on that.

I've had similar thoughts when it comes to recognizing the underlying (potential) simplicity of a phenomena of interest.

For example, consider a toy experiment where you take dozens of high speed sensors pointed a rig in order to study basic spring dynamics (i.e. Hooke's law).

You could apply "big data analytics" or ML methods to break apart the dynamics to predict future positions past on past positions.

But hopefully, somewhere along the way, you have some means of recognizing that it is a simple 1D phenomena and that most of the volume of data that you collected is fairly pointless for that goal.

Almost all deep learning progress is optimization on a scale going from 'incredibly inefficient use of space and time' to 'quite wasteful' to 'optimal'. You're jumping the gap from 'quite wasteful' to 'optimal' in one step because you understand the problem. If you could find a way to do that algorithmically you likely would have created an actual AI.

The interesting part of human intelligence isn't the ability to calculate that F=MA based on measurement, it's the ability to come to a social consensus on the meaning of F,M,A, and =, and to decide that the relationship between force, mass, and acceleration would be a useful or interesting thing to know.

Actually there are quite a few researchers working on applying newer NN research to systems that incorporate sensorimotor input, experience, etc. and more generally, some of them are combining an AGI approach with those new NN techniques. And there has been research coming out with different types of NNs and ways to address problems like overfitting or slow learning/requiring huge datasets, etc. When he says something about abstraction and reasoning, yes that is important but it seems like something NNish may be a necessary part of that because the logical/symbolic approaches to things like reasoning have previously mainly been proven inadequate for real-world complexity and generally the expectations we have for these systems.

Search for things like "Towards Deep Developmental Learning" or "Overcoming catastrophic forgetting in neural networks" or "Feynman Universal Dynamical" or "Wang Emotional NARS". No one seems to have put together everything or totally solved all of the problems but there are lots of exciting developments in the direction of animal/human-like intelligence, with advanced NNs seeming to be an important part (although not necessarily in their most common form, or the only possible approach).

> Doing this well is a game-changer for essentially every industry, but it is still a very long way from human-level AI.

We're still a long way from even insect level "intelligence" (if it could even be called that), hence the harm in calling it AI in the first place. The fact that machine learning performs some particular tasks better than humans means little. That was true of computers since their inception. The question of how much closer we are to human-level AI than to the starting point of machine learning and neural networks over 70 years ago is very much an open question. That after 70 years of research into neural networks in particular and to machine learning in general, we are still far from insect-level intelligence makes anyone suggesting a timeline for human-level AI sound foolish (although hypothetically, the leap from insect-level intelligence to human-level could be technically simple, but we really have no idea).

As a chemical engineer who started learning deep learning after learning regular old regression-based empirical modeling, my interpretation of deep learning is that it's just high-dimensional non-linear interpolation.

If what you're trying to predict can't be represented as some combination of your existing data, it breaks immediately. Data drives everything; all models are wrong, but some are useful. (George Box)

Incidentally, humans aren't very good at extrapolation, either, but our ability to generate good hypotheses differentiates us strongly from these models.

"This ability [...] to perform abstraction and reasoning, is arguably the defining characteristic of human cognition."

He's on the right track. Of course, the general thrust goes beyond deep learning. The projection of intelligence onto computers is first and foremost wrong because computers are not able, not even in principle, to engage in abstraction, and claims to the contrary make for notoriously bad, reductionistic philosophy. Ultimately, such claims underestimate what it takes to understand and apprehend reality and overestimate what a desiccated, reductionistic account of mind and the broader world could actually accommodate vis-a-vis the apprehension and intelligibility of the world.

Take your apprehension of the concept "horse". The concept is not a concrete thing in the world. We have concrete instances of things int he world that "embody" the concept, but "horse" is not itself concrete. It is abstract and irreducible. Furthermore, because it is a concept, it has meaning. Computers are devoid of semantics. They are, as Searle has said ad nauseam, purely syntactic machines. Indeed, I'd take that further and say that actual, physical computers (as opposed to abstract, formal constructions like Turing machines) aren't even syntactic machines. They do not even truly compute. They simulate computation.

That being said, computers are a magnificent invention. The ability to simulate computation over formalisms -- which themselves are products of human beings who first formed abstract concepts on which those formalisms are based -- is fantastic. But it is pure science fiction to project intelligence onto them. If deep learning and AI broadly prove anything, it is that in the narrow applications where AI performs spectacularly, it is possible to substitute what amounts to a mechanical process for human intelligence.

The Chinese Room argument is one of the least convincing arguments against AI. Of course the man in the room isn't conscious neither is the individual neurons in your brain. It's the whole house that become conscious.

The reality is that we just don't know.

This is because a deep learning model is "just" a chain of simple, continuous geometric transformations mapping one vector space into another.

Per my understanding - Each vector space represents the full state of that layer. Which is probably why the transformations work for such vector spaces.

A sorting algorithm unfortunately cannot be modeled as a set of vector spaces each representing the full state. For instance, an intermediary state of a quick sort algorithm does not represent the full state. Even if a human was to look at that intermediary step in isolation, they will have no clue as to what that state represents. On the contrary, if you observe the visualized activations of an intermediate layer in VGG , you can understand that the layer represents some elements of an image.

The brain is a dynamic system and (some) neural networks are also dynamic systems, and a three layer neural network can learn to approximate any function. Thus, a neural network can approximate brain function arbitrarily well given time and space. Whether that simulation is conscious is another story.

The Computational Cognitive Neuroscience Lab has been studying this topic for decades and has an online textbook here:


The "emergent" deep learning simulator is focused on using these kinds of models to model the brain:


That's about as interesting as saying that a Taylor series can approximate any analytic function arbitrarily well given time and space. Or that a lookup table can approximate any function arbitrarily well given time and space: see also the Chinese room example.

The first question is whether that neural network is learnable. Sure, some configuration of neurons may exist. Is it possible given enough time and space to discover what that configuration is, given a set of inputs and outputs?

The second question is whether "enough time and space" means "beyond the lifetime and resources of anyone alive," in which case it seems perfectly reasonable to me to call it a limitation. I generally want my software to work within my lifetime.

I like your comment. The real question is whether they are conscious.

The analogy between deep neural networks and the brain has proven to be very fruitful. Other analogies may as well. See our upcoming paper for more info.


I think a lot of people end up mixing being alive with being conscious. Is a tree conscious? Is a self driving car conscious?

If we use the definition "Aware of its surroundings, responding and acting towards a certain goal" then a lot of things fit that definition.

When an AI plays the atari games, learns from it and plays at a human level, I would call it conscious. It's not a human level conscious agent but conscious nonetheless.

Consciousness has a specific meaning - https://en.wikipedia.org/wiki/Qualia

Recurrent models do not simply map from one vector space to another and could very much be interpreted as reasoning about their environment. Of course they are significantly more difficult to train and backprop through time seems a bit of a hack.

Sure they do. The spaces are just augmented with timestep related dimensions.

No they aren't? RNNs have state that gets modified as time goes on. The RNN has to learn what is important to save as state, and how to modify it in response to different inputs. There is no explicit time-stamping.

There is an implicit ordering of timesteps ("before" and "after") though, right? If you have that, you can dispense with an explicit time dimension.

not necessarily, depending on the usage RNN based models are sometimes trained in both directions, i.e. for every sample of say videos show it to the network in its natural time direction and then also reversed. This is motivated some say to eliminate dependence on specific order of sequences but instead to train an integrator.

So, time's arrow can be reversed, and the model can thus extrapolate both forward and backward. Cool!

However, that doesn't actually eliminate the axis/dimension. Eliminating timestamps only makes the dimension a unitless scalar (IOW 'time' tautologically increments at a 'rate' of 'one frame per frame').

If the deep learning network has enough layers, then can't it start incorporating "abstract" ideas common to any learning task? E.g. could we re-use some layers for image/speech recognition & NLP?

this is exactly what happens in transfer learning. A recent paper by google ( https://research.googleblog.com/2017/07/revisiting-unreasona... ) shows that pre-training on a very large image database leads to improvements in state of the art for several different image problems. This is because the weights required for one image problem are not necessarily all that different from another image problem, especially in the early layers. There may not be as much common ground beteen images and e.g. NLP. Perhaps at much higher abstraction levels, but we aren't there yet.

Transfer learning has been shown to improve training times in other modes (such using an image classification model to initialize an NLP model) over randomly initialized values.

When an implementation of AGI comes around (yes, it will come around) it will inevitably involve a number of different neural nets working together in concert as separate subsystems. That's what makes these "Neural Nets Will Never Become Conscious!" articles so hilarious.

But yeah, I could see feeding the output of an array of sub-networks into a parent network. So think one NN for vision, one for hearing, etc, etc, all of those outputs feed into a parent level network that could be your abstraction network that deals with making executive level decisions.

If this article is correct about limitations, couldn't one simply include a Turing machine model into the process to train algorithms?

Some ideas:

- The vectors are Turing tapes, or

- Each point in a tape is a DNN, or

- The "tape" is actually a "tree" each point in the tape is actually a branch point of a tree with probabilities going each way, and the DNN model can "prune this tree" to refine the set of "spanning trees" / programs.

Or, hehe, maybe I'm leading people off track. I know absolutely nothing about DNN ( except I remember some classes on gradient descent and SVMs from bioinformatics ).

You can bolt all kinds of funny structures into some DNN system, but if the system doesn't have well behaved gradients (or if it isn't even differentiable) it won't train.

Then people are assuming Deep Learning can be applied to a Self Driving Car System end-to-end! Can you imagine the outcome?!

Yes. Death Race 2000.

My qualm with this article is disappointingly poorly backed up. The author makes claims, but does not justify those claims well enough to convince anyone but people who already agree with him. In that sense, this piece is an opinion piece, masquerading as a science.

> This is because a deep learning model is "just" a chain of simple, continuous geometric transformations mapping one vector space into another. All it can do is map one data manifold X into another manifold Y, assuming the existence of a learnable continuous transform from X to Y, and the availability of a dense sampling of X:Y to use as training data. So even though a deep learning model can be interpreted as a kind of program, inversely most programs cannot be expressed as deep learning models [why?]—for most tasks, either there exists no corresponding practically-sized deep neural network that solves the task [why?], or even if there exists one, it may not be learnable, i.e. the corresponding geometric transform may be far too complex [???], or there may not be appropriate data available to learn it [like what?].

> Scaling up current deep learning techniques by stacking more layers and using more training data can only superficially palliate some of these issues [why?]. It will not solve the more fundamental problem that deep learning models are very limited in what they can represent, and that most of the programs that one may wish to learn cannot be expressed as a continuous geometric morphing of a data manifold. [really? why?]

I tend to disagree with these opinions, but I think the authors opinions aren't unreasonable, I just wish he would explain them rather than re-iterating them.

For one, input and output size has to be fixed. All these NNs doing image transformations or recognition only work on fixed-size images. How would you sort a set of integers of arbitrary size using a neural network? What does "solve with a NN" even mean in that context?

Another problems/limitation I can think of is that in NNs you don't have state. The NN can't push something on a stack, and then iterate. How do you divide and conquer using NNs?

Are NNs Turing complete? I don't see how they possibly could be.

Input and output sizes don't have to be fixed. E.g. speech recognition doesn't work with fixed sized inputs. Natural language processing deals with many different length sequences. seq2seq networks are explicitly designed to deal with problems that have variable length inputs and outputs that are also variable in length and different from the input.

How would you sort integers? using neural turing machines: https://arxiv.org/abs/1410.5401

NMTs and other memory network architectures have explicit memory as state (including stacks!), indeed any recurrent neural net has state.

Are NNs Turing complete? Yes! http://binds.cs.umass.edu/papers/1992_Siegelmann_COLT.pdf

Interesting, thanks! On https://www.tensorflow.org/tutorials/seq2seq I found a link to https://arxiv.org/abs/1406.1078, which says

> $One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols.$

To me it sounds like they use an RNN to learn a hash function.

Thanks for the NTM link, I'll check it out.

It seems unfair to level the criticism of being incomplete and not fully explaining all the points given that the lead-in to the piece says it's a book excerpt and doesn't explain a lot of stuff that a reader of the book would already have encountered.

This is evergreen:


See also, if you can, the film "Being in the world", which features Dreyfus.

the author raises some valid points, but i don't like the style it is written in. He just makes some elaborate claims about the limitation of Deep Learning, but conveys why they are limitations. I don't disagree about the fact that there are limits to Deep Learning and many may be impossible to overcome without completely new approaches. I would like to see more emphasis on why things, like generating code from descriptions, that are theoretically possible, are absolutely impossible and out of reach today and not make the intention that the tasks itself is impossible (like the halting-problem).

This is why Elon Musk is projecting. We are long ways away from AI.

This is why I don't know if it will be possible (at current limitations) to let insect like brains to fully drive our cars. It may never be good enough.

Insects can drive themselves quite well, occasional splatters aside. This is one of the tasks that is I feel tractable. However, propose letting insects to drive and people will never accept it, but somehow they trust the SV hype men.

This is basically the Chinese Room argument though?

Not really. Deep learning does not give you an Artificial General Intelligence (what the Chinese Room is supposed to be). The author just explains why this is so (admittedly, in a handwavy, not necessarily convincing fashion).

On the limitations of machine learning as in the OP, the OP is correct.

So, right, current approaches to "machine learning* as in the OP have some serious "limitations". But this point is a small, tiny special case of something else much larger and more important: Current approaches to "machine learning" as in the OP are essentially some applied math, and applied math is commonly much more powerful than machine learning as in the OP and has much less severe limitations.

Really, "machine learning" as in the OP is not learning in any significantly meaningful sense at all. Really, apparently, the whole field of "machine learning" is heavily just hype from the deceptive label "machine learning". That hype is deceptive, apparently deliberately so, and unprofessional.

Broadly machine learning as in the OP is a case of old empirical curve fitting where there is a long history with a lot of approaches quite different from what is in the OP. Some of the approaches are under some circumstances much more powerful than what is in the OP.

The attention to machine learning is omitting a huge body of highly polished knowledge usually much more powerful. In a cooking analogy, you are being sold a state fair corn dog, which can be good, instead of everything in Escoffier,

Prosper Montagné, Larousse Gastronomique: The Encyclopedia of Food, Wine, and Cookery, ISBN 0-517-503336, Crown Publishers, New York, 1961.

Essentially, for machine learning as in the OP, if (A) have a LOT of training data, (B) a lot of testing data, (C) by gradient descent or whatever build a model of some kind that fits the training data, and (D) the model also predicts well on the testing data, then (E) may have found something of value.

But the test in (D) is about the only assurance of any value. And the value in (D) needs an assumption: Applications of the model will in some suitable sense, rarely made clear, be close to the training data.

Such fitting goes back at least to

Leo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone, Classification and Regression Trees, ISBN 0-534-98054-6, Wadsworth & Brooks/Cole, Pacific Grove, California, 1984.

not nearly new. This work is commonly called CART, and there has long been corresponding software.

And CART goes back to versions of regression analysis that go back maybe 100 years.

So, sure, in regression analysis, we are given points on an X-Y coordinate system and want to fit a straight line so that as a function of points on the X axis the line does well approximating the points on the X-Y plot. Being more specific could use some mathematical notation awkward for simple typing and, really, likely not needed here.

Well, to generalize, the X axis can have several dimensions, that is, accommodate several variables. The result is multiple linear regression.

For more, there is a lot with a lot of guarantees. Can find those in short and easy form in

Alexander M. Mood, Franklin A. Graybill, and Duane C. Boas, Introduction to the Theory of Statistics, Third Edition, McGraw-Hill, New York, 1974.

with more detail but still easy form in

N. R. Draper and H. Smith, Applied Regression Analysis, John Wiley and Sons, New York, 1968.

with much more detail and carefully done in

C. Radhakrishna Rao, Linear Statistical Inference and Its Applications: Second Edition, ISBN 0-471-70823-2, John Wiley and Sons, New York, 1967.

Right, this stuff is not nearly new.

So, with some assumptions, get lots of guarantees on the accuracy of the fitted model.

This is all old stuff.

The work in machine learning has added some details to the old issue of over fitting, but, really, the math in old regression takes that into consideration -- a case of over fitting will usually show up in larger estimates for errors.

There is also spline fitting, fitting from Fourier analysis, autoregressive integrated moving average processes,

David R. Brillinger, Time Series Analysis: Data Analysis and Theory, Expanded Edition, ISBN 0-8162-1150-7, Holden-Day, San Francisco, 1981.

and much more.

But, let's see some examples of applied math that totally knocks the socks off model fitting:

(1) Early in civilization, people noticed the stars and the ones that moved in complicated paths, the planets. Well Ptolemy built some empirical models based on epi-cycles that seemed to fit the data well and have good predictive value.

But much better work was from Kepler who discovered that, really, if assume that the sun stays still and the earth moves around the sun, then the paths of planets are just ellipses.

Next Newton invented the second law of motion, the law of gravity, and calculus and used them to explain the ellipses.

So, what Kepler and Newton did was far ahead of what Ptolemy did.

Or, all Ptolemy did was just some empirical fitting, and Kepler and Newton explained what was really going on and, in particular, came up with much better predictive models.

Empirical fitting lost out badly.

Note that once Kepler assumed that the sun stands still and the earth moves around the sun, actually he didn't need much data to determine the ellipses. And Newton needed nearly no data at all except to check is results.

Or, Kepler and Newton had some good ideas, and Ptolemy had only empirical fitting.

(2) The history of physical science is just awash in models derived from scientific principles that are, then, verified by fits to data.

E.g., some first principles derivations shows what the acoustic power spectrum of the 3 K background radiation should be, and the fit to the actual data from WMAP, etc. was astoundingly close.

News Flash: Commonly some real science or even just real engineering principles totally knocks the socks off empirical fitting, for much less data.

(3) E.g., here is a fun example I worked up while in a part time job in grad school: I got some useful predictions for an enormously complicated situation out of a little applied math and nearly no data at all.

I was asked to predict what the survivability of the US SSBN fleet would be under a special scenario of global nuclear war limited to sea.

Well, there was a WWII analysis by B. Koopman that showed that in search, say, of a submarine for a surface ship, an airplane for a submarine, etc. the encounter rates were approximately a Poisson process.

So, for all the forces in that war at sea, for the number of forces surviving, with some simplifying assumptions, we have a continuous time, discrete state space Markov process subordinated to a Poisson process. The details of the Markov process are from a little data about detection radii and the probabilities at a detection, one dies, the other dies, both die, or neither die.

That's all there was to the set up of the problem, the model.

Then to evaluate the model, just use Monte Carlo to run off, say, 500 sample paths, average those, appeal to the strong law of large numbers, and presto, bingo, done. Also can easily put up some confidence intervals.

The customers were happy.

Try to do that analysis with big data and machine learning and will be in deep, bubbling, smelly, reeking, flaming, black and orange, toxic sticky stuff.

So, a little applied math, some first principles of physical science, or some solid engineering data commonly totally knocks the socks off machine learning as in the OP.

There is a whole lot of difference between curve fitting and curve fitting with performance guarantees on future data under a distribution free (limited dependence model).

BTW the 'machine learning' term is Russian coinage and its genesis lies in non-paramteric statistics, the key result that sparked it all off was Vapnik and Chervonenkis's result that is essentially a much generalized and non-asymptotic version of Glivenko Cantelli. The other result was that of Stone that showed universal algorithms that can achieve the Bayes error in the limit not only exist but also constructed such an algorithm. This was the first time it was established that 'learning' is possible.

This is much stricter and well-thought approach than OP makes, there is no need to consider deep learning alone without generalization to all possible math models. For example, OP could mention that simple x^2 function could not be well approximated with a deep network with relu layers with small number of nodes but it could be trivially approximated with a single x^2 layer.

However, the question is, how complex are the "true" models of nature. Gravity law is simple with single equation and one parameter but what if human language law has millions of parameters and not really manageable by human. 500 samples would not be enough then. This is a classical Norvig vs Chomsky argument. Still, for many things the simple laws might exist.


I am sorry but GMO is actually bad for you.... Monsanto tried to spread gmo corn in France, they tested it on rats for a year and the rats developed multiple tumors the size of an egg.

Oops wrong article

DL/ML == Wisdom of Crowds

I don't get it. If reasoning is not an option how does deep learning beat the boardgame go?

Memorisation + small amounts of generalization.

Unlikely. If it's mostly memorisation it couldn't learn from playing itself.

And what you describe is how AI beats chess. The problem with that is that it is a quite inhuman way to play. But AlphaGo plays quite humanly.

1. Imagine infinite compute capability. Exhaustively play all possible games, and use that to figure out best moves at any state. This is essentially what Alphago did, but using translation variance to reduce the search space.

2. There is no contradiction here. We just have to accept that human-like play can emerge from memorization.

Can you explain what from your point of view is the difference between AlphaGo and a Chess AI? Because to me it sounds like the one should have resulted as an evolution from the other if it would be that simple.

Yes. In Chess, it's relatively easy to judge how good a board position is, and thus people have been successful by hand-engineering the board position evaluator (also called value function in RL lingo), and then just doing tree search to take the action which improves board position the most. In Go, evaluating board position is much more difficult, and it's not possible to approximate the value function by hand-engineered code. Thus, AlphaGO approximates the value by simulating the game till win/lose from arbitrary board positions to evaluate its value. This doesn't really require neural networks. You could also do the same with table lookup. What neural networks offer here is some translation invariance generalization, and capability to compress the table into fewer parameters by identifying common input features. It's possible to achieve AlphaGo performance by just having a BIG table of state-values and using some kernel to do nearest neighbour search (such as done by deepmind here: https://arxiv.org/abs/1606.04460)

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact