Hacker News new | past | comments | ask | show | jobs | submit login
What does a neural network actually do? (moalquraishi.wordpress.com)
151 points by zercool on May 25, 2014 | hide | past | web | favorite | 38 comments

"What does a neural network actually do?"

This is a fundamental Question: Can we really say and predict what a neural network does?

Contrary to an engineered / constructed algorithm a neural network is 'trained'.

Whenever we will present a 'known' input pattern it will reposnd with a 'learned' response.

This however introduces interesting problems: How can we _debug_ a neural network? How can we debug a correlation? Sure we can tune its paramters, we can train it some more to again show the desired response. But reaching this point we just abandoned knowing how the intrinsic algorithm works in favor to just focussing on the result.

Okay - now if we follow this argument - this would lead to: If we simulate the whole brain by simulating the neural network, we won't gain any knowlege about the intrinsic workings of the brain. We won't find any enlightenment about the innermost algorithm _represented_ by the neural network we call our brain.

I think you have hit upon a problem of present day AI.

Neural networks, Support Vector Machines, Hidden Markov Models and other stuff (Markov Networks, etc) do something like linear regression on some huge space - they draw a curve/plane between groups of things on this feature space. The tendency is for this division to make sense and to correspond to our common sense categorization of these things.

The problem is that once that happens, you can't really reason about the division you've drawn. It's just there. You can tweak for various purposes but that's a manual process.

You categorize animals by shape or particular adaptation or by genetic makeup. You can teach one of these algorithm each of these categorizations. But you can't do something like have the thing categorize for one purpose and then tell it to "change it's outlook" and categorize for a different purpose.

In this sense, despite seeming impressive, the products of these processes are dead-ends that we can't reason about, that lack the flexible intelligence of a human being.

Your criticism is fair, but you fail to explain how an NN or SVM is any worse than how a human mind actually operates.

In other words, the incomprehensibility of a modern AI model is not a failing of AI, it is a failing of (AI) psychology and (AI) neuroscience.

The artificially constructed intelligence works whether or not we understand how. The frontier of AI science is now open to AI Psychology. Psychologists and Neuroscientists will replace the data scientists! Such fun!

I could hardly contrast the operations of a NN to those of the human mind because we most definitely don't understand the later. I have already described apparent properties of human minds that NN and SVN definitely don't have. But I'll repeat and expand:

The artificially constructed intelligence works whether or not we understand how

The operation of human intelligence is very dependent on the fact that we human have operational understanding of each other's mental processes. Moreover, it is well known that human beings process language and can re-evaluate past experience in light of present understand.

Oppositely, we know enough about the properties of the various non-linear recognizers to know that constructs can't do and won't ever be able to do these things.

I can't find an example on Google right now, but I've seen demonstrations that it's possible to visualize the intermediate layers of a neural network - for example you can see how an image recognition network is first breaking down an image into horizontal and vertical lines, then combining those into more complex shapes, etc.

But visualizing is quite a ways from debugging.

To debug a program you actually verify that it's logic is correct (at least the good kind of debugging).

Consider a spectrum:

1. Natural language - we humans combine fragments of natural language easily and on an ad-hoc basis. We can get a fairly amount of use from reusing Shakespeare quotes and neologisms while spending rather little effort.

2. Trained programmers can reuse and combine general purpose libraries - with difficulty and often after considerable debugging.

3. AI algorithms like Neural Networks. These are just plopped in and tweaked, not combining seems possible.

It seem like "intelligent behavior" should be going more towards #1 but the process of Machine Learning seems to move things more towards #3. The "learn once, understand never" approach means that for each significant case, you'll need to do a re-tweaking and re-learning. The potential to get harder rather than easier over time might well be there.

Can you debug a human brain? I can't.

Is a human brain intelligent? I believe so.


Admittedly, all this is in manner of speaking but still, I would claim that most if not all the times you debug a program you are also debugging your mental concept of what the program does. By that fact that we can change our concepts, our minds are very "debuggable."

These are called filters. This is from my deep learning framework: Debugging a net visually: http://deeplearning4j.org/debug.html

An example of doing facial reconstruction: http://deeplearning4j.org/facial-reconstruction-tutorial.htm...

There's an interesting example in one of the coursera courses (Neural Networks for Machine Learning) - you just need to watch through the intro video to see it in action.


I've found that, in practice, traditional neural networks tend to be prone to overfitting and are finicky about their parameters (in particular the topology and number of nodes you choose).

I use the word "traditional" to describe the NN architecture discussed in the article. Recent NN research has been promising [1], but this article strictly discusses traditional NN's. I don't really have much experience with the newer NN algorithms, so I'm not sure to what extent they suffer from the same problems as traditional NN's.

[1] http://en.wikipedia.org/wiki/Neural_network#Recent_improveme...

Hinton's DropOut [1] and Wan's DropConnect [2] have ameliorated some of the overfitting issues present in traditional NN's. In fact, DropConnect in conjunction with deep learning are responsible for new records being set on classical datasets such as MNIST.

[1] http://arxiv.org/pdf/1207.0580.pdf [2] http://cs.nyu.edu/~wanli/dropc/

Dropout is actually a knob on any neural network. These are used in image recognition as well as text and other areas.

The fuzzing creates a very similar effect to convolutional nets where it can learn different poses of an image.

It's pretty funny, I saw DropConnect described in a stackoverflow answer that predated the paper you reference. It was an incorrect answer on how to do dropout. I shall try to find it tomorrow.

Is it safe to say that in ML, use of NNs is more about writing code that designs NNs, evaluates the results, and modifies the designs to optimize some desired meta-values, like accuracy, efficiency, etc?

Another limit they don't address is that the training normally used is purely local— just a gradient descent. So even when the network can model your function well, there is no guarantee that it will find the solution.

For me ANN's always seem to get stuck on not very helpful local minima— they're not one of the first tools in my bags of tricks by far.

Often I associate them as being the sort of thing that someone who doesn't really know what they're talking about talks about. (Esp. if its clear that in their minds NN have magical powers. :) maybe they'll also mention something about "genetic algorithms")

> So even when the network can model your function well, there is no guarantee that it will find the solution.

If it models the function over the input domain, then it is properly trained. If it is trained to a local minima then it doesn't model the underlying function well over the whole input domain. If you have good/representative training and validation sets you will be able to tell.

> Esp. if its clear that in their minds NN have magical powers

I know that type. When dealing with ANN's you realize quickly (just like in all data science) that all of the "magic" relies on the manual work and thought that goes into washing and adapting the data. Not very sexy work, and work that requires a fair bit of knowledge about the problem domain.

> For me ANN's always seem to get stuck on not very helpful local minima

That isn't the ANN that gets stuck, it's the training algorithm (using gradient descent) that gets stuck :) Training is orthogonal to the operation of the network itself (which is just a nonlinear function in the end!). Gradient descent via error backpropagation is the most common training method for MLP's, but you could imagine doing a random/brute force algorithm that is significantly simpler to implement, but slower. Since a network is often trained once and then used repeatedly, it is often plausible to train it for several weeks if needed! A pure random search is usually not feasible, but adding randomization to a gradient descent will help. There are many ways to avoid local minima for a gradient desccent, if you have time to wait.

> maybe they'll also mention something about "genetic algorithms"

The simple error backpropagation methods only work well for normal feed-forward networks. Other topologies e.g. recurrent networks require more exotic methods. In my (limited) experience genetic algorithms are rarely efficient as a training method though.

Well, you could use an EA to take a stab at finding better minima :)

And correct me if I'm wrong, but isn't the cost function for a feed forward neural networks that uses a sigmoid activation function convex wrt the parameters being trained, i.e. gradient descent is guaranteed to find the global minimum when small enough of a step size is used?

Mostly, no. Hidden units introduce non-convexity to the cost. How bout a simple counter-example?

Take a simple classifier network with one input, one hidden unit and one output and no biases. To make things even simpler, tie the two weights, i.e. make the first weight equal to the second. Now, mathematically the output of the network can be written: z=f(w * f(w * x)) where f() is the sigmoid.

Next, consider a dataset with two items: [(x_1, y_1), (x_2, y_2)] where x_i is the input and y_i is the class label, 0 or 1. Take as values: [(0.9, 1), (0.1,0)]. The cost function (loglikelihood in this case) is:

L(w) = sum_i { y_i * log( f(w * f(w * x_i)) ) + (1-y_i) * log( 1-f(w * f(w * x_i)) ) }


L(w) = log( f(w * f(w * 0.9)) ) + log( 1-f(w * f(w * 0.1)) )

Plot that last guy replacing f with the sigmoid, and you'll see the result is non-convex - there's a kink near zero.

A less mathy explanation with some real examples: http://neuralnetworksanddeeplearning.com/chap1.html

Coding a digit recognizer using a neural network is an extremely rewarding exercise and there's a lot of help on the web to get you started.

This is a great example of a hello world application. Keep in mind there are several kinds of neural nets that allow you to do this. This includes convolutional RBMs (recognizes parts of an image) and normal RBMs (learns everything at once)

This is a pretty good article, but I'm seeing a lot of confusion in this thread because the article is maybe one step ahead of the basic intuition needed to understand why ANNs are not magical and are not artificial intelligence (at least not feed-forward networks).

Perhaps a simpler way to look at it is to understand that a feed-forward ANN is basically just a really fancy transformation matrix.

OK, so unless you know linear algebra, you're probably now asking what's a transformation matrix? Without the full explanation, the important understanding is why they are so important in 3D graphics: they can perform essentially arbitrary operations (translation, rotation, scaling) on points/vectors. Once you have set up your matrix, it will dutifully perform the same transformations on every point/vector you give it. In graphics programming, we use 4x4 matrices to perform these transformations on 3D points (vertices) but the same principle works in any number of dimensions - you just need a matrix that is one bigger than the number of dimensions in your data*.

Edit: For NNs the matrices don't always have to be square. For instance you might want your output space to have far fewer dimensions that your input. If you want a simple yes/no decision then your output space is one-dimensional. The only reason the matrices are square in 3D graphics is because the vertices are always 3-dimensional.

What a neural network does is take a bunch of "points" (the input data) in some arbitrary, high number of dimensions and performs the same transformation on all of them, so as to distort that space. The reason it does this is so that the points go from being some complex intertwining that might appear random or intractable, into something where the points are linearly separable: i.e., we can now draw a series of planes in between the data that segments it into the classifications we care about.

The only difference between a transformation matrix and a neural network is that a neural network has at least two layers. In other words, it is two (or more) transformation matrices bolted together. For reasons that are a bit too complex to get into here, allows an NN to perform more complex transformations than a single matrix can. In fact, it turns out that an arbitrarily large NN can perform any polynomial-based transformation on the data.

The reason this is often seen as somewhat magic is that although you can tell what transformations a neural network is doing in trivial cases, NNs are generally used where the number of dimensions is so large that reasoning about what it is doing is difficult. Different training methods can give wildly different networks that seemingly give much the same results, or fairly similar networks that give wildly different results. How easy it is to understand the various convolutions that are taking place rather depends on what the input data represents. In the case of computer vision it can be quite easy to visualise the features that each neuron in the hidden layer is looking for. In cases where the data is more arbitrary, it can be much harder to reason about, so if your training algorithm isn't performing as you'd like, it can be difficult to understand why it isn't working, even if you already understand that the basic principle of a feed-forward network is just a bunch of simple algebra.

> The only difference between a transformation matrix and a neural network is that a neural network has at least two layers. In other words, it is two (or more) transformation matrices bolted together. For reasons that are a bit too complex to get into here, allows an NN to perform more complex transformations than a single matrix can. In fact, it turns out that an arbitrarily large NN can perform any polynomial-based transformation on the data.

Nice explanation. I need one clarification, though. Isn't matrix multiplication associative? Isn't thus any transformation defined by two matrices representable by a single matrix that is the product of the two matrices?

I am probably misunderstanding how NNs bolt matrices together.

You apply a non-linear function (usually some sigmoid) on the output vector after each matrix product. Otherwise, you'd be correct and any multi-layer ANN could be expressed as a single layer network.

Thanks. It makes sense. The sigmoid is the activation function of the output "neuron". Unfortunately, matrix algebra here is not as useful as in computer graphics.

No problem. Actually, I personally found that a pretty intuitive understanding of linear algebra & vector calculus makes quite a lot of ML straight forward to approach geometrically.


I suspect some kind of transformation could be used to make a two level NN into a one level one. The thing is the resulting one level network might be more complex and less useful than the original two level network. Still, I think this does illustrate the limitations of multilevel networks.

Another way to see this is to notice that NNs and SVMs[1] are (approximately or exactly) equivalent [2] because they both involve the fairly simple linear and non-linear transformations we've been looking at.

[1] http://en.wikipedia.org/wiki/Support_vector_machine [2] http://www.staff.ncl.ac.uk/peter.andras/PAnpl2002.pdf

Interesting to note though that even with a linear network that can be represented by a single matrix, it can be faster, easier and converge to better results with multiple layers because the different gradient and parameter space that is presented to the optimization algorithm.

A nice, cogent explanation.

It's good to remember the ANN's input offset comes as vector data. The ANN isn't directly transforming those vectors directly, rather it is transforms these input to a higher dimensional "feature" space and performs the linear transform. If you take the separating plane that's drawn in the feature space and reverse the map, you'll the ANN has drawn complex surface between the points it want to recognize and those it rejects.

So it's basically a heuristic and no more intelligent than Taylor's series.

So, IIUC creating a NN basically follows this process:

- define an input vector space (i.e choose dimensions you want to operate on with input data)

- define your categories in another space (or another basis in the same space?)

- set up a transformation pipeline between the two spaces (with at least two stages)

- devise an algorithm that takes categorised elements and produces new transformation matrices

- train the NN (i.e feed input and categorise the result so that through some algorithm the transformation matrices converge)

http://www.dkriesel.com/en/science/neural_networks about 200 page bilingual ebook about neural networks containing lots of illustrations, in case some guys of you want to read further. There's also a nice java framework to try out for the coders.

Looks like an interesting text, but to be honest I didn't understand a substantial portion of it.

You're 17; get to work understanding it! The more you learn now, the more you'll be able to do for the rest of your life. Plus, diving into random topics and understanding them more deeply than anyone else is a ton of fun.

Title made me think it would be something for beginners. Either it is not or I am very dim.

They write comments on HN.

Those are Markov-chain based text generators. (yes, I got the joke).

Well recently a lot of work has gone into using NNs for natural language processing. Typically it's trained to do something like predict the next word or character in a sequence. Using that you could possibly create a far better generative model than markov chains, and create more realistic sentences. Perhaps you could even combine them (NN gets the output of the markov chain to help make it's prediction.)

:( I wish everyone else did.

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact