I had once attempted to build a genetic algorithm for manipulating the synapse weights, specifically because of the problems of traditional back-propagation falling into local minima (unfortunately, some serious shit at work made it drop by the wayside). This RBM approach sounds better than back-propagation, but it also sounds like it would be prone to runaway feedback.
One of the performance problems with neural networks is that the number of cores on a typical machine are far less than the number of input and intermediate nodes in the network. The output nodes are less of a concern as you're trying to distill a lot of data down to a little data, but there is no reason to treat them differently. There are (very few) examples of NNs on GPUs, so that helps, but I've recently been curious to try a different, more hardware-driven approach, just because one could.
Texas Instruments has a cheap DSP chip that you'ns are probably familiar with called the MSP430. It's pretty easy to use, the tool chain is free and fairly easy to setup (especially for a bunch of professional software devs like us, right? right? Well, there's an Arduino-like tool now, too, if not), costs around 10 cents in bulk for the simplest version, requires very few external parts to run (something like a power source, 1 cap, and two resistors), and it has a couple of serial communication protocols built in. I'm quite fond of the chip; I've used it to build a number of digital synthesizers and toys.
For about $50 and quite a bit of soldering time, you could build a grid of 100 of these, each running at 16Mhz, and I bet with a clever design you could make them self programmable, i.e. propagate the program EEPROM over the grid. Load up a simple neural network program, maybe even having each chip simulating more than one node, and interface it with a PC to pump data in one end and draw it out the other. It might not be more useful than the GPGPU approach, but having physical hardware to play with and visualize node activity through other hardware would be a lot of fun.
There are many ways to avoid this, for example have a look at:
In traditional (I am not talking about the "deep" stuff) neural networks optimization is hardly ever the problem though, most often under- or over- fitting is the issue that produces poor performance.
One of the performance problems with neural networks is that the number of cores on a typical machine are far less than the number of input and intermediate nodes in the network.
This sounds odd. Certainly doing neural networks in hardware is interesting, but it sounds a bit like an imagined problem, I mean, when multiplying 20 numbers one does not complain that the number of cores is less than 20. And most often it is the training of the network that is resource-intensive not the actual running of it.
Using simulated annealing to guide random initializations can help find a better minima, but to get the global minima with simulated annealing takes an inordinate amount of time.
There's an obvious vision of building a "neural circuit" where there is some specialized processor for each neuron but my guess is that it gets difficult when you consider the communication fabric required between the layers.
As said, it is an interesting thing to consider as an alternative computer architecture and what not, but I just have some doubts if people practically using neural networks right now really run into performance problems because of having too little cores. I don't think this is more true for neural networks than for anything else.
PS: Still a fun project, just harder than you might think to scale.
1. "Vanishing gradients after 2-3 layers"-does this mean that the partial derivatives tend to be smaller on the higher layers, and therefore the network finds local minima that aren't very useful?
2. Step 3 (p 18) mentions that the outputs are not continuous variables, they're binary. What's the reasoning behind that?
2. It's been a while since I read the paper, but I believe that the justification has to do with the proof of convergence of Gibbs sampling. I haven't tried using continuous values, so I can't give an intuition for what happens in those cases.
On Intelligence has dramatically changed the way I think about thinking. It's an awesome book.
Now, it looks like those Vitamin D people have their own company: http://www.vitamindinc.com/
Even the really early versions of Vitamin D were impressive. Anybody use it for anything interesting now?
Here's a very good tech talk from him about RBMs: http://www.youtube.com/watch?v=AyzOUbkUf3M
That said, both approaches loosely mirror the function of the brain, as neurons are not simple threshold devices, and both backpropagation and the RBMs training algorithms do not have a biophysical equivalent.
After all, a deep belief network starts with an RBM for unsupervised pre-training, but the finetuning stage that follows just treats the network as a standard MLP using backprop.
Also, you can use an autoencoder instead of an RBM, which I think are getting better results these days? And there are better regularization techniques for backprop now--weight decay, momentum, L1/L2 regularization, dropout, probably more that I'm leaving out.
The pre-training (RBM or autoencoder) helps to not get stuck in local minimas, but there's also interesting research that suggests you're not even getting stuck in local minima so much as you're getting stuck in these low slope, high curvature corridors that gradient descent is blind to, so people are looking into second order methods that can take curvature into account so you can take big steps through these canyons and smaller steps when things are a bit steeper. Or something like that :-)
All that being said, anyone care to weigh in on the pros/cons of RBMs vs something like a contractive autoencoder? No such thing as a free lunch, so what are the key selling points of RBMs at this point? I keep seeing them pop up, but afaik, they don't provide a particular advantage over autoencoder variants.
Great article though, I'm really glad to see more and more people getting interested in neural networks, they've come a long way and people are just starting to wake up to that.
For some problems, it may be nice to have a generative model as offered by RBMS (although Rifai et al. published a sampling method for contractive auto-encoders recently: http://icml.cc/2012/papers/910.pdf). I feel like with RBMs, you can design models which incorporate prior knowledge more "easily" (you may end up with pretty complex models...), e.g. the conditional RBM, the mean-covariance RBM or the spike & slab RBM. Additionally, there's the deep boltzmann machine that consists of multiple layers that are jointly trained in an RBM-like fashion.
Auto-encoders are straightforward to understand and implement. With contractive terms or denoising, the are powerful feature extractors as well.
But as you already noted, if you "just" want to have a good classifier, I think it pretty much boils down to personal preference since you're going to spend some effort on making these techniques work well on your problem anyway.
We could also pipe the raw data through an RBM and then slap a SVM or some other classifier on top.
In late 2006/early 2007 I was working a lot with standard two layer feed-forward neural networks (first for my research and then for my job). Hinton had a great paper on practical deep networks at NIPS 2006 (a big AI/machine learning conference), which sparked my interest in more complex neural networks. I had read Hawkins' book a few years earlier, and my impression of it was somewhat negative; I thought it was a really interesting book, but it was too fluffy and high-level to be intriguing. He hit a lot of points about hierarchies in intelligence that were intriguing but not new or drastically insightful. After NIPS I downloaded some of Numenta's code (numenta is Hawkins' company) and it was pretty slow on toy problems so I didn't spend too much time with it - this isn't a judgement of their code, I just didn't have the time to dig deeply into it. My impression at the time, which may be unfair, is that Numenta's approach was ad-hoc while Hinton's was principled. I was negatively biased by Hawkins' book and my professors' opinions of him vs Hinton.
Neural Network techniques which work so well in small, easy and trivial datasets like MNIST do not generalize to more serious datasets, and that's where the "and this is where the magic happens" component is needed.
A recent blog post by Jeff:
And more detailed information on the technology (I would recommend the CLA white paper):
(on the positive side, Internet made me loads of international friends, and there's truckloads of things I could not have learned without it)
> Now, when I say Artificial Intelligence I’m really only referring to Neural Networks. There are many other kinds of A.I. out there (e.g. Expert Systems, Classifiers and the like) but none of those store information like our brain does (between connections across billions of neurons).
This is a middle-brow dismissal of almost the entire field of A.I. because it does not meet an unnecessarily narrow restriction. (Which, by the way, neural nets don't either. Real neurons are analog-in, digital-out, stochastic processes with behavior influenced by neural chemistry and with physical interconnectivity and timing among other things not accurately modeled at all by any neural net. It's closer modeling to the mechanisms of the brain, but far from equivalent and as a CogSci student you should know that.)
A.I. is the science of building artifacts exhibiting intelligent behavior, intelligence being loosely defined as what human minds do. But in theory and in practice, what human minds do is not the same thing as how they do it.
The human mind does appear to be a pattern matching engine, with components that might indeed be well described as a hidden Markov model or restricted Boltzmann machine. It may be that our brains are nothing more than an amalgamation of some 300 million or so interconnected hidden Markov models. That's Ray Kurzweil's view in How to Create a Mind, at any rate.
However it is a logical fallacy to infer that neural nets are the only or even the best mechanism for implementing all aspects of human-level intelligence. It's merely the first thing evolution was able to come up with through trial and error.
Take the classical opposite of neural nets, for example: symbolic logic. If given a suitable base of facts to work from and appropriate goals, a theorem prover on your cell phone could derive all mathematics known up to the early 20th century (and perhaps beyond), without the possibility of making a single mistake. And do it on a fraction of the energy you spend splitting a bill and calculating tip. A theorem prover alone does not solve the problem of initial learning of ontologies or reasoning about uncertainty in a partially observable and even sometimes inconsistent world. But analyzing memories and perception for new knowledge is a large part of what human minds do (consciously, at least), and if you have a better tool, why not use it?
Now I myself am enamored by Hilton-like RBM nets. This sort of unassisted deep-learning is probably a cornerstone in creating a general process for extracting any information from any environment, a central task of artificial general intelligence. However compared with specialized alternatives, neural nets are hideously inefficient for many things. Doesn't it make sense then to use an amalgam of specialized techniques when applicable, and fall back on neural nets for unstructured learning and other non-specialized tasks? Indeed this integrative approach is taken by OpenCog, although they plan to use DeSTIN deep-learning instead of Hilton-esque RBM's, in part because the output of DeSTIN is supposedly more easily stored in their knowledge base and parsed by their symbolic logic engine.
(I wrote it... :-)
It discusses how to choose a learning algorithm, selecting hyperparameters, number of hidden units, etc.