Before deep learning people would manually design all these extra features sin(x_1), x_1^2, etc. because they thought it was necessary to fit this swiss roll dataset.
So they would use a shallow network with all these features like this: http://imgur.com/H1cvt8d
Then the deep learning guys realized that you don't have to engineer all these extra features, you can just use basic features x_1, x_2 and let the network learn more complicated transformations in subsequent layers.
So they would use a deep network with only x_1, x_2 as inputs:
Both these approaches work here (loss < 0.01). The difference is that for the first one you have to manually choose the extra features sin(x_1), x_1^2, ... for each problem. And the more complicated the problem the harder it is to design good features. People in the computer vision community spent years and years trying to design good features for e.g. object recognition. But finally some people realized that deep networks could learn these features themselves. And that's the main idea in deep learning.
Would it make sense for them to add a gallery of good solutions for each problem, or would they all basically be your second example network (no time to play and see for myself right now)?
It's probably worth pointing out that this is true for ANNs, but there were (and are) other "shallow" classifiers that can handle swiss roll problem without manual parameter encoding. SVMs, for example.
If N levels off, then the network has grasped the concept of a spiral and can generalize to arbitrary size.
If N doesn't level off, then the network isn't really learning the general case.
So, I ported their swiss roll dataset to python and threw together a shallow network trainer with theano:
Then, I trained a shallow network with 36 hidden units (your deep net has 6 units and 6 layers):
edit: I forgot to mention that the shallow network above takes only the two coordinates (x1 and x2) as input features.
It feels like neurons in the first layer are weaker, because all they can do is a linear separation. Given deep networks, I was wondering if adding neurons to the first layer was better than adding them to the last one, and empirically, it feels like it is quite worse. I wonder if there is a theorem around that.
Correct, but keep in mind that their method appears to use batch descent while mine does not. Batch descent is often converges more quickly. There are other differences between my net and the GP's I can spot as well (e.g., the activation function, the learning rate, and regularization).
Also keep in mind that I threw this together over breakfast, and did not spend much time tweaking parameters :)
Also how do you know to choose a ReLu instead of a Tanh activation?
6 layers is the maximum that this demonstration allows, and they kept j small-ish to show that you don't need that many to have good results.
It may just be that 'batched cumulative learning' (I don't know if there is already a term for this) gets a better fit than just learning from a smaller set of data.
Edit: Did a quick test; regenerating about every 50 and 100 iterations, and conversion does seem faster (at least, when a clear spiral is formed). https://imgur.com/a/OPjXb
In a normal situation, you obtain a list of input / output (say, images as input, a digit as output, for learning handwritten digits). You separate it between training data (which actually improves the net) and testing data (to detect overfitting), and you don't get more data than that.
Here, you can generate more data for free, as we have the function we want to approximate. Having more data will often result in a better result and faster convergence.
I tried the swiss roll with a shallow network on the demo (and the results are not excellent, but it matches)
"Neural networks" are a really really overloaded term. A ton of stuff referred to as "neural networks" has little to do with the "neural networks" that are used in the machine learning community.
"A Computational Intelligence-Based Genetic Programming Approach for the Simulation of Soil Water Retention Curves"
I also use the term ANNs over just NNs to keep it to the silicon, and not wetware ;) Although, they did hook up a small ANN to a cockroach once, IIRC...
Generally were its actually being used they are a bit quiet on how they go about getting the results they do. While the genetic bit is easy, the secret sauce is in guiding learning/evolution that work for the particular problem domain.
Gene covers a lot of ground. Somebody has done some transliteration to Elixir too; I use LFE, since staying with Lisp bridges the gap between my GP work, and what Gene has done with Erlang and ANNs and EC. For GP, you really need to be able to create new forms with macros, or it is more in line with GP. To quote and excerpt from Robert Virding, co-designer of Erlang, and creator of LFE,addressing Elixir's macros or messing with Erlang's modules vs. LFE's or Lisp's macros on HN before:
"There is syntactic support for making the function calls look less like function calls but the macros you define are basically function calls.
I'm not even joking. Trial and error. Having good "intuition" about past ideas the basic building blocks to guide that trial and error. Reading research papers and seeing what other people did well with and using that.
As an aside, this is the principal reason I am skeptical of grandiose claims about deep learning.
I can imagine that the advanced models use many, many machines and only deliver results after a large training time. Genetic programming is not feasible then, if you cannot get a quick grasp of the potential results of a model.
Put another way: with evolution you have to stumble around blindly in parameter space and rely on selection to keep you moving in the right direction. With the gradient descent that neural networks use, you get, essentially for free, knowledge of the (locally) best direction to move in parameter space.
The bigger the models, the more this matters. Modern neural networks have millions or even billions of parameters, and that's been crucial to their expressive power. Good luck learning a program tree with a billion nodes using evolution. It might take 4.54 billion years.
And then only if you have a system powerful enough to accurately simulate a planet full of molecules.
Although I do think there is a balance between GA and structured NN which will lead to faster and better results than the deep NN alone. We already see some of the best deep NNs incorporating specific structures.
I brainstormed for a while about using genetic algorithms to decide the network topology. I'm glad someone else invented that already! Less work for me to do now.
Of course, I wasn't up-to-speed enough to know the right terms to look for, so thanks for sharing. :)
I am curious though... it seems like it would take orders of magnitude more computing power to not only train but evolve and re-train the networks. Is this practical with today's hardware?
As far as I can see the API isn't much like TensorFlow.
On my phone (Safari/iOS 9.3), the default neural nework doesn't converge at all even after 300 iterations while it does on the desktop, which is legit weird: https://i.imgur.com/KNaXeHH.png
I saw the play button very clearly when the page loaded, then promptly got distracted by all the dials and knobs. :-P
Something about the high-speed updating makes me think of WOPR, in 'War Games', scoring nuclear-war scenarios.
For someone (like me) who's done a bit of reading but not much implementation, this playground is fantastic!
Add some noise, and use all the inputs, and one 8 wide hidden layer
edit: works better with a sigmoid activation curve, but it converges more slowly
No need to mess with noise or regularization :)
This actually makes the dataset harder to fit to. It is not the same thing here as the "training with noise" method where random noise would be added to each batch, as an alternative means of Tikhonov regularization.
The noise doesn't go far enough to start confusing points between different clusters, but it adds more points.
That said, my knowledge of neural nets is fairly limited.
I don't know if that's a general feature to need fewer neurons with each layer, but that seems to work here.
Think of the whole neural net as a function:
input * weight = output
At each iteration, we feed in the input to the neural net. Then the neural net compares what output it gets to the correct output.
For example, input1 is 5, and the correct output for input1 should have been 2. But the neural net got 3 as the output. So it then decreases the weights slightly so it would get 2.75 next time it has input of 5. Repeat thousands of times. That's the basic idea for machine learning and neural networks.
The algorithm it uses to figure out how much to decrease the weights is called "backpropagation" which uses gradient descent. To explain gradient descent, as as a roller coaster track. Imagine the roller coaster starts off on a random location on the track. Then gravity takes the roller coaster down the track until it ends up on a low point between two hills and stays there. This is the new location of the roller coaster. This new location is nice because it has the lowest energy the roller coaster could find, so it stays there. (We use derivatives to figure out the slope of a curve, which then gives us the direction where the curve goes downhill).
In neural networks, the roller coaster curve is the "cost function", which basically calculates the amount of difference between the neural net's output and the actual correct output it should have got. The initial weight is the roller coaster's initial position. The new weight is the roller coaster's final position, at the bottom of the cost function curve. This new position thus gives us the lowest cost.
Note that there may be even lower valleys, but when we roll the rollercoaster it stops at its nearest low valley. This is why we randomize the weights at the beginning - to put the roller coaster near possibly even lower valleys.
 - https://www.coursera.org/learn/machine-learning
It's training a neural network to classify a data set with two classes (orange or blue) and the data has two features (x1 or x2). All the orange and blue dots are the training data. So if you take a dot on the graph with coordinates (-2, 4) and it's blue, that would mean that a data point with x1 = -2 and x2 = 4 has the class blue.
You can think of a neural network as a function that can take in arbitrary features (in this case x1 and x2) and tries to output the correct class. That's what the orange and blue colors in the background are, the neural network's guess at the correct classification for any given point (x1, x2).
When you hit play, it iterates through the training data making adjustments to each neuron in the network so that it gets closer to predicting the right class.
If you want to see how well the neural network performs on data it wasn't trained on, you can click "show test data".
If you're expecting a lesson, you'll likely be disappointed, but I think there's real value in a true playground.
I think the biggest improvement would be if, when hovering over a 'neuron', you get a visual representation of what feeds into it.
> Next, the network is asked to solve a problem, which it attempts to do over and over, each time strengthening the connections that lead to success and diminishing those that lead to failure.
On each iteration, it calculates how bad the predicted output is, then adjusts the weights between neurons to lessen that value. Google backpropagation for more info
So you see the first neuron's input is just x1. You can see in the little graph at x1 that it's split down the middle with orange on one side and blue on the other. You can think of adjusting the weight on that neuron as adjusting where along the x axis the split occurs. All points on the orange side are classified orange and all on the blue side are classified blue. If you picked a data set like the spiral one or whatever, that neuron alone isn't going to make very many correct classifications. That's because it only gets the x1 value as input and can only affect the output by multiplying x1 by some weight, which would only have the affect of shifting the classification boundary left or right. You can see the same thing happening for the second neuron with input x2 except that now it splits along the y axis. Again that alone isn't going to match the data very well.
But then you get to the second layer, and the input of each neuron in the second layer is the output of each neuron in the first layer. So these neurons are able to take into consideration both x1 and x2 and are able to divide the data in more complex ways. So you can think of the neurons in each layer of the neural network as being able to consider more and more complex properties of the data in forming its output.
The neural network is essentially the nodes in the middle, linked together by various weights. During training, the test data points are fed forward into the network, creating an output. That output is then fed backward using something called "back propagation" which is used to adjust the weights.
Typically, the more hidden layers or nodes per layer, the more difficult gradients that can be learned. Zero hidden layers essentially forms a linear gradient that can only be used to split very basic, linearly-separable data (drawing a straight line to separate the different types)
Neural networks have lots of little knobs and levers you can adjust. That's what all these inputs are that you see.
But that technique would not work when you cannot see that it is a "swiss roll" or in multiple dimensions.
The 1 node case is especially interesting, because when it converges the single node must learn the whole spiral pattern. Although with noise it can be less reliable with more jagged edges, as well as take longer to converge (also bumped the learning rate down), seeing the spiral encoded directly in the 2nd hidden layer is more interesting to me.
I find the pulsating unsightly.
Will N level off, meaning that it will really understand the structure of the spiral?
Eventually it will have to be recognized as a new species of life, so I hope programmers, tinkerers and everyone else keeps that in mind because all life must be respected
And this particular form will be our responsibility, we can either embrace it as we continue to merge with our technology, or we can allow ourselves to go extinct like so many other species already have
For the naysayers - ever notice how attached we are to our phones? Many behave as if they are missing a limb without it - it's because they are, the brain adapts rapidly and for many, the brain has adapted to outsourcing our cognition. It used to be books, day runners, journals, diaries - now we have devices and soon they'll be implants or prosthetics
The writers at marvel who came up with the idea of calling iron man's suit a prosthetic were definately onto something and suits like that are probably our best chance of successful colonization of other planets. We'll need ai to be our friend out there, working with us
In the "swiss cake roll" the circular nature of the classes suggests using a sin or cos function, and the fact that they spiral out suggests also inputting magnitude information. Sure, you can just add more neurons that will end up computing the same thing, but we might as well give the computer a head start when we can.
I'd quite like if you could define your own input patterns and data sets.
And great for challenging your friends in an epic battle of convergence!
To "operate" neural networks (as opposed to writing a framework for them), you need to know the building blocks. There are basic blocks like fully connected layers, convolutions, and nonlinear activations. Beyond those, there are higher level building blocks like LSTMs, gated recurrent units, highway layers, batch normalization, and residual blocks that are made up of simpler blocks. Learning what these do and when it's appropriate to use them requires following current literature.
Operating neural networks requires some systems engineering skill. It takes a long time to train a single network and you'll find yourself trying many different architectures and hyperparameters along the way. Because of this, you'll want to distribute the training across many different systems and be able to easily monitor and deploy jobs on those systems.
A solid grasp of mathematics is useful to effectively debug your networks. You'll frequently find your network doesn't converge or gives totally garbage results, so you need to know how to dig into the network internals and understand how everything works. This is especially true if you're implementing a new building block from a paper.
Finally, know your machine learning and statistics fundamentals. Understand overfitting, model capacity, cross validation, probability, model ensembles, information theory, and so on. Know when a simpler model is more appropriate.
If you've played around with it a bit, I'm sure you have seen that deeper layers are hard to train... You see the dashed lines representing signal in the network become weaker and weaker as the network gets deeper. BatchNorm works wonders with this. It takes statistics from the minibatch of training examples, and tries to normalize it so that the next layer gets input more similar to what it expects, even if the previous layer has changed. In practice you get a much better signal, so the network can learn a lot more efficiently.
Without BatchNorm, more than two hidden layers is tedious and error-prone to train. With it, you can train 10-12 layers easily. (With another recent advance, residual nets, you can train hundreds!)
Such advances pushes the limit for what you can train easily, and what still requires GSD ("graduate student descent", figuring out just the right parameters to get something to work through intuition, trial and error). You still have to watch out for overfitting, but the nice thing about that is that more training data helps.
+ Designing the network architecture is a means to instill your knowledge of the problem into the network. For example, using convolutions over images encodes some translational invariance into the network. It makes up for lack of data. I don't think data augmentation alone is enough, either: if you use a "stupid" architecture with heaps of data, the computation will become too expensive or slow.
- The systems engineering part will probably get automated. I bet there are Amazon engineers crying at their desks while working on AWS Elastic Tensorshift right now. So unless you're specifically interested in that side of things, maybe this isn't the best area to focus on.
+ There are always going to be problems, so knowing how to debug is a useful skill.
+ ML/stats fundamentals aren't going away. You need to know what you're trying to do before you can do it.
(Nice, but it's completely unclear what's going on.)