But I really appreciate these kinds of write-ups: he declares his non-expertise up-front, and then proceeds to document his understanding as he goes along. There's something useful about this kind of blog post for non-experts.
I'm working my way through Karpathy's writeup on RNNs (http://karpathy.github.io/2015/05/21/rnn-effectiveness). I've mechanically translated his Python to Go, and even managed to make it work. But I still don't entirely understand the math behind it. Now obviously Karpathy IS an expert, but despite his extremely well-written blog post, a lot of it is still somewhat impenetrable to me ("gradient descent"? I took Linear Algebra oh, about 25 years ago). So sometimes it's nice to see
other people who are a bit bewildered by things like tanh(), yet still press on and try to understand the overall process.
And FWIW I had the same reaction as the author when I started toying around with neural nets- it's shocking how small the hidden layer can be and still do useful stuff. It seems like magic, and sometimes you have to run through it step-by-step to understand it.
Also definitely +1 for not putting down people who write similar posts. I encourage everyone who is trying to learn to do it through blog posts because it lets you explain/organize thoughts. I also enjoy reading them quite a bit because it illustrates the kinds of conceptual problems beginners face (which is not at all obvious once you've been in the area for a few years). And it's also interesting to see many different interpretations of the same concepts, as everyone has different background and the way they reason through things is usually quite unique. Granted, this one could have been named something more appropriate!
It's really wonderful that all of this is freely available, thank you.
I think this style of teaching has great value. Someone who's learning something themselves is the person most suitable to teach it to others, since they know exactly what a novice user doesn't know. For example, I wanted to write something up for monads the other day, since it's a simple concept that's made super confusing by people who dive into mathematical notation right away. The downside with this approach is that the novice lacks experience, so what they're learning may not be entirely accurate.
I think the best approach is a hybrid: Someone who is learning the material explains it, and someone who already knows it points out mistakes. In this case, HN can serve as the expert, and we all end up with a very informative post.
If we/I ever do it, I'll make sure to send you a link. :)
Gradient descent is the approximation solution basically because getting the exact solution requires a good computation of inverse matrices which is apparently not yet doable (it's too slow)
I think gradient descent is attractive because it's a memoryless process at the batch level - you can process training data in batches instead of processing the entire dataset in one go, without any explicit tracking of the previous batch history. This is a great feature when the scale of your dataset is mind-boggling. I think this is what you were suggesting?
covers RNN on lec 8
Some things you'd expect from a modern language model--like, your phone keyboard's predictor already generates some grammatical-looking phrases, and there's only room to do so much in a phone. I'm more impressed that (as the author notes) it gets syntax right, including matched pairs over long distances, occasionally makes words up using plausible components, suffixes, etc. ("quanting"), and learns other quirks (e.g. there's Cyrillic after the [[ru:]] link and some Hangul after the [[ko:]]). Those aren't intuitively things you'd expect from applying arithmetic to a bunch of characters.
Can't find the startup's name though.
I believe it was discussed here a few years back.
I think there are a few mistake in your maths though. You can learn a 1-1 discrete mapping through a single node where you are using a one-hot vector. You just assign a weight to each of the input nodes, and then use a delta function on the other side. If I understood correctly, this is what you are doing.
Also, if you use a tanh in your input layer, but keep a linear output layer (as you start off with), you are still doing a linear approximation because you have a rank H (where H is the hidden layer) matrix that is trying to linearly approximate your input data. This is done optimally using PCA.
I'd second the advice to look into the coursera courses, or the nando de freitas oxford course on youtube (that actually has a really nice derivation of backprop).
His overall point that the field is largely empirical with little solid theory to guide one's choices is, in my opinion, not wrong. It may be that a general theory doesn't exist and different network architectures and hyperparameters are needed for different types and sizes of problems.
(I say "almost" because while I haven't seen a single state of the art result without ReLUs, I'm sure there is probably a random paper out there)
For example: http://arxiv.org/abs/1308.0850
As for how well this works compared to other approaches, I don't know, and this blog post certainly doesn't address the question.
Right now he's a chef who obsessively spent six months thinking about knives. But it's about the meat and the heat, not the chef knife.
Just pick a problem and start playing with it! If you lean to this sort of pedantic "I MUST UNDERSTAND ALL FUNDAMENTALS" (I do!) I still recommend Matlab (available for $49-$99 in student editions) because the tools are so good and the docs are so damn readable. This post is basically a matlab doc page minus the links to the stuff that actually works.
Don't pay to torture yourself! Use Python instead! Numpy and scipy are the standard libraries for most scientific code nowadays, and you'll be able to use Theano, Caffe, TensorFlow, PyMC, scikit-learn, pybrain, and loads of other machine-learning and statistics libraries with Python interfaces!
That said, I deploy using Python and several of these libraries (and more interesting unmentioned ones). But often, Matlab for exploratory dev! I re-recommend it for getting started.
I was a kid in 2007, and I'd prefer that you didn't insinuate I developed MVC apps!
Right now I'm trying to get transfer-learning to show up correctly in a Hakaru model translated from a Church model so I can do some extra information-theoretic measurements.
care to share more info?
Once you dig in and try to actually do something, you stumble on the cool stuff. It's often Python throwaway or Matlab spaghetti code in a .edu directory that starts with a tilde. And it disappears when you need it most.
(If it's Python, it's guaranteed to be something wonky that uses 2.7 if you're on 3.x or vice versa, or requires three additional packages that throw obscure build errors on your distro. And if it's Matlab, it inevitably requires you to shell out another $299 for some stupid toolbox just to generate white noise...)
Good times. I won't kill your fun by being overly specific. Dig in!
But more to the point of this discussion: the celebrated online class that builds the first principles that the creator of this article flubbed uses Matlab (and Octave).
Andrew Ng's introductory Machine Learning Coursera uses Octave to teach programming a simple Neural Net that recognises handwritten digits (MNIST).
Consensus (I'll use the term loosely) was that you get what you pay for. Either spend the bucks on Matlab or preferably use Python. I'll dare to provide a stronger opinion: Octave just sucks and is one of those unfortunate GNU tools that forever seems stuck 15 years in the past. Many people moved to Python, we're all eyeing Julia, the rest of us are largely stuck in Matlab.
Given the range of finances of a MOOC class of 160,000, Octave's price (free) was vital for some to participate.
Matlab looks really cool, but I feel it unfair to advocate un-Free software if my students can't take it home. I never got a chance to play with Matlab so can't really comment.
As you suggest I graduated from Octave to Python, sometimes Lua & Torch7 and then the subsets of C for CUDA & OPENCL's GPU parallelism.
Professor Ng's MOOC students managed a good fundraiser for Octave development so those that could, paid something.
> Matlab looks really cool, but I feel it unfair to advocate un-Free software if my students can't take it home.
Why is it your job to advocate anything? Inform them of what's available. Challenge yourself to make your lessons applicable to a broader range of tools, including ones your students might craft themselves. Let them know they can get the job done with GIMP but there's Photoshop at $9.99/month that might save them time and make them more employable.
$1999 MacBook, $649 phone, but somehow $99 for a critical piece of software is unacceptable. And people wonder why students have debt but can't find jobs.
I can only guess you never learned about sushi. Or any other cuisine in which high precision cuts are the foundation/hallmark of excellence. Good chefs are obsessed over their knives. With good reason.
In contrast, enthusiasts spend six months obsessing about different types of knives. And then write a blog post detailing their thought process. Which, like this post, is often meaningless because it's a guy writing about knives who's never cut into anything.
Reading a few of your comments in this thread, it seems like you're heavily involved in this area of technology/research, but it's depressing that you're making comments that range from passive-aggressive to condescending to dismissive. Further, you've yet to actually offer constructive criticism about the post, or even detailed criticism, one way or another.
However I can provide constructive input by providing broader perspectives and alternate viewpoints. This technically weak, mis-titled post sat at the top of HN for most of a day. It's now a top hit on Google when you search "Neural Net Compression" -- ranked above classic research papers from the 80s and 90s. That's the depressing part.
Now create 6 layers, classify different sets of inputs as 'different' to represent different neurochemicals (you need several excitatory and several inhibitory and then a couple of very small master neurochemicals that have major excitatory and inhibitory responses to represent the dopamine network and a whole system for the amygdala), cluster different groups to either respond to inputs or create outputs, and set it loose on an environment. How close would we come to something that behaves as if conscious?
The human brain is recursive in nature - the neurons and synapses form a cyclic graph. You can't do that with a simple perceptron.