Hacker News new | comments | show | ask | jobs | submit login
Hacker's guide to Neural Networks (2012) (karpathy.github.io)
424 points by catherinezng 11 months ago | hide | past | web | favorite | 39 comments

I've read so many of these, none of them include the information I need.

If someone wrote a "Hackers guide to Tuning Hyperparameters" or "Hackers guide to building models for production" I would ready/share the shit out of those.

The problem is, this really is different on each task. It's really hard to state any generic rules of thumb. Everyone has their own default parameters and intuitions but I would say most of them are heavily biased to the tasks you have worked with. For example, I work with deep bidirectional LSTMs on acoustic modeling for speech recognition, and I use Adam with a starting learning rate of 0.0005, Newbob learning rate scheduling, pretraining, dropout of 0.1 in addition to L2 regularization, and so on. I tried to collect these results here: https://www-i6.informatik.rwth-aachen.de/publications/downlo...

I know nothing about any of this, but has there been any work into using neural networks to guide and vary the parameter values during training?

There has been some work related to what you are describing:

Learning to learn by gradient descent by gradient descent https://arxiv.org/abs/1606.04474

Related (but less so), there are also some papers about learning neural network architectures:

Designing Neural Network Architectures using Reinforcement Learning https://arxiv.org/abs/1611.02167

Neural Architecture Search with Reinforcement Learning https://arxiv.org/abs/1611.01578

This is great, thank you, added it to tonight's reading list.

Both are tricky subject and lot of cooking-like thinking (there are popular recipes, general rules, but in all practical cases to master it there is a lot of non-transferable experience and knowledge).

Some pointers:

* http://rishy.github.io/ml/2017/01/05/how-to-train-your-dnn/

* http://www.alexirpan.com/2017/06/27/hyperparam-spectral.html

At deepsense.io we are developing Neptune to facilitate that process (compare models, version-control them), here: https://go.neptune.deepsense.io (a public release in less that a month!).

Another resource - https://sigopt.com/, in the vein of Whetlab.

e.g. For NLP:

* https://aws.amazon.com/blogs/ai/fast-cnn-tuning-with-aws-gpu...

Disclaimer: they did share free trials with some of our students.

It looks interesting, and talk with these guys (including the author) on a conference (GPU Tech Conf in San Jose). However, for other much more transparent systems, I am kind of anxious to add a black box.

Especially as most of neural network tweaking go well beyond hyperparameter tweaking - one need to adjust architecture, make sensible cross-validation, good augmentation, etc.

The problem is that these guides would have to be domain specific. I can imagine such e.g. "Hackers guide to Tuning Hyperparameters for NLP", but it would have different recommendations than a similar guide for image processing or financial dataset analysis.

That makes sense. But what I need to know are the thought processes that go into determining the hyper-parameters. Most of the time my process amounts to slightly-educated-guess and check. Are there smarter ways to go about that? What tools / analysis should I be doing to see the effects of my tweaks.

Thought process #1 is to test, measure and document everything. Change one thing at a time; if you have two modifications that you want to try and have the resources, then it'd be best to run (measure) A, B and A+B instead of both of them at the same time. If early in your experiments adding A was a mildly beneficial thing, it may be the case that after extensive modifications it's not anymore, but you can try and check that. This obviously means that you need a simple, mostly automated way to run repeatable experiments and document their results.

Thought process #2 is to read lots of papers that go into details on solutions for similar problems, see what works and doesn't work for them, try to understand if the factors that make it useful apply to you as well (e.g. size or type of data may mean that your experience is likely to be opposite) - and, of course, try and evaluate.

Thought process #3 is to do error analysis, possibly with tooling to show the relations (e.g. for image analysis tasks). You definitely want to know what kinds of mispredictions are you getting in what amounts, and that may help you (though not always) to understand why a particular type of misclassification occurs.

Technical analysis may also come into play, but IMHO that is more useful for debugging why something fails totally and not that useful for getting better accuracy on something that works really well (more useful for getting faster convergence to the same result). There are all kinds of metrics you may measure on your network, e.g. dead neurons for ReLU family, are your early layers stabilizing or not, etc. But again, problems of convergence at all or its speed and problems of converging to a not-good-enough optimum are quite different.

Per my knowledge tuning hyperparameters and building models are mostly intuitive guesswork/experimentation based on some fundamental mathematical principles. Am not sure if that's equivalent to hacking in the canonical sense.

Not sure if this is what you want, but they are aiming to self tune parameters for you. http://www.ml4aad.org/

This has been submitted quite a few times in the past: https://hn.algolia.com/?query=karpathy.github.io%2Fneuralnet...

Aw. And I was hoping Chapter 3 and 4 would be finished sometime soonish. It's the only part of the guide that you can learn from by example.

As the author mentions, the CS231n course notes may be what you want. They build up many of the neural network primitives using just numpy and basic operations, so it's very easy to see exactly how it works at the code level.

I don't think it'll ever be finished.

Yep, Andrej leads Tesla Autopilot now. Doubt he'll be following up here.

yep :(

I guess ancestor commenters need only wait for a weekend now :)

It would be awesome if there was a section in Hacker News called Classics where this could be posted.

What would be Classic and what would not? I guess it would be a combination of votes, times posted, and a minimum age

Total votes would qualify and then perhaps a combination of views and comments that occur at least one month or more later. The goal is to track posts that have an extended relevancy beyond the current month. Ideally, beyond the current 3 months. Posts that are more than 3 years old but still get high votes when submitted should also be given special attention.

A good sit in probability theory and multivariate calculus is the first thing you should spend your time if you want to understand NN, ML and most of AI for once.

These hacker guides only scratch the surface of the subject which, in part, contributes to creating this aura of black magic that haunts the field; I'm not saying that is a bad thing though, but it needs to be a complementary material, not the way to go.

When it comes to backpropagation, PyTorch introduction contains some valuable parts: http://pytorch.org/tutorials/beginner/deep_learning_60min_bl...

Static neural networks on Rosetta Code for basic things like Hello World, etc, would do a lot to aid in people's understanding of neural networks. It would be interesting to visualize different trained solutions.

Knew this wasn't for me when he had to introduce what a derivative was with a weird metaphor. I like this approach to teaching things (it's Feynman-y) but half the time I end up hung up on trying to understand a particular author's hand-waving for a concept I already grok.

Thank you for posting this! I hadn't seen it and have been looking for a simple guide like this one.

thanks for sharing, apparently i missed past submits

Hmm, I've just scanned through this, but it seems this gets the concept of stochastic gradient descent (SGD) completely wrong.

The nice part of SGD is that you can backpropagate even functions that are not differentiable.

This is totally missed here.

No, stochastic gradient descent (SGD) needs a differentiable loss function (differentiable with respect to the trainable parameters). The stochastic bit is that you want to optimize the loss for the whole dataset but what you actually do is to do each gradient descent step on a small mini-batch of the dataset.

There might be extensions where you have some non-differentiable part where you just assume some gradient in back-propagation. Like Gumble-Softmax (https://arxiv.org/abs/1611.01144). Or similar is the reparametrization trick in VAEs (https://arxiv.org/abs/1606.05908). But that are special cases.

Maybe you are thinking about reinforcement learning?

How do you compute the gradient of a non differentiable function? I'm not an expert, but that contradicts anything I've learned about gradient descent.

Pseudo gradients are a thing. You can pretend its a continuous function and get gradients that push it in the right direction.

As others have pointed out, SGD requires differentiability. The next best match I can think of is that you're actually thinking of subgradient methods, which mostly see use in convex optimization problems.

Can you explain? I thought the entire point of back-propagation was to differentiate and calculate the weights that are contributing more to the error and thus will be "changed" more when you do GD.

maybe you're thinking about simulated annealing?

As someone who is quite new to this field and also a software developer I really look forward to seeing this progress. I write and look at code all day so for me this is much easier to read than the dry math!

Wonderful guide, thanks for sharing!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact