I've read so many of these, none of them include the information I need.
If someone wrote a "Hackers guide to Tuning Hyperparameters" or "Hackers guide to building models for production" I would ready/share the shit out of those.
The problem is, this really is different on each task. It's really hard to state any generic rules of thumb. Everyone has their own default parameters and intuitions but I would say most of them are heavily biased to the tasks you have worked with. For example, I work with deep bidirectional LSTMs on acoustic modeling for speech recognition, and I use Adam with a starting learning rate of 0.0005, Newbob learning rate scheduling, pretraining, dropout of 0.1 in addition to L2 regularization, and so on. I tried to collect these results here: https://www-i6.informatik.rwth-aachen.de/publications/downlo...
Both are tricky subject and lot of cooking-like thinking (there are popular recipes, general rules, but in all practical cases to master it there is a lot of non-transferable experience and knowledge).
At deepsense.io we are developing Neptune to facilitate that process (compare models, version-control them), here: https://go.neptune.deepsense.io (a public release in less that a month!).
It looks interesting, and talk with these guys (including the author) on a conference (GPU Tech Conf in San Jose). However, for other much more transparent systems, I am kind of anxious to add a black box.
Especially as most of neural network tweaking go well beyond hyperparameter tweaking - one need to adjust architecture, make sensible cross-validation, good augmentation, etc.
The problem is that these guides would have to be domain specific. I can imagine such e.g. "Hackers guide to Tuning Hyperparameters for NLP", but it would have different recommendations than a similar guide for image processing or financial dataset analysis.
That makes sense. But what I need to know are the thought processes that go into determining the hyper-parameters. Most of the time my process amounts to slightly-educated-guess and check. Are there smarter ways to go about that? What tools / analysis should I be doing to see the effects of my tweaks.
Thought process #1 is to test, measure and document everything. Change one thing at a time; if you have two modifications that you want to try and have the resources, then it'd be best to run (measure) A, B and A+B instead of both of them at the same time. If early in your experiments adding A was a mildly beneficial thing, it may be the case that after extensive modifications it's not anymore, but you can try and check that. This obviously means that you need a simple, mostly automated way to run repeatable experiments and document their results.
Thought process #2 is to read lots of papers that go into details on solutions for similar problems, see what works and doesn't work for them, try to understand if the factors that make it useful apply to you as well (e.g. size or type of data may mean that your experience is likely to be opposite) - and, of course, try and evaluate.
Thought process #3 is to do error analysis, possibly with tooling to show the relations (e.g. for image analysis tasks). You definitely want to know what kinds of mispredictions are you getting in what amounts, and that may help you (though not always) to understand why a particular type of misclassification occurs.
Technical analysis may also come into play, but IMHO that is more useful for debugging why something fails totally and not that useful for getting better accuracy on something that works really well (more useful for getting faster convergence to the same result). There are all kinds of metrics you may measure on your network, e.g. dead neurons for ReLU family, are your early layers stabilizing or not, etc. But again, problems of convergence at all or its speed and problems of converging to a not-good-enough optimum are quite different.
Per my knowledge tuning hyperparameters and building models are mostly intuitive guesswork/experimentation based on some fundamental mathematical principles. Am not sure if that's equivalent to hacking in the canonical sense.
As the author mentions, the CS231n course notes may be what you want. They build up many of the neural network primitives using just numpy and basic operations, so it's very easy to see exactly how it works at the code level.
Total votes would qualify and then perhaps a combination of views and comments that occur at least one month or more later. The goal is to track posts that have an extended relevancy beyond the current month. Ideally, beyond the current 3 months. Posts that are more than 3 years old but still get high votes when submitted should also be given special attention.
A good sit in probability theory and multivariate calculus is the first thing you should spend your time if you want to understand NN, ML and most of AI for once.
These hacker guides only scratch the surface of the subject which, in part, contributes to creating this aura of black magic that haunts the field; I'm not saying that is a bad thing though, but it needs to be a complementary material, not the way to go.
Static neural networks on Rosetta Code for basic things like Hello World, etc, would do a lot to aid in people's understanding of neural networks. It would be interesting to visualize different trained solutions.
Knew this wasn't for me when he had to introduce what a derivative was with a weird metaphor. I like this approach to teaching things (it's Feynman-y) but half the time I end up hung up on trying to understand a particular author's hand-waving for a concept I already grok.
No, stochastic gradient descent (SGD) needs a differentiable loss function (differentiable with respect to the trainable parameters). The stochastic bit is that you want to optimize the loss for the whole dataset but what you actually do is to do each gradient descent step on a small mini-batch of the dataset.
There might be extensions where you have some non-differentiable part where you just assume some gradient in back-propagation. Like Gumble-Softmax (https://arxiv.org/abs/1611.01144). Or similar is the reparametrization trick in VAEs (https://arxiv.org/abs/1606.05908). But that are special cases.
Maybe you are thinking about reinforcement learning?
How do you compute the gradient of a non differentiable function? I'm not an expert, but that contradicts anything I've learned about gradient descent.
As others have pointed out, SGD requires differentiability. The next best match I can think of is that you're actually thinking of subgradient methods, which mostly see use in convex optimization problems.
Can you explain? I thought the entire point of back-propagation was to differentiate and calculate the weights that are contributing more to the error and thus will be "changed" more when you do GD.
As someone who is quite new to this field and also a software developer I really look forward to seeing this progress. I write and look at code all day so for me this is much easier to read than the dry math!
If someone wrote a "Hackers guide to Tuning Hyperparameters" or "Hackers guide to building models for production" I would ready/share the shit out of those.