If someone wrote a "Hackers guide to Tuning Hyperparameters" or "Hackers guide to building models for production" I would ready/share the shit out of those.
Learning to learn by gradient descent by gradient descent
Related (but less so), there are also some papers about learning neural network architectures:
Designing Neural Network Architectures using Reinforcement Learning
Neural Architecture Search with Reinforcement Learning
At deepsense.io we are developing Neptune to facilitate that process (compare models, version-control them), here: https://go.neptune.deepsense.io (a public release in less that a month!).
e.g. For NLP:
Disclaimer: they did share free trials with some of our students.
Especially as most of neural network tweaking go well beyond hyperparameter tweaking - one need to adjust architecture, make sensible cross-validation, good augmentation, etc.
Thought process #2 is to read lots of papers that go into details on solutions for similar problems, see what works and doesn't work for them, try to understand if the factors that make it useful apply to you as well (e.g. size or type of data may mean that your experience is likely to be opposite) - and, of course, try and evaluate.
Thought process #3 is to do error analysis, possibly with tooling to show the relations (e.g. for image analysis tasks). You definitely want to know what kinds of mispredictions are you getting in what amounts, and that may help you (though not always) to understand why a particular type of misclassification occurs.
Technical analysis may also come into play, but IMHO that is more useful for debugging why something fails totally and not that useful for getting better accuracy on something that works really well (more useful for getting faster convergence to the same result). There are all kinds of metrics you may measure on your network, e.g. dead neurons for ReLU family, are your early layers stabilizing or not, etc. But again, problems of convergence at all or its speed and problems of converging to a not-good-enough optimum are quite different.
These hacker guides only scratch the surface of the subject which, in part, contributes to creating this aura of black magic that haunts the field; I'm not saying that is a bad thing though, but it needs to be a complementary material, not the way to go.
The nice part of SGD is that you can backpropagate even functions that are not differentiable.
This is totally missed here.
There might be extensions where you have some non-differentiable part where you just assume some gradient in back-propagation. Like Gumble-Softmax (https://arxiv.org/abs/1611.01144). Or similar is the reparametrization trick in VAEs (https://arxiv.org/abs/1606.05908). But that are special cases.
Maybe you are thinking about reinforcement learning?