For example in this line of thought in question answering, there is the BABI dataset which create auxiliary problems. So you know where the problems in modelisation are.
By pushing this problem to the extreme (for example in nlp you can have tasks that consist of repeating a sequence of character in reverse order to demonstrate that the architecture is indeed capable of memorizing like a parrot), you can often create trivial problems, which takes minutes to run on a single machine, and help discover most bugs.
You can also create hierarchies of such problems, so you know in which order you have to tackle them. And you can build sub-modules and reuse them.
Quite often the code you obtain then is very explainable and you know what situation will work and what will probably not work. But this network architecture is usually "verbose" and numerically optimize a little less well on large scale problems. The trick is then to simplify your network mathematically into something that is more linear and more general. You can reorder some operations like summing along a different dimension first. Semantically this will be different but will converge better. Because for a network to optimize well it needs to work well in both the forward direction and the backward direction so that the gradient flows well.
Once you have a set of simple problems that encompass your general problem, a good solution architecture is usually a more general mixture of the model architecture of the simple problems.
> One time I accidentally left a model training during the winter break and when I got back in January it was SOTA (“state of the art”).
This can't be overstated. I can't count the number of times I'm the first person to find a problem with the data. It's incredibly frustrating. Just look at your damned data to sanity check it and understand what's going on. Do the same with your model outputs. Don't just look at aggregates. Look at individual instances. Lots of them.
I don't think "stick with supervised learning" is very good advice, though. Unsupervised techniques sometimes work well for NLP and has worked well for other domains, such as medical records. In particular, anytime you have access to much more unlabeled data than labeled data, it should be something you should at least consider.
This blog post is full of the kind of real-world knowledge and how-to details that are not taught in books and often take endless hours to learn the hard way.
If you're interested in deep learning, do yourself a favor and go read this.
It is worth its weight in gold.
Karpathy is amazing. I have had so much ‘mileage’ on two projects out of his unreasonable effectiveness of RNNs article.
Bullshitting and wankery doesn't come naturally to Karpathy so the few spots where he was under pressure to do as much stood out like a sore thumb.
LOL. Human assisted training at scale is perfectly allowable for mission critical success. Especially if you enjoy an unlimited research budget!
You can follow these instructions to the letter. And the same problems around generalization will arise. It's foundational.
For 30fps camera images, handling new data in real time works fine for 99% of scenarios. But seeking usable convergence rates on petascale sized data problems such as NVidia's recent work on Deep Learning for fusion reaction container design requires a breakthrough. Not just in software. But computation architectures as well.
Deep Reinforcement Learning and the Deadly Triad
Identifying and Understanding Deep Learning Phenomena
Can someone translate this to PyTorch for me? Or give a simple example of how one would go about doing this?
It means, that if I have a 1:10 ratio in the data, an untrained net should predict positive in 10% of the cases, right?
Also, it seems to me that most of what he says can be distilled into a boilerplate/template structure for any given deep learning framework, from which new projects can be forked - does this already exist?
>> model = SuperCrossValidator(SuperDuper.fit, your_data, ResNet50, SGDOptimizer)
under "Neural net training is a leaky abstraction" my first thought was, this IS fastai's API
What is he talking about here? BERT, GPT etc are not unsupervised, they are pretrained on a task that has naturally supervised data (language modelling).
As you said, the test subset should only be used at the very last.