Networks try to minimize loss. If you want something to happen less frequently, add it to the loss. Literally addition.
It was a mind-bending “there is no spoon” moment for me.
Also, loss is one of the worst names imaginable. Kerfluffle would’ve been better, because at least it’s mostly meaningless. Whenever you see “loss”, substitute with “penalty” and things will become much clearer.
Secondary advice: if you don’t have patience, you won’t get anywhere. In the same way that the stock market is a lever for transferring money from the impatient to the patient, neural networks are a lever to transfer advantage to the patient.
By that I mean, I can’t count the number of times I almost wrote off some small tweak as “doesn’t work”, only to leave the network training for another week or so and discovering it worked fine. In fact, it was almost always equivalent, or had no advantage, I.e. a placebo. It’s not like code; you can do so much fucked-up shit to a neural network, and it will still work. It’s unlike anything you’re used to.
Beyond that, just remember that this stuff is hard. Coding the network is easy. Getting it right is hard. And getting it perfect, well, took me a year. Google’s official biggan model at google/compare_gan never achieved the same FID as real biggan. Why? I immersed myself in this mystery, eventually reverse engineering the official tensorflow graph. I discovered their implementation was missing a crucial + 1, so their gamma was centered around zero instead of one. And in batchnorm, gamma is a multiplier — so the network was basically multiplied by zero and no one noticed for years. (Remember how I said you can do a lot to a network without causing problems? Sometimes the problems are so subtle they’ll drive you nuts. You know something is wrong, but you don’t know what or why, and it’s almost impossible to debug.)
However, I believe that coding the network is very challenging unless you do a task that has been widely explored already. For optical flow, there was a wide consensus among SOTA papers for some years that convolutional filters, warping of the input data, and a hierarchical structure was the correct way, e.g. everything descended from FlowNet.
But turns out, a hierarchical structure can NOT correctly represent some movement patterns in the real world, like branches on a tree moving or overhead cables. So now we have a category of AI solutions that all fail in the same way in the same circumstances, plus commercial products (e.g. Skydio Drone) with the exact same issues.
The correct approach seems to be a iterative solver approach, which has been attempted with RAFT, but nobody has yet managed to design a suitable network architecture that does not require hierarchical undersampling.
Just like in your failure story, tiny mistakes in the network can prevent success for good. And you need lots of attention to detail and plenty of experience to avoid those mistakes.
This is a pretty good ELI5 neural networks.
Could you expand a bit on this? Do you mean that it might miss small fast moving objects due to losing fidelity at the coarse resolutions. Or is there actually some sort of movement that the hierarchical structure can't interpret.
The single RGB stream top1 goes up to 73.48% with resnet50, and up to 74.71% equipped with non-local. Both are much higher than the original paper with two-streams.
Penalty already has a meaning in machine learning so this substitution just adds more confusion instead clarifying things. Loss seems descriptive enough to me.
How is loss descriptive? Ah yes, we're losing... something. Our lunch, maybe.
The neural network isn't playing a game, even though people like to phrase GANs that way. There's no "win" condition. Training just ends whenever you decide to end it.
Minimizing loss doesn't bring you closer to winning a game anyway. It's often the worst strategy in certain kinds of games.
Minimizing penalty, on the other hand, is perfectly clear. If you want the neural network to do something less, add a penalty term.
Feynman had a funny story about this. I’m no Feynman, but he invented new ways of writing sin, cos, etc. He said he disliked the way it looked, since cos(x) looks like cos multiplied by x. And of course the story ended with the same punchline you outlined: when you want to talk to others, you need shared vocabulary.
But the thing is, it’s extremely easy to remember to say “loss” instead of “penalty” when I’m talking to someone. But it was extremely hard for me to even understand what the heck a loss was. What is it, exactly? What’s it doing and why? How should I think about it — and more importantly, how can I extrapolate that thinking to take advantage of it?
Maybe it’s a personal quirk, but I simply couldn’t understand loss. I know penalty though. Ditto tor learning rate vs step size. So it’s more of “internal advice” rather than me saying that you should rewrite your papers with the new names.
EDIT: By the way, I wasn't proposing that "penalty be renamed to "regularization". I was under the impression that what the parent comment was calling penalty" was normally called "regularization", i.e. that regularization was the formal name for it. If that's not true, it's possible my understanding is incomplete -- what is penalty? I haven't heard of it till now, to be honest. And googling for "machine learning penalty" pops up 5 articles on regularization.
So I was proposing two changes: loss -> penalty, and learning rate -> step size.
I guess I'll call loss "punishment." It matches how it feels to make progress in ML anyway.
I have looked into the historical context and reason for the use of the word (the technique was first popularized in something which "regressed to the mean"), as well as its development, and it still bugs me any time.
I agree with you about fucked-up shit. I had a strong optimization background, and a traditional statistics background, and from that point everything you do with neural networks is just crazy.
Physics suffered from the same problem: "action," "work," and so on, are unrelated to their usage. But we're stuck with them.
Both "loss" and "learning rate" are confusing, and neural networks are so confusing that I think it's worth undoing as much as possible.
I would s/loss/penalty/ and s/learning rate/step size/, after giving it much thought. At least, I haven't thought of better names yet.
The reason "step size" is important is because it represents what's actually going on. You don't increase the learning rate to make it learn faster. You increase the step size to make it take longer steps towards a goal. And when it's close to the goal, it circles around the goal, like water down a drain. You decrease the step size (learning rate) towards the end of training so that it doesn't keep dancing around the bowl, and can finally reach its target in the middle.
Slight modifications like that can give lots of insights. For example, now that you're thinking of learning rate in terms of water spiraling down a drain, you can see why averaging the last N model checkpoints increases accuracy: if you're spinning in a circle around a target, then the average of your last 5 positions must bring you closer to the center. In fact, that's true of any convex shape. Therefore the loss landscape seems mostly convex.
And so it goes. It's very much like compound interest. The more you understand, the more you can understand. That's why it's so important to be determined and patient.
Also, ask lots of questions on Twitter. In my opinion it's one of the most crucial resources for learning ML. The ML community there is phenomenal, and I don't know why. All I know is that everyone is super friendly and eager to help you out. Start with @pbaylies, @jonathanfly, @aydaoai, and @arfafax.
One of the nice things about ML (and math, for that matter) is that there are multiple mathematically equivalent ways of looking at a thing.
It’s not much of a writeup. It’s basically saying, hey, this is zero when it should be one.
The results were dramatic. It went from blobs to replicating the biggan paper almost perfectly. I think we’re at a FID of 11 or so on imagenet. Here's a screenshot I just pulled from our current run: https://i.imgur.com/k1RuWEG.png
Stole a year of my life to track it down. But it was a puzzle I couldn’t put down. It haunted my dreams. I was tossing and turning like, but why won’t it work... why won’t it work...
It was both as simple and as hard as that.
Ended up being featured in a few articles.
You can join the ML discord server here (https://github.com/shawwn/tpunicorn#ml-community) if you're looking to toss around ideas for things to do.
My "Deep Learning for Programmers: A Tutorial with CUDA, OpenCL, DNNL, Java, and Clojure" book explains and executes every single line of code interactively, from low level operations to high-level networks that do everything automatically. The code is built on the state of the art performance operations of oneDNN (Intel, CPU) and cuDNN (CUDA, GPU). Very concise readable and understandable by humans.
Here's the open source library built throughout the book:
Some chapters from the beginning of the book are available on my blog, as a tutorial series:
I didn't study ML in general so what I gained from the course was a deep understanding of the fundamentals & math behind it, but since I didn't get to familiarize myself with any of the existing libraries (Tensorflow/Keras back then) I had a hard time convincing anyone in industry of my skills in the field :/
Also: Why does the book only cover Deep Q-Network on Reinforcement Learning? Sure it is the most notable deep learning step in the field but, there are some relevant versions such as Actor-Critic & Maximum Entropy RL that can be very relevant too. If one includes YOLO, ResNet and newer architectures for Computer Vision application, I don't know why same things are not on RL.
Arxiv is supposed to be a pre-print scientific publication server. A place to post your nearly finished or ideally submitted journal or conference manuscript so you can reference it while it is being reviewed.
The purpose of Arxiv is in itself a patch for a too common, too long in duration publication process. Now it is often the first place to publish ML research and an obligatory source of literature in ML. No peer-review or any quality assurance makes for dubious work appearing there that can waste a lot of research time.
Hosting your elementary course there because that's where the researchers are, muddies the quality of work on Arxiv further.
My criticism pertains to OP using Arxiv as a PDF host for course material irrelevant to Arxiv's userbase, i.e. expert researchers. It is already hard enough to find the quality manuscripts in Arxiv. Hence why Karpathy made Arxiv Sanity Preserver . I would rather not have to drudge through pages of tutorial pdf's when searching "neural graph methods for NLP" for instance.
I’m working through ISLR (starting ch8) so I’ll be done in a few weeks.
This topic isn’t covered so another textbook, with exercises, would be ideal.
Just noticed that this paper is a book. Maybe I have a winner?
To me, Its more of a reference book.
But if you want a "Fyneman" type book that describes the underlying structure & workings in a non academic way, I would recommend;
Michael Nielsen's Neural Networks & Deep Learning
Jeff Heaton's Introduction To The Math Of Neural Networks
Have to get to it soon.
For more fundamental material I like the CS231n course notes  and Goodfellow, Bengio, and Courville 
I eventually settled for Python and implemented the NN with the help of the book Make Your Own Network by Tariq Rashid. Numpy is really magic.