Applications of Deep Neural Networks v2 [pdf]

sillysaurusx · on Jan 25, 2021

After 1.5 years of self study in neural networks, my advice would be to internalize the fact that you can train a neural network to do anything that you can encode as a loss function.

Networks try to minimize loss. If you want something to happen less frequently, add it to the loss. Literally addition.

It was a mind-bending “there is no spoon” moment for me.

Also, loss is one of the worst names imaginable. Kerfluffle would’ve been better, because at least it’s mostly meaningless. Whenever you see “loss”, substitute with “penalty” and things will become much clearer.

Secondary advice: if you don’t have patience, you won’t get anywhere. In the same way that the stock market is a lever for transferring money from the impatient to the patient, neural networks are a lever to transfer advantage to the patient.

By that I mean, I can’t count the number of times I almost wrote off some small tweak as “doesn’t work”, only to leave the network training for another week or so and discovering it worked fine. In fact, it was almost always equivalent, or had no advantage, I.e. a placebo. It’s not like code; you can do so much fucked-up shit to a neural network, and it will still work. It’s unlike anything you’re used to.

Beyond that, just remember that this stuff is hard. Coding the network is easy. Getting it right is hard. And getting it perfect, well, took me a year. Google’s official biggan model at google/compare_gan never achieved the same FID as real biggan. Why? I immersed myself in this mystery, eventually reverse engineering the official tensorflow graph. I discovered their implementation was missing a crucial + 1, so their gamma was centered around zero instead of one. And in batchnorm, gamma is a multiplier — so the network was basically multiplied by zero and no one noticed for years. (Remember how I said you can do a lot to a network without causing problems? Sometimes the problems are so subtle they’ll drive you nuts. You know something is wrong, but you don’t know what or why, and it’s almost impossible to debug.)

fxtentacle · on Jan 25, 2021

I mostly second this. Training a neural network is like training a dog, and the loss function just describes when you'll shout "no!" and how loudly.

However, I believe that coding the network is very challenging unless you do a task that has been widely explored already. For optical flow, there was a wide consensus among SOTA papers for some years that convolutional filters, warping of the input data, and a hierarchical structure was the correct way, e.g. everything descended from FlowNet.

But turns out, a hierarchical structure can NOT correctly represent some movement patterns in the real world, like branches on a tree moving or overhead cables. So now we have a category of AI solutions that all fail in the same way in the same circumstances, plus commercial products (e.g. Skydio Drone) with the exact same issues.

The correct approach seems to be a iterative solver approach, which has been attempted with RAFT, but nobody has yet managed to design a suitable network architecture that does not require hierarchical undersampling.

Just like in your failure story, tiny mistakes in the network can prevent success for good. And you need lots of attention to detail and plenty of experience to avoid those mistakes.

ohnemint · on Jan 25, 2021

> Training a neural network is like training a dog, and the loss function just describes when you'll shout "no!" and how loudly.

This is a pretty good ELI5 neural networks.

thermistokles · on Jan 26, 2021

> But turns out, a hierarchical structure can NOT correctly represent some movement patterns in the real world, like branches on a tree moving or overhead cables.

Could you expand a bit on this? Do you mean that it might miss small fast moving objects due to losing fidelity at the coarse resolutions. Or is there actually some sort of movement that the hierarchical structure can't interpret.

fxtentacle · on Jan 29, 2021

It might miss anything where the average structure size is smaller than the large hierarchical blocks. For most SOTA, that means 32px minimum size. So it'll also miss fences, for example, because the wires are too thin and it'll not treat it as a whole but as separate tiny objects.

g_airborne · on Jan 25, 2021

I couldn't agree more, especially with the latter part. I've worked on action recognition with I3D for over a year now, and found that seemingly equivalent implementations in Keras, TensorFlow 2 or PyTorch will produce wildly different results. Worse yet, I found a bunch of papers that will claim SOTA results compared against one of those non-original implementations with just a few percentage-point differences. It makes no sense! It took me hundreds of hours to hunt down the differences between how these frameworks implement their layers before I could come even close to the expected accuracy...

innerlee · on Jan 25, 2021

shameless ad: try mmaction2, where every result is reproducible https://github.com/open-mmlab/mmaction2 . Modelzoo: https://mmaction2.readthedocs.io/en/latest/modelzoo.html

g_airborne · on Jan 25, 2021

This is very cool, I’ll be studying your implementation of I3D. Did you ever attempt to train I3D end-to-end as done in the Quo Vadis paper? And it so, did you get comparable Top1/Top5 accuracy?

innerlee · on Jan 25, 2021

Sure, checkpoints, configs and detailed training logs all are available at modelzoo https://mmaction2.readthedocs.io/en/latest/recognition_model...

The single RGB stream top1 goes up to 73.48% with resnet50, and up to 74.71% equipped with non-local. Both are much higher than the original paper with two-streams.

superbcarrot · on Jan 25, 2021

> Whenever you see “loss”, substitute with “penalty” and things will become much clearer.

Penalty already has a meaning in machine learning so this substitution just adds more confusion instead clarifying things. Loss seems descriptive enough to me.

sillysaurusx · on Jan 25, 2021

Perhaps, but regularization is a better name for that term anyway.

How is loss descriptive? Ah yes, we're losing... something. Our lunch, maybe.

The neural network isn't playing a game, even though people like to phrase GANs that way. There's no "win" condition. Training just ends whenever you decide to end it.

Minimizing loss doesn't bring you closer to winning a game anyway. It's often the worst strategy in certain kinds of games.

Minimizing penalty, on the other hand, is perfectly clear. If you want the neural network to do something less, add a penalty term.

superbcarrot · on Jan 25, 2021

So far you've mentioned that you want to change three terms (loss, penalty, learning rate), one of them with a term which is already in use. You're basically rewriting the terminology to fit your personal preference. If you need to communicate with people who have experience in the field, all of this will add more confusion than it removes. It's fine if it helps you reason about things of course but it's just important to keep in mind that the rest of the world isn't on board.

sillysaurusx · on Jan 25, 2021

Nah, I’ll bend the world to my way of doing things. It’s better.

Feynman had a funny story about this. I’m no Feynman, but he invented new ways of writing sin, cos, etc. He said he disliked the way it looked, since cos(x) looks like cos multiplied by x. And of course the story ended with the same punchline you outlined: when you want to talk to others, you need shared vocabulary.

But the thing is, it’s extremely easy to remember to say “loss” instead of “penalty” when I’m talking to someone. But it was extremely hard for me to even understand what the heck a loss was. What is it, exactly? What’s it doing and why? How should I think about it — and more importantly, how can I extrapolate that thinking to take advantage of it?

Maybe it’s a personal quirk, but I simply couldn’t understand loss. I know penalty though. Ditto tor learning rate vs step size. So it’s more of “internal advice” rather than me saying that you should rewrite your papers with the new names.

EDIT: By the way, I wasn't proposing that "penalty be renamed to "regularization". I was under the impression that what the parent comment was calling penalty" was normally called "regularization", i.e. that regularization was the formal name for it. If that's not true, it's possible my understanding is incomplete -- what is penalty? I haven't heard of it till now, to be honest. And googling for "machine learning penalty" pops up 5 articles on regularization.

So I was proposing two changes: loss -> penalty, and learning rate -> step size.

jmmcd · on Jan 25, 2021

Yes, I think you're confused about regularisation. A regularisation is (usually) a component in the overall loss, which has the goal of simplifying or preventing overfitting, as opposed to the main component which has the goal of fitting. It's not another term or a formal term for penalty.

sillysaurusx · on Jan 25, 2021

Thanks for explaining that. It seems that penalty is already used, which is unfortunate. One of the interesting things about ML is that you can learn for a year and a half and still uncover more things you didn't know, which I love.

I guess I'll call loss "punishment." It matches how it feels to make progress in ML anyway.

tpoacher · on Jan 25, 2021

Just because terms become established doesn't mean the way they got about becoming established was through clarity and careful deliberation. In fact I'd go as far as saying that more than half of the terms/notation in such fields sound like they were created as silly placeholder names which then stuck. So much so that we need a translation of their actual meaning each time they're used. Even something as basic as p(x).

pugio · on Jan 25, 2021

I still feel mental friction when contemplating anything to do with "regression" because the word doesn't seem to capture what the technique(s) (e.g. linear, logistic) actually do.

I have looked into the historical context and reason for the use of the word (the technique was first popularized in something which "regressed to the mean"), as well as its development, and it still bugs me any time.

jl2718 · on Jan 26, 2021

“error”? “resisidual”? “objective”? “cost”? “Penalty function” is commonly used in operations research/optimization, and the parameter correction, the “penalty”. There are some conceptual differences with RL as “penalty” is more like a data input, but I think that should be a participle like “punishment” because it implies action by the trainer.

QuesnayJr · on Jan 25, 2021

What's wrong with the name "loss"? I like the idea of calling it "kerfuffle", but loss seems like a neutral term to me.

I agree with you about fucked-up shit. I had a strong optimization background, and a traditional statistics background, and from that point everything you do with neural networks is just crazy.

sillysaurusx · on Jan 25, 2021

We have an opportunity here to define the terms that our descendants will be using 50 years from now. It won't come again.

Physics suffered from the same problem: "action," "work," and so on, are unrelated to their usage. But we're stuck with them.

Both "loss" and "learning rate" are confusing, and neural networks are so confusing that I think it's worth undoing as much as possible.

I would s/loss/penalty/ and s/learning rate/step size/, after giving it much thought. At least, I haven't thought of better names yet.

The reason "step size" is important is because it represents what's actually going on. You don't increase the learning rate to make it learn faster. You increase the step size to make it take longer steps towards a goal. And when it's close to the goal, it circles around the goal, like water down a drain. You decrease the step size (learning rate) towards the end of training so that it doesn't keep dancing around the bowl, and can finally reach its target in the middle.

Slight modifications like that can give lots of insights. For example, now that you're thinking of learning rate in terms of water spiraling down a drain, you can see why averaging the last N model checkpoints increases accuracy: if you're spinning in a circle around a target, then the average of your last 5 positions must bring you closer to the center. In fact, that's true of any convex shape. Therefore the loss landscape seems mostly convex.

And so it goes. It's very much like compound interest. The more you understand, the more you can understand. That's why it's so important to be determined and patient.

Also, ask lots of questions on Twitter. In my opinion it's one of the most crucial resources for learning ML. The ML community there is phenomenal, and I don't know why. All I know is that everyone is super friendly and eager to help you out. Start with @pbaylies, @jonathanfly, @aydaoai, and @arfafax.

notretarded · on Jan 25, 2021

It's not penalty or step size. It's loss as in amount of information lossd (not encoded in your network) compared to one perfectly encoding ground truth. Learning rate, as in what is the maximum amount of delta you are allowed to change your inputs to minimise your information loss analogous to how quickly you can possibly learn in one experiment.

sillysaurusx · on Jan 25, 2021

Fair. I’ll keep that in mind. On the other hand, it went way over my head, and I’m not afraid to admit it.

One of the nice things about ML (and math, for that matter) is that there are multiple mathematically equivalent ways of looking at a thing.

rseed42 · on Jan 25, 2021

That's really funny, 50 years from now? Backpropagation is not a law of physics, you know. Thanks for making me laugh ;).

sjg007 · on Jan 25, 2021

I’d be interested in reading your formal write up. It’d be interesting to see the issue as you understand it.

sillysaurusx · on Jan 25, 2021

Sure thing! https://github.com/google/compare_gan/issues/54

It’s not much of a writeup. It’s basically saying, hey, this is zero when it should be one.

The results were dramatic. It went from blobs to replicating the biggan paper almost perfectly. I think we’re at a FID of 11 or so on imagenet. Here's a screenshot I just pulled from our current run: https://i.imgur.com/k1RuWEG.png

Stole a year of my life to track it down. But it was a puzzle I couldn’t put down. It haunted my dreams. I was tossing and turning like, but why won’t it work... why won’t it work...

person_of_color · on Jan 25, 2021

What was your self study resource?

sillysaurusx · on Jan 26, 2021

As glib as it sounds: pick a lot of hard projects and work on them tirelessly. Ask lots of questions on Twitter.

It was both as simple and as hard as that.

person_of_color · on Jan 26, 2021

Where did you get project ideas?

sillysaurusx · on Jan 26, 2021

Mostly from gwern. He's an endless source of ideas.

Ended up being featured in a few articles.

- https://www.newsweek.com/openai-text-generator-gpt-2-video-g...

- https://www.theregister.com/2020/01/10/gpt2_chess/

- https://news.ycombinator.com/item?id=23479257

You can join the ML discord server here (https://github.com/shawwn/tpunicorn#ml-community) if you're looking to toss around ideas for things to do.

dragandj · on Jan 25, 2021

If I may drop in with a bit of shameless self-promotion.

My "Deep Learning for Programmers: A Tutorial with CUDA, OpenCL, DNNL, Java, and Clojure" book explains and executes every single line of code interactively, from low level operations to high-level networks that do everything automatically. The code is built on the state of the art performance operations of oneDNN (Intel, CPU) and cuDNN (CUDA, GPU). Very concise readable and understandable by humans.

https://aiprobook.com/deep-learning-for-programmers/

Here's the open source library built throughout the book:

https://github.com/uncomplicate/deep-diamond

Some chapters from the beginning of the book are available on my blog, as a tutorial series:

https://dragan.rocks

NalNezumi · on Jan 25, 2021

I remember taking my first Deep Learning course in University 5 years ago; We had to implement the neural network, gradient computation, Batch Normalization, Drop out and all other details from scratch without external libraries in either Java/C++/Matlab. Apparently this was the old way DL used to be taught, and even in 2016 the professor insisted everyone had to do it this way.

I didn't study ML in general so what I gained from the course was a deep understanding of the fundamentals & math behind it, but since I didn't get to familiarize myself with any of the existing libraries (Tensorflow/Keras back then) I had a hard time convincing anyone in industry of my skills in the field :/

Also: Why does the book only cover Deep Q-Network on Reinforcement Learning? Sure it is the most notable deep learning step in the field but, there are some relevant versions such as Actor-Critic & Maximum Entropy RL that can be very relevant too. If one includes YOLO, ResNet and newer architectures for Computer Vision application, I don't know why same things are not on RL.

gillesjacobs · on Jan 25, 2021

I don't like that OP is using Arxiv to upload his course material.

Arxiv is supposed to be a pre-print scientific publication server. A place to post your nearly finished or ideally submitted journal or conference manuscript so you can reference it while it is being reviewed.

The purpose of Arxiv is in itself a patch for a too common, too long in duration publication process. Now it is often the first place to publish ML research and an obligatory source of literature in ML. No peer-review or any quality assurance makes for dubious work appearing there that can waste a lot of research time.

Hosting your elementary course there because that's where the researchers are, muddies the quality of work on Arxiv further.

cambalache · on Jan 25, 2021

You are around 25 years too late. Since the very beginning Arxiv has been used to publish not only cutting-edge research but also divulgation pieces, reviews, commentaries,workshop's accompanying materials,lecture notes and books.

hoseja · on Jan 25, 2021

Your criticism depends on peer-review actually being functional in the first place.

gillesjacobs · on Jan 25, 2021

I fail to see how it does. The popularity of Arxiv is indeed a sign of a slow-functioning peer-review system, but it has its valid uses.

My criticism pertains to OP using Arxiv as a PDF host for course material irrelevant to Arxiv's userbase, i.e. expert researchers. It is already hard enough to find the quality manuscripts in Arxiv. Hence why Karpathy made Arxiv Sanity Preserver [1]. I would rather not have to drudge through pages of tutorial pdf's when searching "neural graph methods for NLP" for instance.

1. https://www.arxiv-sanity.com

melling · on Jan 25, 2021

What book are people recommending for Deep Neural Networks?

I’m working through ISLR (starting ch8) so I’ll be done in a few weeks.

This topic isn’t covered so another textbook, with exercises, would be ideal.

Just noticed that this paper is a book. Maybe I have a winner?

max_ · on Jan 25, 2021

The Deep Learning book is great & I have a copy, but its not honestly something that I read cover to cover.

To me, Its more of a reference book.

But if you want a "Fyneman" type book that describes the underlying structure & workings in a non academic way, I would recommend;

Michael Nielsen's Neural Networks & Deep Learning[0]

Jeff Heaton's Introduction To The Math Of Neural Networks[1]

[0]:http://neuralnetworksanddeeplearning.com/index.html

[1]:https://www.amazon.com/Introduction-Math-Neural-Networks-Hea...

manojlds · on Jan 25, 2021

Have been hearing good things about this - http://d2l.ai/

Have to get to it soon.

rsfern · on Jan 25, 2021

This paper seems really interesting, and it’s geared towards applications I guess

For more fundamental material I like the CS231n course notes [0] and Goodfellow, Bengio, and Courville [1]

0: https://cs231n.github.io/

1: https://www.deeplearningbook.org/

visarga · on Jan 25, 2021

The Hundred-Page Machine Learning Book by Andriy Burkov - is quite highly regarded and to the point.

http://themlbook.com/

gnrlst · on Jan 25, 2021

Is there an equivalent book/pdf for PyTorch? In general, any recommended courses/books? I've already done a first pass of fastai's 2020 course which was very eye-opening and a good segue from the introductory courses on Kaggle.

shimonabi · on Jan 25, 2021

I came across Jeff Heaton's books a long time ago, when I was looking for material on how to implement my own NN for my AI class. If I remember correctly, he published C# and Java implementations.

I eventually settled for Python and implemented the NN with the help of the book Make Your Own Network by Tariq Rashid. Numpy is really magic.

jonbaer · on Jan 25, 2021

Same guy? https://www.youtube.com/user/HeatonResearch

tanelpoder · on Jan 25, 2021

Yes, looks like the same person. The Code tab on arxiv page points to the same github account as on the youtube page.

aceon48 · on Jan 25, 2021

I've been really enjoying learning Deep Learning. What are job titles where you would get to work with this? and how much knowledge is expected already?