
Lessons from Optics, the Other Deep Learning - mxwsn
http://www.argmin.net/2018/01/25/optics/
======
twtw
Another field that might be interesting to compare is analog design. There is
a similar stack of theories: lumped element -> transmission line -> maxwell's
equations. And yet analog IC design depends heavily on inherited mental models
from mentors and modifications of well known topologies. Outsiders think it is
black magic. The physics is all understood (nearly) perfectly, and yet knowing
the details of QM that explains MOSFET operation helps not at all (or very
little) when designing actual useful circuits. The real world considerations
of parasitics, coupling, etc. dominate, and extensive formal analysis is not
terribly useful. The general methodology is to make changes to the design
based on intuition, simple predictive models that give you a direction, and
previous experience, and then simulate to see how you did.

A ton of high-quality engineering is done based on intuition, mental models,
and patterns learned over years of experience. My hunch is that deep learning
will be the same.

EDIT: Just reread, and I want to clarify. I'm not saying that analog design is
at the same stage of development as deep learning, or that it is anywhere near
as ad hoc. Deep learning probably has a long way to go, but it could
potentially end up in a similar state where years of experience is critical
and intuition rules.

~~~
wrycoder
A couple of times, I've heard Gerry Sussman at MIT (SICP author) give a talk
on how bipolar transistor circuits are actually designed. You don't use SPICE
or mesh analysis except in unusual circumstances (e.g. non-linear circuits) or
to fine tune a completed design.

As an example, a bias design goes something like this: "Let's see, I'll pin
the base at five volts with a resistor divider. The emitter will be 0.6V below
that. Then the emitter current will be (5.0 - 0.6) divided by the emitter
resistor. The collector current will be essentially the same, so I can pick
the collector load resistor to give me an appropriate quiescent point and make
sure the output impedance is less than a tenth of the input impedance of the
following stage (so I can ignore the latter)."

~~~
analog31
The classic _Art of Electronics_ by Horowitz and Hill definitely taught this
approach. It's considered to be a hard textbook for undergraduates, but is
also a joy to read and study for pleasure.

~~~
vvanders
Can't upvote enough, that book is a treasure. Expensive but totally worth it.

~~~
amelius
Sorry, but I can't agree. It's basically a cookbook. It doesn't present the
material in an analytical style.

~~~
vvanders
Is that a bad thing? For me the value wasn't in the cookbook aspect but in the
explanation of the practical considerations when it comes to engineering a
design.

Knowing a parameter in detail isn't nearly as important as knowing if that
parameter matters in the scope of the final design.

The book is also pretty honest about that and goes to pretty good lengths to
guide readers to deeper material if it's an area they want to understand in
greater depth.

------
TYPE_FASTER
I worked for a couple years on control systems software without any kind of
experience or formal training. We would get source code, with some initial
tuned parameters, then take our robot out into the field, and re-tune our
control loops based on reality.

Here was my takeaway: an engineer has to understand the domain and algorithms
involved at a deep level, or they will not be productive. Or, you will need to
have both an engineer and somebody with the domain knowledge and experience.

It doesn't really matter what your problem domain is. If you're an engineer,
and it's your job to make changes to a system, whether code or config, you
need to understand it at a deep level. And your manager needs to understand
this requirement.

Otherwise, you will be guessing at changes, so your productivity will be
horrible or non-existent.

~~~
zwieback
Very true but the post makes a more subtle point: you don't have to have a
model that explains everything as long as your model is predictive for the
problem you're trying to solve. You can build a good telescope without
understanding quantum mechanics. I guess that's the difference between science
and engineering, broadly speaking.

~~~
amelius
> I guess that's the difference between science and engineering, broadly
> speaking.

Science also doesn't need to have a model that explains everything. Because if
it did, we wouldn't have science.

------
metakermit
I like the parallel between optical lens systems and deep learning. I'm also
kind of disappointed by the "arcane lore" status hyper-parameters have in
different ML domains. I think it would be healthier for the community to make
it a habit to explicitly document why a certain topology and layer sizes were
selected. It's like providing documentation with your open source project –
yes, it would be possible for knowledgable people to use it without it, but
much more difficult and beginner unfriendly.

~~~
manux
I wonder how documentable the space of hyperparameters really is (which is I
think what the OP is poking at) with the current way we conceive of them, and
also with how experiments currently happen.

Often, people either reuse other people's architectures, or simply try 2 or 3
and stick with the best one, only changing the learning rate and such.

I also wonder if there's a computation issue (training is long, we can only
try so many things), or if it really is that we are working in the wrong
hyperparameter space. Maybe there is another space we could be working in,
where the HPs that we currently use (learning rate, L2 regularization, number
of layers, etc.) are a projection from that other HP space where "things make
more sense".

~~~
azag0
In this regard, it is similar to how natural sciences are done. The
hyperparameter space of possible experiments is immense, they are expensive,
so one has to go with intuition and luck. Reporting this is difficult.

[edit:] In this analogy, deep learning currently misses any sort of a general
theory (in the sense of theories explaining experiments).

~~~
saguro
A DNN might be more effective at exploring the hypyerparameter space than
people are with their intuition and luck. Rumor is Google has achieved this.

~~~
yorwba
Google simply has the computational resources to cover thousands of different
hyperparameter combinations. If you don't have that, you won't ever be able to
do systematic exploration, so you might as well rely on intuition and luck.

~~~
posterboy
This is not accurate. Chess alone is so complex, brood force would still take
an eternity, and they certainly don't have a huge incentive to waste any money
just to show off (because that would reflect negatively on them).

But how does it work? It's enough to outpace other implementations, alright.
But the model even works on a consumer machine, if I remember correctly.

I have only read a few abstract descriptions and I have no idea about deep
learning specifically. So the following is more musing than summary:

They use the Monte Carlo method to generate a sparse search space. The data
structure is likely highly optimized to begin with. And it's no just a single
network (if you will, any abstract syntax tree is a network, but that's not
the point), but a whole architecture of networks --modules from different
lines of research pieced together, each probably with different settings. I
would be surprised if that works completely unsupervised; after all it took
months from beating go to chess. They can run it without training the weights,
but likely because the parameters and layouts are optimized already, and to
the point of the OP, because some optimization is automatic. I guess what I'm
trying to say is, if they extracted features from their own thought process
(ie. domain knowledge) and mirrored that in code, than we are back at expert
systems.

PS: Instead of letting processors run small networks, take advantage of the
huge neural network experts have in their head and guide the artificial neural
network into the right direction. Mostly, information processing follows
insight from other fields, and doesn't deliver explanations. The explanations
have to be there already. It would be particularly interesting to hear how the
chess play of the developers involved has evolved since and how much they
actually do understand the model.

~~~
yorwba
I'm curious why you believe to be able to tell that my comment is not accurate
when you yourself admit that you have no idea about deep learning?

Note that I'm not saying that Google is doing something stupid or leaving
potential gains on the table. What I'm saying is that their methods make sense
when you are able to perform enough experiments to actually make data-driven
decisions. There is just no way to emulate that when you don't even have the
budget to try more than one value for some hyperparameters.

And since you mentioned chess: The paper
[https://arxiv.org/pdf/1712.01815.pdf](https://arxiv.org/pdf/1712.01815.pdf)
doesn't go into detail about hyperparameter tuning, but does say that they
used Bayesian optimization. Although that's better than brute force, AFAIK its
sample complexity is still exponential in the number of parameters.

~~~
posterboy
Your comment reminded me of my self, so maybe I read a bit to much into it.
Even given googles resources, I wouldn't be able to "solve" chess any time
soon. And it's just a fair guess that this applies to most people, maybe
slightly fewer percent here, though, so I took the opportunity to provoke
informed answers correcting my assumptions. I did then search papers, so your
link is appreciated, but it's all lost on me.

> they used Bayesian optimization. Although that's better than brute force,
> AFAIK its sample complexity is still exponential in the number of
> parameters.

I guess the trick is to cull the search tree by making the right moves forcing
the opponents hand?

~~~
yorwba
I think you are confused about the thing being optimized.

 _Hyperparameters_ are things like the number of layers in a model, which
activation functions to use, the learning rate, the strength of momentum and
so on. They control the structure of the model and the training process.

This is in contrast to "ordinary" parameters which describe e.g. how strongly
neuron #23 in layer #2 is activated in response to the activation of neuron
#57 in layer #1. The important difference between those parameters and
hyperparameters is that the influence of the latter on the final model quality
is hard to determine, since you need to run the complete training process
before you know it.

To specifically address your chess example, there are actually three different
optimization problems involved. The first is the choice of move to make in a
given chess game to win in the end. That's what the neural network is supposed
to solve.

But then you have a second problem, which is to choose the right parameters
for the neural network to be good at its task. To find these parameters, most
neural network models are trained with some variation of gradient descent.

And then you have the third problem of choosing the correct hyperparameters
for gradient descent to work well. Some choices will just make the training
process take a little longer, and others will cause it to fail completely,
e.g. by getting "stuck" with bad parameters. The best ways we know to choose
hyperparameters are still a combination of rules of thumb and systematic
exploration of possibilities.

------
mabbo
Scary possibility: what if there is no good formal theory to explain how it
works? What if intelligence, both animal and machine, is purely random trial
and error and "this thing seems to work"?

I don't believe that's true necessarily, but it will sure hamper the authors
hopes.

~~~
wirrbel
This might be indeed true for deep learning in its current shape. However, in
the long run I think we will see more models that (1) are more composable and
(2) better approaches to engineering such models, however, these, engineered
models will look a bit different than what your research scientist throws over
the fence today.

At the moment we are in a phase were, to stick to the optics metaphor, we
stack up lenses until we see the object on the screen. This means we end up
with models that are sprawling, instead of having models that were engineered.

Another trend that seems to start in deep learning is that layers become more
constrained. I expect, that in 20 years, we will see much more constrained
models and much generative models.

~~~
m_ke
1\. deep learning models are extremely composable

2\. hopefully with time we'll have better approaches to engineer all things
that are engineered

No, at the moment we go for the biggest and shiniest lens that we can get our
hands on and hope that it's capable enough to tackle our problem. If it is we
can waste time designing a smaller, more constrained, lens to ship to
consumers.

~~~
taeric
I'm curious if you two are going to be talking past each other with the first
point. Any chance I could get you both to explore what you mean by composable?

~~~
m_ke
These days any method that uses gradient descent to optimize a computational
graph gets branded as deep learning. It's a very general paradigm that allows
for almost any composition of functions as long as it's differentiable. If
that's not composable then I don't know what is.

There's a reason why Lecun wanted to rebrand deep learning as differentiable
programming.
[https://www.facebook.com/yann.lecun/posts/10155003011462143](https://www.facebook.com/yann.lecun/posts/10155003011462143)

I'm not sure what wirrbel meant.

~~~
taeric
My guess for what they meant was that you can't compose the trained models.

For example, a classifier that tells you cat or not can't be used with one
that says running or not to get running cat.

The benefit being that you could put together more "off the shelf" models into
products. Instead, you have to train up pretty much everything from the
ground. And we compare against others doing the same.

@wirrbel, that accurate?

~~~
m_ke
You can do that, you just need a good embedding space.

~~~
taeric
You have good examples? All ways I can think of for doing this are worse than
just training the individual models.

------
nashashmi
Look at how many times this article was submitted and how long ago it was
submitted and never gained traction.

[https://news.ycombinator.com/from?site=argmin.net](https://news.ycombinator.com/from?site=argmin.net)

~~~
Houshalter
HNs algorithm could be improved if new storied were briefly shown on the front
page so a few people could see them and have a chance of voting. Like how new
comments start a the top of the thread and fall if they don't get votes.

------
dekhn
If this is interesting to you, it will also be interesting to you that lenses
perform the physical equivalent of an analog fourier transform, and physicists
exploited this to compute wave spectra well before digital computers existed.

~~~
6502nerdface
Similarly, the human cochlea performs a physical fourier transform of sound
waves, and acoustic engineers used similar principles to create paper
spectrograms back in the 1950s (check out the Kay Electric Co. Sona-Graph).

------
zwieback
Thanks for introducing me to this blog! This is the money quote for me:

 _There’s a mass influx of newcomers to our field and we’re equipping them
with little more than folklore and pre-trained deep nets, then asking them to
innovate._

~~~
tinymollusk
As one of the recent newcomers, should I feel defensive when I read something
like this? I understand there are people with much more knowledge. Isn't this
true of everyone, in every field?

The message I've gotten is "try things out". Innovation isn't necessarily
improving specific techniques, but applying them to new fields. To apply
techniques to things that are more mundane like data processing in non-AI-
focused companies, you're gonna need bodies who know how to apply these newer
programming techniques to solve problems.

Not every electrician has to understand electrical engineering.

~~~
currymj
i don't think so. the author here also helped co-write the "deep learning is
alchemy" talk that was somewhat controversial at NIPS.

i think this is especially important if you purely want to do applications. we
have a bag of tricks (dropout, batchnorm, different optimizers and learning
rate schedules). we have no real theory for why any of this should work; often
a proposed explanation will later turn out not to make sense.

so the choice of how to train things comes down to "folklore", the community's
collective experience. and there's no guarantee that folklore will generalize
to your new architecture or dataset, and no way to know whether it even
should.

the presentation seems to have struck a nerve and there's papers and talks
floating around now examining the performance of common architectures in very
simple settings. it's probably worth paying attention to these at least in the
background, as it will hopefully crystallize into a body of knowledge that
will be useful for someone trying to decide on architectures and optimization
techniques.

------
sgt101
The flaw (geddit?) in this is that deep networks are processing data from
different domains, with different characteristics (which change in chaotic
ways). In ML the choice was always "the simpler the better" \- we used
statistics and information theory to apply Occam's razor. Deep networks don't
work that way and they do work well in some real domains, nature does not
always prefer simple domain theories. If the laws of the universe suddenly
changed designing optics would suddenly be difficult as well.

------
meri_dian
A comment below makes an important point that I think is worth repeating:

"The power of digital computing is the power of modular expansion of objects.
Analog circuits and computers don't have that. And current trained deep
learning models don't have combinability and modularity either."

A point I'd like to make: the brain exhibits properties of both digital and
analog computers. It also exhibits repeating units in the neocortex which do
vary but are uniform enough that neuroscientists are comfortable classifying
them as discrete units within the brain.

I believe we must look to how the brain implements effective modularity in the
context of analog computation in order to replicate the success of digital
computers with deep nets.

------
gyom
One of the problems with coming up with a good theory is that, at the end of
day, we're building a system that's particularly suited for a certain kind of
patterns. If you're building a facial-recognition convnet, there is something
about the dataset of faces that is going to influence what works and what
doesn't.

When you're building digital circuits, they're expected not to care about what
the bits mean, which patterns are more likely. It works for all possible
inputs, with equal quality.

There are things in common with how you would process faces and how you would
recognize other visual objects, and that's why there are design patterns such
as "convolutional layers come before fully-connected layers".

In a way, the "no free lunch" theorem says that you are always paying a price
when you specialize to a certain kind of patterns. It comes at the detriment
to other patterns. So, any kind of stack of theories on ML/DL is going to be
incomplete unless you say something about the nature of your data/patterns.

(That doesn't mean that we can't anything useful about DL, but it just puts a
certain damper on those efforts.)

------
agitator
It could also be that the younger engineers are engineers by training, while
the more senior members of the team are Phd's in the field.

What I'm trying to say is that Phd's come from an academic research
background, while engineers come from a product focused background. The deep
learning field is still dealing with a lot of unknowns, counter-intuitive
responses to modifications, and pure experimentation. The engineers might just
not realize the need for continued experimentation, and, for them, it may just
feel like an undesirable waste of time to fiddle with parameters (as in,
taking away time from developing the actual product).

It's an alternate point of view, but something that I experienced.

------
stochastic_monk
This reminds me of ACDC, a Deep-fried Convnets-like[0] approach by some NVIDIA
employees. [1] See section 1.1, where they state they could perform the
operation in analog.

[0]: [https://arxiv.org/abs/1412.7149](https://arxiv.org/abs/1412.7149)

[1]: [https://arxiv.org/pdf/1511.05946](https://arxiv.org/pdf/1511.05946)

------
eb0la
I really miss having the building block rationale of all the
perception/classification/segmentation networks out there.

The only thing I've found _really_ useful until now, is to put 2-fully
connected layers if the classifier does now handle well classification... just
because you needed a hidden perceptron layer for the XOR case.

I hope to find more examples like that. If you know them, please share!!

------
amelius
Deep learning is the new alchemy. So perhaps we can learn from alchemists?

PS: The parallel between DL and optics is (if viewed historically) a bit
misleading, because for building lenses we first had a theory.

------
lexy0202
What the author refers to as "randomization strategies" is in fact
regularisation: a set of techniques to prevent over fitting.

------
TeMPOraL
Tangential to the topic, but one of the references in this article is this
amusing paper:

[http://nyus.joshuawise.com/batchnorm.pdf](http://nyus.joshuawise.com/batchnorm.pdf)

... which references an even better one:

[http://pages.cs.wisc.edu/~kovar/hall.html](http://pages.cs.wisc.edu/~kovar/hall.html)

We've been having a solid laughfest in the office for the past 10 minutes or
so.

~~~
fpgaminer
>
> [http://pages.cs.wisc.edu/~kovar/hall.html](http://pages.cs.wisc.edu/~kovar/hall.html)

This reminds me of my bioinformatics class. The final project was to reproduce
the results of a famous paper in the field.

All of us spent _weeks_ trying to do it. Nobody succeeded. The more we dug
into the paper, the more holes appeared. There were variables missing in the
paper, assumptions not covered, datasets not properly specified, etc. It made
reproduction nearly impossible; like winning the lottery. Imagine trying to
recreate the results of a deep learning paper without the paper specifying
_any_ information about the layers used, their sizes, or any hyperparameters.

The professor was equally mystified.

Years later I learned this kind of pseudo-science is rife in the field of
bioinformatics. I felt both a sense of relief in knowing we weren't crazy, and
disappointment. I actually really liked that class; the field of
bioinformatics fascinated me. But realizing what a cesspool it was, left me
disappointed.

I'm glad machine learning as a field has taken proactive steps to avoid these
exact kinds of issues. It's now common practice in ML to publish code and
models alongside your papers, and most ML libraries allow deterministic
training. This makes reproduction of results easy. It's a breath of fresh air.
That doesn't obviate all problems. Methodologies and conclusions are still up
for debate in any given paper. But at least the experiments themselves are
reproducible. And if you question the methodology or some aspect of the
experiment, you can go in and augment the experiment yourself.

~~~
sevenfive
Curious, can you link the paper?

~~~
fpgaminer
The paper was Neuronal Transcriptome of Aplysia: Neuronal Compartments and
Circuitry by Moroz

This was a decade ago. Looking at the paper again I believe we only tried to
reproduce a small portion of it; the phylogeny tree from the paper and its
supplemental material.

------
gaze
The difference is the character of the non-linearity.

