
Finding Waldo Using Semantic Segmentation and Tiramisu - carlosgg
https://hackernoon.com/wheres-waldo-terminator-edition-8b3bd0805741?imm_mid=0f57e6
======
everdev
The premise of "novelty" ML projects like finding Waldo, playing Mario, etc.
has been that the application could be of greater use elsewhere.

Has this been demonstrated? I've heard Watson got really good at Jeopardy, but
in the enterprise world many were unsatisfied with the results.

Is it still considered more efficient to focus on small ML projects and then
apply it to a bigger problem? It seems like we have enough algorithms that we
can start making valuable ML products and focusing less on novelty
applications.

~~~
ml_thoughts
I think an aspect of deep learning that is often overlooked is that it is
still not clear how much of current algorithm performance is defined by local
"obsession to detail" vs global "understanding" of the subject matter.

The "tiramisu" layers here are an interesting example of this: they are built
on dilated convolutions and one of their main selling points is that they can
do calculations on a pixel-by-pixel basis, as opposed to standard methods
which use are forced to compress information along the way through
pooling/strided convolutions (basically taking multiple pixels and summarizing
them into fewer features).

Even Wavenet, which has had a few posts on HN, is in some sense a compromise:
a few years ago people were obsessed with the idea of forcing RNNs/LSTMs to
summarize the inputs they've seen to date and learning long-range dependencies
through a hidden layer that would hopefully be interpretable. Mostly, though,
models seem to be very happy staring at the last few inputs...recent paper
showed they basically seem to mostly work as n-gram models with relatively
small n [0, 1].

The compromise is Wavenet, which can only act on a context of about ~300 ms at
a time [2], which more or less precludes learning long-range structure, but
doubles down on this inferential bias and runs this tiny audio context through
many layers and tens of millions parameters to outdo state-of-the-art models
that need to "lose information" as they process it.

To your point, I would argue that most real-world applications are more
interested in "global" interactions and an ability to "understand" signals
rather than expending tremendous resources on every tiny detail that is
observed. I'm not sure that this is the typical solution neural networks are
going to converge to.

Partly I think this is motivated by hardware: GPUs are unbelievably powerful
computing machines and they make convolutions look unbelievably attractive.
Some researchers have 8 or more of them so you don't have to worry about
obsession over detail. The other part of the tiramisu models, DenseNets,
basically glue together layer after deep layer after deep layer... I think
it's an architecture that is an obvious idea but from my understanding of
GPUs, layer concatentation is an expensive operation, and people wouldn't have
bothered designing the architecture a few years ago because they wouldn't have
been able to run it on anything other than a Titan X from the future.

Probabilistic models have been floated as a way to increase global coherence
of the information models are learning. In my experience, however, when trying
to combine models with probabilistic and convolutional components (like [3])
the neural network's first order optimization promotes obsession over details
vs understanding data well enough to be able to handle any uncertainty. To
some degree I think this is also what we see in the deep learning community:
why do second-order optimization and move toward new paradigms when you have 8
GPUs and can get a 0.1% improvement on the latest image benchmark?

[0] Blog post on "Frustratingly Short Attention":
[https://martiansideofthemoon.github.io/2017/06/28/short-
atte...](https://martiansideofthemoon.github.io/2017/06/28/short-attention-
iclr-summary.html) [1] From
[https://arxiv.org/pdf/1703.08864.pdf](https://arxiv.org/pdf/1703.08864.pdf) :
"It shows that the memory acquired by complex LSTM models on language tasks
does correlate strongly with simple weighted bags-of-words. This demystifies
the abilities of the LSTM model to a degree: while some authors have suggested
that the LSTM understands the language and even the thoughts being expressed
in sentences (Choudhury, 2015), it is arguable whether this could be said
about a model that performs equally well and is based on representations that
are essentially equivalent to a bag of words." [2]
[https://arxiv.org/pdf/1609.03499.pdf](https://arxiv.org/pdf/1609.03499.pdf)
[3] "PixelVAE":
[https://arxiv.org/pdf/1611.05013.pdf](https://arxiv.org/pdf/1611.05013.pdf)

~~~
mycelium
Thank you much for the substantive thoughts. As an every day working
programmer, it's nice to hear higher level thoughts that explain the wider
context of "NNs play mario". We're looking to include some ML based features
in our product and I like to hear intuition from practitioners as to what
actually works and wtf these things are actually paying attention to. I've
followed the Alpha Go competition closely in hopes of gaining better insight,
but it still ends up being a ton of woo.

