
Neural Networks, Types, and Functional Programming - burningion
http://colah.github.io/posts/2015-09-NN-Types-FP/
======
ericjang
I really liked the mention of the "three narratives" of neural networks: 1)
human brain 2) nonlinear "squashing and folding" of the input space 3)
probabilistic / latent variables. Seeing different ways to look at something
is always refreshing and opens up deep learning to rich analytical techniques
developed in optimization, type theory, etc.

HN: how do YOU think about NNs? As a matter of pure preference, most of my
daydreaming of NNs comes from my inductive biases of (1) and (2), with a few
ideas from MCMC methods and optimization thrown in the mix.

> Representation theory

One question I've always wondered is whether it is better to get a network to
(1) learn representation transformations A->B->C->D via a series of layers, or
(2) try to learn A->D via a single (highly nonlinear) transformation.

Obviously the former lends itself better to analytical understanding, and this
seems to be what emerges when you train ImageNet. However, do researchers have
any control over whether the learned net does (1) or (2)? I'm thinking that
scattering "skip arcs" that pass gradients through layers (or an LSTM) would
cause representations to mix more between layers.

> Each layer is a function, acting on the output of a previous layer. As a
> whole, the network is a chain of composed functions. This chain of composed
> functions is optimized to perform a task.

This is certainly true for the majority of models in use today, but I wonder
if this will remain true for future architectures. In practice, most layer
types store some state, and implement "forward" and "backward" functions that
may utilize that state.

One could argue that state is not necessary if we only consider the network to
be defined by its feedforward passes (de-coupled from the optimization
problem, which requires the "backward" functions). But the neuroscience
narrative argues that function and learning cannot be decoupled.

In the case of a network that is supposed to update it's own weights during
normal operation (i.e. topic discovery), the functional narrative is less
clear to me. Thoughts?

~~~
colah3
> Seeing different ways to look at something is always refreshing ... how do
> YOU think about NNs?

A year or two ago, when I was really doing the wandering researcher thing, I
made a point of asking lots of people at different groups how they saw deep
learning. I think these three narratives capture most of what I heard,
although a people blended them in different ways.

Certain views are more common in particular groups. For example, people at the
Montreal group seem more inclined to think about things from the
representations narrative than other groups. (Although, I think I have just
about everyone else beat in my extreme version of seeing everything from the
manifold perspective. :P )

I feel like there's some pretty interesting quasi-sociological work to be done
on deep learning. Because the community has grown so fast, and there's a lot
of variety in how people think.

>> ... chain of composed functions is optimized.. > This is certainly true for
the majority of models in use today, but I wonder if this will remain true for
future architectures.

I suspect there's something very fundamental about chains of composed
functions. I can't formalize my feeling into a strong argument though.

> In the case of a network that is supposed to update it's own weights during
> normal operation (i.e. topic discovery), the functional narrative is less
> clear to me.

I don't think matters too much. The functional graph becomes more complicated,
but it's just a matter of the output of one function going to multiple places.

~~~
ericjang
Cool, thanks again for your great post. Big fan of your articles.

> The functional graph becomes more complicated

Indeed. I once tried to come up with a "higher order function" that takes in a
feedforward network and computes a separate function that computes the
backward pass (like Theano's autodiff, but abstraction at the layer-level
rather than ops).

Here's a diagram for a simple forward MLP (left to right), with the backward
pass network below (right to left). I found this hard to work with, because of
the explosion in the size of the computational graph when you try to decouple
optimization / function. I notice something similar when trying to unroll a
RNN visually across time. [http://imgur.com/ATNwknh](http://imgur.com/ATNwknh)

Let me know if this is way off base from what you were talking about in your
post.

------
kailuowang
Interesting thoughts there.

I am writing a deep neural network lib in scala as part of my DQN
implementation. It's developed in the FP paradigm with some type safety. If
anyone is interested, the code is here.
[https://github.com/A-Noctua/glaux](https://github.com/A-Noctua/glaux) Right
now only feedforward and convolution with Relu are implemented, no
documentation either. But would love to collaborate with anyone also
interested in the area.

------
frigaardj
It seems like the author has realised that there are a few common functional
programming patterns (folds, maps) that are also common ways of combining
information and operating on data structures, and seen the parallel to some
operations that we frequently want to do within neural networks. A 'function'
is simply a thing that takes another thing and produces a third thing. This
doesn't seem that revolutionary or insightful - do these ideas give us any
extra knowledge about neural networks or is this just a nice parallel?

~~~
jeremysalwen
I think that the theory is pretty much already clear to people who have a lot
of experience with neural networks. I think the contribution is to explicitly
write it down in a clear and concise way for people who it wouldn't have
otherwise occurred to.

~~~
frigaardj
Fair enough.

------
tel
It feels a lot like one might getter better bang for their buck using Category
Theory here than raw type theory or functional programming.

~~~
colah3
Very possible. It just feels like something from that general area, and I
wanted to point it out.

I played around with fitting this in to the Curry-Howard correspondence. For a
moment, let's set aside neural networks and just talk about the probability
distributions they operate on.

In Curry-Howard, one interprets values of a type as proofs of a theorem.
Perhaps we could similarly interpret a samples of a distribution as verifying
the distribution. This might lead to a kind of fuzzy logic version of Curry-
Howard.

I'm not sure if this actually works or can be formalized -- I've only put a
tiny bit of thought into it.

~~~
tel
I've been idly chasing that idea for some time. There's some great literature
out there on categories of distributions and (conditional) random variables as
arrows between them. That would be a good place to start looking more deeply!

(God I wish I had the time to look into this myself!)

------
rybern
The idea of deep learning as "differentiable functional programming" resonates
with me. It makes me wonder if one could build an elegant monadic interface
for deriving networks, where you could bind on the results of previous layers
and use traditional functions to build new ones. That would be powerful.

------
rshaban
This is easily the clearest overview of deep learning I've read. Kudos to Olah
for his consistent quality

------
cs702
Regardless of whether this post is right or wrong, it is a showcase for the
kind of out-of-the-box thinking that can force people within and from outside
the field to see deep neural nets from a different angle. For that alone, I
think this is a fantastic essay.

Some comments:

The author writes: "using multiple copies of a neuron in different places is
the neural network equivalent of using functions. Because there is less to
learn, the model learns more quickly and learns a better model. This technique
– the technical name for it is 'weight tying' – is essential to the phenomenal
results we’ve recently seen from deep learning. Of course, one can't just
arbitrarily put copies of neurons all over the place. For the model to work,
you need to do it in a principled way, exploiting some structure in your data.
In practice, there are a handful of patterns that are widely used, such as
recurrent layers and convolutional layers."

Recurrent and convolutional layers are not just two examples of widely used
"weight tying" patterns; they are THE TWO MAJOR WAYS in which virtually
everyone "ties" weights -- across time or pixel space, usually. With the
exception of tiny fully-connected models used to solve relatively small
problems, every successful deep neural net model of meaningful scale I know of
is either a convnet or an RNN!

If anyone knows of a counter example, I'd love to hear about it!

I would also have pointed out that "weight tying" not only allows the deep
neural net to learn more quickly; it also _massively_ reduces the "true"
number of parameters required to specify the model, thereby reducing by a
similar proportion the number of samples necessary to have reasonable PAC
learning guarantees.

~~~
colah3
TreeNets are also pretty popular. Word embeddings are also a form of weight
tying, and can't always be described as part of an RNN. If you want, you can
think of most attentional mechanisms as involving a form of weight tying, in
addition to the RNN tying.

There are also more obscure things, like deep symmetry networks (where are
really a close relative of conv nets, with convolution replaced by group
convolution).

~~~
cs702
Thank you. It seems to me that every time a new state-of-the-art result in AI
is announced, a deep composition of convolutional or recurrent "neural
functional programs" is involved.

I don't see _deep_ compositions of treenets or word embedding layers, which
tend to be used instead stand-alone as simpler models or as preprocessing
layers to deep networks. I'd have to think about attentional models.

This is not a criticism. Rather, it's my way of suggesting that we need more
experimentation with more interesting compositions using a broader range of
"neuron functional programs" \-- which I believe is also one of your points.

And again, I think your essay is fantastic.

\--

Edits: changed my wording to express what I actually meant to write.

------
mjw
It's nice to get people thinking about possible connections and sharing
terminology here, but I'm not sure that many of the connections which the
article manages to make precise are particularly new or deep.

A neural network is just a mathematical function[1] of inputs and parameters,
and of course you can build them up using higher-order functions and
recursion. Libraries like theano already let you build up a function graph in
this way -- see theano.scan[2], for example, which is something like a fancy
fixed-point combinator.

The idea about types corresponding to representations seems like it would be
hard to make precise, because at a type level everything's pretty much just
R^n or perhaps some simple submanifold of R^n. "Has representation X" isn't a
crisp logical concept, and in most type theories I know of types are crisp
logical properties.

Even if you can accomodate fuzzy/statistical types somehow, I would tend to
think of a representation as being the function that does the representing of
a particular domain, rather than the range or the distribution of the outputs
of that function. Two very different representations might have a similar
distribution of outputs but not represent compatible concepts at all.

Still there is a neat observation here that by combining two different
representations downstream of the cost function you're optimising (e.g. by
adding their outputs together and using that as input to another layer) you
can force them in some sense to conform to the same representation, or at
least to be compatible with eachother in some sense. You could probably
formalise this in a way that lets you talk about equivalence classes of
representation-compatible intermediate values in a network as types.

It wouldn't really typecheck stuff for you in a useful way -- if you
accidentally add together two things of different types, by definition they
become the same type! But I guess it would do type inference. For most
networks I'm not sure if this would tell you anything you didn't already know
though.

If you want to find a useful connection between deep learning and typed FP,
you could start by thinking about what problem you'd want your type system to
solve in the context of deep learning. What could it usefully prove, check or
infer about your network architecture?

[1] I was going to say "a differentiable function", but actually they're often
not these days what with ReLUs, max-pooling etc. The non-differentiable points
are swept under the rug in that I don't see anyone bothering with subgradient-
based optimisation methods.

[2]
[http://deeplearning.net/software/theano/library/scan.html](http://deeplearning.net/software/theano/library/scan.html)

------
Xcelerate
Over the past few months, I've been learning group theory, representation
theory, and a little bit about reproducing kernel Hilbert spaces, and I think
these concepts also tie in with the author's vision for the future of neural
networks.

We are often interested in studying a set of objects in a way that is
invariant to some type of transformation. Essentially, we want to retain all
of the information describing an object except for that which is necessary to
distinguish the object from all transformed versions of itself.

As a motivating example, let's say we want to compare a collection of
photographs. One difficulty is that each photograph may have an
arbitrary/unknown rotation applied to it. How, then, can we create a
representation of each photo that does not depend on its rotation? The trivial
solution is to map each photo to the element of a singleton. This solves the
rotation problem but unfortunately removes all rotationally invariant
information from the photos as well. So our goal is to find the boundary that
splits the information describing each photo into "invariant" and "non-
invariant" partitions.

One way to do this is to train a neural network on arbitrarily rotated
versions of many photos. The network will then evolve to extract the
unnecessary information from each photo (compressing them in a sense). The
downside of using a neural network is that the training process can take a
heck of a long time to figure out the invariants, and it isn't even guaranteed
to work that well. And that's just for the preprocessing step; the rest of the
network will take forever to train.

So what can we do? Another idea is to use the Gram matrix (a.k.a. the kernel
matrix) corresponding to a set of n points in R^2. The Gram matrix preserves
all non-rotational information about the set of points; however, it is also
_overcomplete_ , meaning that we started with 2n real numbers, and now we have
n^2 real numbers (or n(n-1)/2 if we consider just the upper triangular portion
of the Gram matrix). This is less than ideal.

Representation theory furnishes a better solution to the problem by providing
a means of identifying the fundamental degrees of freedom that characterize a
set of transformations (i.e., a _group_ ). A _representation_ ρ of a group G
is a homomorphism from G to the general linear group on a vector space V. A
representation is considered _irreducible_ if there exists no nontrivial
proper subspace W ⊂ V for which ρ(g)(w) ∈ W for all w ∈ W, g ∈ G. So assuming
you can find an irreducible representation, the remaining task is to figure
out how to label (index) the mappings. A familiar example of a set of labeled
mappings of an irreducible representation is the set of spherical harmonics:
these functions constitute an irreducible representation of the group SO(3)
(rotations in 3D space). The functions are indexed by integers l ≥ 0, |m| ≤ l.
Every equivalence class {f(x, y, z) ~ (Qf)(x, y, z) | x, y, z ∈ R, Q ∈ SO(3)}
maps to a unique linear combination of the spherical harmonics (augmented with
a set of radial functions). Thus, the coefficients of these basis functions
contain only the rotationally invariant information describing a function over
R^3, and nothing more (these are what you want to feed into your neural
network!).

In a similar vein, the symmetric group Sn (the group of all permutations of n
objects) has irreducible representations that are indexed by the Young
tableaux (unique geometric shapes corresponding to integer partitions). So you
can isolate the permutationally invariant information that specifies a
function as well.

~~~
colah3
I think questions about representations and symmetry are really interesting.

Have you looked at deep symmetry networks? (I think the right way to frame the
idea is in terms of group convolutions, which I outlined here:
[http://colah.github.io/posts/2014-12-Groups-
Convolution/](http://colah.github.io/posts/2014-12-Groups-Convolution/))

