
Operational calculus on programming spaces and generalized tensor networks - hcarvalhoalves
https://arxiv.org/abs/1610.07690
======
hcarvalhoalves
OP here. This is a dense paper since it involves many concepts (abstract
algebras, tensors, automata) and I don't fully grok it (not even close).

The main points seem to be:

\- Realization of the duality between programs and maps over vector space
(simplification: assumes the real field).

\- Define differentiation for those programs.

\- Expand over current automatic differentiation approaches by having a well
defined algebra, as opposed to a pure algorithmic translation of the chain
rule by forward/backward propagation.

\- Since the differentiation operator is well defined, get higher-order
derivatives "for free" .

\- Also, get composition of differentiable programs "for free".

This formalization seem to have a lot of potential IMO by allowing
generalization of techniques researchers have to currently "hand craft" in
neural networks (control structures, recursion, composition).

~~~
kxyvr
I skimmed their paper and don't claim to understand everything, but I am
curious about some of your comments.

As far as higher-order derivatives, there's not really been an issue with
obtaining these derivatives with traditional automatic differentiation
techniques. The hard part is how to compute them efficiently. For example,
Hessian-vector products can be obtained via forward-over-reverse or reverse-
or-forward methods, but one runs significantly faster than the other. What's
not clear to me in this paper is the practical efficiency of their technique
especially for high-order derivatives.

Second, as far as neural networks, it's also not clear to me what's currently
missing from using existing AD tools and what this new paper provides. Right
now, there exist good AD tools that work just fine with control structures,
recursion, and composition. In my opinion, a much better way to implement the
training algorithms for machine learning is to just run an AD package on a
particular model and then feed the resulting derivatives into an appropriate
optimization algorithm. It's accurate, clean, and save time and effort. As
such, what do we really gain from using this new methodology?

That said, I appreciate that the authors put together some interesting
abstractions that I'd not seen before in this context. I think it's worth a
read even if just for that.

~~~
hcarvalhoalves
> What's not clear to me in this paper is the practical efficiency of their
> technique especially for high-order derivatives.

On section "5.3.1 Order reduction for nested applications":

> We gained the ability of writing a differentiable program acting on
> derivatives of another program, stressed as crucial (but lacking in most
> models) by other authors [18].

See
[http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1370&...](http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1370&context=ecetr)
for [18].

\---

> Right now, there exist good AD tools that work just fine with control
> structures, recursion, and composition.

See their example implementation
([https://github.com/ZigaSajovic/dCpp](https://github.com/ZigaSajovic/dCpp))
to understand what "control structures, recursion and composition" mean in
this context.

~~~
kxyvr
Thanks for the reply. Perlmutter and Siskind's paper doesn't really talk about
the efficiency of higher-order derivatives either and only has a very simple
run time for a gradient calculation.

I had a look at the code and I can't figure out how their abstractions would
allow for efficient higher-order derivatives. Even on the front page, they
list the examples

    
    
      //  df/dxdx
          std::cout<<"df/dxdx = "<<f.d(&x).d(&x).id<<std::endl;
      //  df/dxdy
          std::cout<<"df/dxdy = "<<f.d(&x).d(&y).id<<std::endl;
    

Generally, I think those abstractions make it hard to compute things like
Hessian-vector products because we really need a reverse pass to generate
either a program or computational tree that we can fly through with a forward
pass. That involves giving an explicit instruction to run a reverse pass since
we don't do it automatically since it's expensive. Also, we generally give an
entire vector to the forward pass at one time since we can overlap a lot of
computation. Finding second order derivatives one at a time, like `df/dxdy`
for an entire Hessian tends to be slow. It looks like most of the action is in
var.cpp, but I can't see how this would run fast. Do you have any idea how
this scales up for derivative calculations (gradient and Hess-vec) as we get
at least into the thousands of variables in terms of derivative time vs
evaluation time with no AD abstractions?

Certainly, I may be misunderstanding something. That said, if the efficiency
doesn't scale, it seems like the practical significance will be limited.

------
ilaksh
Weird I was just looking at this last night. They implemented it:
[https://github.com/ZigaSajovic/dCpp](https://github.com/ZigaSajovic/dCpp)

I got onto that based on this fantasy I was having. Basically I keep thinking
that the main advantage deep learning has is probably not actually the neural
nets, but the fact that they are highly parallel and very general computation
techniques.

So if it doesn't really have to be a NN, maybe there could be a way to more
efficiently achieve the same level of results by using something more like
evolved parallel programs. Maybe have a universal virtual sensor grounding.
Then if you find a useful/general procedure/dataset you could distribute it to
improve lots of diverse agent's knowledge/skills immediately.

But to evolve these supposed universal function-approximating parallel
programs that are more efficient and exchangable than deep NNs, you would need
something like backpropagation, except for some arbitrary program structures.
So when reading about backpropagation on wikipedia they mention the
generalization is
[https://en.m.wikipedia.org/wiki/Automatic_differentiation](https://en.m.wikipedia.org/wiki/Automatic_differentiation)

Anyway I don't grock all the details fully but my interpretation of all of
this is that I think things like backpropagation with deep NNs are truly
general purpose and when combined with grounding such as streams of
virtual/real senses and training across widely varied tasks can actually
produce general human and animal/like intelligence. I think the problem is it
just takes a ton of computing power, iterations and time. If we wait 2-10
years that probably won't matter. But if we don't want to wait we can probably
do something similar and more efficient/readily distributable if we can use
algorithms that aren't NNs but are general and highly parallel and can be
adjusted via AD to learn.

Based on this I personally believe we will see human/animal like general
intelligence systems before 2020.

~~~
argonaut
You have to be a little careful here. Neural networks are not "very general
computation techniques." A dot product and a rectified linear function (or
some other function of choice) are not "general computation techniques" in the
sense you seem to use. They are a very specific set of operations. And the
fact that you can show that two layers of these operations is a universal
approximator is a red herring: decision trees and k nearest neighbors are also
universal approximators.

~~~
ilaksh
Sure but I thought it was clear that I wasn't speaking in a precise or
mathematical way. I was referring to what seems to be amazingly general
abilities of certain successful deep learning-ish systems. And potentially
also possibly amazingly general capabilities of some supposed imagined system
a little bit like like those in terms the AD versus backpropagation.

Do you really think these abilities are not or cannot be as general as I
imagine? I just see these systems like the Deep Mind projects reading pixels
in 2d and 3d, and having this low-level sensory grounding. It seems likely to
me that we will see very exciting and general further results and then down
the line similar things on perhaps much smaller computers with the non-NN
efficiencies. Do you think this is impossible or unlikely?

~~~
argonaut
If you cannot describe something in a way that would allow for a precise
formulation, then I don't know what you're trying to say. This is the
difference between something that is implementable / testable, and something
that's just a fantasy.

As far as I can tell, your definition of "general computing" is: "it works, so
it must be general." While this is ironically indeed the attitude of a lot of
people in deep learning, this is just faulty logic.

------
m_hack
I spent the last few hours reading this paper. From what I understand, the
paper is not about AD, but something more. AD is "efficient means to
calculating derivatives", as they put it. This paper seems to be about program
analysis through calculus (AD comes into play, as derivatives are needed).

They show how to form equations with programs and calculate with them, kind of
like with "normal" math. This seems to be the reason for constructing the
algebra. They derive some transformation of programs similar to Fouriers I do
not fully understand yet (they also use this algebra and calculus to
generalize methods like Deep Dream).

They show how to use this algebra to initialize "tensor networks" from some
other sub-optimal program and than train it to obtain better results than the
original program. The transformation is called "Neural tensor series", it
seems to be like an enhanced Taylor series for programs, but again, I do not
fully understand it yet. What I do understand from it, is that through this
transformation you can somehow relate a program to a tensor network, and use
this to study and analyze programs and vice versa.

This algebra strikes me as interesting and important. They mention parallels
to how physics is being studied. In physics, you often express a system as a
solution to a system of equations, because you do not know the system in
advance. In this frame, a program is like a system. Perhaps this theory allows
a similar approach to programming.

There is another paper listed on github
[https://arxiv.org/abs/1612.02731](https://arxiv.org/abs/1612.02731) . It
seems as though the code on github and this paper are additional material to
the paper in the OP, not a full implementation of the theory, but a bridge for
those familiar with AD to better understand the theory in the paper. I could
be wrong though, it is getting late :)

------
abecedarius
Take the factorial function defined in one of the usual ways (recursive or
iterative). How do you take the derivative of it, using this paper? Do you
magically get the gamma function this way, too?

I just skimmed the paper without getting close to answering these questions --
anyone who does already understand, maybe explaining this concrete example
could help the rest of us?

Thinking about the problem in my naive way earlier this year, I noticed that
the usual if-then-else expression is a special case of linear interpolation,
at least when you're dealing with numbers. ("lerp by t between a and b"
generalizes "if t then b else a", taking false=0 and true=1.) But it's not so
obvious how to handle recursion and nonnumeric data: I guess there's a hint in
that the usual theory of programs has Scott continuity
([https://en.wikipedia.org/wiki/Continuous_function](https://en.wikipedia.org/wiki/Continuous_function))
while differentiable functions are continuous in the reals, but I'm
mathematically undereducated.

~~~
ilaksh
See my comment with link to their code.

~~~
abecedarius
That's a helpful link, thanks. At first glance it looks like, in all their
examples, the control flow is oblivious to the values in the vectors. I
suppose I'll have to try it to see if/how it can differentiate factorial(x)
with respect to x. ("You can't do that" would be a perfectly reasonable
answer, but it's unclear to me if that's what it is.)

~~~
Smaug123
It's certainly not obvious how even to define factorial in this model: in
section 2 of the paper, they restrict to real-valued variables.

------
alistproducer2
I could be [very] wrong but my read of the practical use of this is that it
creates a really complicated-mathey AST that allows for adaptation of a
program to different forms. The paper notes several times how a program could
be adapted for maximum efficiency on different hardware than it was originally
written for. I'm certain there a bunch more to it, but that was all I was able
to grasp.

~~~
kxyvr
Actually, what's funny is that this is probably the most succinct explanation
of an operator overloading approach to AD that I've heard. But, from what I
can tell as well, yes. Most of the action is in var.cpp:

[https://github.com/ZigaSajovic/dCpp/blob/master/src/var.cpp](https://github.com/ZigaSajovic/dCpp/blob/master/src/var.cpp)

It looks like the AST is held by a collection of maps that contain pointers
and other variables.

Outside of this paper, most of my time on writing AD codes has been literally
just trying to do things on ASTs really, really fast, which is why your
comment is really apt.

~~~
alistproducer2
As a wannabe teacher, I like to think I'm pretty good at condensing
complicated information. It's nice to know I will got it :)

------
theideasmith
This looks really interesting but I just don't understand it. Can someone with
more expertise explain in layman's terms what this paper is about?

~~~
imh
It's a bit beyond me too, but here's my take from skimming: It's already known
that you can do differential calculus in some pretty abstract spaces (not just
functions from R^n to R^m). Automatic differentiation is a way of
automatically keeping differentiability and computing derivatives when
sticking other differentiable stuff together. This puts those ideas together,
so you can automatically differentiate some more interesting things stuck
together in some interesting ways related to programming. That was what I got
from my rusty semi-layperson's skim, at least.

~~~
theideasmith
so what does it actually mean to differentiate a program?

~~~
niels_olson
Karpathy's explanation in CS231n is great: for any computational unit that is
a function, you can calculate a derivative. So propagating the derivative
backward through the function, just aplly the chain rule. This is much with
simpler functions, so understand your algorithm down to the smallest
computable units in the graph.

~~~
theideasmith
Does this only work for mathematical programs? I don't understand how this
would work for more general computations, involving loops, writes to files,
etc.

~~~
kxyvr
This paper isn't really a great introduction into the concept of AD. That
said, one way to visualize it is to see how a basic AD implementation works.
There are generally two different varieties, operator overloading and source
transformation. Let's focus on the first.

Say we have a function:

    
    
      double myfun(std::vector <double> x) {
          auto y = double(0.);
          for(int i=0;i<=5;i++)
               y += x[0]*x[1];
          return y;
      }
    

We may want to find the gradient of myfun, or simply the derivative of myfun
with respect to x[0] and x[1]. What we're going to do is reply double with a
special AD type, which requires us to code things like:

    
    
      template <typename Real>
      Real myfun(std::vector <Real> x) {
          auto y = Real(0.);
          for(int i=0;i<=5;i++)
               y += x[0]*x[1];
          return y;
      }
    

Now, instead of using a double, we'll set Real to be our AD type. In C++,
this'll be something like a class where all of the AD action occurs. What's in
this class depends on the particular AD algorithm, which depends on the what
derivatives we want and what kind of recomputations of the derivative we want
to make. In general, though, it consists of some kind of tree. At its most
basic level, the nodes of this tree consist of the types variable, constant,
unary function, or binary function and they contain information that relates
to the chain rule. When we want to compute a derivative, we traverse the tree
in a very particular order to compute and accumulate the derivative
information. How this is done can vary as well, which has implications for the
efficiency of the algorithm. Most implementations use a tape that predefines
the order of the tree traversal. It's also possible to do a topological sort
to solve for a tree traversal. Anyway, traversing a tree can be expensive for
a really, really large computational tree, so the goal is to only touch every
necessary node once.

Returning to your original question, finding the derivative of a program
involves building up this tree of computation. Control structures don't really
matter in this context because we'll just keep adding to the tree regardless
of loops or conditionals. This, by the way, can be really, really expensive
since the computational tree can be huge for a long running loop. There's some
game in determining how and when to collapse the tree in certain ways. Source
code transformation tools can just leave the original, or really modified,
control structures in place, which alleviates a lot of this expense, which is
partially why they're faster than operator overloading.

~~~
theideasmith
Thanks for your explanation. So what this really does is enable one to train
code using machine learning techniques such as gradient descent, traversing a
code-space instead of traversing a space of numbers?

~~~
kxyvr
Well, from my perspective, there are three different components to machine
learning techniques like multilayer perceptrons. Basically,

1\. What's the model?

2\. How do we define the error between the model and the data?

3\. What algorithm do we use to fit the model to the data?

In the case of a multilayer perceptron with a single hidden layer, the model
is

m(alpha,W,b,x) = alpha'*sigmoidal(Wx + b)

where

alpha : nhidden x 1

W : nhidden x ninput

b : nhidden

' denotes tranpose and sigmoidal is some kind of sigmoidal function like a
logistic function or arc tangent. As far defining the error, we tend to use
the sum of squared errors since it's differentiable and easy:

J(alpha,W,b) = sum_i (m(alpha,W,b,x[i]) - y[i])^2

Finally, we need to apply an optimization algorithm to the problem

min_{alpha,W,b} J(alpha,W,b)

Most of the time, people use a nonglobalized steepest descent. Honestly, this
is terrible. A matrix-free inexact-Newton method globalized with either a
line-search or trust-region method will work better.

Anyway, all good optimization algorithms require the derivative of J above.
Certainly, we can grind through these derivatives by hand if we're
masochistic. Well, for one layer it's not all that bad. However, as we start
to nest things, it becomes a terrible pain to derive. Rather than do this, we
can just use automatic differentiation to find the gradient and Hessian-vector
products of J. In fact, most AD codes already do this. I can never find a
paper that's run through the details, but back propagation is basically just a
reverse mode AD algorithm applied to J in order to find the gradient.

~~~
nn3
This may be a dumb question:

What is the data that the machine learning algorithm would optimize with given
an arbitrary program that is differentiated?

Is it just a faster program or something else?

For any changes you would need some kind of perfect 100% coverage regression
test that proves that the optimized program is still correct and handles all
cases because the differentiation only recorded one possible path through the
program.

~~~
kxyvr
Alright, so machine learning is a pretty broad term that encompasses a lot of
things. Very specifically above, I was talking about things I call curve
fitters, which include multilayer perceptrons. There's also classifiers, which
often people create by use a curve fitter and then thresholding it. There's
other strange techniques as well that people call machine learning that don't
fit into these categories as well.

Anyway, for curve fitters, we literally need a set of points
{(x_i,y_i)}_{i=1}^m given to us and our goal is to map the n-dimensional
vectors x_i to the scalar output y_i. If we want a multidimensional output, we
just do this more than once. Now, theoretically, we can try and match any
function, or program, to this data. Let phi be this model, so that we want
phi(alpha,x)=y. Here, alpha are the parameters we want to tune for the
matching. Also note, these parameters can't be arbitrary. We really need them
to be some kind of floating point number. Now, how we do the matching can
vary, but sum of squared errors is easy, so we solve the optimization problem

min_{alpha} sum_i (phi(alpha,x_i)-y_i)^2

Basically, we want to find a parameter alpha so that the output from program
phi matches our data {(x_i,y_i)}_{i=1}^m as closely as possible in a least-
squares sense. Since all good optimization algorithms need a derivative, we
use automatic differentiation to find the gradient, and probably Hessian-
vector product, of

J(alpha) = sum_i (phi(alpha,x_i)-y_i)^2

Now, if phi is a program, then so is J. It's a program that takes alpha as an
input, runs a for-loop to sum up the squared errors, and then it returns the
result.

Alright, that was long winded, but the key is this. We're not optimizing a
program when we do this. Rather, the optimization occurs when we minimize the
difference between our inputs and our outputs, which were given to us ahead of
time. In this context, we have two programs. One, the program that we're
trying to fit to our data. Two, the program that determines the amount of
misfit between our inputs and our outputs. We need the derivative of the
composition of these two programs, which is why we use automatic
differentiation.

------
yequalsx
I'm curious about this claim:

A set of all the possible states of the program’s memory, can be modeled by a
finite dimensional real vector space V ≡ Rn.

Isn't the set of possible states of a computer's memory countable? How can one
reasonably model this using a vector space over the reals? The paper claims
any field will do. Well Z_p isn't a Banach space.

It seems to me if they want a notion of differentiability of a program that
one would work in a field of characteristic 2 and use the module of Kahler
differentials or use discrete differential forms. I certainly don't know
enough about computer science but using a real vector space seems like it
wouldn't be helpful.

~~~
cs702
yequalsx: yes, almost by definition, memory made up of bits should be modeled
with a field of characteristic 2. That seems the most "natural" way to do it.

However, the authors of this paper are thinking about "computers" in a broader
sense.

Take a look at these two papers:

* "Neural Turing Machines:" [https://arxiv.org/abs/1410.5401](https://arxiv.org/abs/1410.5401)

* "Hybrid computing using a neural network with dynamic external memory:" [http://www.nature.com/nature/journal/v538/n7626/full/nature2...](http://www.nature.com/nature/journal/v538/n7626/full/nature20101.html) and a related blog post at [https://deepmind.com/blog/differentiable-neural-computers](https://deepmind.com/blog/differentiable-neural-computers)

The memory space of these "neural" computers is formally modeled as R^n (n
being the total number of parameters and memory addresses), and implemented
_approximately_ with 32-bit floating point numbers in practice (i.e., with a
memory space of (2^32)^n bits). As you read the paper linked to by the OP,
think about these neural computers instead of the one atop your desk.

EDITS: removed sloppy math and lightly edited some sentences.

~~~
yequalsx
Thanks for the links and insight!

~~~
cs702
Glad to be helpful. And sorry for the sloppy math -- my mind is always on a
million other things at the same time! :-)

------
bthornbury
This looks extremely interesting, but I find many of the terms difficult to
map to a mental concept. Does anyone know good background material (books,
other papers, blogs)?

~~~
akssri
The language used can be found in introductory chapters on Differential
Geometry.

~~~
bthornbury
Going to check out the opencourseware
([https://ocw.mit.edu/courses/mathematics/18-950-differential-...](https://ocw.mit.edu/courses/mathematics/18-950-differential-
geometry-fall-2008/))

Thanks for the pointer

------
cbsmith
I haven't read this yet. Can anyone describe how this compares to Darcs'
"calculus"?

~~~
mike_hearn
They're totally unrelated.

This paper is rather hard to read and I've put it into my pile of papers to
work through when it's not 1 AM, but basically this is related to machine
learning and neural networks. More generally it's about how to transform
arbitrary imperative programs with branching and control flow into a
"differentiable" form. It means, you can train the program by using algorithms
like gradient descent.

This article may help:

[https://en.wikipedia.org/wiki/Differentiable_neural_computer](https://en.wikipedia.org/wiki/Differentiable_neural_computer)

The awesome/scary thing about this line of research is that it's exploring the
question of how far computer programming can be automated. Simple machine
learning techniques can already be seen as "automatic computer programmers" in
a sense, but their capacity is very limited. DeepMind recently made a
breakthrough with their DNC architecture, described in that wiki page, which
introduces a notion of memory to neural networks, and right now NN research is
going at an incredible speed. The more stuff that becomes encoded in a
differentiable form, the more stuff can be "trained" based on examples using
standard machine learning techniques.

I plan to do some blogging on these topics soon.

~~~
narrator
Isn't there a limit to completely automated programming that has to do more or
less with the halting problem? Like if the program builds an infinite loop? I
guess it will have to timeout, but that's going to slow the learning down.
Deep neural networks don't have conditional loops in them so they don't have
to deal with the halting problem.

~~~
andrewflnr
I'm not sure exactly how it applies here, but in general you can limit the
"language" so it only includes things that are guaranteed to halt. You lose a
lot of expressive power, but hey, no halting problem. The obvious thing to do
is guarantee loops halt by forcing them to have a monotonic countdown, or ban
them entirely.

