Hacker News new | comments | ask | show | jobs | submit login
Linear logic and deep learning [pdf] (therisingsea.org)
170 points by adamnemecek 11 months ago | hide | past | web | favorite | 47 comments

Author here. The theoretical background can be found in:




As neel_k notes, a good way to understand this picture is in terms of differential linear logic (a refinement of simply-typed differential lambda calculus). I did not provide references in the talk as unfortunately I did not understand the subject at the time, but the introduction to the second paper above hopefully serves to remedy that omission. Suffice to day that Ehrhard and Regnier discovered something quite profound in differential lambda calculus, the implications of which are still being explored; this discovery should have significant bearing on what people refer to as differentiable programming, I think.

The third link above is the master’s thesis of my talented student James Clift. We have subsequently extended that work and are finishing a paper which explains the computational content of derivatives of Turing machines, according to differential linear logic. This might be of interest to people here. I will post a link when we put it online.

I find this stuff fascinating, at least in the application of differential calculus to things I thought non-continuous, differential grammars, differential regex, etc.

It seems like operating in the discrete domain was fine for small problems where problems could be brute forced, but if we want to get into an analytical regime for larger problems, differentiation is akin to recursion/induction in that it allows us to make tractable smaller problems. Is that roughly correct?

How would you recommend an undergrad bootstrap themselves on this subject? Are there patterns we can apply to transform discrete problems into differentiable problems?

I think it’s too early to say what differential lambda calculus or linear logic is “good for” in a practical sense. However, I don’t tend to think of it as being about breaking problems into smaller pieces.

I think it is more about error propagation: you can’t estimate which inputs to an algorithm contribute more to the final error, without some notion of derivative of an algorithm with respect to its inputs (even if those inputs are discrete, so that making infinitesimal variations does not obviously make sense).

Linear Logic is amazing-ly useful, http://homepages.inf.ed.ac.uk/wadler/topics/linear-logic.htm...

Rust [0] wouldn't exist without it.

Oddly enough, I think the next discontinuity in programming language design will be in making it easier to construct anytime [1], incremental [2] and succinct [3] programs. There is a tyranny of poor tolerancing that we currently have a very hard time from escaping.

Is it that differentiable programs can be made sloppy in a way that a non-differentiable one can't? Have you looked at the research of using DNN for physics simulations? [4]

[0] https://www.rust-lang.org/en-US/

[1] https://en.wikipedia.org/wiki/Anytime_algorithm

[2] https://en.wikipedia.org/wiki/Incremental_computing

[3] https://en.wikipedia.org/wiki/Succinct_data_structure

[4] https://www.youtube.com/watch?v=iOWamCtnwTc

What do you think of 'Programming with a Differentiable Forth Interpreter'[0] and 'A Neural Forth Abstract Machine'[1]? I don't know if they are related to your work, but both are on my to-read pile.

[0]: https://arxiv.org/abs/1605.06640 [1]: http://people.idsia.ch/~rupesh/rnnsymposium2016/files/bosnja...

I think these papers are fascinating! I would also recommend the Neural Turing Machine paper from DeepMind, which I know better than the papers you mention.

But keep in mind that this is a very young field, and there is no definitive point of view.

Fantastic stuff!

Does your work still keep the non-determinism of Ehrhard and Regnier?

This was the part that 'bothered' me, but it provided evidence of a link with process calculi.

I've also thought Ehrhard and Regnier's work was groundbreaking. It potentially opens up not just insights into differentiable programming, but a theory of concurrent computation and an algebraic theory of computation. My expectation is that it will reveal some deep links with other areas of mathematics, like group theory.

I think you are right, the meaning of the sums that appear in syntactic derivatives is quite subtle, and I don’t claim to have an authoritative answer as to their deep meaning.

Semantically, however, I think things are clearer. The denotation of the syntactic derivative of a proof is a limit, and the terms in the limit have an interpretation as probabilistic processes in a (close to) standard sense. So while the sum-over-linear-substitutions doesn’t have a clear probabilistic interpretation (at least that I can see) it is the limit of computational processes where you run the probability of making a substitution to zero.

I don’t know anything about process calculi, unfortunately!

P.S - Silly question, I can see it does. I responded before reading.

Sorry for the unrelated comment but I wanted to thank you for your nice lecture notes. I found them clear and very helpful during my studies.

Thanks! You’re welcome.

Oh, wow -- thanks especially for that second link! I was wondering about this recently.

I haven't had time to get into the details of this, but I'm very interested in differentiable programming that doesn't repeat the mistakes of imperative programming. I'd like to understand how the model in this work differs from, e.g. http://www.bcl.hamilton.ie/~barak/papers/sound-efficient-ad2...

At a super high level, the central problem is higher-order functions.

If you have a programming language, you can think of differentiation as being a higher-order function that takes a function a and returns a new function as a result. If your language has support for higher-order functions, then (a) you have to extend the definition of the derivative to cope with differentiation at higher type, and (b) ideally you'd like differentiation to be an operator in your programming language.

In general this is hard, so you have to cut the problem somewhere.

Pearlmutter and Siskind's basic idea is to use Reynolds defunctionalization transformation to turn higher-order programs into first-order programs, and then apply automatic differentiation to the resulting first-order program.

The approach in this talk is inspired by "differential lambda calculus", which is a setting where you retain full first-class functions, but restrict how you can use them by imposing certain typing constraints upon them (basically to ensure smoothness/differentiability of definable functions).

There is a mysterious connection between these ideas, since defunctionalization is closely related to the "geometry of interaction" semantics of linear logic, and differential lambda calculus is closely related to the vector space semantics of linear logic.

In addition to these semantic issues, there's also the question of how you hook it up to the actual algorithms people use to do gradient descent -- eg you need to use the Riesz representation theorem to justify representing the gradient as a matrix -- and that constrains how things could work even further. This is simultaneously annoying (because it makes things harder) and helpful (at higher type we have too much freedom in how to define the derivative, and so additional requirements help constrain the solution space).

So you might ask: why do we care so much about higher-order functions? The answer is that "higher-order function" is how semanticists pronounce "linking" and "separate compilation". It would be super great if we could take a bunch of teeny-tiny separately-developed models together, stick them together, and have everything work no problem.

Re GoI and defunctionalization, is "On the relation of interaction semantics and defunctionalization" by Schopp a good reference? I've just started reading the literature regarding applications of GoI to pl theory and would love to add this line of research to my queue.

Thanks for such a fantastic answer!

I should add that Gordon Plotkin gave a keynote at POPL this year, Some Principles of Differentiable Programming, where he discusses this from the angle of probabilistic programming. It's on Youtube at:


It's a really inspiring talk, albeit super intense (you need to be bilingual in denotational semantics and real analysis to follow everything).

I think Pearlmutter and Siskind‘s work is primarily about using ideas from lambda calculus to clean up AD (automatic differentiation) of numeric functions. Whereas differential lambda calculus is not about numeric functions per se: the aim is to define derivatives for arbitrary algorithms.

So, for example, I do not think that in the Pearlmutter-Siskind system that a function of type Int -> Int would have a meaningful derivative, whereas in the differential lambda calculus it does (this is surprising).

I do not recall the details of the Pearlmutter-Siskind system, but some approaches to programming with AD is based on adding formal infinitesimals. There are no infinitesimals explicitly in differential lambda calculus or linear logic, but in the semantics of the latter that we use, derivatives do arise from the coalgebra (k[x]/x^2)^* which is the manifestation of a first-order infinitesimal in algebraic geometry. So in some sense both subjects proceed by adding formal infinitesimals to an existing programming language, but in differential lambda calculus this is implicit.

Thanks for answering. This clears up some things for me.

I've made a first pass through the Pearlmutter-Siskind system (the Lambda the Ultimate Back-propagator paper) and the main idea, IIUC, is to implement backward-mode AD over reals in a compositional manner. So I generally agree with your characterisation of it. Forward-mode AD is usually presented in terms of infinitesimals (dual numbers) but backward-mode is not, but this is a minor point.

>> So I’d like to begin by summarising some of the recent history in the field of artificial intelligence, or machine learning as its now called.

To be more precise, the field is still known as AI, but people outside the field only know (and care) about machine learning, presumably because that's what Googe, Facebook, et al are recruiting for.

This is a bit of a sad situation, really. AI, in its drive to solve major hard problems (that still remain unsolved btw), has always contributed powerful programming techniques and even entire language paradigms to computer science (functional programming, object orientation that basically started with Minsky's frames, all sorts of search algos etc).

Machine learning is just one such technique which has gained popularity outside AI. It's powerful, for restricted problems. But if it somehow replaces AI in peoples' mind, the goose that lays the golden eggs might just die.

> Machine learning is just one such technique which has gained popularity outside AI.

I'd argue the machine learning label can be applied to any AI system that's data-driven in some sense (even self-generating the data using reinforcement learning).

Wikipedia lists the following approaches to Machine Learning. Surely you wouldn't call them all 'one technique'?: Decision tree learning, Association rule learning, Artificial neural networks, Deep learning, Inductive logic programming, Support vector machines, Clustering, Bayesian networks, Reinforcement learning, Representation learning, Similarity and metric learning, Sparse dictionary learning, Genetic algorithms.

AI used to cover Expert Systems, which have nothing we'd recognise as a "learning" phase at all. Hell, it used to cover parsers.

Let's say that it's a class of techniques that learn models (or progams, in the case of ILP) from data.

My main concern is not about defining what machine learning is, however. Rather, I'm worried about the definition of AI shrinking to "it's what we now call machine learning". Which is at the very least unhistorical.

I thought it was common for people to think of AI as largely disjoint from machine learning ... AI covering all the expert/rule based systems, grammars, and search algorithms with ML being pretty strictly referring to techniques for building predictive models from datasets ...

Well, machine learning arose partly from the need to find a better way to create rule-bases than hand-crafting them, painstakingly and at great cost. You can see this more clearly in the work of early pioneers like Ryszard Michalski, or in inductive inference (the field preoccupied with learning automata from examples). All these have always been well in the purview of classical AI.

Then of course there's the connectionist branch that explicitly set out to take inspiration from the human brain. I don't think that anyone ever saw neural networks as anything but AI for the last 50 years or so. In any case, when the connectionist were having arguments, they were having them with other AI folks, not with, say, statisticians, or mathematicians.

Then, if you think about the classic tasks that machine learning systems perform, they're all firmly classic AI tasks: classification, game-playing, machine vision, speech processing, etc etc.

I mean, the fact that models are "predictive" on its own doesn't tell you that machine learning is separate from AI. After all, an intelligent agent must be able to make predictions about its environment and the outcome of its actions on this environment.

Isn't it kind of reaching to say that the field of AI contributed the entire functional paradigm of computation? Can you say more about why you think that's the case?

To a trained mathematician, deep learning is so so far away from the cutting edge. If there's anything that's going to make a massive, revolutionary not evolutionary, change in the deep learning landscape, it's not going to come from engineers walking around on the surface of what's already there, it'll come from pure mathematicians connecting it to the insights ripe for the picking found deep deep down in the theoretical dirt.

Often the trick is not to make the new thing, but to make the old thing comprehensible so someone else can pick it up and run with it.

I agree. The real low hanging fruit is not new research, but to make the easy, basic first or second year graduate math understandable in the context of ML.

Or applying the old thing in a new way to a different audience, like Einstein using Lorentzian manifolds.

I guess that's why pure mathematicians have been at the forefront of all the major developments in neural networks and deep learning? This is the classic applied vs. theoretical debate. Let's not pretend like there's ever a clear winner.

I suspect you're being sarcastic. As a species, engineers are currently way more successful than pure mathematicians because of their higher "frequency dependent fitness" (to use terminology from biology), meaning that because of the huge number of engineers that do the small incremental improvements, looking at all of them as a single organism makes them look more well adapted to the innovation landscape than the (pure) mathematicians. But if you normalize for population size, if you were to pit an equal number of pure mathematicians against engineers, then I suspect the former would prove to be more innovative, at least on a medium to long term horizon, than the latter.

Listen, I respect and love pure mathematics as much as the next guy. However, where's the utility in establishing, with any more clarity, exactly which field should be understood as being responsible for the biggest innovations? This feels like a needless detour into a tired debate which has already been hashed over countless times.

I agree. I was mostly addressing your argument that the output of each field (engineering vs math) is a good marker of ability to innovate (which needn't be true because of their relative population size).

On a similar note, taking a look at a mathematician's coursebooks made me realize how puny our mathematical tools are. Most Computer Science topics only rely on (relatively) basic concepts from logic, algebra, calculus, and/or probability. The hardest course I ever attended was a quantum computing one, and even that does not require much, apart from a generalization of probability theory to complex numbers.

I bet there are lots of brilliant results that a smart mind could apply to our field.

I think the human brain is capable of only grasping so much abstraction. I'm sure every genius working in the deepest areas of advanced mathematics has heard of deep learning. If there was some amazing concept from their field directly applicable to deep learning we would have heard about it by now.

Differentiable learning is a new hot topic and this is research from an (applied) math phd from UCLA. This is valid research that may or may not advance the field - like any other research. What's the issue?

Nobody is discounting that someone could open up a completely different learning paradigm, and it would most likely come from a pure mathematician with computer skills. A few prominent researchers have said something to that effect already.

So according to you, why hasn't it happened yet? Not trying to flame, I'm genuinely curious.

I would think that the economic incentive to do exactly this would be pretty high.

What are some good results coming out from applied mathematics?

"Lively Linear Lisp--'Look Ma, No Garbage!'". ACM Sigplan Notices 27, 8 (August 1992),89-98. How to implement a 'linear' language efficiently through 'behind the scenes' hash consing.


From the abstract:

Linear logic has been proposed as one solution to the problem of garbage collection and providing efficient "update-in-place" capabilities within a more functional language. Linear logic conserves accessibility, and hence provides a mechanical metaphor which is more appropriate for a distributed-memory parallel processor in which copying is explicit. However, linear logic's lack of sharing may introduce significant inefficiencies of its own.

We show an efficient implementation of linear logic called Linear Lisp that runs within a constant factor of non-linear logic. This Linear Lisp allows RPLACX operations, and manages storage as safely as a non-linear Lisp, but does not need a garbage collector. Since it offers assignments but no sharing, it occupies a twilight zone between functional languages and imperative languages. Our Linear Lisp Machine offers many of the same capabilities as combinator/graph reduction machines, but without their copying and garbage collection problems.

Yes, we need a logic-based approach rather than a statistical one for NLP if you want to incorporate things like attention and memory. The full range of non-classical logics should be looked at, including Modal, Fuzzy etc.

We need to extend the logic-based approach to deal with reasoning under uncertainty, so many-valued logic is needed. Also, we need the logic to be able to model discrete, dynamical systems, so we need to look at Temporal Logic, Situation Calculus , Event Calculus etc etc. You can call all this 'Computational Logic'; it may actually be more general than probability theory.

See my wiki-book listing the central concepts of 'Computational Logic' here:


Functional programming is best for handling the logic-based approach, for example Haskell. Functional programming languages work at a higher-level than ordinary imperative and procedural programming, since functional programming deals with the manipulation of knowledge itself rather states of the computer.

I'd also look closely at the pure mathematical origins of computational logic, including classical mathematical logic and category theory. Type theory (a constructive form of category theory) looks like it forms the basis for language rules, thus making it ideal for NLP.

I conjecture that an extension of computational logic leads to a solution to machine psychology (inc. NLP), analogous to how machine learning can be viewed as an extension of probability and statistics.

Probability&Stats >>> Machine Learning

By analogy,

Computational Logic >>> Machine Psychology (inc. NLP)

I don't understand why the type of bint was chosen, and what impact, exactly, does it have on the usability of the network.

Or, asked differently: Given some task that you want to train your network for, how do you decide which type you should choose?

The type bint is an encoding of binary sequences, so it's simple enough as a working example but it's nontrivial enough that you could construe any sort of learning algorithm as operating on it.

The type you choose is part of the model you're training so your question is as difficult as "what kind of network topology should I use?".

Right. In practice one would want to use more complicated types (such as the type of Turing machines, explained in Clift’s thesis) but in the talk I did not have time to explain more interesting examples.

Is it possible to extend this to relevance logic?

I don’t think so, but I haven’t really thought about it.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact