As neel_k notes, a good way to understand this picture is in terms of differential linear logic (a refinement of simply-typed differential lambda calculus). I did not provide references in the talk as unfortunately I did not understand the subject at the time, but the introduction to the second paper above hopefully serves to remedy that omission. Suffice to day that Ehrhard and Regnier discovered something quite profound in differential lambda calculus, the implications of which are still being explored; this discovery should have significant bearing on what people refer to as differentiable programming, I think.
The third link above is the master’s thesis of my talented student James Clift. We have subsequently extended that work and are finishing a paper which explains the computational content of derivatives of Turing machines, according to differential linear logic. This might be of interest to people here. I will post a link when we put it online.
It seems like operating in the discrete domain was fine for small problems where problems could be brute forced, but if we want to get into an analytical regime for larger problems, differentiation is akin to recursion/induction in that it allows us to make tractable smaller problems. Is that roughly correct?
How would you recommend an undergrad bootstrap themselves on this subject? Are there patterns we can apply to transform discrete problems into differentiable problems?
I think it is more about error propagation: you can’t estimate which inputs to an algorithm contribute more to the final error, without some notion of derivative of an algorithm with respect to its inputs (even if those inputs are discrete, so that making infinitesimal variations does not obviously make sense).
Rust  wouldn't exist without it.
Oddly enough, I think the next discontinuity in programming language design will be in making it easier to construct anytime , incremental  and succinct  programs. There is a tyranny of poor tolerancing that we currently have a very hard time from escaping.
Is it that differentiable programs can be made sloppy in a way that a non-differentiable one can't? Have you looked at the research of using DNN for physics simulations? 
But keep in mind that this is a very young field, and there is no definitive point of view.
Does your work still keep the non-determinism of Ehrhard and Regnier?
This was the part that 'bothered' me, but it provided evidence of a link with process calculi.
I've also thought Ehrhard and Regnier's work was groundbreaking. It potentially opens up not just insights into differentiable programming, but a theory of concurrent computation and an algebraic theory of computation. My expectation is that it will reveal some deep links with other areas of mathematics, like group theory.
Semantically, however, I think things are clearer. The denotation of the syntactic derivative of a proof is a limit, and the terms in the limit have an interpretation as probabilistic processes in a (close to) standard sense. So while the sum-over-linear-substitutions doesn’t have a clear probabilistic interpretation (at least that I can see) it is the limit of computational processes where you run the probability of making a substitution to zero.
I don’t know anything about process calculi, unfortunately!
If you have a programming language, you can think of differentiation as being a higher-order function that takes a function a and returns a new function as a result. If your language has support for higher-order functions, then (a) you have to extend the definition of the derivative to cope with differentiation at higher type, and (b) ideally you'd like differentiation to be an operator in your programming language.
In general this is hard, so you have to cut the problem somewhere.
Pearlmutter and Siskind's basic idea is to use Reynolds defunctionalization transformation to turn higher-order programs into first-order programs, and then apply automatic differentiation to the resulting first-order program.
The approach in this talk is inspired by "differential lambda calculus", which is a setting where you retain full first-class functions, but restrict how you can use them by imposing certain typing constraints upon them (basically to ensure smoothness/differentiability of definable functions).
There is a mysterious connection between these ideas, since defunctionalization is closely related to the "geometry of interaction" semantics of linear logic, and differential lambda calculus is closely related to the vector space semantics of linear logic.
In addition to these semantic issues, there's also the question of how you hook it up to the actual algorithms people use to do gradient descent -- eg you need to use the Riesz representation theorem to justify representing the gradient as a matrix -- and that constrains how things could work even further. This is simultaneously annoying (because it makes things harder) and helpful (at higher type we have too much freedom in how to define the derivative, and so additional requirements help constrain the solution space).
So you might ask: why do we care so much about higher-order functions? The answer is that "higher-order function" is how semanticists pronounce "linking" and "separate compilation". It would be super great if we could take a bunch of teeny-tiny separately-developed models together, stick them together, and have everything work no problem.
It's a really inspiring talk, albeit super intense (you need to be bilingual in denotational semantics and real analysis to follow everything).
So, for example, I do not think that in the Pearlmutter-Siskind system that a function of type Int -> Int would have a meaningful derivative, whereas in the differential lambda calculus it does (this is surprising).
I do not recall the details of the Pearlmutter-Siskind system, but some approaches to programming with AD is based on adding formal infinitesimals. There are no infinitesimals explicitly in differential lambda calculus or linear logic, but in the semantics of the latter that we use, derivatives do arise from the coalgebra (k[x]/x^2)^* which is the manifestation of a first-order infinitesimal in algebraic geometry. So in some sense both subjects proceed by adding formal infinitesimals to an existing programming language, but in differential lambda calculus this is implicit.
I've made a first pass through the Pearlmutter-Siskind system (the Lambda the Ultimate Back-propagator paper) and the main idea, IIUC, is to implement backward-mode AD over reals in a compositional manner. So I generally agree with your characterisation of it. Forward-mode AD is usually presented in terms of infinitesimals (dual numbers) but backward-mode is not, but this is a minor point.
To be more precise, the field is still known as AI, but people outside the field only know (and care) about machine learning, presumably because that's what Googe, Facebook, et al are recruiting for.
This is a bit of a sad situation, really. AI, in its drive to solve major hard problems (that still remain unsolved btw), has always contributed powerful programming techniques and even entire language paradigms to computer science (functional programming, object orientation that basically started with Minsky's frames, all sorts of search algos etc).
Machine learning is just one such technique which has gained popularity outside AI. It's powerful, for restricted problems. But if it somehow replaces AI in peoples' mind, the goose that lays the golden eggs might just die.
I'd argue the machine learning label can be applied to any AI system that's data-driven in some sense (even self-generating the data using reinforcement learning).
Wikipedia lists the following approaches to Machine Learning. Surely you wouldn't call them all 'one technique'?: Decision tree learning, Association rule learning, Artificial neural networks, Deep learning, Inductive logic programming, Support vector machines, Clustering, Bayesian networks, Reinforcement learning, Representation learning, Similarity and metric learning, Sparse dictionary learning, Genetic algorithms.
My main concern is not about defining what machine learning is, however. Rather, I'm worried about the definition of AI shrinking to "it's what we now call machine learning". Which is at the very least unhistorical.
Then of course there's the connectionist branch that explicitly set out to take inspiration from the human brain. I don't think that anyone ever saw neural networks as anything but AI for the last 50 years or so. In any case, when the connectionist were having arguments, they were having them with other AI folks, not with, say, statisticians, or mathematicians.
Then, if you think about the classic tasks that machine learning systems perform, they're all firmly classic AI tasks: classification, game-playing, machine vision, speech processing, etc etc.
I mean, the fact that models are "predictive" on its own doesn't tell you that machine learning is separate from AI. After all, an intelligent agent must be able to make predictions about its environment and the outcome of its actions on this environment.
I bet there are lots of brilliant results that a smart mind could apply to our field.
Nobody is discounting that someone could open up a completely different learning paradigm, and it would most likely come from a pure mathematician with computer skills. A few prominent researchers have said something to that effect already.
I would think that the economic incentive to do exactly this would be pretty high.
Linear logic has been proposed as one solution to the problem of garbage collection and providing efficient "update-in-place" capabilities within a more functional language. Linear logic conserves accessibility, and hence provides a mechanical metaphor which is more appropriate for a distributed-memory parallel processor in which copying is explicit. However, linear logic's lack of sharing may introduce significant inefficiencies of its own.
We show an efficient implementation of linear logic called Linear Lisp that runs within a constant factor of non-linear logic. This Linear Lisp allows RPLACX operations, and manages storage as safely as a non-linear Lisp, but does not need a garbage collector. Since it offers assignments but no sharing, it occupies a twilight zone between functional languages and imperative languages. Our Linear Lisp Machine offers many of the same capabilities as combinator/graph reduction machines, but without their copying and garbage collection problems.
We need to extend the logic-based approach to deal with reasoning under uncertainty, so many-valued logic is needed. Also, we need the logic to be able to model discrete, dynamical systems, so we need to look at Temporal Logic, Situation Calculus , Event Calculus etc etc. You can call all this 'Computational Logic'; it may actually be more general than probability theory.
See my wiki-book listing the central concepts of 'Computational Logic' here:
Functional programming is best for handling the logic-based approach, for example Haskell. Functional programming languages work at a higher-level than ordinary imperative and procedural programming, since functional programming deals with the manipulation of knowledge itself rather states of the computer.
I'd also look closely at the pure mathematical origins of computational logic, including classical mathematical logic and category theory. Type theory (a constructive form of category theory) looks like it forms the basis for language rules, thus making it ideal for NLP.
I conjecture that an extension of computational logic leads to a solution to machine psychology (inc. NLP), analogous to how machine learning can be viewed as an extension of probability and statistics.
Probability&Stats >>> Machine Learning
Computational Logic >>> Machine Psychology (inc. NLP)
Or, asked differently: Given some task that you want to train your network for, how do you decide which type you should choose?
The type you choose is part of the model you're training so your question is as difficult as "what kind of network topology should I use?".