
Deep Learning for Symbolic Mathematics - lucidrains
https://openreview.net/pdf?id=S1eZYeHFDS
======
wwalker3
The authors have shown a very nice and (to me) non-intuitive result. But
they're playing a little fast and loose with their comparison to Mathematica.
They're comparing their algorithm's accuracy (solution correctness vs.
incorrectness), with Mathematica's ability to find the correct solution in
less than 30 seconds. This is a very important distinction! Mathematica will
never silently return an incorrect solution (barring software bugs, of
course). And Mathematica can often take minutes to evaluate what appears to be
a simple integral, so a 30-second timeout is far too short, unless you're
simply trying to compare the computational efficiency of the two approaches.

There may be other subtleties as well. Mathematica works in the complex domain
by default, which makes many operations more difficult, but the authors
discard expressions which contain complex-valued coefficients as "invalid",
which makes me think they're implicitly working in the real domain. Do they
restrict Mathematica to the real domain when they invoke it? Perhaps, but they
don't say one way or the other. And do they try common tricks like invoking
FullSimplify[] on an expression/equation before attempting to operate on it?
I'd like to see more details of their methodology.

~~~
theresistor
> They're comparing their algorithm's accuracy (solution correctness vs.
> incorrectness), with Mathematica's ability to find the correct solution in
> less than 30 seconds. This is a very important distinction!

I had the same initial reaction as you, but then I realized that this is still
extremely useful. In a ton of examples, only one direction of
differentiation/integration is hard while the other direction is easy. You
could build a system that attempted to solve it directly, and failing that
attempted to guess-and-verify using this approach. My intuition is that such
an overall system would be strictly superior to Mathematica's approach as it
exists today.

~~~
wwalker3
That's a good point. Guess-and-verify could be a handy additional heuristic
method if Mathematica's other methods came up empty on a problem. I've also
heard of machine learning being used to choose between internal algorithms
available in formal proof systems, to try to pick the algorithm that's most
likely to work instead of just trying them all sequentially.

~~~
andrepd
The person opposite my desk is working on precisely that! (And I'm making more
algorithms for him to feed to his model :p)

------
Cybiote
This is interesting, the authors encode expression trees in RPN format and use
vanilla transformers plus seq2seq to decode solutions to certain integrals and
differential equations.

I suspect what's happening, given the constrained space of problems, is the NN
is matching input formats to solution templates. For that space, it would have
learned some patterns of what undoes to what. Perhaps there's even something
that's captured aspects of the product rule buried in those weights.

I would be curious to see how it does on problems outside the training set
format. Or on integrals that require a trick, such as exp(x) * sin(x) or those
with nested composition: A/(1+exp(-k*(x-a))). Is it sensitive to input size?
Would a smaller but slightly tricky integral like 1/sqrt(x^2 - y^4) trip it?

Most important, since the bulk of real world indefinite integrals are not
analytically tractable, it is vital that a practical system knows when to give
up. If I put in exp(-x^2), it better not return an answer. This is my main
worry for methods like this. In a real world setting covering a wider range of
mathematical problems, you can't trust something that's not 100% on what it
knows how to do.

~~~
Iv
They compare it to Mathematica, which can return a good result, fail to do so,
or timeout. Apparently they outperform Mathematica significantly (on the
domain tested)

It won't be 100%, but no system (including humans) would be able to deliver a
100% rate of answers on symbolic maths.

~~~
Cybiote
By 100%, I mean that it should fail when it does not know the answer, not that
it needs to be able to answer 100% of provided questions. It should not just
output any answer.

Alternatively, for the problems it knows how to do, a CAS should be as close
to 100% as can be reached. But you are right, even Mathematica is not always
correct so it is useful to be able to query a solver for its steps.

------
andrepd
The results look almost too good to be true. There could be a bias in the way
the expressions are generated; there is no guarantee that this is in any way
representative of real problems, and I don't think this is addressed in the
paper. A good thing to do, in my opinion, is compile a corpus of real-world,
occurring problems, and then measure performance on that. Then you would know
how it performs on some objectively useful metric, not on a potentially biased
sample.

~~~
knzhou
Mathematica is at a disadvantage since it worries about getting everything
perfectly right, i.e. it cares about branch cuts and domains. I can cook up
plenty of examples engineered to have these issues, which I could do (naively)
but which Mathematica would choke on (when trying to solve it properly). All
of the examples that Mathematica couldn't solve seem to be of this form. If
you actually want it to stand a chance, you have to put in appropriate
Assumptions, not just toss it in.

That said, I'm still _extremely_ surprised this works that well. I would be
very curious to test this interactively.

~~~
amelius
You could take a 2-step approach, where you first take the "fuzzy" approach,
and then later try to correct for possible oversights (which is just simple
bookkeeping).

~~~
mlevental
how exactly would you do that? how would you know when the fuzzy search has
failed?

~~~
amelius
You plug the result back into the equation, then simplify the equation until
you get a tautology. Of course, the "simplify" step can be difficult, but it's
usually a lot simpler than solving equations.

------
lapink
It is a possibility that there is a natural vector space embedding for
functions in which integration/derivation is a simple operation. A deep
learning network could find such an embedding.

~~~
mlevental
there is such a vector space it's too bad it has an uncountable basis:
[https://en.wikipedia.org/wiki/Laplace_transform](https://en.wikipedia.org/wiki/Laplace_transform)

~~~
lapink
There exists many such bases, you could also take all the monomes and
approximate a function by its Taylor development. However, it does not mean
that such bases can be efficiently approximated in a reasonable dimension, nor
that conversion from/to textual representation is easy. A deep learning
network would achieve those points.

~~~
mlevental
what is a monome?
[https://en.wikipedia.org/wiki/Monome](https://en.wikipedia.org/wiki/Monome) ?

~~~
MauiWarrior
I think he meant monomial.

~~~
lapink
Indeed...

------
turingbike
They give examples of problems their model could solve that Mathematica
couldn't (within a 30 second timeout) - and that's awesome. Destroy
Mathematica. But, I did anyone notice if there were problems that it couldn't
solve that Mathematica could?

~~~
nfltn
I'm also curious whether there are problems that Mathematica can solve but
this system cannot.

More importantly, I'm curious if there are problems that Mathematica knows it
can't solve but for which this system silently gives wrong answers.

Another interesting extension to the experiments would be a longer timeout --
30 seconds seems a bit arbitrary and quite low for a CAS. However, I suspect
the reason for that time out is the fact that Mathematica licenses are
insanely expensive. Otherwise the 5,000 (actually, only 500) test problems
could be run for at least a few minutes at pretty trivial cost. Maybe there's
a Mathematica employee here who can suggest Wolfram donate some compute (or at
least limited licenses) for a small evaluation cluster. Especially if the
authors decide to do follow-up work.

In any case, this is really interesting work. I think deep learning for
symbolic mathematics is going to be a super interesting area to watch for a
least the next few years. Good work, anonymous author(s).

~~~
wendyshu
Verifying a candidate solution for these problems is relatively easy so wrong
answers aren't so bad.

~~~
nfltn
I understand.

To explain: the thing that's _super_ interesting to me about this paper (i.e.,
"strong result" vs. "best paper contender") is not integration per se. It's
the possible applications of the method to problems with much, much, much
higher computational complexity than integration. On those problems,
validating the correctness of a solution is also intractable. In those cases,
a _sound_ function approximation approach would be an absolute game changer
for symbolic methods.

(Not that integration isn't interesting as well.)

~~~
wendyshu
How are they going to generate training data if verifying solutions is hard?

~~~
nfltn
Some of these decision problems have thousands of examples because they
correspond to industrially relevant problems. So, not automatically generated
all at once, but gleaned from people who have been using CAS for decades to
solve specific problems.

Still, I fear, the numbers are currently too small to get past the information
bottleneck (mere thousands). We'll see.

~~~
shmageggy
Are these gathered in one place anywhere? I and probably many others,
including the authors of this paper, would be interested in these as a test
set for models like this.

------
currymj
This paper was literally submitted to the venue yesterday! I generally think
it's good how fast-moving ML is, but this seems excessive.

------
mark_l_watson
I posted a link for this paper on twitter this morning. Amazing results.

I have used seq models for several character recognition and generation
projects, and seq RNN/LSTM style models are surprising in what they can do in
terms of modeling language, building representations, synthesizing data like
JSON as text data, etc.

~~~
nextos
But don't you think the real power (AGI) will come when merging these deep
approaches with logic and probability, sort of where deep probabilistic
programming seems to be heading, albeit slowly?

I find deep learning models like this one are trying to do too much in a
single "step" or module.

~~~
mark_l_watson
Well, I think we have not discovered or invented all the technologies needed
for AGI yet. I think true AGI will take a long time, but, new technologies
developed along that path will be awesome.

------
xvilka
Would be nice to have integration with Sage Math suite, and/or Maxima.

------
efavdb
Often knowing that a problem has a solution is a big help. My understanding is
that most expressions can't be integrated in closed form. Had the system been
trained on the full set of problems, would it then learn the most likely
answer is "no answer" and report this every time? When mathematica attacks a
problem, it must presumably have to first address that issue in some sense. Is
this an unfair advantage here?

------
wildermuthn
Very impressive. But is the network distilling mathematical knowledge, or
merely extracting a long list of pattern matching rules? It seems the latter.
AGI will go nowhere until researches have the gall to experiment with machine
consciousness, and to create an entity capable of knowing anything at all.

------
sabujp
finally a paper i can mostly understand :)

------
JacobiX
Very nice results that challenge the conventional wisdom. DNN can work with
symbolic data and calculations. I think that, it says a lot about our poor
theoretical understanding of deep neural networks despite all the practical
advances.

~~~
pfortuny
Notice that the method is essentially pattern-matching, nothing more: it is
even simpler than object recognition. As a matter of fact, that is the way any
"competent" student works with integration and solving differential equations.

I am not trying to underestimate the value of the work but those problems are
very much what DNN are trained to solve, are they not?

The fact that they are "symbolic" is just an artifice of the problem: they are
no more symbolic than a series of colors, if you tell me.

That is why they can be automated (mostly) and why "tricks" to solve families
of them can be developed.

------
mikorym
Correct me please if I am mistaken, but are differential equations not a
solved problem in terms of this is f(x), calculate this kind of derivative? In
terms of creating new mathematics, I suppose the latter would be a different
topic.

------
amelius
The first job to be replaced by AI: taxi driver or mathematician? Or perhaps
analog designer [1]?

[1]
[https://news.ycombinator.com/item?id=21083173](https://news.ycombinator.com/item?id=21083173)

------
MauiWarrior
I only glanced through it, but there is no github link.

~~~
intuitionist
I also like papers with code, but linking an author’s github in a _blind
conference submission_ wouldn’t make a ton of sense.

~~~
MauiWarrior
What does blind submission mean?

~~~
nicwilson
In this case the review is, mostly (chances are the area chair knows who is
who), double blind meaning: the reviewers don't know who the paper is written
by _and_ the authors of the paper don't know the identity of the reviewers.

~~~
psychoslave
Well, if reviewers want to know, they most likely will be able to make a good
guess, without even using any social engineering trick.

[https://www.schneier.com/blog/archives/2018/07/identifying_p...](https://www.schneier.com/blog/archives/2018/07/identifying_peo_8.html)

------
ShankarWarang
Very debatable outcomes!

------
mlevental
I mean cool but I don't see this as doing anything more than just memorizing
answers? at least a cas performs transformations. also the time limit metric
is artificially necessary for being able to perform the comparison in a
statistically significantly way in a reasonable amount of time - an apples to
apples comparison would be letting both systems run until completion and then
compare accuracy but that would take too long because it is the case that
algorithms that fit those transformations are often have combinatoric
runtimes.

good idea to use the cas to generate though. I'm always trying to think up
tasks that operate on procedurally generated data (e.g. inverse graphics)
because then the labeled data is cheap.

~~~
eximius
I mean, at the very least I assume that they can generate a function, check
that it wasn't in the training set, then solve it and verify.

Then what would be your complaint?

~~~
mlevental
I don't mean literally memorizing. "memorization" is tongue in cheek for
overfit. I'm saying that the net won't actually generalize to classes it
hasn't seen examples from (while reduction/transformation algorithms will).

