Hacker News new | past | comments | ask | show | jobs | submit login
Deep Learning for Symbolic Mathematics (openreview.net)
222 points by lucidrains on Sept 26, 2019 | hide | past | favorite | 62 comments

The authors have shown a very nice and (to me) non-intuitive result. But they're playing a little fast and loose with their comparison to Mathematica. They're comparing their algorithm's accuracy (solution correctness vs. incorrectness), with Mathematica's ability to find the correct solution in less than 30 seconds. This is a very important distinction! Mathematica will never silently return an incorrect solution (barring software bugs, of course). And Mathematica can often take minutes to evaluate what appears to be a simple integral, so a 30-second timeout is far too short, unless you're simply trying to compare the computational efficiency of the two approaches.

There may be other subtleties as well. Mathematica works in the complex domain by default, which makes many operations more difficult, but the authors discard expressions which contain complex-valued coefficients as "invalid", which makes me think they're implicitly working in the real domain. Do they restrict Mathematica to the real domain when they invoke it? Perhaps, but they don't say one way or the other. And do they try common tricks like invoking FullSimplify[] on an expression/equation before attempting to operate on it? I'd like to see more details of their methodology.

> They're comparing their algorithm's accuracy (solution correctness vs. incorrectness), with Mathematica's ability to find the correct solution in less than 30 seconds. This is a very important distinction!

I had the same initial reaction as you, but then I realized that this is still extremely useful. In a ton of examples, only one direction of differentiation/integration is hard while the other direction is easy. You could build a system that attempted to solve it directly, and failing that attempted to guess-and-verify using this approach. My intuition is that such an overall system would be strictly superior to Mathematica's approach as it exists today.

That's a good point. Guess-and-verify could be a handy additional heuristic method if Mathematica's other methods came up empty on a problem. I've also heard of machine learning being used to choose between internal algorithms available in formal proof systems, to try to pick the algorithm that's most likely to work instead of just trying them all sequentially.

The person opposite my desk is working on precisely that! (And I'm making more algorithms for him to feed to his model :p)

This is interesting, the authors encode expression trees in RPN format and use vanilla transformers plus seq2seq to decode solutions to certain integrals and differential equations.

I suspect what's happening, given the constrained space of problems, is the NN is matching input formats to solution templates. For that space, it would have learned some patterns of what undoes to what. Perhaps there's even something that's captured aspects of the product rule buried in those weights.

I would be curious to see how it does on problems outside the training set format. Or on integrals that require a trick, such as exp(x) * sin(x) or those with nested composition: A/(1+exp(-k*(x-a))). Is it sensitive to input size? Would a smaller but slightly tricky integral like 1/sqrt(x^2 - y^4) trip it?

Most important, since the bulk of real world indefinite integrals are not analytically tractable, it is vital that a practical system knows when to give up. If I put in exp(-x^2), it better not return an answer. This is my main worry for methods like this. In a real world setting covering a wider range of mathematical problems, you can't trust something that's not 100% on what it knows how to do.

They compare it to Mathematica, which can return a good result, fail to do so, or timeout. Apparently they outperform Mathematica significantly (on the domain tested)

It won't be 100%, but no system (including humans) would be able to deliver a 100% rate of answers on symbolic maths.

By 100%, I mean that it should fail when it does not know the answer, not that it needs to be able to answer 100% of provided questions. It should not just output any answer.

Alternatively, for the problems it knows how to do, a CAS should be as close to 100% as can be reached. But you are right, even Mathematica is not always correct so it is useful to be able to query a solver for its steps.

There's work about using cartesian genetic programming to efficiently (vectorizable, not much pointer-/tree-chasing of much significance) search for symbolic equations.

The benefit there is that bias control is relatively easy for genetic programming, because their adaptations are more-or-less-easy to wrap your head around.

I'd like more research into marrying cartesian genetic programming to something like transformers maybe, in part because genetic generators are still far beyond transformers in the novelty they can generate. It's relevant for e.g. generating interesting melodies that conform to some automated fitness evaluator, specifically because they are so extremely efficient iterative search algorithms for these high-dimensional spaces.

Using multi-hot encoding "piano-roll style", or even just a sequence of (pitch; volume; rise-speed (attack; how fast the pitch increases to nominal); duration) is not necessarily efficient, as even simple, basic nodes (for a typical tree-based representation) like double (or repeat n-times, where n can increment/decrement more easily than the node itself be changed completely by mutation), double but reverse the copy, double but shift the copy's pitch, but not really more (I'll link the paper later) yield evolution efficient enough to train a neural discriminator from zero by supervised online operation.

The NN suggests a ranking by quality (the worst are culled; just normal genetic programming evolution stepping), and the software itself uses the human who's opinion the NN shall learn to correct the ranking by playing pairs of samples and asking which is better/worse. If you use some statistics you should be able to use very few verification comparison on top of the quicksort/mergesort approach to guard against those algorithms playing poorly with a comparison operator that only probabilistically conforms to a partial order and asymptotically to a total order. Later it might be possible to use the discriminator's scoring to put more effort into close calls, or just adapting the statistics to respect that not all comparisons have the same probability split.

It could be great to try such for solving problems who's inverse is easy. "Throw" a couple mathematicians specializing in butterfly-effects regarding solvability/solutions for the type of equations at it, and let them write rules to guide the trial and error of a genetic programming suggestor. The fancy DNN is trained to recognize particularly promising candidates to submit as feedback to the evolution step. One major benefit is massive parallelism in-model and between candidates, as well as the afaik unmatched ability to efficiently find truly novel solutions to problems.

Just remember how good AFL is as long as there is no cryptography/proper hashing involved. And part can be circumvented by marrying it to an SMT/SAT solver like Z3 to find interesting inputs for evil code like hashes or fill-in problematic fields in the testcase to get past this.

If anyone is interested in code for the melody generator, hit me up and I'll ask the original author to license it so I can publish my fixed version. It's a bit over 20 years old, so it was unlisted on the web and didn't compile / run on a modern linux. It's still 32bit-only, but I guess a bot could replace all then-64bit-types with the 32bit ones where not risking truncation of results from foreign functions.

cartesin gentic programming: DOI:10.1007/978-3-642-17310-3

Brad Johanson, Riccardo Poli: GP-Music: An Interactive Genetic Programming System for Music Generation with Automated Fitness Raters

I think they use normal PN, not RPN.

The results look almost too good to be true. There could be a bias in the way the expressions are generated; there is no guarantee that this is in any way representative of real problems, and I don't think this is addressed in the paper. A good thing to do, in my opinion, is compile a corpus of real-world, occurring problems, and then measure performance on that. Then you would know how it performs on some objectively useful metric, not on a potentially biased sample.

Mathematica is at a disadvantage since it worries about getting everything perfectly right, i.e. it cares about branch cuts and domains. I can cook up plenty of examples engineered to have these issues, which I could do (naively) but which Mathematica would choke on (when trying to solve it properly). All of the examples that Mathematica couldn't solve seem to be of this form. If you actually want it to stand a chance, you have to put in appropriate Assumptions, not just toss it in.

That said, I'm still extremely surprised this works that well. I would be very curious to test this interactively.

You could take a 2-step approach, where you first take the "fuzzy" approach, and then later try to correct for possible oversights (which is just simple bookkeeping).

how exactly would you do that? how would you know when the fuzzy search has failed?

You plug the result back into the equation, then simplify the equation until you get a tautology. Of course, the "simplify" step can be difficult, but it's usually a lot simpler than solving equations.

It is a possibility that there is a natural vector space embedding for functions in which integration/derivation is a simple operation. A deep learning network could find such an embedding.

There is. You can use fourier or laplace space.

there is such a vector space it's too bad it has an uncountable basis: https://en.wikipedia.org/wiki/Laplace_transform

There exists many such bases, you could also take all the monomes and approximate a function by its Taylor development. However, it does not mean that such bases can be efficiently approximated in a reasonable dimension, nor that conversion from/to textual representation is easy. A deep learning network would achieve those points.

I think he meant monomial.


I searched the article, but could not find basis or uncountable. Could you point where exactly I should look?

equation 1 (in formal definition). the basis is e^(-st). if you don't know how that's a basis you need to read a little bit about functional analysis but just look at the integral as a continuous sum and f(t) as the basis coefficients and e^(-st) starts to look like a vector space basis (hilbert space) basis.

I am afraid it is a bit over my head. Would you be able to point me to the source?

Not the most explanatory sources but function spaces essentially are vector spaces:



https://www.youtube.com/watch?v=7ICYxBuS2iw looks like a pretty elementary introduction

Neural ODEs? "Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. " - https://arxiv.org/abs/1806.07366

They give examples of problems their model could solve that Mathematica couldn't (within a 30 second timeout) - and that's awesome. Destroy Mathematica. But, I did anyone notice if there were problems that it couldn't solve that Mathematica could?

I'm also curious whether there are problems that Mathematica can solve but this system cannot.

More importantly, I'm curious if there are problems that Mathematica knows it can't solve but for which this system silently gives wrong answers.

Another interesting extension to the experiments would be a longer timeout -- 30 seconds seems a bit arbitrary and quite low for a CAS. However, I suspect the reason for that time out is the fact that Mathematica licenses are insanely expensive. Otherwise the 5,000 (actually, only 500) test problems could be run for at least a few minutes at pretty trivial cost. Maybe there's a Mathematica employee here who can suggest Wolfram donate some compute (or at least limited licenses) for a small evaluation cluster. Especially if the authors decide to do follow-up work.

In any case, this is really interesting work. I think deep learning for symbolic mathematics is going to be a super interesting area to watch for a least the next few years. Good work, anonymous author(s).

Verifying a candidate solution for these problems is relatively easy so wrong answers aren't so bad.

I understand.

To explain: the thing that's super interesting to me about this paper (i.e., "strong result" vs. "best paper contender") is not integration per se. It's the possible applications of the method to problems with much, much, much higher computational complexity than integration. On those problems, validating the correctness of a solution is also intractable. In those cases, a sound function approximation approach would be an absolute game changer for symbolic methods.

(Not that integration isn't interesting as well.)

How are they going to generate training data if verifying solutions is hard?

Some of these decision problems have thousands of examples because they correspond to industrially relevant problems. So, not automatically generated all at once, but gleaned from people who have been using CAS for decades to solve specific problems.

Still, I fear, the numbers are currently too small to get past the information bottleneck (mere thousands). We'll see.

Are these gathered in one place anywhere? I and probably many others, including the authors of this paper, would be interested in these as a test set for models like this.

Why not just use the wolfram engine for developers? It’s available for the “insanely expensive” cost of $0. (See: https://www.wolfram.com/engine/)

I've had a lot of trouble getting permission to use Wolfram Engine. If authors are at a BigCorp, might be true for them as well.

"we report the accuracy of our models on the three different tasks, on a holdout test set composed of 5000 equations."

I had trouble finding the test cases they used. Where'd they list them?

The cases where Mathematica solves integration by spitting out all kinds of exotic functions (Bessel functions, all kind of weird elliptic integral functions and so on). They don't have these kind of integrals in their training data.

This paper was literally submitted to the venue yesterday! I generally think it's good how fast-moving ML is, but this seems excessive.

I posted a link for this paper on twitter this morning. Amazing results.

I have used seq models for several character recognition and generation projects, and seq RNN/LSTM style models are surprising in what they can do in terms of modeling language, building representations, synthesizing data like JSON as text data, etc.

But don't you think the real power (AGI) will come when merging these deep approaches with logic and probability, sort of where deep probabilistic programming seems to be heading, albeit slowly?

I find deep learning models like this one are trying to do too much in a single "step" or module.

Well, I think we have not discovered or invented all the technologies needed for AGI yet. I think true AGI will take a long time, but, new technologies developed along that path will be awesome.

Would be nice to have integration with Sage Math suite, and/or Maxima.

Often knowing that a problem has a solution is a big help. My understanding is that most expressions can't be integrated in closed form. Had the system been trained on the full set of problems, would it then learn the most likely answer is "no answer" and report this every time? When mathematica attacks a problem, it must presumably have to first address that issue in some sense. Is this an unfair advantage here?

Very impressive. But is the network distilling mathematical knowledge, or merely extracting a long list of pattern matching rules? It seems the latter. AGI will go nowhere until researches have the gall to experiment with machine consciousness, and to create an entity capable of knowing anything at all.

finally a paper i can mostly understand :)

Very nice results that challenge the conventional wisdom. DNN can work with symbolic data and calculations. I think that, it says a lot about our poor theoretical understanding of deep neural networks despite all the practical advances.

Notice that the method is essentially pattern-matching, nothing more: it is even simpler than object recognition. As a matter of fact, that is the way any "competent" student works with integration and solving differential equations.

I am not trying to underestimate the value of the work but those problems are very much what DNN are trained to solve, are they not?

The fact that they are "symbolic" is just an artifice of the problem: they are no more symbolic than a series of colors, if you tell me.

That is why they can be automated (mostly) and why "tricks" to solve families of them can be developed.

Correct me please if I am mistaken, but are differential equations not a solved problem in terms of this is f(x), calculate this kind of derivative? In terms of creating new mathematics, I suppose the latter would be a different topic.

The first job to be replaced by AI: taxi driver or mathematician? Or perhaps analog designer [1]?

[1] https://news.ycombinator.com/item?id=21083173

I only glanced through it, but there is no github link.

it's conventional to leave these out to preserve anonymity during the peer review process; if one is available, it will be added in the final published version if it's accepted.

this article is still unpublished and was only submitted for peer review yesterday!

I also like papers with code, but linking an author’s github in a blind conference submission wouldn’t make a ton of sense.

What does blind submission mean?

In this case the review is, mostly (chances are the area chair knows who is who), double blind meaning: the reviewers don't know who the paper is written by _and_ the authors of the paper don't know the identity of the reviewers.

Well, if reviewers want to know, they most likely will be able to make a good guess, without even using any social engineering trick.


Very debatable outcomes!

I mean cool but I don't see this as doing anything more than just memorizing answers? at least a cas performs transformations. also the time limit metric is artificially necessary for being able to perform the comparison in a statistically significantly way in a reasonable amount of time - an apples to apples comparison would be letting both systems run until completion and then compare accuracy but that would take too long because it is the case that algorithms that fit those transformations are often have combinatoric runtimes.

good idea to use the cas to generate though. I'm always trying to think up tasks that operate on procedurally generated data (e.g. inverse graphics) because then the labeled data is cheap.

One potential use for this, even with it just memorizing answers in a clever way, would be to speed up a CAS. For problems that are taking the CAS a minute or two, it is could be faster to have the model outlined in this paper propose solutions that the CAS simply has to check.

I mean, at the very least I assume that they can generate a function, check that it wasn't in the training set, then solve it and verify.

Then what would be your complaint?

I don't mean literally memorizing. "memorization" is tongue in cheek for overfit. I'm saying that the net won't actually generalize to classes it hasn't seen examples from (while reduction/transformation algorithms will).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact