Eureqa - detecting equations and hidden mathematical relationships in data

EvanMiller · on May 3, 2011

Looks neat, but I'm not sure how useful this will be in practice. If your problem domain seems to be linear, Eureqa won't offer any advantages over linear regression. If your problem domain is non-linear, Eureqa won't be able to answer the question of where the heck the sine and cosine terms are coming from, and might induce people to make up fudgy theories to explain their origin. I'm reminded of Arthur Eddington, who explained why the fine structure "must" be 136, then when better measurements were taken explained why it "must" be 137:

http://en.wikipedia.org/wiki/A._S._Eddington#Fundamental_the...

I could see Eureqa being useful 100 years ago, when everyone was scratching their heads over the blackbody radiation data and nobody realized there was a big fat e^nu term in the denominator keeping us all from giving off infinite quantities of gamma rays. (Thanks Planck.)

Anyway, not to be a hater, but in disciplines where statistical significance is valued, Eureqa is useless, because by cherry-picking models it completely invalidates significance tests.

splat · on May 3, 2011

In some disciplines I could see it being useful. (Astronomy comes to mind, but only because I happen to be an astronomer.) Oftentimes we need an empirical fit to some data and we don't really care why exactly the fit has the form it does. For instance, you might want to know what the density of a galaxy cluster is as a function of radius. Perhaps you just have an obsession with density profiles, but more likely you need to know what the density profile is for some other purpose (maybe you're looking at the evolution of radio jets in the cluster). In this case you don't really care if your density profile has the correct theoretical function form that a density profile should have; you just care that the empirical fit you use is a close match to the data.

tel · on May 3, 2011

Who's to say what humans do is significantly different from what Eureqa does? We constantly attack our models with new data, but effectively are using data to generate and test new models in the same way Eureqa is. It's certainly a simplified process of equation generation, and model comparison is a significantly open field of statistics, but the rudimentary procedure isn't especially strange.

yid · on May 3, 2011

Eureqa doesn't help in this regard. As the OP said, you will eventually find a giant model that is a brilliant fit for the data. What you need is some notion of regularization. To answer your comment, it doesn't deal with overfitting.

aterimperator · on May 3, 2011

Except that it does. I'm not too familiar with the application itself, however I know the original research worked by finding pareto optimums based on the predictive power and the simplicity of the equation.

So for double pendulum it came back with like 8 equations, one of which was conservation of momentum (not actually accurate, but as close as you can get with that number of terms), and one of which was conservation of energy (actually accurate, but a bit more complicated).

dimatura · on May 4, 2011

I'm pretty sure that it does have regularization. The optimization process penalizes model complexity in addition to maximizing the fit to the data.

aterimperator · on May 3, 2011

But isn't the purpose of cross-validation to avoid over-fitting and cherry-picking? Asked another way: how does you argument specifically single out Eureqa, instead of all of machine learning?

dimatura · on May 4, 2011

Eureqa does model selection as part of the optimization, so that would be an advantage over vanilla linear regression. Of course, there are a lot of specialized model selection techniques for linear regression that would probably be better. Using Eureqa for a linear model doesn't really make sense anyway. I think it would be mainly useful for exploratory data analysis.

jamesbkel · on May 3, 2011

I played around with this about a year or so ago. It's a neat little program, but I'm not sure how practical it really is for fitting models in most cases. It could be useful for some very light exploratory work, but one would have to follow up with some more rigorous analysis.

That said, Andrew Gelman posted his thoughts on it awhile back.

http://www.stat.columbia.edu/~cook/movabletype/archives/2009... http://www.stat.columbia.edu/~cook/movabletype/archives/2009...

Dn_Ab · on May 3, 2011

I remember reading about this some time back and being amazed. I forgot about it in the meantime. Now I see it and I am like wait that is "just" genetic programming. A month or so ago I spent about a day playing with genetic programming with F# so I decided to test it against Gelman's broken dataset in link1 with same restriction - dont use pow or sqrt. Now what I have is not as impressive or polished as theirs but for the time I spent on it (not very much) I am happy with its result.

After 5 mins it got stuck at MSE 0.1 with: x2x2/(x1 + x2) + (x - sin(2300)). To test it properly I've built another 40 points with the correct equation data and am going to leave it over night since I'm too tired to wait its full cycle which will be more than an hour.

currently (17mins in) its at MSE 0.07 with (x2 + 0.7325/x2) - 0.6839 + x (x/(x+x2)) another equivalent interesting expression it arrived at was: ((x2 + 0.7325/x2) - cos(cos(sin(x2))) + x * (x/(x+x2)). I told it to stop at 0.025, now I think it will hit it and I set the bar too low. mayb

anamax · on May 4, 2011

Sounds like a gui version of Bacon.

http://www.isle.org/~langley/discovery.html contains some cites and discussion.

convulsive · on May 3, 2011

A critique: http://www.sciencemag.org/content/324/5934/1515.3.full

Dn_Ab · on May 3, 2011

Mere mortals like myself cannot access that paper. Its behind a paywall. could not find on google either

streptomycin · on May 4, 2011

It's a short letter. Here it is, in its entirety:

The 3 April issue contained two Reports about automated science (“Distilling free-form natural laws from experimental data,” M. Schmidt and H. Lipson, p. 81, and “The automation of science,” R. D. King et al., p. 85). These Reports are seriously mistaken about the nature of the scientific enterprise, particularly regarding what theorists do and the meaning of physical law. As Thomas Kuhn famously argued, what most scientists do most of the time—which he called “normal science” and Rutherford called “stamp collecting”—does not contribute very much to the advancement of knowledge; rather, this normal science simply fleshes out the consequences of the paradigms that have been established by truly revolutionary advances. Even if machines did contribute to normal science, we see no mechanism by which they could create a Kuhnian revolution and thereby establish new physical law.

In the Report by Schmidt and Lipson, a machine deduces the equation behind a sample of chaotic motion. The discovery of deterministic chaos is an example of true Kuhnian revolution; others were its application to unexpected fields like meteorology and population biology. In the constrained problem in the Report, the relevant physical law and variables are known in advance; it is hardly a template for the creative, exploratory nature of true science.

marshray · on May 3, 2011

I'll make a guess that it says something along the lines of: "This tool is powerful and easy to use making it dangerous in the hands of mere mortals. It must be stopped."

wladimir · on May 4, 2011

Close. More like 'when a machine does it it's less useful than when a human does it because blablabla', basically because it doesn't have the "feeling".

NY_USA_Hacker · on May 4, 2011

The claim that the fit gets to the "physics" of a pendulum is not well supported: The physics involves the law of gravity, Newton's second law, and aerodynamic drag as a function of velocity, and their work is very far from these three basic inputs from physics.

Indeed, a good intermediate step would be just to get the ordinary differential equation initial value problem for a pendulum, that really is from the physics, that their fit solves; they didn't do that.

Or if not a pendulum, then oscillations in basic A/C circuit theory with resistors, capacitors, and inductors; they didn't do that either.

Uh, data such as they started with ain't necessarily from a pendulum, and to assume that it is is a serious mistake. In particular there is no way that a computer program, or anything else, starting with just the data, can conclude a pendulum instead of an A/C circuit or something else. Sorry guys.

The role of 'complexity' in the expressions very much needs to be made clear. This is especially the case since otherwise can fit such data with just Lagrange interpolation (a polynomial) or splines.

The whole idea of the program encounters one of the biggest old problems in science: Starting with the observational data on the motions of the planets, Ptolemy had some complicated mechanical contraptions that fit the data fairly well (we will set aside some of the accusations that he, uh, 'smoothed' his data, that is, had an early use of Kelly's variable constant and Finkel's fudge factor).

The big problem in Ptolemy's work was his complicated contraptions didn't seem to be from any 'physical laws'. Of course, he didn't know that any suitable physical laws existed.

Big, huge steps later were from Copernicus, Kepler, etc. by which time it was clear that, looked at in the right coordinate system, the planets were moving in ellipses with the sun at one focus of each ellipse.

Then Newton assumed his second law and law of gravity, invented calculus, and used the three to derive the ellipses. That's why we think that Newton was one of the greatest of all scientists. The computer program is suggesting that it is hoping to automate such work; ROFL.

The claim that the program can find function f so that x = f(t, x) is not so amazing: Just set f(t,x) = x. Then perfectly x = x. Semi-amazing.

The example was for only one independent variable. More important cases are for several independent variables and several dependent variables. That is, for the real numbers R and positive integers m and n, find function f: R^m --> R^n so that for independent variables x in R^m and dependent variables y in R^n we get f(x) close to y. Pull this off with any generality, execution efficiency, and usefulness, and I'll start to be impressed.

To be more impressive than now, they might just look for linear f: R^m --> R and show why their work is better than old step-wise regression. There they need to address the conclusions now over 40 years old that step-wise regression didn't work very well, and doing all possible regressions took too long and gave a mess.

Generally they have the cart before the horse: The researcher is supposed to bring some understanding or at least some conjectures to the work and let the numerical work just fill in the details. Doing the numerical work first is bass ackwards.

The idea of a 'complexity measure' to select among different formulas is close to some old moon worship, black magic, or superstition about the magical properties of selected frogs and mushrooms.

Setting aside all the hype and nonsense, where could such work be used? Well, the answer is the same as has been the case for decades: It's a curve fitting program and, thus, could be used where curve fitting programs have been useful:

First, it can do the fitting arithmetic when the researcher brings to the computer some 'form' of the formula and is just looking for some constants and NOT the formula.

Second, if there is some data and want a simple expression that 'compresses' the data. That can be useful, say, in parts of stochastic dynamic programming; see, e.g., D. Bertsekas at MIT where he used neural nets for such 'compression'. That is, just storing the data might be 10^n bytes for n = 20, 30, 40, ..., but a few formulas might do just as well for big savings. Generally splines and multivariate splines have been among the leading approaches, but I didn't see splines mentioned in their video.

The scary, upsetting, objectionable parts are the claims that they are "detecting equations and hidden mathematical relationships in data" and, as in the video, getting at the "physics". No they are not. Here their hype reeks of the usual in 'artificial intelligence' and 'machine learning' that their software can 'learn' and be 'intelligent', and that's so far, and likely for a long time, nonsense.

Another objectionable part is their theme of setting aside all the assumptions, definitions, theorems, and proofs of statistical estimation of parameters in statistical models, hypothesis tests on the parameters, confidence intervals on the parameters and the predicted values from the fits, etc. How do they do that? They just leave it all out!

They have an option of 'smoothing' the data: That's a big subject. So, the implication is that researchers might just click on the button that says "Smooth" and report that in their work. Then to 'smooth' becomes some unlabeled bottle of snake oil.

The theme here is to omit all the rational and deductive framework of applied math and replace it with GUIs: This is a future of irrationality with no hope.

This work is a special case of the more general situation that 'computer science' is getting a C- in applied math and is out'a gas.

The work fills a much needed gap in the literature and would be illuminating if ignited.

sciboy · on May 4, 2011

I used to love data mining. Then I formally learned statistics, causal inference, mcmc and other similar methods. I was amazed at how poor the computer science was for anything outside of computer science problems (i.e. search, collaborative filtering etc). Looking back I now realise that my amazement was misplaced - tools are good for what they are designed for. How many computer scientists who write the tools run experiments or do real exploratory data analysis with data they have collected?

Agree completely with their model selection criteria; model selection is useful when used to compare evidence for k physically plausible models, but should be treated with extreme caution where exploring an infinite model space.

Personally I hold very little hope for automated tools anymore. Considering how complex a seemingly simple study is to analyse correctly, or how hard it is to model physically realistic processes, I think the future does not bode well for tools like eureka as they are currently aimed. At best eureka may present some general hypothesis, but it seems unlikely to be able to search the model space for models that are physically plausible, except in the most simple of all cases.

Besides, try eureka on some of your real data and prepare to be deflated :)

icandoitbetter · on May 4, 2011

>Then Newton assumed his second law and law of gravity, invented calculus, and used the three to derive the ellipses. That's why we think that Newton was one of the greatest of all scientists. The computer program is suggesting that it is hoping to automate such work; ROFL.

Exactly. Most of the solution is spoon-fed into the algorithm in the form of the fitness function and the terminals. The actual work done by the learning mechanism is trivial, non-generalizable and scales exponentially.

I don't even think that the researchers are overoptimistic in this case. They are simply being dishonest. The shortcomings of this line of research are obvious to anyone who's spent a few days with evolutionary computation.

dvse · on May 4, 2011

Thanks for all your contributions! Indeed, a rare perspective on HN (or anywhere for that matter).