
Eureqa - detecting equations and hidden mathematical relationships in data - phreeza
http://creativemachines.cornell.edu/eureqa
======
EvanMiller
Looks neat, but I'm not sure how useful this will be in practice. If your
problem domain seems to be linear, Eureqa won't offer any advantages over
linear regression. If your problem domain is non-linear, Eureqa won't be able
to answer the question of where the heck the sine and cosine terms are coming
from, and might induce people to make up fudgy theories to explain their
origin. I'm reminded of Arthur Eddington, who explained why the fine structure
"must" be 136, then when better measurements were taken explained why it
"must" be 137:

[http://en.wikipedia.org/wiki/A._S._Eddington#Fundamental_the...](http://en.wikipedia.org/wiki/A._S._Eddington#Fundamental_theory)

I could see Eureqa being useful 100 years ago, when everyone was scratching
their heads over the blackbody radiation data and nobody realized there was a
big fat e^nu term in the denominator keeping us all from giving off infinite
quantities of gamma rays. (Thanks Planck.)

Anyway, not to be a hater, but in disciplines where statistical significance
is valued, Eureqa is useless, because by cherry-picking models it completely
invalidates significance tests.

~~~
tel
Who's to say what humans do is significantly different from what Eureqa does?
We constantly attack our models with new data, but effectively are using data
to generate and test new models in the same way Eureqa is. It's certainly a
simplified process of equation generation, and model comparison is a
significantly open field of statistics, but the rudimentary procedure isn't
especially strange.

~~~
yid
Eureqa doesn't help in this regard. As the OP said, you will eventually find a
giant model that is a brilliant fit for the data. What you need is some notion
of regularization. To answer your comment, it doesn't deal with overfitting.

~~~
aterimperator
Except that it does. I'm not too familiar with the application itself, however
I know the original research worked by finding pareto optimums based on the
predictive power and the simplicity of the equation.

So for double pendulum it came back with like 8 equations, one of which was
conservation of momentum (not actually accurate, but as close as you can get
with that number of terms), and one of which was conservation of energy
(actually accurate, but a bit more complicated).

------
jamesbkel
I played around with this about a year or so ago. It's a neat little program,
but I'm not sure how practical it really is for fitting models in most cases.
It could be useful for some very light exploratory work, but one would have to
follow up with some more rigorous analysis.

That said, Andrew Gelman posted his thoughts on it awhile back.

[http://www.stat.columbia.edu/~cook/movabletype/archives/2009...](http://www.stat.columbia.edu/~cook/movabletype/archives/2009/12/equation_search.html)
[http://www.stat.columbia.edu/~cook/movabletype/archives/2009...](http://www.stat.columbia.edu/~cook/movabletype/archives/2009/12/equation_search_1.html)

~~~
Dn_Ab
I remember reading about this some time back and being amazed. I forgot about
it in the meantime. Now I see it and I am like wait that is "just" genetic
programming. A month or so ago I spent about a day playing with genetic
programming with F# so I decided to test it against Gelman's broken dataset in
link1 with same restriction - dont use pow or sqrt. Now what I have is not as
impressive or polished as theirs but for the time I spent on it (not very
much) I am happy with its result.

After 5 mins it got stuck at MSE 0.1 with: x2 _x2/(x1 + x2) + (x - sin(2300)).
To test it properly I've built another 40 points with the correct equation
data and am going to leave it over night since I'm too tired to wait its full
cycle which will be more than an hour.

currently (17mins in) its at MSE 0.07 with (x2 + 0.7325/x2) - 0.6839 + x _
(x/(x+x2)) another equivalent interesting expression it arrived at was: ((x2 +
0.7325/x2) - cos(cos(sin(x2))) + x * (x/(x+x2)). I told it to stop at 0.025,
now I think it will hit it and I set the bar too low. mayb

------
anamax
Sounds like a gui version of Bacon.

<http://www.isle.org/~langley/discovery.html> contains some cites and
discussion.

------
convulsive
A critique: <http://www.sciencemag.org/content/324/5934/1515.3.full>

~~~
Dn_Ab
Mere mortals like myself cannot access that paper. Its behind a paywall. could
not find on google either

~~~
marshray
I'll make a guess that it says something along the lines of: "This tool is
powerful and easy to use making it dangerous in the hands of mere mortals. It
must be stopped."

~~~
wladimir
Close. More like 'when a machine does it it's less useful than when a human
does it because blablabla', basically because it doesn't have the "feeling".

------
NY_USA_Hacker
The claim that the fit gets to the "physics" of a pendulum is not well
supported: The physics involves the law of gravity, Newton's second law, and
aerodynamic drag as a function of velocity, and their work is very far from
these three basic inputs from physics.

Indeed, a good intermediate step would be just to get the ordinary
differential equation initial value problem for a pendulum, that really is
from the physics, that their fit solves; they didn't do that.

Or if not a pendulum, then oscillations in basic A/C circuit theory with
resistors, capacitors, and inductors; they didn't do that either.

Uh, data such as they started with ain't necessarily from a pendulum, and to
assume that it is is a serious mistake. In particular there is no way that a
computer program, or anything else, starting with just the data, can conclude
a pendulum instead of an A/C circuit or something else. Sorry guys.

The role of 'complexity' in the expressions very much needs to be made clear.
This is especially the case since otherwise can fit such data with just
Lagrange interpolation (a polynomial) or splines.

The whole idea of the program encounters one of the biggest old problems in
science: Starting with the observational data on the motions of the planets,
Ptolemy had some complicated mechanical contraptions that fit the data fairly
well (we will set aside some of the accusations that he, uh, 'smoothed' his
data, that is, had an early use of Kelly's variable constant and Finkel's
fudge factor).

The big problem in Ptolemy's work was his complicated contraptions didn't seem
to be from any 'physical laws'. Of course, he didn't know that any suitable
physical laws existed.

Big, huge steps later were from Copernicus, Kepler, etc. by which time it was
clear that, looked at in the right coordinate system, the planets were moving
in ellipses with the sun at one focus of each ellipse.

Then Newton assumed his second law and law of gravity, invented calculus, and
used the three to derive the ellipses. That's why we think that Newton was one
of the greatest of all scientists. The computer program is suggesting that it
is hoping to automate such work; ROFL.

The claim that the program can find function f so that x = f(t, x) is not so
amazing: Just set f(t,x) = x. Then perfectly x = x. Semi-amazing.

The example was for only one independent variable. More important cases are
for several independent variables and several dependent variables. That is,
for the real numbers R and positive integers m and n, find function f: R^m -->
R^n so that for independent variables x in R^m and dependent variables y in
R^n we get f(x) close to y. Pull this off with any generality, execution
efficiency, and usefulness, and I'll start to be impressed.

To be more impressive than now, they might just look for linear f: R^m --> R
and show why their work is better than old step-wise regression. There they
need to address the conclusions now over 40 years old that step-wise
regression didn't work very well, and doing all possible regressions took too
long and gave a mess.

Generally they have the cart before the horse: The researcher is supposed to
bring some understanding or at least some conjectures to the work and let the
numerical work just fill in the details. Doing the numerical work first is
bass ackwards.

The idea of a 'complexity measure' to select among different formulas is close
to some old moon worship, black magic, or superstition about the magical
properties of selected frogs and mushrooms.

Setting aside all the hype and nonsense, where could such work be used? Well,
the answer is the same as has been the case for decades: It's a curve fitting
program and, thus, could be used where curve fitting programs have been
useful:

First, it can do the fitting arithmetic when the researcher brings to the
computer some 'form' of the formula and is just looking for some constants and
NOT the formula.

Second, if there is some data and want a simple expression that 'compresses'
the data. That can be useful, say, in parts of stochastic dynamic programming;
see, e.g., D. Bertsekas at MIT where he used neural nets for such
'compression'. That is, just storing the data might be 10^n bytes for n = 20,
30, 40, ..., but a few formulas might do just as well for big savings.
Generally splines and multivariate splines have been among the leading
approaches, but I didn't see splines mentioned in their video.

The scary, upsetting, objectionable parts are the claims that they are
"detecting equations and hidden mathematical relationships in data" and, as in
the video, getting at the "physics". No they are not. Here their hype reeks of
the usual in 'artificial intelligence' and 'machine learning' that their
software can 'learn' and be 'intelligent', and that's so far, and likely for a
long time, nonsense.

Another objectionable part is their theme of setting aside all the
assumptions, definitions, theorems, and proofs of statistical estimation of
parameters in statistical models, hypothesis tests on the parameters,
confidence intervals on the parameters and the predicted values from the fits,
etc. How do they do that? They just leave it all out!

They have an option of 'smoothing' the data: That's a big subject. So, the
implication is that researchers might just click on the button that says
"Smooth" and report that in their work. Then to 'smooth' becomes some
unlabeled bottle of snake oil.

The theme here is to omit all the rational and deductive framework of applied
math and replace it with GUIs: This is a future of irrationality with no hope.

This work is a special case of the more general situation that 'computer
science' is getting a C- in applied math and is out'a gas.

The work fills a much needed gap in the literature and would be illuminating
if ignited.

~~~
sciboy
I used to love data mining. Then I formally learned statistics, causal
inference, mcmc and other similar methods. I was amazed at how poor the
computer science was for anything outside of computer science problems (i.e.
search, collaborative filtering etc). Looking back I now realise that my
amazement was misplaced - tools are good for what they are designed for. How
many computer scientists who write the tools run experiments or do _real_
exploratory data analysis with data they have collected?

Agree completely with their model selection criteria; model selection is
useful when used to compare evidence for k physically plausible models, but
should be treated with extreme caution where exploring an infinite model
space.

Personally I hold very little hope for automated tools anymore. Considering
how complex a seemingly simple study is to analyse correctly, or how hard it
is to model physically realistic processes, I think the future does not bode
well for tools like eureka as they are currently aimed. At best eureka may
present some general hypothesis, but it seems unlikely to be able to search
the model space for models that are physically plausible, except in the most
simple of all cases.

Besides, try eureka on some of your real data and prepare to be deflated :)

