
Stupid Data Miner Tricks: Overfitting the S&P 500 - herrherr
http://nerdsonwallstreet.typepad.com/my_weblog/files/dataminejune_2000.pdf
======
rgbrgb
Does interpolation ever work in forecasting?

My gut instinct would be that markets and human systems are chaotic in nature.
Even in the most chaotic systems, if you look at a suitably small sample, you
can see some correlations and patterns between different factors which really
don't exist. These are mirage correlations.

Take the lorenz attractor as an example. At some points, it will cycle on the
same "wing" of the butterfly many times. But betting that it will do it again
is a really lousy bet.

Polynomial approximation and curve fitting in general works when we're trying
to explicate relationships between variables in a problem space in which we
understand causal linkages very well (and they're constant) - it can be really
useful in engineering.

------
ck2
Via google PDF viewer

[https://docs.google.com/gview?url=http://nerdsonwallstreet.t...](https://docs.google.com/gview?url=http://nerdsonwallstreet.typepad.com/my_weblog/files/dataminejune_2000.pdf&pli=1)

~~~
imurray
Those that want Google PDF viewer links can get them added automatically
wherever they go. This greasemonkey script (of several available I'm sure)
adds a Google viewer icon after all pdf links:
[http://homepages.inf.ed.ac.uk/imurray2/code/user_scripts/goo...](http://homepages.inf.ed.ac.uk/imurray2/code/user_scripts/google_viewer.user.js)

------
imurray
Terrible generalization of polynomials is useful for demonstrating overfitting
(I've done it myself in tutorials). However, responsible tutorials should
mention that the other obvious lesson is that the polynomials (1, x, x², x³,
etc) are a _terrible_ set of basis functions for regression. Don't just watch
for overfitting, but use a sensible regression model! For complicated fits
some methods to consider are: local regression, splines, various artificial
neural nets, or Gaussian processes.

~~~
transphenomenal
So what is a good basis for polynomial regression? I have heard this statement
a few times, but I never heard of a good alternative.

~~~
imurray
I tried not to be _that guy_ and already gave some alternatives for
regression.

“Polynomial regression” implies to me that the basis functions are
polynomials. I‘ll assume you meant “good basis for a simple fit, maybe by
least squares”. More local functions like “radial basis functions” can work
well. Or use splines or sigmoidal functions, which saturate to a flat line or
linear trend. In some applications Fourier or wavelet bases might be
appropriate.

Gaussian process regression is a Bayesian treatment of some basis function
models, potentially with an infinite number of basis functions. Artificial
neural nets usually use local or sigmoidal basis functions, potentially in a
more complicated way.

------
tropin
What's with the [scribd] tag when direct linking to a .pdf file? It's becoming
common, but I can't understand it.

~~~
vmind
It's just a convenience link for viewing the PDF on scribd rather than
downloading / using a plugin.

~~~
jimktrains2
But it's a direct link to a pdf, not to the scribd page.

~~~
scottbessler
There is a separate link inside the [scribd] tag.

------
zipstudio
"If the NFL wins, the market goes up, otherwise, it takes a dive. What’s
happened over the last thirty years? Well, most of the time, the NFL wins the
Superbowl"

Standards of editing have really gone down over the years. The "NFL" always
wins the Superbowl...

~~~
dandelany
> The "NFL" always wins the Superbowl...

It does now, but it didn't always. Super Bowls III and IV were won by AFL
teams (Jets & Chiefs) before the AFL/NFL merger in 1970. For several years
after the merger, it was quite common (though technically incorrect) to call
the newly created NFC & AFC "conferences" by their pre-merger acronyms.

------
streptomycin
TLDR: Correlation != causation; if you have high dimensional data, you can
always find a correlation, but it's probably meaningless; polynomial wiggle is
a bitch, so don't fit high dimensional polynomials to your data.

------
waqf
Related question: if I despite the warnings fancy my chances at _this sort of
thing_ , what sort of historical data can I get? Is free [machine-readable]
stock market data easy to come by, or impossible?

~~~
hackerblues
I think you can get daily summaries quite easily, eg:

<http://www.google.com/finance/historical?q=NASDAQ:GOOG>

and from memory when the markets are open you can see the order book stuff:

<http://finance.yahoo.com/q/ecn?s=GOOG+Order+Book>

But I believe you have to buy the more detailed data from the exchange itself.
I have no idea how you do this but as far as I know it costs maybe a few
thousand per month.

So more stuff:

<http://www.statslab.cam.ac.uk/~chris/links.html>

