A mathematical formula that predicts US elections with 87.5% accuracy

raincom · on June 25, 2020

Here is a quote from Segre's new biography of Fermi: "When Dyson met with him in 1953, Fermi welcomed him politely, but he quickly put aside the graphs he was being shown indicating agreement between theory and experiment. His verdict, as Dyson remembered, was "There are two ways of doing calculations in theoretical physics. One way, and this is the way I prefer, is to have a clear physical picture of the process you are calculating. The other way is to have a precise and self-consistent mathematical formalism. You have neither." When a stunned Dyson tried to counter by emphasizing the agreement between experiment and the calculations, Fermi asked him how many free parameters he had used to obtain the fit. Smiling after being told "Four," Fermi remarked, "I remember my old friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk." There was little to add." (p.273)

One can fit an elephant with four free parameters. That's the story of these curve fitting exercises.

drjasonharrison · on June 25, 2020

Am I right in interpreting their screenshot that their best found model has 22 parameters?!

mcphage · on June 26, 2020

22 parameters, to predict 25 bits, and get 87.5% accuracy! Which is just under 22!

bnveg · on June 26, 2020

This is out of sample performance. Only 4 models were compared outside the train domain. Also I don't see where you see 22 parameters, the complexity of a model is not defined as its number of parameters.

Grakel · on June 25, 2020

I ask this in all seriousness- what?

yeezyseezy · on June 25, 2020

In regression and curve fitting, you can add additional, often spurious parameters, and get a closer fit. A better fit does not mean you have a better model, it usually means the opposite, that you've overfitted the model, and that it's completely useless for forecasting and predictive purposes. This is why in machine learning, researchers are careful to separate training and testing data. This is just physicists roasting each other over the same issue.

bnveg · on June 26, 2020

The model shown in the article is cross-validated.

strstr · on June 25, 2020

It’s a classic story about models like these (ones with many parameters).

Imnimo · on June 25, 2020

I would expect to be able to get 100% (on the training data) on this task. Shouldn't any sequences of election victories be easily represented with multiple sin functions?

I feel like the message I'm supposed to take from this demo is "Look how good TuringBot is! They can automatically find a function to match this data!" But the actual message I'm getting is "Symbolic regression is too hard for TuringBot!"

frabert · on June 25, 2020

Indeed. Just perform an FFT on the data and encode the resulting coefficients in a closed formula.

bnveg · on June 26, 2020

This is a Pareto optimization with a limit on the size of the formulas. Sure, with a formula of arbitrary size you can fit anything with 100% accuracy, but the task is much harder if what you are looking for are short models.

kevinventullo · on June 25, 2020

That's nothing, I can generate a degree 44 polynomial that predicts US elections with 100% accuracy!

bnveg · on June 26, 2020

Now do that with cross validation and get a better result than the reported

darksaints · on June 25, 2020

This commits the same fallacy that exists all over the place for time series data. This bites almost anybody who tries using time series methods for predicting equities returns.

Since you have a literal dependency between one data point and the next, you can't train your model using randomized data. So the default that people jump to is to segment their data with earlier data being the training set, and later data being the test set.

If your data is the result of a very well understood and controlled process, this works fabulously (see ARIMA and variants). But the more that the data relies on significant noise, whether that is pure randomness or Brownian motion (very likely the case with elections), that methodology breaks down spectacularly. All you end up doing is overfitting to your test set instead of your training set.

bnveg · on June 26, 2020

>Since you have a literal dependency between one data point and the next, you can't train your model using randomized data This is why the train/test split was not randomized, but sequential.

mywittyname · on June 25, 2020

Uhhhh...This is the dumbest ad article I've read in a while.

frabert · on June 25, 2020

Talk about overfitting...

bnveg · on June 26, 2020

This is an out of sample result. 4 formulas were compared. Sure it may be an overfit, as any machine learning model can be, but the procedure that was used to generate this formula cannot be so easily dismissed.

cjohnson318 · on June 25, 2020

I thought this was going to be about Lichtman[0].

[0](https://en.wikipedia.org/wiki/The_Keys_to_the_White_House)

cjohnson318 · on June 25, 2020

Except that the Democrats of 1940 were much different than the Democrats of 1990, and the Democrats of 2020.

RhysU · on June 26, 2020

Everything is a time series.

defertoreptar · on June 25, 2020

Now repeat the same procedure on randomized data and see how often you can match or surpass this accuracy.

bnveg · on June 26, 2020

As a rule of thumb, you cannot. I have tried to fit the Mona Lisa with the software (brighness as a function of x and y) and could not find anything. It performs poorly when no structure exists.

woopwoop · on June 25, 2020

Maybe the other commenters tearing into this are on a higher plane of irony from me, but I want to point out that I'm 99% sure the article is meant as a joke.

mcphage · on June 26, 2020

Well, I really hope so. It doesn’t really read as a joke, but on the other hand, they’re bringing their impressive software to bear to attempt to predict 25 bits of data, and can’t even do that very well. So I hope so.

elektor · on June 25, 2020

The article is hosted by a real software product so it’s not likely this is satire.

purplezooey · on June 26, 2020

Nicolas Cage movies are missing