Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[flagged] A mathematical formula that predicts US elections with 87.5% accuracy (turingbotsoftware.com)
8 points by bnveg on June 25, 2020 | hide | past | favorite | 27 comments


Here is a quote from Segre's new biography of Fermi: "When Dyson met with him in 1953, Fermi welcomed him politely, but he quickly put aside the graphs he was being shown indicating agreement between theory and experiment. His verdict, as Dyson remembered, was "There are two ways of doing calculations in theoretical physics. One way, and this is the way I prefer, is to have a clear physical picture of the process you are calculating. The other way is to have a precise and self-consistent mathematical formalism. You have neither." When a stunned Dyson tried to counter by emphasizing the agreement between experiment and the calculations, Fermi asked him how many free parameters he had used to obtain the fit. Smiling after being told "Four," Fermi remarked, "I remember my old friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk." There was little to add." (p.273)

One can fit an elephant with four free parameters. That's the story of these curve fitting exercises.


Am I right in interpreting their screenshot that their best found model has 22 parameters?!


22 parameters, to predict 25 bits, and get 87.5% accuracy! Which is just under 22!


This is out of sample performance. Only 4 models were compared outside the train domain. Also I don't see where you see 22 parameters, the complexity of a model is not defined as its number of parameters.


I ask this in all seriousness- what?


In regression and curve fitting, you can add additional, often spurious parameters, and get a closer fit. A better fit does not mean you have a better model, it usually means the opposite, that you've overfitted the model, and that it's completely useless for forecasting and predictive purposes. This is why in machine learning, researchers are careful to separate training and testing data. This is just physicists roasting each other over the same issue.


The model shown in the article is cross-validated.


It’s a classic story about models like these (ones with many parameters).


I would expect to be able to get 100% (on the training data) on this task. Shouldn't any sequences of election victories be easily represented with multiple sin functions?

I feel like the message I'm supposed to take from this demo is "Look how good TuringBot is! They can automatically find a function to match this data!" But the actual message I'm getting is "Symbolic regression is too hard for TuringBot!"


Indeed. Just perform an FFT on the data and encode the resulting coefficients in a closed formula.


This is a Pareto optimization with a limit on the size of the formulas. Sure, with a formula of arbitrary size you can fit anything with 100% accuracy, but the task is much harder if what you are looking for are short models.


That's nothing, I can generate a degree 44 polynomial that predicts US elections with 100% accuracy!


Now do that with cross validation and get a better result than the reported


This commits the same fallacy that exists all over the place for time series data. This bites almost anybody who tries using time series methods for predicting equities returns.

Since you have a literal dependency between one data point and the next, you can't train your model using randomized data. So the default that people jump to is to segment their data with earlier data being the training set, and later data being the test set.

If your data is the result of a very well understood and controlled process, this works fabulously (see ARIMA and variants). But the more that the data relies on significant noise, whether that is pure randomness or Brownian motion (very likely the case with elections), that methodology breaks down spectacularly. All you end up doing is overfitting to your test set instead of your training set.


>Since you have a literal dependency between one data point and the next, you can't train your model using randomized data This is why the train/test split was not randomized, but sequential.


Uhhhh...This is the dumbest ad article I've read in a while.


Talk about overfitting...


This is an out of sample result. 4 formulas were compared. Sure it may be an overfit, as any machine learning model can be, but the procedure that was used to generate this formula cannot be so easily dismissed.


I thought this was going to be about Lichtman[0].

[0](https://en.wikipedia.org/wiki/The_Keys_to_the_White_House)


Except that the Democrats of 1940 were much different than the Democrats of 1990, and the Democrats of 2020.


Everything is a time series.


Now repeat the same procedure on randomized data and see how often you can match or surpass this accuracy.


As a rule of thumb, you cannot. I have tried to fit the Mona Lisa with the software (brighness as a function of x and y) and could not find anything. It performs poorly when no structure exists.


Maybe the other commenters tearing into this are on a higher plane of irony from me, but I want to point out that I'm 99% sure the article is meant as a joke.


Well, I really hope so. It doesn’t really read as a joke, but on the other hand, they’re bringing their impressive software to bear to attempt to predict 25 bits of data, and can’t even do that very well. So I hope so.


The article is hosted by a real software product so it’s not likely this is satire.


Nicolas Cage movies are missing




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: