
Using symbolic regression to predict rare events - turingb
https://turingbotsoftware.com/posts/rare-event-prediction.html
======
goodside
This seems like a guaranteed recipe for overfitting. You’re searching over an
enormous space of analytic functions and reporting fits without (as far as I
can tell) any regularization penalty on model complexity or validation on
holdout data. Your model will fit well to any data you throw at it under those
conditions, and the result is very unlikely to generalize.

I’m sorry if I’ve grossly misunderstood what you’re doing here, but trying to
sell this method in a GUI tool (making it usable by people without a
background in statistics) seems almost negligent.

~~~
deckar01
The previous submission on "Machine learning prediction of the coronavirus
outbreak" indicates to me that they do not understand what problems symbolic
regression can be applied to. They are making predictions about the future
based only on past measurements of one event. They also seem to be revising
their numbers without noting the changes in the article.

[https://web.archive.org/web/*/https://turingbotsoftware.com/...](https://web.archive.org/web/*/https://turingbotsoftware.com/posts/coronavirus-
prediction.html)

------
qw3rty01
This seems pretty low, can't we apply Bayes' theorem here?

    
    
      let F = transaction being fraud
      let D = detected as fraud
      
      P(F) = 492/284807 = .17%
      P(D|F) = 80%
      P(~F) = 1-P(F) = 99.83%
      P(D|~F) = 13%
      
      P(F|D) = (P(D|F) * P(F)) / (P(F) * P(D|F) + P(~F) * P(D|~F))
      P(F|D) = ((.8) * (.0017)) / ((.0017) * (.8) + (.9983) * (.13))
      P(F|D) = 0.01 = 1% chance of the transaction being fraud given a positive detection
    

although to be fair, they're not stating to only use this method by itself

------
rini17
Did you verify the result with other dataset than the one used for training?

~~~
turingb
No, I think the features of this dataset are unique to it.

~~~
ASpring
Can you update the article by holding out a portion of the training set and
then using it as an unseen test set?

Otherwise it's impossible to make the comparisons at the end of the article to
other results.

~~~
huac
Yeah, was the model trained using train/test splits? otherwise, the model
likely has been severely overfit.

I wonder how this performance would have compared to a simple random forest or
MLP model.

~~~
jlamberts
I was also curious. Using an out-of-the-box weighted random forest, I got an
f-score of ~.85 using a 75:25 stratified train-test split.

------
verdverm
To better understand the limitations in Symbolic Regression and Genetic
Programming, check out my paper Prioritized Grammar Enumeration

[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.394...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.394.140&rep=rep1&type=pdf)

OP misses the point that SR is the problem space, not the algorithm or
solution space.

------
leto_ii
I was not aware of symbolic regression at all, but I'm wondering now if it
couldn't be replicated by some feature engineering or playing around with smth
like the Box Cox transformation.

You could afterwards use some feature selection, regularization etc. to retain
features that have explanatory power.

It wasn't clear to me why symbolic regression would be especially good for
highly skewed datasets. Does anybody have an idea?

------
cortesoft
Cool math, but this doesn't solve the false positive paradox part of the base
rate fallacy... since fraud is so rare, most transactions marked as fraud are
going to be wrong.

[https://en.m.wikipedia.org/wiki/Base_rate_fallacy#False_posi...](https://en.m.wikipedia.org/wiki/Base_rate_fallacy#False_positive_paradox)

------
chromaton
I have used [http://zunzun.com](http://zunzun.com) for a similar effect in the
past.

------
clircle
Why do machine learners use F1 score instead of a proper scoring rule (MSE,
log-loss, etc.)?

~~~
Tarq0n
F1 score is harder to misrepresent, and relates more closely to business
objectives. MSE is mostly for regression. Log-loss is very useful for training
because it represents a useful signal for how well your model is progressing
in fitting the data, but you'll still want to evaluate precision and recall.

In practice you'll usually want to tune the tradeoff between precision and
recall for the situation. This way you can make a direct connection between
model performance and the costs and benefits as they manifest in practice, as
well as your tolerance for risk.

~~~
solresol
Given that this is fraud prevention, the correct error function should be
measured in dollars. The thing you want to minimise is money lost.

Is the cost of allowing a fraudulent transaction through much greater than the
cost of blocking a non-fraudulent transaction? Then the error function should
reflect that. Minimising F1/accuracy/log-loss is not necessarily going to save
the most money.

~~~
Tarq0n
There may also be compliance costs. (Will this model satisfy an auditor?)

------
ttul
So basically this is a neural network architecture discovery process.

~~~
deckar01
It has no neural network. It is a simulated annealing search over a solution
space of symbolic equations. The traditional strategy is to use a genetic
algorithm to construct symbolic equations. I don't see any proof that this
product provides any improvement over the open source tools that are available
for free.

~~~
verdverm
Prioritized Grammar Enumeration is a better algorithm for symbolic regression

