
How to generate an arbitrarily large amount of test data - skingkong
https://www.pilosa.com/blog/smote/
======
akalmans
The SMOTE algorithm is fairly old - isn't there something newer that may be
more relevant?

~~~
alanbernstein
I'm the author of the post. I've wanted a tool like this for a while, and I
coincidentally discovered the SMOTE paper recently. It's simple enough to
throw together a prototype in a few hours, and it requires very little
understanding of the data set.

I was looking for something with a certain balance between speed/effort and
statistical robustness. I wanted a big data set for testing pilosa
performance, not for training ML models, or anything that _really_ cares about
the statistics. However, hundreds of repeated records can make histograms look
glitchy, so I wanted to avoid that naive approach. Something like SMOTE fit
that need well.

~~~
juandes
I agree with you. I have a bit of experience using SMOTE and one of the things
that make me keep using it is its simplicity, and how versatile it is. Just
like you, a couple of days ago I wrote a small prototype on how to balance an
already synthetic dataset and was very, very satisfied with the results. I'll
share it with you in case you are interested,

[https://kite.com/blog/python/smote-python-imbalanced-
learn-f...](https://kite.com/blog/python/smote-python-imbalanced-learn-for-
oversampling/)

