Ask HN: Can a large model be trained directly on a db's binlog?

PaulHoule · 2024-05-17T20:59:48

One issue that’s come up a few times for me the challenge of developing a clear separation of test and training data which winds up being specific to the problem.

For instance I downloaded the articles from HN and trained models to predict, based on the title: (a) >10 votes, (b) comments/votes > 0.5, (c) is it dead?

Early on I got test-train splits based on hashing the id. For (a) I got a AUC-ROC (see https://scikit-learn.org/stable/modules/model_evaluation.htm...) around 0.62 or so, (b) got 0.73 or so, and (c) got around 0.98 which is shockingly good.

Turns out it was too good to be true: you never see it if you don’t have showdead on but HN get spammed repeatedly with the same spam headlines over and over that end up [dead] right away. The same headline thus ends up in both the train and test set which makes too easy for the algorithm. The same phenomenon also affected my other scores but didn’t make them completely unrepresentative of the performance the way it was for (c).

If I ever go back to try to improve those models I expect to make “stacked” models that need training and test data for submodels, more data for a model that aggregates the submodels, and eval data at the end so even more thinking will go into sampling.

I see other people run into problems like this too.