Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Can a large model be trained directly on a db's binlog?
1 point by martythemaniak 31 days ago | hide | past | favorite | 1 comment
At a high level, traditional ML architectures look something like this:

1. Have some services with a database. Users do stuff, services update the database.

2. The service emits some kind of event describing what changed.

3. The stream of events gets consumed and stored.

4. You write a bunch of aggregations on those events and call them "features"

5. Models get trained on these features.

6. Features get calculated every time there's a new event. You run the trained model on the newest feature values to try and predict something useful.

There's a second path, where you might replicate the db in a warehouse, then run ETLs to produce features, but the outline is very similar. There's basically a lot of manual work done to produce features that models can be trained and run on.

This vaguely reminds me of the way computer vision used to be done - you'd manually run some algorithm to do edge detections, etc then try to operate on them. But it turns out training a large model on a giant pile of images will let the model create and learn its own features from a lot of raw data. So I'm curious - is there an analog here? Can a sufficiently large model be trained directly on the binlog of a database and learn its own features?




One issue that’s come up a few times for me the challenge of developing a clear separation of test and training data which winds up being specific to the problem.

For instance I downloaded the articles from HN and trained models to predict, based on the title: (a) >10 votes, (b) comments/votes > 0.5, (c) is it dead?

Early on I got test-train splits based on hashing the id. For (a) I got a AUC-ROC (see https://scikit-learn.org/stable/modules/model_evaluation.htm...) around 0.62 or so, (b) got 0.73 or so, and (c) got around 0.98 which is shockingly good.

Turns out it was too good to be true: you never see it if you don’t have showdead on but HN get spammed repeatedly with the same spam headlines over and over that end up [dead] right away. The same headline thus ends up in both the train and test set which makes too easy for the algorithm. The same phenomenon also affected my other scores but didn’t make them completely unrepresentative of the performance the way it was for (c).

If I ever go back to try to improve those models I expect to make “stacked” models that need training and test data for submodels, more data for a model that aggregates the submodels, and eval data at the end so even more thinking will go into sampling.

I see other people run into problems like this too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: