
Tuna – a simple streaming ETL for machine learning features - Lemaxoxo
https://github.com/MaxHalford/tuna
======
Lemaxoxo
Hello everyone,

I'm currently participating in the "PLAsTiCC Astronomical Classification"
Kaggle competition. The dataset is rather large (~5M rows) so I decided to
write a simple tool to compute aggregate features online. I got really
inspired and tidied things up over the weekend. Maybe this can be of interest
to some of you.

PS: I'm aware that there are other similar projects out there such as Spark
Streaming, but they all feel too bloated and difficult to grok.

