The model itself appears very flexible:
I'm not affiliated with these guys but they are clearly doing the most interesting work in this area.
Curious as some of the language is very similar to a series of stealth job postings here over the past ~6 months.
If you already have clean data in tabular form, with a single target class to predict, and enough training data, the last step was always sort of easy. Much harder is the fact that people expect Big Data and ML to be fairy dust, just give it my DB password and MAGIC comes out. And instead of a clean two-class classification problem you have some ill-defined a bit of clustering, a bit of visulisation and a bit pure guessing -problem.
totally agree, indeed whenever I train a machine learning model (for a ranker or a classifier) I spend most of the time building the workflow to generate the datasets and extract and compute the features. I actually haven't found yet a good open source product that cares about that, last time I had to work on a ML related stuff I relied on Makefiles and a few Python scripts to distribute the computation in a small cluster. I needed a more powerful tool for doing that so during my spare time I've tried to build something similar to what I've in my mind. I came out with a prototype here: https://bitbucket.org/duilio/streamr . The code was mostly written the first day then I did a few commits to try how it could work in a distribute environment. It is in a very early stage and need a massive refactoring, it is just a proof of concept. I'd like my workflows to look like https://bitbucket.org/duilio/streamr/src/26937b99e083/tests/... . The tool should take care of distribute the workflow nodes and cache the results, so that you can slightly change your script and avoid to recompute all the data. I hadn't used celery before, maybe many of the stuff I've done for this prototype could have been avoided (i.e. the storage system could have been implemented as celery cache)
You can, of course, override any of that if you wish. Check the documentation for more details.
Is there any plan to provide some assessment of model accuracy via the API - e.g. K-fold cross validation with respect to some specified loss function?
Did I mention we're hiring? Someone with the right combination of big data and machine learning skills can make a big impact.
at most they could be used for entry-level didactic purposes
(my random forest can beat up your kernel machine any day)