

Machine Learning in Python Has Never Been Easier - jdonaldson
http://blog.bigml.com/2012/05/04/machine-learning-in-python-has-never-been-easier/

======
equark
Prior Knowledge (<http://www.priorknowledge.com>) has a very similar API but a
more interesting underlying model. They model the full joint distribution of
the data, so any variables can be missing not just the outcome. They also are
able to return the joint probability distribution over unknowns, which is
extremely useful in terms of quantifying uncertainty.

The model itself appears very flexible:

<http://blog.priorknowledge.com/blog/beyond-correlation/>

I'm not affiliated with these guys but they are clearly doing the most
interesting work in this area.

~~~
dshah
Looks really interesting. Any chance you know where to get an invite code for
the beta?

~~~
equark
They are pretty responsive from my experience. The beta is definitely open.

~~~
Bootvis
True I had my invite in a couple of hours.

------
gromgull
This looks fun and certainly simple - but I would guess that for many, the
actual training of the model is not the show-stopped before "automated, data-
driven decisions and data-driven applications are going to change the world."

If you already have clean data in tabular form, with a single target class to
predict, and enough training data, the last step was always sort of easy. Much
harder is the fact that people expect Big Data and ML to be fairy dust, just
give it my DB password and MAGIC comes out. And instead of a clean two-class
classification problem you have some ill-defined a bit of clustering, a bit of
visulisation and a bit pure guessing -problem.

~~~
mau
<quote>This looks fun and certainly simple - but I would guess that for many,
the actual training of the model is not the show-stopped before "automated,
data-driven decisions and data-driven applications are going to change the
world."</quote>

totally agree, indeed whenever I train a machine learning model (for a ranker
or a classifier) I spend most of the time building the workflow to generate
the datasets and extract and compute the features. I actually haven't found
yet a good open source product that cares about that, last time I had to work
on a ML related stuff I relied on Makefiles and a few Python scripts to
distribute the computation in a small cluster. I needed a more powerful tool
for doing that so during my spare time I've tried to build something similar
to what I've in my mind. I came out with a prototype here:
<https://bitbucket.org/duilio/streamr> . The code was mostly written the first
day then I did a few commits to try how it could work in a distribute
environment. It is in a very _early stage_ and need a _massive refactoring_ ,
it is just a proof of concept. I'd like my workflows to look like
[https://bitbucket.org/duilio/streamr/src/26937b99e083/tests/...](https://bitbucket.org/duilio/streamr/src/26937b99e083/tests/workflows/example2.py)
. The tool should take care of distribute the workflow nodes and cache the
results, so that you can slightly change your script and avoid to recompute
all the data. I hadn't used celery before, maybe many of the stuff I've done
for this prototype could have been avoided (i.e. the storage system could have
been implemented as celery cache)

~~~
carterschonwald
Cool project on your side of things. I'm spending a bit of time myself trying
to put together a compositional tool chain for machine learning tasks myself.
Are there any major design choices you've thought out for your own project
youd care to expound upon?

------
joelthelion
I'd rather use the very good scikits.learn and have full control over what I'm
doing.

------
ptg180
What type of algorithm (or at least what general class of algorithm) is
generating the model? I'm just curious because I'm wondering if the input data
has to be linearly separable or if it has a limited number of classes(I think
that the iris dataset is only linearly separable in one case)?

~~~
skystorm
Looks like good old decision trees to me (with some discretization for
continuous input values).

------
dchichkov
Nice, but just like Google Predict you are trying to violate the "no free
lunch" theorem.

~~~
Dn_Ab
Luckily, our universe has exploitable structure. The No Free lunch theorem is
as much a headache to Searching as the Halting problem is to programming - not
really one.

------
xaa
Seems a little _too_ simple -- how can you generate predictions without even
specifying which is the class variable and which are the predictor variables?

~~~
jdonaldson
In the absence of any input or objective field arguments, the API assumes that
the final column in a flat file is the objective field, while the rest are
input fields. It will also try to determine the appropriate types for all
fields.

You can, of course, override any of that if you wish. Check the documentation
for more details. <https://github.com/bigmlcom/python>

------
hogu
sincere question - Is your only data ingest mechanism by uploading gzipped
csvs, or other files? it seems that if people really have big data, then by
definition that approach won't work

~~~
jdonaldson
We'll be supporting cloud file bucket locations soon: S3, etc. We're also
working on handling streaming data, e.g. logs.

~~~
dshah
Have signed up for the beta. Look forward to checking it out. I've looked at
the Google prediction API, but it doesn't do what I need.

~~~
benhamner
What do you need?

------
manpreets7
How does one get control over the predictive model? What classifier gets used,
for instance. Maybe there is something in the API, but I didn't see it in the
article.

~~~
jdonaldson
We provide basic classification and regression trees for now, and we can
decide which one is appropriate from the objective field type. Once we start
adding in other types of models we will add a model type parameter for the
relevant API method.

~~~
shoo
Do you plan on exposing parameters that control the fitting process? E.g. loss
function / tree depth / min samples per leaf? Or will the fitting process
always be a black-box automagic call with no user-controllable knobs?

Is there any plan to provide some assessment of model accuracy via the API -
e.g. K-fold cross validation with respect to some specified loss function?

~~~
jdonaldson
We do a little automagic currently, but we'll expose some of the knobs soon,
probably first via the API. Expressing model confidence and handling loss
functions are being worked on right now.

Did I mention we're hiring? Someone with the right combination of big data and
machine learning skills can make a big impact. <https://bigml.com/team>

------
geldedus
decision trees are a rudimentary ML technique.

at most they could be used for entry-level didactic purposes

~~~
iskander
...not when you stick a few hundred of them together and train each on
bootstrap samples...

(my random forest can beat up your kernel machine any day)

