Hacker News new | comments | ask | show | jobs | submit login
Machine Learning in Python Has Never Been Easier (bigml.com)
266 points by jdonaldson on May 4, 2012 | hide | past | web | favorite | 31 comments

Prior Knowledge (http://www.priorknowledge.com) has a very similar API but a more interesting underlying model. They model the full joint distribution of the data, so any variables can be missing not just the outcome. They also are able to return the joint probability distribution over unknowns, which is extremely useful in terms of quantifying uncertainty.

The model itself appears very flexible:


I'm not affiliated with these guys but they are clearly doing the most interesting work in this area.

Are they a Y Combinator company, by any chance?

Curious as some of the language is very similar to a series of stealth job postings here over the past ~6 months.

Looks really interesting. Any chance you know where to get an invite code for the beta?

Just submit your email and we'll get you into the beta very quickly:


They are pretty responsive from my experience. The beta is definitely open.

True I had my invite in a couple of hours.

I'm very curious to see (a) what sort of generative model they're using under the hood, and (b) how they do inference efficiently enough to not dedicate a cluster to each customer.

This looks fun and certainly simple - but I would guess that for many, the actual training of the model is not the show-stopped before "automated, data-driven decisions and data-driven applications are going to change the world."

If you already have clean data in tabular form, with a single target class to predict, and enough training data, the last step was always sort of easy. Much harder is the fact that people expect Big Data and ML to be fairy dust, just give it my DB password and MAGIC comes out. And instead of a clean two-class classification problem you have some ill-defined a bit of clustering, a bit of visulisation and a bit pure guessing -problem.

<quote>This looks fun and certainly simple - but I would guess that for many, the actual training of the model is not the show-stopped before "automated, data-driven decisions and data-driven applications are going to change the world."</quote>

totally agree, indeed whenever I train a machine learning model (for a ranker or a classifier) I spend most of the time building the workflow to generate the datasets and extract and compute the features. I actually haven't found yet a good open source product that cares about that, last time I had to work on a ML related stuff I relied on Makefiles and a few Python scripts to distribute the computation in a small cluster. I needed a more powerful tool for doing that so during my spare time I've tried to build something similar to what I've in my mind. I came out with a prototype here: https://bitbucket.org/duilio/streamr . The code was mostly written the first day then I did a few commits to try how it could work in a distribute environment. It is in a very early stage and need a massive refactoring, it is just a proof of concept. I'd like my workflows to look like https://bitbucket.org/duilio/streamr/src/26937b99e083/tests/... . The tool should take care of distribute the workflow nodes and cache the results, so that you can slightly change your script and avoid to recompute all the data. I hadn't used celery before, maybe many of the stuff I've done for this prototype could have been avoided (i.e. the storage system could have been implemented as celery cache)

Cool project on your side of things. I'm spending a bit of time myself trying to put together a compositional tool chain for machine learning tasks myself. Are there any major design choices you've thought out for your own project youd care to expound upon?

I'd rather use the very good scikits.learn and have full control over what I'm doing.

What type of algorithm (or at least what general class of algorithm) is generating the model? I'm just curious because I'm wondering if the input data has to be linearly separable or if it has a limited number of classes(I think that the iris dataset is only linearly separable in one case)?

Looks like good old decision trees to me (with some discretization for continuous input values).

As skystorm notes, these are decision trees. We have a limit on the number of classes (it's in the hundreds). We'll increase this as we improve the algorithms.

Nice, but just like Google Predict you are trying to violate the "no free lunch" theorem.

Luckily, our universe has exploitable structure. The No Free lunch theorem is as much a headache to Searching as the Halting problem is to programming - not really one.

Problems like classification come from a very tiny corner of the "all possible objective functions", so NFL doesn't really apply.

Curious about that. How so?

Seems a little too simple -- how can you generate predictions without even specifying which is the class variable and which are the predictor variables?

In the absence of any input or objective field arguments, the API assumes that the final column in a flat file is the objective field, while the rest are input fields. It will also try to determine the appropriate types for all fields.

You can, of course, override any of that if you wish. Check the documentation for more details. https://github.com/bigmlcom/python

sincere question - Is your only data ingest mechanism by uploading gzipped csvs, or other files? it seems that if people really have big data, then by definition that approach won't work

We'll be supporting cloud file bucket locations soon: S3, etc. We're also working on handling streaming data, e.g. logs.

Have signed up for the beta. Look forward to checking it out. I've looked at the Google prediction API, but it doesn't do what I need.

What do you need?

Most of my data is too big for CSVs but too small to justify distributed storage. I use HDF5 with chunking and column compression. I think many other people in the sciences and finance also do this (along with using NetCDF).

How does one get control over the predictive model? What classifier gets used, for instance. Maybe there is something in the API, but I didn't see it in the article.

We provide basic classification and regression trees for now, and we can decide which one is appropriate from the objective field type. Once we start adding in other types of models we will add a model type parameter for the relevant API method.

Do you plan on exposing parameters that control the fitting process? E.g. loss function / tree depth / min samples per leaf? Or will the fitting process always be a black-box automagic call with no user-controllable knobs?

Is there any plan to provide some assessment of model accuracy via the API - e.g. K-fold cross validation with respect to some specified loss function?

We do a little automagic currently, but we'll expose some of the knobs soon, probably first via the API. Expressing model confidence and handling loss functions are being worked on right now.

Did I mention we're hiring? Someone with the right combination of big data and machine learning skills can make a big impact. https://bigml.com/team

decision trees are a rudimentary ML technique.

at most they could be used for entry-level didactic purposes

...not when you stick a few hundred of them together and train each on bootstrap samples...

(my random forest can beat up your kernel machine any day)

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact