

Benchmarking Random Forest Classification - jsbloom1
http://about.wise.io/blog/2013/07/15/benchmarking-random-forest-part-1

======
tlarkworthy
Its random forests ... each tree is trained on a __subset __of the data. You
can split the massive dataset into chunks and train independently. That
sidesteps the "big data" hangup.

If you look at the implementation for ski-learn, each tree emits a normalised
probability vector for each prediction, those vectors are simply multiplied
together to get the aggregate prediction, so its not very difficult to do
yourself.

Although regardless, you are applying a batch learning technique anyway. You
want an incremental learner for big data.

~~~
msellout
The training subset for each tree can still be quite large. Note that most of
the implementations failed on their 12 GB dataset.

Although I'm a big believer in streaming/online machine learning, it's not
necessarily the best solution. There are many cases when batch is the better
option, especially for big data. Anything historical, really.

------
glouppe
Any chance for you to run your benchmarks on this branch of Scikit-Learn?
[https://github.com/glouppe/scikit-
learn/tree/trees-v2](https://github.com/glouppe/scikit-learn/tree/trees-v2)
This will be shipped anytime soon :)

We have been working hard to reduce computing times and memory footprint
(though, there is still a lot of improvement on that side).

(Unfortunately, I cannot run your benchmarks myself, because the compiled
version of WiseRF requires a newer version of glibc than the one on my
cluster, and crashes.)

------
bravura
Question: Why do I have to implement hyperparameter selection?

For me, the promise of in-the-cloud machine learning is that I can call
'train' method, and specify one single hyperparameter: training budget (i.e.
$). Perhaps also the max time before I am returned a trained model.

That's it. Can you do that?

~~~
joeyrichar
This is exactly what we're enabling with our ML Platform (currently in private
beta). Such a system needs to be built on top of fast & scalable ML technology
with smart & efficient tuning/optimization.

Would love to hear about your use cases & get you on the beta.

-Joey Richards, Chief Scientist @ wise.io

