

H2O: fast statistical, machine learning and math runtime for big data - ColinWright
https://github.com/0xdata/h2o#readme

======
cpa
The scientific adivsory council is made up of 3 big guns of the ML (research)
community: Boyd, Tibshirani, Hastie; so I guess it's not "just another ML
lib"…

~~~
carterschonwald
yeah, those are baller advisors to have. On the flip side, while that means
they will write interesting algorithms /modelling tools correctly, that
doesn't mean they will build the _right_ tools.

For good or for ill, many academic folks are technological curmudgeons.

Let me in fact make a much bolder but still true statement: very very few of
the folks who are strong both algorithmically and mathematical are also adept
at helping engineer transformatively better to use tools. Why? Because skill
in any subset of the math, cs and engineering things nicely requires
deliberate practice in that entire subset of skills. Deliberate practice of
all three together in tandem does not happen in an academic environment, ever.
(measure zero sized counter examples exist admittedly)

------
srisatish
Thanks for the mention: I work for H2O by 0xdata. We want to bring change to
the world through math. As it turns out, Math is very widely applicable. We
are building a high-scale math library & prediction engine that developers can
extend or embed for gaining actionable insights from their data.

Math has been packaged for way too long and modeling has been sampling driven.
We envision a world where math is free and one does not have to choose between
Big Data or Better Algorithms - get them both.

(answering some of the comments)

\- One of our goals is to continue to be extensible and easy to use. H2O is
extensible today via, JSON, R (& Python) or via simple java (see package hex)
The core platform is very scalable and fast - Thanks again to an amazing team
of devoted hackers.

\- We are inspired by prior art and lessons from Mahout & efforts from RHIPE
(Thanks, Saptarshi!) and think of ourselves as the next generation fulfilling
the promise with a one simple stack for ML & Math on Bigdata that is open,
useful and production grade (performance & testing)

\- We also believe that great math systems can be built by great systems
engineers surrounded by great math people and domain experts (Also, by
starting with the end user experience especially one for Big Data Science.)
Our team reflects some of that thinking. We welcome data analysts, scientists,
math people, distributed systems engineers and domain experts to use, critique
and extend our product.

\- Data ingest into our system can be via SQL, NoSQL, HDFS and plain old
filesystem. H2O ingests regular CSV, xls or hive delimiter files. Most all
commands are JSON directives and can be easily programmed via Python, see our
test bed in action here - <http://test.0xdata.com>

Above all, we are grateful for the attention by HN and would like to welcome
and nurture a community of users, doers and data enthusiasts who can use,
patch, add to the docs and give feedback through your data experiences.

A product is not complete without it's community. Come join us on the
refreshing journey ahead!

------
memming
It looks promising, but it's still in its early stage of development. At least
the documentation is poor. The google group link doesn't seem to work, and I
have no idea how to format and load my data into this framework.

------
Keyframe
I'm not sure I understand. Is this something (to be) like Wakari.io?

~~~
carterschonwald
Wakari (by the continuum folks) seems to be more of a web notebook ontop of
python, where you run the python code

the oxdata stuff seems to be more of "interact with a hadoop cluster via a
restful api via R" kinda thing.

the 0xdata stuff seems to be less easily extensible because of that strong
separation / siloing. (though it looks like they have some really smart
interesting folks on board!)

[edit: I'm working on building some scalable extensible numerical / data
analysis tools myself, and whenever I see that partitioning between the tools
for extending vs the tools for using, it just screams "wrong" to me. That
said, the more everyone else focuses on businesses where that partitioning is
normal, the more left for me :) ]

~~~
Keyframe
Thanks for clarification! How do they approach interactivity with "big
data"/hadoop(Hbase I presume)? Apache Drill is still draft/wip, Impala is not
that fast from what I hear (for interactive)... unless they pre-calculate some
use cases, but how would that constitute (semi)real-time interaction with
hadoop. MR by its nature is not very real-timey.

~~~
carterschonwald
I'm not sure which you are referring to. But the best way is to go and read
the source!

------
pwang
How is this different from RHIPE? <http://www.datadr.org/>

Also some "big guns" on that project.

------
djulius
What's new here ? Mahout already did it.

~~~
ColinWright
Rather than simply making offhand, disparaging comments, perhaps you could
actually provide some details and/or references.

Thanks.

~~~
nazka
It's just a Green Troll :)

~~~
djulius
I admit my first comment was a bit rough. My apologies.

My point is that this library provides the same features as Apache Mahout
(that exists since quite a long time), so why duplicate the work and why not
contribute to mahout ?

Duplicate features: \- Hadoop support \- RandomForest \- Generalized Linear
Modeling

The list of algorithm support my Mahout is way bigger than the one of 0xdata
[https://cwiki.apache.org/confluence/display/MAHOUT/Algorithm...](https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms)

I would say the only interesting feature is the R support. So why not put R
availability in Mahout ?

There is no explanations given although the existence of mahout is pretty
obvious for anybody in the field. So without any proper argumentation, it
seems that the authors voluntarily ignore the major concurrent tool in their
fields.

~~~
nazka
Well Green trolls are fun too Hehe. Thx for this comment I am looking for
solutions to experiment ML and Date Mining.

