

How to use R, H2O, and Domino for a Kaggle competition - earino
http://blog.dominoup.com/using-r-h2o-and-domino-for-a-kaggle-competition/

======
izyda
I do not understand start ups like Domino. It seems to me like it is
essentially the equivalent of running an AWS instance along with a git hub
account. AWS does not require any sort of hardware maintenance on your part
and it takes only a tutorial or two to learn how to install R and run code on
it in parallel / across multiple instances.

Presumably, Domino does not take unparallelized R code and transform it into
parallelized code - so if you have to use the parallels R package (or some
equivalent) anyway to get it to run on multiple cores, what really is the
value add? Am I missing something?

~~~
DominoDataLab
Hi izyda. We get that question a lot, and we're working on our messaging
around this, so I appreciate the feedback. Here are some reasons our customers
find Domino valuable:

\- Domino makes it really easy to start and manage multiple runs in parallel
(think a modern, easy-to-use cluster). If you're doing all this directly with
AWS, you're quickly running into pain points managing all your instances and
images.

\- Domino auto automatically keeps a revisioned history of your work. It
supports large files like data sets (which git can't handle) and it tracks the
results/artifacts of your analysis (which makes it more like git + CI). These
things are critical to analytics workflows, rather than pure software
development.

\- Domino lets you deploy your analyses as self-service web UI tools, or
deploy them to API endpoints. Doing this on your own would involve building an
entire web stack around your analysis.

\- Domino hosts your analysis centrally so you can share and collaborate with
others (yes, this is like github, but on a platform that has all the benefits
above).

\- The entire product can be installed on-premise, so companies can use the
functionality described above without going to the cloud if they don't want
to.

Finally, even for pure infrastructure management, we've found that many data
scientists don't want to spend their time dealing with system administration.
It's true that it's not that hard to start an EC2 instance. But pretty quickly
you're installing packages (perhaps in an environment you aren't used to),
dealing with security groups, file transfer (configuring S3), etc. People use
Domino for the same reason they use Heroku: yes, you could deal with all that,
but it might be a better use of your time to let someone else do it.

~~~
izyda
Thanks for the response - there are some fair points here.

As other commenters pointed out, the fact that you charge by minute not hour
does in fact make a big difference in price, particularly for those of us that
need to run intensive but sporadic/short tasks.

A few question about your points that I am trying to find in documentation
right now but perhaps you can save me the trouble if you happen to see this
first:

> Domino makes it really easy to start and manage multiple runs in parallel
> (think a modern, easy-to-use cluster). If you're doing all this directly
> with AWS, you're quickly running into pain points managing all your
> instances and images.

How so? Does Domino allow you spin up more cores at will from R? That would be
awesome.

> \- Domino lets you deploy your analyses as self-service web UI tools, or
> deploy them to API endpoints. Doing this on your own would involve building
> an entire web stack around your analysis.

This is awesome and definitely useful if you are doing work for clients and do
not want to be bothered with spending too much time building production grade
stuff. In some sense is this like yhathq.com? (I understand you guys do more
than they do in the sense you provide all these other features).

------
dxbydt
Can you please comment on why you need a 50 node 3 hidden layer ffnn to do
regression, as opposed to something simpler ?

~~~
jofai_chow
The starter code I provided is a basic DNN structure for modelling complex
non-linear relationships between five soil properties and 3000+ predictors.

In practice, I found that some of the properties require even more complex DNN
structure to achieve better predictive accuracy. The 50-50-50 setup is a very
solid starting point for the readers to begin their own experiments.

~~~
dxbydt
Thank you. How did you come up with the 50-50-50 setup, or was it purely
empirical ? Did you try something simpler first, and how did that simpler
method perform vis-a-vis this DNN ? Congratulations on topping the
leaderboard.

~~~
jofai_chow
Thanks! Yes, I always start with much simpler networks like 10, 10-10,
10-10-10. Unfortunately, the regression problems here are quite complex hence
bigger networks are required (well, it wouldn't be on Kaggle otherwise).

