Ruby is great for data prep, basic calculations, web app development, and scrapi...

thauck · on Sept 12, 2012

I mostly agree with this, because I do like Ruby even though I don't use it.

The missing link here, and the reason Python gets more love from the data community, is that Python scales down to the smaller data sets as well as it handles big ones. (Not sure if you ment it couldn't, but the distinction you make implies that.)

textminer · on Sept 12, 2012

Python is surprisingly heavy-duty. But my kingdom for a seamlessly distributed or parallelized version of NumPy/SciPy! How nice would it be to just enter "C = A * B", with A living as a sparse CSC across many nodes?

msellout · on Sept 12, 2012

Would Disco (http://discoproject.org/) work for you?

cdavid · on Sept 12, 2012

I don't think MR is a good abstraction for implementing linear algebra, and I expect the overhead to be too high (although I don't have numbers to back that up). For large problems (>> couple of machines worth of RAM), you use big iron HPC solutions, or you avoid 'exact' linear algebra altogether to focus on one-pass algorithms.

For example, instead of computing an exact SVD, you will use something like Hebbian algorithm to compute the SVD in a streaming manner (that's what Mahaout implements for example).

Radim · on Sept 12, 2012

No, the sparse matrix code in SciPy is plain C (not even multi-core, let alone distributed).

EDIT: or did you mean Disco offers distributed sparse CSC operations?

hogu · on Sept 12, 2012

we, http://continuum.io/, are working on this.

EzGraphs · on Sept 12, 2012

Agreed, Python scales down and so is also good for small tasks. What I was saying is that Ruby - as much as I like it - does not generally scale up beyond a certain point.

Both R and Ruby have had issues with large data sets which have been addressed to some degree in more different distributions and more recent releases. Python is ready out of the box for large data sets. So what I meant to communicate is that if you know that you are going to be dealing with a large data set, you might as well go straight to python.

ucsd_surfNerd · on Sept 12, 2012

I totally agree that Ruby + R makes a great data toolbox. I use Ruby + R for almost all of my day to day exploratory data analysis work.

As far as python goes I think python has become more popular simply because it has more community around data applications. Unfortunately a lot of people view ruby as just Rails. I think academia's adoption of python has also helped it grow into a data analysis language.

Really any MR you do with python you could do with Ruby as it all uses hadoop streaming.