Hacker News new | past | comments | ask | show | jobs | submit login

Ruby is great for data prep, basic calculations, web app development, and scraping aggregating data and R for visualization. I find them to be a joyful combination. Great for small data sets, quick estimations, and various small projects.

Larger data sets and performance intensive operations are better handled in Python (or Java, C++, etc). Lots of statistical analysis is way below the threshold of Hadoop and company.




I mostly agree with this, because I do like Ruby even though I don't use it.

The missing link here, and the reason Python gets more love from the data community, is that Python scales down to the smaller data sets as well as it handles big ones. (Not sure if you ment it couldn't, but the distinction you make implies that.)


Python is surprisingly heavy-duty. But my kingdom for a seamlessly distributed or parallelized version of NumPy/SciPy! How nice would it be to just enter "C = A * B", with A living as a sparse CSC across many nodes?


Would Disco (http://discoproject.org/) work for you?


I don't think MR is a good abstraction for implementing linear algebra, and I expect the overhead to be too high (although I don't have numbers to back that up). For large problems (>> couple of machines worth of RAM), you use big iron HPC solutions, or you avoid 'exact' linear algebra altogether to focus on one-pass algorithms.

For example, instead of computing an exact SVD, you will use something like Hebbian algorithm to compute the SVD in a streaming manner (that's what Mahaout implements for example).


No, the sparse matrix code in SciPy is plain C (not even multi-core, let alone distributed).

EDIT: or did you mean Disco offers distributed sparse CSC operations?


we, http://continuum.io/, are working on this.


Agreed, Python scales down and so is also good for small tasks. What I was saying is that Ruby - as much as I like it - does not generally scale up beyond a certain point.

Both R and Ruby have had issues with large data sets which have been addressed to some degree in more different distributions and more recent releases. Python is ready out of the box for large data sets. So what I meant to communicate is that if you know that you are going to be dealing with a large data set, you might as well go straight to python.


I totally agree that Ruby + R makes a great data toolbox. I use Ruby + R for almost all of my day to day exploratory data analysis work.

As far as python goes I think python has become more popular simply because it has more community around data applications. Unfortunately a lot of people view ruby as just Rails. I think academia's adoption of python has also helped it grow into a data analysis language.

Really any MR you do with python you could do with Ruby as it all uses hadoop streaming.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: