Hacker News new | past | comments | ask | show | jobs | submit login
R in a 64 bit world (win-vector.com)
43 points by erehweb on June 9, 2015 | hide | past | favorite | 13 comments



Coincidentally, R on Spark (SparkR) is also announced today: http://databricks.com/blog/2015/06/09/announcing-sparkr-r-on...

It will appear in Spark 1.4 to use R on a cluster of machines, or a single machine with multicores.


I once encountered a problem in R trying to run a mixture model that depended on some underlying Fortran code and the Fortran code couldn't handle the initial size of the value to be minimized.

The only solution I found was to completely rewrite the code in Python to avoid the problem. I was chuckling for a while about hitting an unsolvable Fortran problem in 2014.


That's a puzzling bug as Python floats are just doubles, which Fortran definitely has. But I identify with the larger point: if you do scientific computing for any amount of time you will run into Fortran code. Old, convoluted, unmaintainable, and often wicked fast and devoid of bugs. A really eye-popping amount of netlib, *pack, etc. being used in production right this very moment either relies on Fortran routines, or is calling C code that was ported from an equivalent Fortran routine. It's the result of some really smart people putting in a lot of time and effort over the past 30 years; if it ain't broke...


> wicked fast and devoid of bugs

Having ported a lot of old fortran to modern fortran and C I wish that were the case. More often than not, that's the assumption but its rarely true. There are some great lower level libraries like that *packs. But for the most part the performance and bug-freeness of these codes is more dogma than reality. For instance, in MCNP (a large monte-carlo neutronics package) a coworker of mine replaced the 70s era handrolled FFT function (which looked like someone was trying to write assembly in fortran) with a modern libary for 500%+ performance gain.


As a friend of mine in academia is discovering...it can be quite hard to tell if it is broke.


About two years ago I encountered a bizarre bug where, if I asked SciPy for the eigenvalues of a particular small matrix that I was using as a test case, it would consistently give a different result on my desktop computer than anywhere else. But when I tried to isolate the test case, it went away. It would only happen if I ran three other test cases in a particular order first. Or if I ran that one test case 12 times in a row, it would fail the twelfth time.

I really wanted to find out what was going on. I looked through the code, from SciPy to ARPACK to the underlying ATLAS calls, at which point it became completely opaque to me.

I still don't know whether it was the fault of ARPACK or ATLAS or what, but I just put the test cases in a different order, they consistently passed in that order just like they passed for everyone else, and a few system upgrades later the problem didn't happen anymore.


ATLAS compiles differently on different machines ("Automatically Tuned Linear Algebra Software") so this doesn't come as a complete surprise. Agree that that's a really annoying bug though. Do you happen to remember the matrix?


A similar thing happened to me too, I would get some kind of internal error in a wrapped ARPACK and it would cause the entire python process to crash. It was reliable in that a given sequence of events would always trigger a crash, but not reliable in that we couldn't really alter those events to minimize things. We never bothered reporting it.


I remember once trying to get an older professor on board with the concept of unit testing. He couldn't wrap his head around the idea. What kind of madmen were the Computer Science department hiring that programmers were wasting their time writing code so trivial that you knew what it would return before it ever ran?


I found the way to sell unit testing to the academics is the following. The unit test isn't a proof that the code works on all inputs, it is a demonstration that there exists at least one set of inputs the code works for. As we know: a lot of code doesn't even meet the seemingly easy weaker second condition.


Fortran 77 code is supposed to be also legal Fortran 95/03/08 code, so you could have just changed all the 32bit REAL's in the F77 code to real(dp) or something.

But maybe recompiling the Fortran library you used would have been tricky for you.


While I'm not a data scientist, I have been doing Euler problems in Julia for fun and the situation seems better there, at least when it comes to the foundations.


Definitely.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: