Hacker News new | comments | show | ask | jobs | submit login
Verbatim: Stats guru Jeff Sagarin (si.com)
32 points by jgarmon 1399 days ago | hide | past | web | 41 comments | favorite



The point of this submission being....?

The fact that someone runs statistics calculations in Fortran (a language whose most recent standard is dated 2008) should be shocking only to the recent generation of coders who have become "code hipsters." The guys looking for the Next Big Thing™ in programming languages.

These are the guys trying to cram Python into every possible goddamned use case, whether that be embedded systems or computational fluid dynamics. Whether it's the right tool for the job or not. The same guys who are implementing a Ruby interpreter in Haskell, running on a VM written in Rust. Why? Because damn it, their language of choice is shiny and new, and therefore it must be the best.

Nope. You use the best tool for the job. If you're already familiar with a language and that language is perfectly well-suited for the task at hand (in this case mathematics) then you use that language.

Fortran is old, which means it's seen a lot of development. The compiler optimizations are very, very efficient and Fortran code runs extremely fast. In fact, Fortran was explicitly designed for running mathematics algorithms on computers.

tl;dr Fortran is actually a great choice for this application and I hate hipsters.


These are the guys trying to cram Javascript into every possible goddamned use case

Fixed that for you.

Yes Fortran is probably the right tool for the job in this case, but I also think that creative experimentation with programming languages, invention of new languages, new implementations of and enhancements to existing languages, are good and important, and not just a "hipster" thing. All modern programming languages and their implementations have warts and deficiences, and building a new and improved one requires a level of skill and dedication that goes far beyond hipsterism.


It's a matter of taste. How many verbal languages can you speak and write? Computer languages aren't special, and polyglotism is really only a hallmark of CS-mindedness, which is actually yet another form of hipsterism. It's not a real thing, and the language that is best is the language that the programmer is most productive.


That really, really depends on what the language is being used for. There are a lot of desirable qualities and features, and no single language has even close to all of them. With the vast, vast number of different programming paradigms that are effective for solving different problems, as well the various resource constraints of different computing platforms, a certain degree of "polyglotism" is absolutely necessary to be an effective, productive programmer. There's nothing "hipster" about it.


That really, really depends on what the language is being used for.

The work of stats guru Jeff Sagarin.


How many people in the world need to do the work of stats guru Jeff Sagarin? Is everyone else a hipster?


Now that I reread, I see that your comment was less a recommendation for Fortran-using Jeff Sagarin as it was a so-what digression. I took the bait, mea culpa.


Um, people have been doing high performance numeric and scientific computing in python for over 15 years now. The truth is, for a lot of numeric computing, you're better off using python, which calls fortran or c code to do the heavy lifting.

Fortran can run very fast. If all you need to do is numerical calculations, that might be enough. But that's never all you need to do: you need to preprocess your input data, or deserialize it, normalize it, join it or transform it based on some rdbms data, then you do your calculations, and then you need to graph it, serialize it, etc. Most of those tasks are somewhere between excruciating and impossible in fortran, however modern a dialect you use.


>Um, people have been doing high performance numeric and scientific computing for over 15 years now.

Have I disputed this fact somewhere?

>The truth is, for a lot of numeric computing, you're better off using python, which calls fortran or c code to do the heavy lifting.

Proof that Python can't get the job done in that particular area. Nevertheless there remain Python zealots who will tell you that it's the best for everything.

>Fortran can run very fast. If all you need to do is numerical calculations, that might be enough.

Actually it sounds like that is all he needs to do. Reading and spitting out a .csv file from Fortran is trivial, despite its clunky IO syntax.


Have I disputed this fact somewhere?

Yes, when you were complaining that "These are the guys trying to cram Python into every possible goddamned use case".

Proof that Python can't get the job done in that particular area.

Python can't run at all without C code. Fortunately, Python interfaces really well with C and C++ and fortran. The fact that Python lets you be way more productive using fortran tools without having to know or suffer from fotran's glaring deficiencies seems like a huge plus to me.

Reading and spitting out a .csv file from Fortran is trivial, despite its clunky IO syntax.

Not really. People who do serious analysis need to make graphs. They need to push data back into SQL databases. They need to do all sorts of interactive analyses. They need to present their results and analytics to other people, including reviewers. They need to debug their analysis. All of that stuff is much easier with python than fortran.


People who do serious analysis need to make graphs

Which is where the aforementioned CSV files come in handy. Heard of GNUplot?


I used to work for a physicist who (still) codes in Fortran. I tried to show him Python. He said, "That's nice, use Python if that's what you like." But he always used fortran. It wasn't usually a big deal to translate between the two. He was giving out the big ideas, we were making use of them. To this day, he is still doing physics, still using fortran, still publishing papers.


> you need to preprocess your input data, or deserialize it, normalize it, join it or transform it based on some rdbms data, then you do your calculations, and then you need to graph it, serialize it, etc.

In large scale number crunching, like climate models, numerical weather prediction, the typical case is that input data is conceptually in a regular 2d or 3d grid, and stored in binary format files (like NetCDF or HDF), as that is more efficient and saves space.

Then the heavy lifting number crunching code runs on a cluster as a batch job, reads in the data, crunches the numbers, and writes results out again in NetCDF or HDF files.

The output files are then downloaded to a desktop PC, and graphing is done with Matlab (Python is also getting more popular) or especially in meteorology with some dedicated meteorology graphing software.

The binary format input and output is probably about as efficient as it can be. Also, heavy number crunching scientist probably don't have much use for relational databases.


Is Fortran actually the best choice for this application? I don't know how heavy the calculations are for sports analytics, but I'm guessing it's less advanced than scientific simulations. Who cares how numerically efficient your code is if you use 10x the lines than a modern language and spend half your time struggling to maintain it.


On the sports analytics question I don't have any knowledge or experience, but I can see operations there turning into multi-matrix statistical correlations and such.

On the subject of being the best language for the job, Fortran does not use 10x more lines than more modern languages. In fact, in many cases it uses less. When I was working on real-time signal processing algorithms we used to prototype algorithms in Fortran[1], then implement them in C. I also used to use Python with SciPy and NumPy for analysis, as well as Matlab. Matlab's huge toolbox library notwithstanding, the Fortran implementations were usually shorter, simpler, and easier to understand than any of the other languages. It was only bias and preconceived notions that kept it from being used in the real system. I would have loved to see Fortran subroutines running the algorithms underneath the C distributed framework.

[1] Until someone in management and the customer team decided to have the scientists write their algorithms directly in the real-time C. That is a rant for another day.


Not sure why code to crunch some statistics would have a huge maintenance cost. There goes one point out the window.

As for the first, a) I touched on the fact that Fortran is a modern language and b) unless you've got examples showing two side-by-side implementations of the same algorithm with Fortran being 10x longer, then it sounds like you're just another of the aforementioned hipsters who is experiencing aversion to something just because it is not new and shiny.


The language is only half the machine. The human is the other half. The question is not if Fortran is the best choice, it's is Fortran the best choice for him.


> Is Fortran actually the best choice for this application?

Fortran is like fast, typed, compiled Matlab (with smaller standard library, but a lot of numerical libraries available all over internet).

Nut yes, exploratory data analysis, something interactive like NumPy + Matplotlib, Matlab or R might be better.

But NumPy + matplotlib is kinda painful to set up first to be interactive, R the language is wonky, and Matlab cost a lot of money (and the free alternative, Octave, has much more limited graphics, and they are kind of ugly, too).

Of course, Fortran has no graphing capabilities, so one needs some tool for that anyway.


Python is actually rather old at this point, dating back to 1991. But yes, I see your point: people treating languages like they do cellphones, expecting a new one every 2 years-- confused as to why you have still use that 5 year old junker (whose battery lasts a month).


should be shocking only to the recent generation of coders who have become "code hipsters." The guys looking for the Next Big Thing™ in programming languages.

This was written in 1983: http://www.pbm.com/~lindahl/real.programmers.html

It contains among other things the sentiment "Unix is a glorified video game."

Most of the things you are using right now used to be hipster.


I'm guessing the title was chosen because the submitter thought coding in Fortran was stupid, or old, or whatever. The reality is that Fortran has been updated frequently and for what he's doing seems like a reasonable choice (plus he probably has a store of routines that he likes to use for certain analyses).

Old language != Bad language

From TFA:

"SI: Is it true you still code in Fortran?

JS: Yeah, what’s wrong with that? It’s a good language. Fortran is real good for doing mathematics and running it real quickly. I’m not doing Photoshopping or anything like that. I’m just running numbers."


the submitter thought coding in Fortran was stupid, or old, or whatever

I don't think it's stupid or old, but it's certainly interesting: in the world where hopping from one newest technology to another is normal, somebody sticking to what they are used to is noteworthy. Some other examples are Bob Staake using Photoshop 3.0 (http://www.bobstaake.com/pixfix/films.shtml) or OpenBSD using CVS (hehe) and writing their own version of it.

Maybe that's what the submitter thought.

(Yes, Fortran is still updated, but it's certainly not the cutting edge tech. Or maybe it will be, after this HN submission.)


> but it's certainly not the cutting edge tech

In a way, array slicing syntax has been with us since Algol 68, so it's not cutting edge. But still, Wikipedia only lists Algol, Matlab, R, Fortran, Sinclair Basic, Ada, Perl, Python and then coming to modern times, D and Go, as having that feature.

So it kind of feels like cutting edge in times when I'm lucky enough to be writing in a language that supports it.

http://en.wikipedia.org/wiki/Array_slicing


in the world where hopping from one newest technology to another is normal

On the assumption that you're talking about programmers, I suspect it isn't. I suspect the majority of the world's programmers stay with the same few tools for years and years, sometimes (but not always) upgrading a version when they're forced to.


Sure, I didn't mean the World, I meant our small world of tech news / startups.


I agree with your observations, especially that the the expert has a built-up library of routines that he understands and that have proven themselves.

My question about using Fortran as the main infrastructure is that this does not sound like a big data problem. It sounds more like an exploratory data analysis and model development problem. In which case a REPL-based environment like Python, Matlab, or R would have a lot to offer: interactivity, integrated plotting, modeling toolboxes.

People in my group do scientific computing with Fortran, large datasets, and big iron, and it works great in that domain. As people in this thread have noted, there is significant support for automated parallelization. I used to snicker at these dinosaurs (their own self-deprecating term), but I have noticed their publication record seems to be piling up faster than mine. I have therefore gotten over it, and hope that the dinosaurs are not snickering back at me.


Fortran is still pretty much the only choice if you want the maximum possible performance for your parallel math computations. The only downside is that you have to vectorize a lot of things by hand if you aren't doing nice flat data parallelism.


The problem isn't with the programming language; it with his algorithms. Sagarin's rankings are often an outlier, frequently in ways that just don't make sense. </opinion>


Well then his algorithms would still be wonky if he wrote them in Python or Haskell.


Well, that's my point. When I read the interview, it sounded like his heart just wasn't in it anymore.


Fortan isn't dead, by far.

There is a lot of scientific and industrial code (in astronomy, aeronautics, ...) that is in Fortran. And I'm not talking about legacy stuff that is only in maintenance mode. These codes are still evolving and get new functionalities. They're not in Fortran only for legacy reasons (though that plays some part). They're in Fortran because Fortan is still the right tool for the job.

Scientific computing, where is IO is pretty much never the bottleneck, is a completely different world from web development. Python/Numpy is usable for some tasks, but only because it delegates a lot of stuff to underlying FOrtran code..


Yeah, anyone who has installed numpy, matplotlib, etc has probably noticed that you need a Fortran compiler to build the native extensions...


Actually, you don't. A fortran compiler is optional, and used mostly for f2py and I think(?) for the LAPACK bindings. (The fortran bindings are more efficient than the C bindings for reasons I don't remember at the moment.)

Numpy's (and matplotlib's, actually) core is written in C.

Scipy is a different story, however. A lot of scipy's internals are written in fortran.

That having been said, fortran is a _very_ nice language for a significant portion of scientific programming tasks. IO in fortran is rather painful, though, i.m.o. Tools like f2py (which still doesn't really support fully modern fortran, unfortunately) make it very easy to combine fortran and python, for whatever it's worth.


Yeah maybe I was thinking of scipy. I've only built numpy and scipy from source once, and it was to install scikit-learn, which has both of them as a dependency. I had to install gfortran first.


My favourite example is randomForest, probably the most often used implementation of Random Forest, one of the most popular machine learning method. And it is an R package calling C calling Fortran 77.


Absolutely nothing wrong with Fortran. In the 90s I worked on source-level optimizations in the HPF (High-Performance Fortran) compiler at Thinking Machines: not having a numerical processing background (and having only used Fortran for one course in college, which was Microsoft Fortran with each compiler phase on a separate floppy) it was fascinating to work on, and I found Fortran 90 to be pretty compelling in a lot of numeric operations. There is a lot of code out there that was written and debugged years ago and there is no reason to change.


If your emphasis is on crunching numbers Fortran is the best.


Or, rather, if your emphasis is on crunching numbers, and so many of them that performance is always a big issue, then Fortran is pretty much the best.

Not all number crunching is the same.


A lot of financial services academic modelling is still done in Fortran too. The old programs still run, and new results keep coming.

It was interesting that the NCAA doesn't include scores in their methodology, just wins and losses. Is that because they don't want to officially encourage teams to run up the score? It doesn't stop teams from doing that, but perhaps that's because they are trying to impress other audiences.


Like saying someone still codes in C.


So do most of the world's cosmologists, you know boring stuff like the fabric of the universe, the beginning of time, the hunt for dark matter.

It's parallelisable - if that's a word.

Admittedly the stuff I have seen is all written without knowledge of the maxim "write your code as if the next person to fix your code is a mad axe murderer who knows where you live."

But they still unlock the secrets of the universe with it, even with unusual naming conventions.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: