

Goodbye R scripts, hello R packages - RA_Fisher
http://statwonk.github.io/blog/2013/11/03/goodbye-scripts/

======
vikp
I love R, and it was actually the first language I really learned to program
with (for obvious reasons, I wouldn't ever recommend this). I can identify a
lot with the "one-off scripts" problem of R. When I look back at some of my R
code, I find that it is just a giant mess of commands mixed together semi-
randomly.

As I learned to program properly, I solved the reuse and "good code" problem
more by moving towards using Python than by making R packages. I occasionally
use R for data exploration and visualization, but Python has support for
almost all of the machine learning and statistical functions that I need.

I am very interested to hear how other people have solved this problem. Do you
only use R, or R in combination with Python/Julia/Java? If you only use R, are
you using it purely academically?

~~~
chubot
IMO the Unix philosophy of reuse is vital for data analysis. I recommend that
people in this space learn to use the interactive shell and shell scripts
well. I saw a comment that said "Shell is a REPL for C", which I think is
quite pithy.

People think too much about reusing R packages or Python packages. In my
experience you can get things done a lot faster if you factor things into
_programs_ and not libraries. This lets you use multiple languages, and all
real world analysis problems need multiple languages. (If you're only using
one language, then you're likely only working on part of the problem).

Right my toolset is Python, C++, and R, coordinated with shell scripts. I
still like R for quick plotting, and of course data frames are essential, but
I'm playing with Pandas now, which seems impressive. I don't think Python will
ever catch up to R in terms of statistical functions, but in terms of
plotting/munging/data frames it might.

In the future there will be more languages, not less (Julia will add to the
number of languages, not replace any). So being able to decompose a problem
into separate, reasonably generic, programs is an important skill IMO. Most
data analysis pipelines are a huge mess, but they don't have to be.

~~~
vikp
That's an interesting way to go about it. Any reason for using shell scripts
to coordinate the flow instead of things like Cython and RPy? (I don't shell
script a lot, so this may be a silly question)

These days, I mostly seem to be able to get away with using single languages
for applied machine learning, like a Python webserver that runs background
machine learning tasks, or an Android app in Java that connects to a webserver
written in Python. But back when I was doing less applied side stuff, I was
similar to you in that things were spread across R and Python.

~~~
chubot
In my experience, RPy is annoying to get working, because you're dealing with
Python versions and R versions together. It's brittle and usually unnecessary,
as it's you can just serialize to CSV or JSON.

If you can get away with using a single language that's good, and Python is
probably the only one where that is possible (i.e. can write both prototype
and production code, and data ingestion and machine learning). But I often
have to use C++ because of the data size, and I think R's plotting is more
convenient than anything in Python now.

The shell scripts have their messiness and sharp edges, but they definitely
save me many many lines of code. There is always some weird thing that needs
to be integrated/automated and shell is almost always the right tool for that.

------
elohesra
Purely out of interest, what's so wrong with Windows as a development
environment? I've seen this claim touted before on HN, but I've never seen an
explanation (not doubting that there is one; I just haven't seen one). I
develop on Windows every day, and I'm always happy to learn a new way to
improve my workflow, so does Ubuntu (or Mac) actually make it easier to
develop software, and if so then how?

I'd also be interested to hear what kinds of software you've found it hard to
develop on Windows.

~~~
CraigJPerry
>> what's so wrong with Windows as a development environment?

Command line.

When you're doing a task on a computer, some paradigms are better suited than
others depending on the task.

E.g. imagine delicately touching up pixels in a photo, guided by your artistic
eye yet instructing the computer by laboriously typing many commands into a
command line. There's no reason this couldn't work, it's just way more natural
to use a pen & digtiser, or a mouse.

When we're developing we're editing text and invoking commands. Editing text
is covered by any major platform just fine, a lot of popular editors & IDEs
are even cross platform.

Invoking commands are sometimes covered by the IDE, sometimes the commands
covered are pretty comprehensive (i'm thinking of Tom Christiansen's quip
"Emacs is a nice operating system, but I prefer UNIX").

There's a limit to what tasks your IDE can cover, even though some do try
really hard to cover all your needs (Eclipse has a web browser in it!) and
when you hit that limit as we so often do in development, you need a command
line.

At that point, Windows kind of whistles while sheepishly looking off to one
side.

There's a related point about development dependencies, some libraries are
hard and / or time consuming to build. A great package manager is a brilliant
productivity booster. Mostly though, it's the point about the command line.

(just to be clear, i have given powershell a good shot, i estimate i've
authored almost 1k lines of it).

~~~
elohesra
Thanks for the reply.

What sort of command line tasks are you trying to invoke on both systems that
are easier to invoke on unix? I can't think of any commands off the top of my
head that I've had to invoke recently, other than IIS commands. Maybe it's
just the fact that the C# developer's workflow is so heavily IDE based. I can
imagine that if you were having to manually invoke the compiler etc, then a
useful command line would be a must.

~~~
CraigJPerry
No problem at all!

Searching is a big one, looking through my shell history i frequently use
egrep although I think egrep is a bad example here since I could use the IDE
to do a search. The problem is that I frequently want to do something with the
results - maybe make a substitution in each of the result files, or compress
the files, or copy them to another host.

I overuse the find command, locate would run faster in many cases but find is
just a reflex. E.g. answering:

    
    
        find data/ -type f -mtime -1 # get me the working dataset from today
    

I see a lot of source control commands, This is something all the IDEs do but
in my experience it's much more robust from the command line than an IDE.

There's a few 3 or 4 line scripts for various tasks I was doing manually. To
give an example one is to kick off a rebuild of my KVM virtual machine (I
currently write a lot of CFEngine code so this is a frequent thing when the
unit tests of the cfengine code don't go to plan and leave the KVM borked!)

The biggest use case in my shell history is simple navigation to look at or
operate on various files. Fuzzy search in ST2 is slowly winning my heart here
right enough but it doesn't work on remote hosts or even local dirs not in my
project.

~~~
elohesra
Very interesting, thanks for the in-depth explanation.

I think that this does appear to be a fundamental difference between Windows
devs and Unix devs. All of the things you've listed there, I'd do through the
GUI on Windows.

Searching, obviously, I'd Windows+E (or Super + E in OS non-specific keyboard
terms) to open up the explorer, then tab+tab to navigate to find, then type
the query. I can't think of an easy way to pipe a set of files into some sort
of function (e.g. for substitutions) outside of F#, maybe something exists for
that in PowerShell, but I find PowerShell to be a bit of a poorly-documented
mess.

Source control, again, I'd do through the GUI. With TortoiseHG, I've got my hg
commands available for me with the right click of a file/folder under a
repository. I see that it seems that this'd be significantly slower than doing
it through command line, but the windows shortcuts actually make it pretty
snappy to navigate through the file system in the GUI.

For kicking off a VM, I'd usually attempt to shortcut the command I'm looking
to execute regularly and then Windows+D my way to the desktop to execute it.

Windows does seem to have made the decision early on (i.e. back in '95, when I
first started using it) to provide a clean and concise GUI first and foremost,
and then to add programmer/automation-friendly terminal APIs as an
afterthought. I wonder if the issues that Unix users on windows have are
caused by the fact that Unix systems seemed to go the other way, and that
attempting to replicate a Unix experience on Windows would just lead to
frustration?

This is interesting, because I rarely get a chance to use Unix at work, nor do
I get much of a chance to speak to Unix devs (we're a Microsoft shop -- WPF,
ASP.NET MVC, and SharePoint if we're feeling masochistic). Thanks again for an
interesting snapshot into your workflow.

~~~
CraigJPerry
Likewise, ive just realised that I dont make effective use of the desktop on
windows but when you mention win+d to get to a screen worth of cherry picked
shortcuts, it makes sense.

I never really put anything on my desktop, I do use the win7 taskbar a lot but
I reckon ill start using the desktop too.

EDIT: meant to add for piping in windows, someone showed me something
similarly useful, if you drag a bunch of selected files onto a program icon,
the program can often make use of those files. E.g. drag some files onto the
outlook icon and it'll compose a new email with them attached.

~~~
elohesra
With regards to piping, yes this _can sometimes_ be true. It's very dependent
on the actual program. Programs on windows take a string array as the argument
to their main function, so if the program is written in such a way that it
iterates across every element of the string array argument and then does
something with them, then this'd work. Most of the Microsoft programs behave
sensibly with this, and execute an Open command against each of the files
dragged onto them, but this isn't guaranteed to be true.

I'd still like a nicer, general purpose way of manipulating multiple files in
Windows. That does seem to be one place where it's definitely lacking, but it
could just be that I don't know how to do it.

------
EpiMath
Interesting, thanks. ( Like your username too, despite him being slightly
denigrated in the linked article. ) I personally think that it is good to have
a basic grounding in some kind of fundamental "real" programming language
before you get into s-plus ( R ) or any other more domain-specific language.
The main reason being that you will better understand the compromises,
limitations and shortcomings. I like R but as a programmer I've seen some
awful code written in it!

~~~
RA_Fisher
Yes, if you read further on the blog, you'll see I'm more than slightly
denigrating. I read a book recently called The Cult of Statistical
Significance and it's really turned me against Fisher. Not only do I see his
methods as sub-standard to Bayesian analysis, he was really mean and
disparaging himself to those folks. It's interesting, my username is what it
is because he used to be my hero! Just 6 months ago, I really considered him
to be the father of modern science. Now I basically see his work as mostly
enabling mediocre scientists to rise through the ranks. This is what I mean by
that:
[http://www.economist.com/blogs/graphicdetail/2013/10/daily-c...](http://www.economist.com/blogs/graphicdetail/2013/10/daily-
chart-2)

~~~
EpiMath
I've spent my career thinking about these same issues. And certainly Fisher
was, well, abrasive to put it mildly. I've read much of his early work and I
think the intended context has mostly been lost. Many modern statisticians (
of the "frequentist" persuasion ) use a strange and awkward combination of
fisherian and neyman-pearson. We talk about "p-values" but then interpret them
as hypothesis tests with long-term error probabilities ( Fisher disavowed this
interpretation of p-values, and insisted they did not have a long run
probabilistic interpretation but were a measure only of evidence against the
null in the particular experiment. ) I think Fisher gets a bad rap for a lot
of later bastardization of his work. ( I'm sure it did not help that he was
not a likable person, or his valid but misguided and testimony about tobacco.
)

Still, I'm sympathetic to your position and can understand how you'd come to
that way of thinking. I'm not convinced even a diehard Bayesian would
completely disagree with Fisher's more restricted and stringent interpretation
of p-values. Or at least they would see it as a big step up from the more
common usage you are referencing.

~~~
baldfat
When did ideas get muddied by personality? I see this in the whole Ender's
Game debate and in this Fisher debate. We need to compare ideas outside of the
personality THAN we can separate the idea for the person.

In our history (Historical Philosophy/Theology Major) you would be apsolutely
shocked at the lives of great thinkers and disapprove so much of their lives
but MOST people don't know. This is the issue with open lives we know so much
more about people. It is never just a book or an idea we can learn maybe to
much???

