
Ask HN: APL family instead of R for data analysis? - wrp
I&#x27;ve read a lot about the APL languages, but never used them because I felt they were too specialized. Recently, I&#x27;ve been roped into some projects using R. I fully understand why R is so popular in statistics departments, but programming in R has been the most migraine inducing experience I&#x27;ve ever had with a language.<p>Since R is an array language, I though maybe an APL family language would work as a replacement. It wouldn&#x27;t have the multitude of libraries that R has, but would have a clean design. Note that I am not thinking of this as an alternative for the typical R user, who is a non-programmer and for whom R is their first and possibly last language. I&#x27;m thinking of a moderatly experienced programmer who is faced with a data analysis task.<p>Some APL derivatives, like J and Dyalog APL, have had their capabilities extended well beyond simple array juggling. How are they, compared to R, as data analysis environments?
======
tlack
I'm having a blast playing with the free 32bit version of Q/K/Kdb+ these days
in the context of a medium sized website which requires a good deal of data
analysis.

As you probably know, Q/K are in the lineage of APL languages, with strong
inherent vector-oriented capabilities, but Q uses an SQL-like dialect of words
to replace APL's symbols.

I've always found the idea of APL exciting, but it seemed to be a very
isolated platform. "How do I use this on my site?"

Q speaks SQL natively (using s.k) so it's easy to work into my MySQL-based
flow, though not exactly the same as MySQL's syntax.

It's got great built in facilities for bulk data loading and storing, which
saves you a lot of time with the boring "getting my data set up" step when
banging out little helper scripts.

I've begun using Q as a cache in place of memcache/redis because I much prefer
being able to ask flexible questions with a dynamic query language. Example: A
MySQL query on a 1m row table that was often showing up in my slow query log
at 1sec+ took less than 50ms in Q.

Obviously Q and MySQL are apples to oranges, but it is still a handy adjunct
when you need serious speed without full loss of flexibility.

And writing Q directly is a really interesting exercise. My PHP and Node code
has gotten better as a result.

~~~
wrp
I have wondered about K/Q, because I gather that K is actually a
simplification relative to APL, to make it more focused and easier to use. I
thought it might be missing useful facilities for data munging. Like, I don't
think it has regexes, does it?

~~~
tlack
It has a limited version of regular expressions available in the "like" and
"ss" functions:
[http://code.kx.com/wiki/Reference/like](http://code.kx.com/wiki/Reference/like)

There's also a port of re2 available:
[http://code.kx.com/wiki/Cookbook/regex](http://code.kx.com/wiki/Cookbook/regex)

The C API seems pretty simple once you get over the insane function (and
macro!) names, so I think a pcre port could be done.

In practice though I think Q gods prefer a split-based vector oriented
approach when it comes to string chomping. Q's built in verbs and adverbs
function much like regex operators do, but at an abstraction level tied
fundamentally to your data.

------
tpetricek
R is... _interesting_ language. Its main power really comes from the
comprehensive libraries that are available. I think there is an interesting
option if other languages can figure out how to make the R libraries easily
accessible in their environments, which is not that easy, because even the
libraries can use the _interesting_ R design features :-).

That said, in F# (which is a functional-first language with some solid
programming language design background), one can call R functions in a fairly
nice way using a type provider:
[http://bluemountaincapital.github.io/FSharpRProvider/](http://bluemountaincapital.github.io/FSharpRProvider/).

This probably cannot replace R for typical statisticians, but it is a nice
option for programmers...

~~~
mziel
Exactly! There are many good languages for data analysis (Julia is one
example) but no other language (not even Python, which is BTW leaps and bounds
above all the other alternatives) can match the library support and
statistical maturity that R has.

And R is not so bad as a language. It's OO is really bad, but it has functions
as first-class members and great support for NAs. Combined with Hadleyverse
and pipe operator (magrittr) is makes a very readable code.

~~~
wrp
I think the value of all the libraries available for R is overrated (even
ignoring the buggy ones). Most of the time, you don't need that kind of
choice.

I just finished reading _Data Analysis_ by Peter Huber. One chapter is about
programming languages for statistics. In the language that he developed (ISP),
he included only a small core of statistical facilities because in his
experience that's all people really need. David Hand has also written
observations along the same line.

~~~
mziel
It really depends on what you're doing. If you're a developer tapping into
data analysis, sure. But I'm building predictive models for a living and I
evaluated quite a few different frameworks and languages.

Sophistic algorithms for missing data, sparse solutions, latent modelling. You
can find the many industrial grade algorithms in Python, Spark, Julia, Weka
etc. but for stats-heavy data science/machine learning R is unmatched. Sans
writing your own implementations based on pseudocode, but this is really not
effective use of your time for prototyping.

EDIT: And even for simple stuff, the strength of R packages is easy to prove
since they are either directly ported to other languages (ggplot) or heavily
influence the implementation (pandas).

~~~
Lofkin
With Pymc3, you can code a bayesian generalization of most of those packages
pretty easily. For bayesian models, it is easier than calling into C++ with
stan in R.

For everything else, one can call arbitrary R packages with Rpy2, albeit with
clunkier syntax.

For agent based modeling, there is no comparison. Python with its better and
faster class system and Numba is in another league.

Finally python has distributed and out of core lists, arrays and dataframes:
[http://dask.pydata.org/en/latest/](http://dask.pydata.org/en/latest/)

------
chubot
Are you using Hadley Wickham's libraries in R? I think those go a long way
toward fixing the problems you may have encountered -- i.e. API consistency,
usability, orthogonality, naming, etc.

I also suggest reading some of his papers like "Tidy Data".

R is weird in that you NEED a large set of CRAN libraries to do work, whereas
in Python you can do a fair amount with the standard library, and PyPI is
relatively weak in comparison.

I'm also a programmer who learned R for data analysis. It does take quite
awhile and some pain. Part of the difficulty is that R is weird and has warts,
but an even bigger difficulty is that you are learning new concepts and
programming abstractions (fundamental difficulty vs accidental difficulty). R
passes the test of a language which changes the way you think.

R has its warts but is likely the most practical choice. It has the full set
of things you need -- data preparation and cleaning, exploratory analysis,
model building, and visualization.

~~~
keithpeter
May I venture to add that R based applications may be more acceptable to
clients as they can be sure that there is a large community that can at least
understand how the application links into the environment, even if client
employees might not fully understand the logic of the algorithm.

Disclaimer: I'm a civilian, not a programmer

------
gd1
I use both. R is slow as a dog, Q/KDB+ is (almost) as quick as C. R is ugly,
Q/K is beautiful and elegant. R has a wealth of open-source packages, Q not so
much.

Best solution is to do most of your work in Q and call into R when you want to
use a package. The R integration works well, see
[http://code.kx.com/wiki/Cookbook/IntegratingWithR](http://code.kx.com/wiki/Cookbook/IntegratingWithR).

The reality is that if you are using (and paying for) Q/K, you are likely
doing so because you're dealing with billion+ point datasets. At least. R just
melts and falls apart at that kind of size (without using native code, which
isn't really R then is it?). So I often end up just translating R functions
into K equivalent, and just using R for the nice-to-have features like
latex/brew/ggplot/etc.

~~~
wrp
Good comments. The way I have viewed the role APL languages is that you use a
general purpose language to do everything up to the point where you have a
clean data array, then send it to APL for array juggling, then take the result
back into the GP language for all follow-up tasks. Does this sound like the
workflow for Q/K and other APL language users? My reading of IBM's APL2
literature is that they intended this workflow with Tcl as the GP language.

The two big things I'm trying to get here are a comparison of how APL-family
compares to R in that role and in reaching beyond that role to handle the
stuff before and after.

------
BMarkmann
You could just blow everyone's mind of sneak APL _into_ R:

[http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSIp...](http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSIpMjAxMi8wNS8xNy8xNF8yNF8yOV8zNjJfQVBMX2luX1IucGRmBjoGRVQ/APL%20in%20R.pdf)

~~~
wrp
Interesting. He implemented APL array functions in R. Since there is some
mismatch in how APL and R handle things, I wonder if it doesn't make coding
even hairier.

------
GFK_of_xmaspast
Given a choice between 'clean design' (which is certainly not a truth
universally acknowledged) and 'getting stuff done with libraries that somebody
else did the hard work on' I'll choose r any day of the week.

------
avmich
J has pretty good - but may be small - community on Jsoftware forums. J
libraries are small but ever growing - are there particular examples of R
libraries which are sorely missing in J?

I should admit I was asking myself the question of this topic.

~~~
qznc
J is also amazingly active on Rosetta Code. Right now at place 4 with 789
implemented challenges. Only Tcl, Racket, and Python are ahead.

~~~
GFK_of_xmaspast
"Almost as popular as tcl and racket" is one hell of a selling point.

~~~
avmich
Here it means "more popular than Lisp, Java, C++..." \- so context matters.

~~~
GFK_of_xmaspast
The existence of an index in which j is #4 behind tcl and racket should not
lead one to think 'wow, j is really catching on' but rather 'wow, that's an
unusual index.'

~~~
avmich
I think looking at Rosetta Code you shouldn't make conclusions about selling
points :) . Rather you can use RC as a repository of J programs, which happen
to have many examples - comparing to repositories for other languages on the
same RC.

If you're familiar with RC, you know that it encouraged participation with
unusual languages, as it's generally hard to find lots of examples of code for
them. RC is a code chrestomathy site, but for Java and likes it's easy to find
many good examples elsewhere. So rare languages comparatively shine on RC.

------
kcl
You might give Kerf a try:
[https://github.com/kevinlawler/kerf](https://github.com/kevinlawler/kerf)

~~~
keenerd
I was going to suggest that Kona would be a better fit, but you would probably
know which is best.

Thanks also for adding a linux binary today, looking forward to playing with
it. Can't say I'm as excited about the 1 month timer though.

~~~
kcl
If you want an unrestricted version send me an email at k.concerns@gmail.com
(this goes for anyone here)

------
brudgers
The advantage of R is it's robust community. The "Array processing languages"
family [APL, J, K] is a radically smaller community.

~~~
wrp
It really doesn't ease the pain of dealing with an awkward language. I think
the general lack of technical sophistication in the R community just adds to
the aggravation. So as far as community experience goes, I might actually rate
the APL-family higher.

~~~
brudgers
As a J fanboi, I can't say I don't sympathize, and J has perhaps the best
documentation of any language I've come across. Yet, R is probably a better
choice for general collaboration simply due to its ubiquity.

~~~
wrp
Oh yeah, I wouldn't even consider APL for collaborating with non-programmers.
I'm thinking of just as a personal tool.

With R, you can do data preparation, but I find it easier to do it with Perl
or awk than import the cleaned data to R. Is a standard workflow with J pretty
much the same? I thought that with PCRE and other libs available in J, it
might be suitable for more than just juggling cleaned data.

------
baconner
There are some other alternatives that I suspect you'll be happier with if
your main issue is the R language. Python + pandas or Julia for instance.
Smaller communities around those but less aggravating languages for sure.

~~~
wrp
Python is definitely an alternative that seems to be taking the hard sciences
by storm. I just didn't mention it because I don't need any help in evaluating
that option.

------
chrisocowan
I did a lot of work with device models and SPC, while working in VLSI Design.
We had SAS and APL2 (with a companion stats package called GraphStat). The
APL2/GraphStat combo was awesome, and I always found myself going to APL2 when
I really needed to do some analysis. But, APL2 withered on the vine long ago.

After some recent forays into Clojure, Haskell, and R, I find myself getting
re-interested in languages in the APL family.

I had read about Q, K and Kdb+ a while ago. Guess I'm going to have to kick
the tires.

~~~
codygman
Sadly Q and K are both proprietary.

------
caseyf7
Really, you're going to call the typical R user a non-programmer? R has become
so popular because there are many phenomenal programmers working in R.

------
msravi
I use Julia ([http://julialang.org/](http://julialang.org/)) and have found
that it brings the best of multiple worlds into one neat platform. There are
the native Julia functions, but there are interfaces to C, Python, and R.
Plus, programming in the language is a pleasure.

~~~
jeo1234
You haven't found it too be a little rough around the edges?

------
Lofkin
For huge datasets, Python has distributed and out of core data structures:
[http://dask.pydata.org/en/latest/](http://dask.pydata.org/en/latest/)
[https://github.com/ContinuumIO/blaze](https://github.com/ContinuumIO/blaze)

This is pretty unique, and works better than spark for out of core on one
machine...(and easier to set up.)

For stats not in the statsmodels and scikitlearn packages, you can easily whip
up the bayesian generalization in pymc3.

Then if there is another R package you need, you can use Rpy2 to call it.

Not sure if this would be relevant to your usecase.

------
ZenoArrow
There are probably plenty of good options, but have you considered F#? It's
got some good support for this type of work, including an R Type Provider (and
I'm not talking about something that swallows quarters!). Here's a useful
summary of some of the key features F# has access to for data science work:
[http://fsharp.org/guides/data-
science/index.html](http://fsharp.org/guides/data-science/index.html)

~~~
wrp
I am slightly familiar with SML, OCaml, and F#. I like F# and might use it
sometimes if I were interested in developing on .NET/Mono. However, while F#
is good for numerical analysis, it is not so good for the kind of interactive,
exploratory work that Matlab and R are designed for.

------
MageSlayer
Just in case if someone is really interested in APL expert (Dyalog APL) and/or
remote job please contact me.

~~~
genieyclo
How? There is no contact information in your profile.

~~~
MageSlayer
Hm. I filled in email and duplicated it in description.

------
davelnewton
No way; use Clojure:
[https://github.com/incanter/incanter](https://github.com/incanter/incanter)

