

Factors are not first-class citizens in R - jmount
http://www.win-vector.com/blog/2014/09/factors-are-not-first-class-citizens-in-r/

======
jzwinck
Here's my favorite way to get factors in your R program without wanting them:
load an HDF5 file.

Python's excellent h5py module and the official HDF5 tools like "h5dump"
understand that a dataset is boolean if it has the type H5T_ENUM with two
values "FALSE" (0) and "TRUE" (1).

But R works a different way: if you save a boolean vector from R to HDF5
(using the rhdf5 package), it will create a dataset of type H5T_STD_I32LE,
which takes 4x the storage space. And if you read the boolean vector as
understood by other tools, you will get factors. Then, the most amazing thing
happens: R uses the string names of the enum ("FALSE" and "TRUE"), but remaps
them to factors with 1 and 2. So if you do as.integer(h5read(...)) you will
get a vector of ones and twos instead of zeros and ones. And of course R
itself treats 1 and 2 as truthy values, so now your data is well and truly
corrupted: all the elements are truthy!

That is: as.logical(as.integer(h5read(...))) always produces all TRUE values
when the input is boolean written by a program other than R. This is a lovely
source of bugs in real software.

~~~
mapcar
Is rhdf5 the best library? I understand there is ncdf4 and h5r, and rgdal. I
understood that R support for HDF5 was limited so I have not used R much for
this purpose.

~~~
jzwinck
h5r has been removed from CRAN "due to licensing problems"[1]. ncdf4 and rgdal
use totally different formats. rhdf5 may not be the best library, but it seems
to be the only library.

I don't think R support for HDF5 is "limited," I think it's just poorly
implemented in a few ways (another is that you need a separate package to get
64-bit integer support, which is another problem with R in general).

[1]
[http://cran.r-project.org/web/packages/h5r/index.html](http://cran.r-project.org/web/packages/h5r/index.html)

------
mapcar
This is really well-written. I was skeptical about the title's claim but
truly, the author defines what first-class citizens mean and a wide range of
cases where the behavior is inconsistent.

I have gotten used to its behavior but only because I use factors sparingly,
having set stringsAsFactors=FALSE as my default. But reshape2::melt() requires
a separate argument (factorsAsStrings) if you don't want automatic conversion,
and with plyr::ldply() you can't prevent the index column conversion to a
factor at all. So factors creep in periodically into my data frames and burns
me every now and then.

~~~
craigching
> But reshape2::melt() requires a separate argument (factorsAsStrings) if you
> don't want automatic conversion, and with plyr::ldply() you can't prevent
> the index column conversion to a factor at all. So factors creep in
> periodically into my data frames and burns me every now and then.

Oh, wow, thanks for that. I generally prefer data.table to plyr, but I do use
reshape2 of course so that's great to know.

~~~
rz2k
It's worth taking a look at `dplyr`. You can work with data tables or data
frames (or a database), and the functions are in the same neighborhood as
data.table functions in terms of speed. Also, in addition to `reshape2`, there
is `tidyr`.

~~~
craigching
Yep, I know about dplyr, I plan on spending some more time with that. Thanks
for the 'tidyr' reference, I haven't heard of that one yet and will check it
out!

------
JasonCEC
I can't even describe how much I wish `stringsAsFactors=F` was the default.

~~~
mapcar
They've agreed to do this site-wide at the Mayo Clinic.

[https://stat.ethz.ch/pipermail/r-help/2007-February/125438.h...](https://stat.ethz.ch/pipermail/r-help/2007-February/125438.html)

I have adopted this approach as well, and for all the scripts I distribute I
set options(stringsAsFactors=FALSE) at the top.

~~~
craigching
I'm still a bit of an R neophyte but, yes, stringsAsFactors=FALSE seems so
much more intuitive to me. I seem to assume that's true and write code that
explicitly sets strings as factors where appropriate when it's not needed due
to the default and get confused when trying to figure out why some are factors
and some are not.

I'm going to look more into setting this option, but is it consumable by
others? I write R that will be consumable by customers, presumably I can set
this at the top of my R source files?

~~~
mapcar
Again, if you remember to include options(stringsAsFactors=FALSE) at the top
of your scripts, they are transferable. With other R neophytes who only run
code that I write, I tell them to run

cat("options(stringsAsFactors=FALSE)\n",file="~/.Rprofile",append=TRUE)

at the R console.

~~~
jghn
And I'd tell you no.

And to the GP, adding that to consumable code, particularly a distributed
package, is generally seen as impolite behavior. What happens if I load your
package and then another one which assumes the standard default?

If you absolutely must mess with peoples options, the way to do it is capture
the original state, modify the option, make your affected call and then
restore the original options.

I do agree that for my purposes I generally would prefer that as the default
but even after using R for ~15 years I don't see it as some huge burden to
specify it each time.

------
craigching
An aside, on the sidebar there is a link to "Practical Data Science with R",
presumably this blog belongs to the author? Apologies for my inexperience
here, but can anyone recommend that book? I have "R in Action" and am
subscribed to the MEAP for the next edition, is this book a good companion?
Appreciate any responses!

~~~
jmount
One of the blog/book authors here (John Mount). Yes the book and blog are both
by Nina Zumel and John Mount. Obviously my opinion is biased, but here is a
bit anyway. "R in Action" is one of our favorite R books (that and "The Art of
R Programming"). "Practical Data Science with R" is more about working
examples of data science (lots of the grungy tubing steps, how to think about
some of the statistical tests, what sort of software to use at moderate data
size). So if you would like the book depends a lot on if you are interested in
those steps are not. My more hard-sell push of the book is here:
[http://www.win-vector.com/blog/2014/06/how-does-practical-
da...](http://www.win-vector.com/blog/2014/06/how-does-practical-data-science-
with-r-stand-out/)

~~~
craigching
Yes, that does sound interesting to me, the reason I got into R in the first
place was for machine learning. I will check it out!

~~~
jmount
That is something to be careful about with our book. We largely use machine
learning- not tinker with or develop it.

