
Feather: A Fast On-Disk Format for Data Frames for R and Python - revorad
http://blog.rstudio.org/2016/03/29/feather/
======
jzwinck
HDF5 is supported by many languages including C, C++, R, and Python. It has
compression built in. It can read slices easily. It is battle-tested, stable,
and used in production for many years by thousands of people. Pandas even has
integrated support for DataFrames stored in HDF5.

What's the advantage of Feather over HDF5? Couldn't the Feather libraries be
written with the same API but HDF5 as the storage format, if the Feather API
is preferable?

~~~
wesm
Hi, Wes here.

HDF5 is a really great piece of software -- I wrote the first implementation
of pandas's HDF5 integration (pandas.HDFStore) and Jeff Reback really went to
town building out functionality and optimizing it for many different use
cases.

But the HDF5 C libraries are very heavy dependency. Feather by comparison is
an extremely small amount of code (< 2KLOC in the core library) and a
correspondingly minimal API. It's a simple file format with excellent
performance, and we wanted to make it as easy as possible for people to use
Feather.

There is also the Apache Arrow factor -- integration between the Arrow memory
representation and R and Python tools will have a lot of ecosystem benefits,
so one of the goals of Feather is to reconcile Python's and R's metadata
requirements with the "official" Arrow metadata so that we can move around
data frames with very low overhead.

~~~
jacobolus
Is there a plan to add other language implementations (or C implementation
wrappers)?

I’d love to see a nice format of this type that can easily be written/read
from Javascript in a browser [e.g. to get the data into a D3 visualization]
and from Matlab, in addition to Python.

I looked into trying to implement an HDF5 codec in Javascript, but that looked
like a large task for one person unfamiliar with the format.

~~~
hadley
It's just a matter of someone implementing the protocol. It's not a huge
amount of work for an experienced js/matlab programmer.

~~~
jacobolus
Is there a spec somewhere, or is the existing implementation the spec?

Edit:
[https://github.com/wesm/feather/blob/master/doc/FORMAT.md](https://github.com/wesm/feather/blob/master/doc/FORMAT.md)

Seems a bit sparse/incomplete still (as would be expected for a brand new
project).

------
hadley
Both Wes and I (project authors) will be tracking this thread in case you have
questions!

~~~
data_scientist
Great idea! Some questions:

\- Both R and Python support strings, factors, and complex objects in a
dataframe. What is NOT supported by feather?

\- Feather is "not for long term data storage". Will it be standardize in a
distant future?

\- Do you plan to integrate it into Pandas?

~~~
hadley
Feather currently doesn't support recursive/hierarchical data structures, like
lists in R. That'll be added in the future though. We'll definitely
standardise in the future so you can feel confident using it in the long term.

I have no plans to integrate it with pandas, but I'm sure Wes does ;)

~~~
infinite8s
Will you take a storage approach for nested data similar to Parquet?

------
peatmoss
My heart is palpitating. I love where this is going. Jake Vanderplas talked
about the desire for a common data frame lib to unite the warring tribes in
his PyCon keynote a year or so ago, and I couldn't have agreed more.

This appears to be "only" a serialization format ("oh, my unicorn _only_ lays
_golden_ eggs"). I really hope this is the start of some common library
infrastructure that can be used for all aspects of in- and out-of-memory data
frames.

Great work, and I hope it is a harbinger of good things to come. Also, I'll
treat this as tangible evidence that the "language wars" are stupid.

~~~
wesm
This is precisely the goal of the Apache Arrow project
[http://arrow.apache.org/](http://arrow.apache.org/) \-- and I've been working
very hard to bring together diverse groups of data system developers to work
on this problem together. Exciting road ahead!

~~~
peatmoss
Ah jeez, I read the feather announce, but not the arrow docs. How did I not
know about this!?

------
jph00
If anyone wants to get this running on Windows, I've made a start. In
cpp\thirdparty, you can run the .sh scripts with cygwin, but in
build_thirdparty you'll need to add to the appropriate section:

    
    
        elif [[ "$OSTYPE" == "cygwin"* ]]; then
          PARALLEL=$NUMBER_OF_PROCESSORS
    

And you'll need to use msbuild rather than make, e.g.:

    
    
          if [[ "$OSTYPE" == "cygwin"* ]]; then
            msbuild gtest.sln /p:configuration=release
    

That got the 3rd party stuff working. But then I hit a snag, because building
python 2.7 modules on Windows requires an old MSVC version that doesn't
support stdint.h, which is used by feather in ext.cpp . Maybe a simple
conditional compilation for the appropriate header will be enough to fix that,
but I haven't got time to check today. So hopefully someone else can fix
that...

------
mziel
I can see that Wes is a reporter for Spark issue
([https://issues.apache.org/jira/browse/SPARK-13534](https://issues.apache.org/jira/browse/SPARK-13534)),
what are the plans (if any) for tighter Spark/Python/R DataFrames integration?

~~~
wesm
I'd like to see SPARK-13534 completed so that we can bring Spark to Python/R
data access performance reach a level we can deem "acceptable". I dug into
this issue a bit here: [http://wesmckinney.com/blog/pandas-and-apache-
arrow/](http://wesmckinney.com/blog/pandas-and-apache-arrow/)

------
sshillo
Are there any plans to support larger than RAM datasets? Like hdf5 or bcolz
does.

~~~
hadley
The format already supports larger than RAM data, but we don't yet have an API
for creating those files or just extracting slices. That will come in the
future.

~~~
dbyte
Thanks for working on this. Really a great effort. Hope your guys can get good
inspiration from projects like h5df, bcolz, pytables having the option of
incrementally adding features and maintaining an open spec.

------
stephenl
what are some practical uses for this if living in a pure R world?

Is this like BigMemory but for data frames?

[https://cran.r-project.org/web/packages/bigmemory/index.html](https://cran.r-project.org/web/packages/bigmemory/index.html)

Thanks.

~~~
hadley
It's often much faster than rds. And in the long long term there will be tools
for computing on feather files that don't require loading it into memory. (In
the short term I'll add ways to pull in slices of the full dataset)

~~~
alsocasey
Faster because it isn't (currently) using compression (which rds uses by
default) or faster period?

Either way, the idea of mixed Python/R pipelines with feather file
intermediates input/outputs is pretty sweet. Learn in scikit, save to feather,
plot in ggplot2... using Make to tie the pieces together?

~~~
hadley
It's usually faster than either compressed or uncompressed RDS - but if you
have heavily duplicated data, compressed RDS can be faster than feather
(depending on some tradeoff between compression speed and disk speed). Feather
will probably gain compression support eventually.

------
wallstquant
Now if only there was a Julia package for this too

~~~
andrioni
It shouldn't be hard, given that most of the work is being done by a shared
C++ library.

(I'll probably take a stab at writing one this weekend, though)

------
zzleeper
This looks amazing!

Hadley, Wes, what are your thoughts on how to implement compression? I recall
some open source columnar datastores (e.g. infobright) that achieved very VERY
fast compression rates with just a few tricks:
[https://news.ycombinator.com/item?id=8354416](https://news.ycombinator.com/item?id=8354416)

In particular, compression is extremely fast for columnar datastores (its the
same type one after the other). Since a lot of times the data is sorted by
some ID (date, individual, etc.), you should see large improvements in both
speed and disk space.

~~~
hadley
Compression is on the long term to do list - Wes knows a lot more about it
than me.

------
th0ma5
Any plans to support a pipe-aware POSIX command, or get this into PostgreSQL?

~~~
hadley
Not by us, but we expect that many other projects will add feather support now
that it's used by both R and Python.

~~~
vannevar
Is there any functional distinction between a data frame and a relational
table? Could you implement a persistent frame just by wrapping sqlite?

------
joelthelion
It seems like C++ (C?) is also supported. Why don't you advertise it? In my
(limited) experience, exchanging data with C/C++ is even more painful than
between Python and R.

------
sandGorgon
Quick request - could you hook into save.image in R and give the ability to
save the entire workspace in R? That would be awesome.

Incidentally I had filed a bug request for a functionality to save the entire
workspace in Pandas...but was rejected as being unpythonic. Oh and the devs
claimed Apache Arrow was vaporware !

[https://github.com/pydata/pandas/issues/12381](https://github.com/pydata/pandas/issues/12381)

~~~
hadley
I think saving your entire workspace is a bad idea too, sorry!

~~~
sandGorgon
Could you talk about why? Other than convenience factor (and R already does
it), could you talk about why.

Is it stemming from a fundamental aspect of the data format - for example can
you save two data frames to the same file?

Because if you can save two - why not save two hundred.

~~~
peatmoss
It thwarts reproducibility. By saving your workspace, it drags a lot of state
from session to session that isn't accounted for. If you share code with
someone else, their workspace space won't be the same, and thus the code may
not function the same.

~~~
sandGorgon
Point taken. But we are again delving dangerously close to thou-shall-not .
from my perspective, it is a quick and convenient way to save all the data
frames in my code. It's a boon for productivity.

If not this, then I pray for Feather to be able to save multiple data frames
innone file.

~~~
peatmoss
I don't see it as a thou-shalt-not. As a file format, feather is is
lightweight. If they turned it into a container format it would be expanding
the scope. If they instrumented it to comb objects in the global namespace and
serialize them to the new container format, it would be heavier still--all to
support a feature that the authors view as an anti-pattern. That's less a thou
shalt not than it is a prioritization of their own vision.

If you're looking for a container to store lots of tabular data in one file,
I'd suggest SQLite. Using dplyr, you can save those dataframes very easily.
Plus, you can join tables and perform efficient aggregations on datasets too
large to keep in memory.

In a lot of ways, I don't understand what limitations prevent SQLite from
becoming the defacto common data.frame format. There probably are some, I just
don't understand the tradeoffs (especially given how much SQLite gives you for
free)!

~~~
sandGorgon
Actually this is interesting - whhy Feather vs sqlite. I would love to know
the answer!

But coming back to the anti-pattern : well, obviously the authors have the
power to not spend time on something. But I'm trying to figure out why it's an
anti-pattern in general. Snapshotting execution state is probably the ideal
goal, but saving intermediate data structures is a decent convenience feature.

Now if that's restricted by the limitations of the format itself (no multiple
frames in a single file), then we are back to thinking that HDF5/sqlite may
indeed be the better format.

~~~
hadley
Basically because you should be encoding state in code, not data. If you store
data between sessions, it's easy to lose the code that you use to create it
and then later on you can't recreate it.

It is convenient to save your complete workspace but I've seen too many cases
where it's contributed to lack of reproducibility to spend my time working on
it.

~~~
sandGorgon
So there's a use case difference. I create models from remote data sources -
this is incremental on a daily basis and takes quite a bit of time.

So I snapshot the workspace after I do a run and do some experiments. Now -
for me, saving the workspace is a convenience feature, NOT a programming
feature.

This is what I mean by thou-shall-not. My use case is very well defined and
I'm not stupid. And I completely knows the pitfalls of what you talk about -
but a philosophical opposition is what hurts me (and lots of devs like me)

~~~
hadley
I hope I didn't come across as "thou shalt not" \- it's just never going to be
high in my priority list.

(And even for your use case I would think you'd be better off keeping the
models in a list and saving that. Then other random stuff in your evn won't
get carried along for the ride)

~~~
sandGorgon
Oh no you did not! That was polite musing. Thank you for the reply - I still
hope you change your mind. Because people do have genuine, but different needs
;)

------
baldeagle
Any luck in getting it to work with sas... Maybe in a 10.12 future release? My
daily grind is SAS to csv to R, because not everyone else has seen the light.

~~~
hadley
If SAS wanted to support it, it would be super easy for them to implement it.

------
eximius
Ignorant question: What does this solve that a csv doesn't? Type information?

~~~
oddthink
Type information is one. CSV is slow, since it has to parse everything on
load. CSV has no random access, since rows can be arbitrary length. CSV takes
up a lot of disk space, since a 8-byte double gets expanded into a 15+ digit
string.

~~~
roywiggins
Also there's no single, official CSV standard.

~~~
jabl
There's RFC 4180. Problem is, there is no universal acceptance in actually
following it.. :)

