
Moving away from HDF5 - nippoo
http://cyrille.rossant.net/moving-away-hdf5/
======
bhouston
Alembic, a data transfer format for 3D graphics, especially high end VFX, also
initially started with HDF5 but found it to have low performnance and was
generally a bottleneck (especially in multithreaded contexts.)

Luckily the authors of Alembic were smart and in their initial design
abstracted out the HDF5 interface and were able to provide an alternative IO
layer based on C++ STL streams. The C++ STL streams-based interface greatly
outperformed the HDF5 layer.

Details on that transition here:

[https://groups.google.com/forum/#!msg/alembic-
discussion/FTG...](https://groups.google.com/forum/#!msg/alembic-
discussion/FTG1HuuO_qA/jUadpxpk3IoJ)

------
ipunchghosts
Reading all these comments that bash HDF5 makes me want to tell how HDF5 has
really worked for my group.

Although the spec is huge, there is plenty of sample code online to get it
working. You do actually have to read it though to understand slabs,
hyperslabs, strides etc. Once you get though, its really versitle.

As far as speed, we used it to replace our propriatary data format. We would
have to provide readers to all the scientists that use our data. It was
nightmare. Some people want stuff in R, some in Python 2.7, some in Python
3.4, some in Matlab, and the list goes on. HDF5 gets rid of all this.

When in the field and the system shits the bed, its really easy to open an
HDF5 file in HDFview and inspect the file contents. I dont always have matlab
available when im in the field, same with python. Sometimes I just need to
look at a time series and I can diagnose the problems with the system.

For me, its silly in 2016 to have any kind of proprietary binary format when
something HDF5 exists.

Many of the complaints the author had makes me think he the stereotypical
scientist of really smart in one area but cant program worth the beans. I dont
think that HDF5's fault.

~~~
nippoo
The author (my colleague, and probably the most talented developer I know)
isn't replacing HDF5 with a 'proprietary binary format': in fact, the
transition is as simple as replacing "HDF5 group" with "folder in a
filesystem", "HDF5 dataset" with "binary file on the filesystem" (ie you store
each array item sequentially on disk, exactly as HDF5 or any other format will
store it, which you can memmap trivially with any programming language), and
"HDF5 attribute" with "JSON / XML / INI / text file".

"When in the field and the system shits the bed", to quote you... you can just
open the dataset in Windows Explorer. Or Mac OS Finder. Or Nautilus. Or using
'cd' and 'ls' in Linux. Want to look at an array? Sure, open it in a hex
editor or Python or, heck, FORTRAN88. You can .tar up your folder (or
subfolder, or any arbitrary subset of the dataset) and send it to someone with
no knowledge of the format whatsoever, and they'll be able to make sense of it
in minutes. This isn't anything remotely complex - it's just using the
filesystem rather than creating a filesystem-within-a-filesystem. Want to keep
track of changes in your massive dataset? Sure, just back it up on Mac OS X
Time Machine, or a simple rsync script, or even Git LFS.

Most researchers aren't computer scientists; they know how to use Dropbox and
Notepad++ and open text files, and they don't want to have to install a Java-
based HDF5View when they could just use Windows Explorer.

It's not even that HDF5 is _that bad_ , it's just that filesystems are, in
many respects, _so much better_.

(If it wasn't clear from the article, we're not just misinformed - we're
making this call after having developed an entire software suite around HDF5,
spent about two years of firefighting HDF5 issues and wasted days of
development time (so many horror stories) - this is actual feedback from
several dozen users, thousands of datasets, and petabytes of data.)

~~~
ipunchghosts
Filesystems are the worst! Try telling a customer to tar up a directory and
send it to you -- things get lost so easily! Most of our customers dont even
know what "tar" means. I think you are asking for trouble going with a
filesystem as a storage mechanism.

HDFfView is not the only viewer in town. There are several viewers.

I've used HDF5 weekly for the last 5 years and so have my associates and its
been wonderful. MATLAB even uses it to store its .mat files these days.

I think you are misinformed and Im sticking to my story !

~~~
nippoo
It depends who your users are and what needs they have. Our users are mostly
MATLAB/Python users, pretty tech-savvy, and being able to edit individual bits
of the dataset with other applications or write their own code to analyse them
is an often-used feature.

It's rare we ever need the whole dataset - in fact, it's really great to be
able to say to the user "don't send us your 100GB dataset: just go to the
"acquisition" subfolder and send me the 10MB file called "oscilloscope.dat"".
With HDF5 this is difficult enough that it's almost always easier to send
99.9% of useless data (i.e. the whole file) when all you want is a single
array within it.

If your users will rarely need to do this, you could just store the entire
folder hierarchy in a .zip and access it using standard tools that most
programming languages have. It's worth noting that the new Microsoft Office
formats do exactly this - in their case, a bunch of XML files inside a .ZIP.
(Rename a .docx to .zip and you'll see!).

MATLAB has moved from their own custom binary format to HDF5, which is the
lesser of two evils.

~~~
ipunchghosts
Your usecase doesnt seem like HDF5 would be a good fit.

Yes, I do know about docx and zip.

I'm happy with what matlab has done. People send me .mat files and I happily
process them in python. And my plots usually look much nicer also. :)

------
batbomb
I've done more than my fair share of fucking with FITS and ROOT files, HDF5,
SQLite, proprietary struct-based things, etc...

It's easy to get a file format working on one system. It's Herculean getting
it working on all systems bug free. It's nearly impossible to get something to
work portably and performant across many systems.

As for simplicity, people start wanting metadata and headers and this and
that, and before you know it you need HDF5 or ROOT again and it's no longer
simple. Maybe if you're lucky you can stick with something that looks like
FITS. If it's tabular, SQLite still can't be beat. Maybe Parquet would work
fine too.

I'd vehemently oppose anyone in the projects I work on from trying to
standardize on a new in-house format. I'd maybe be okay if they were just
building on top of MessagePack or Cap'n Proto/thrift etc... but nearly every
disadvantage the OP references about HDF5 will undoubtedly be in anything they
cook up themselves. For example, a "simpler format" that works well on
distributed architectures, well... now you're going to go back to the single
implementation problem.

~~~
nippoo
It all depends what your aims are. We have a well-defined set of data we need
to keep, including intermediate processing steps. We don't need headers,
structured arrays, or any weird esoteric object types. (The author is my
colleague.)

We can get by just fine with: \- N-dimensional arrays stored on disk \- Key-
value metadata associated with those arrays \- A hierarchical data structure.

We've been very happy so far replacing HDF5 groups with folders (on the
filesystem), HDF5 datasets with flat binary files stored on disk (just as
HDF5/pretty much any other format stores them - each value takes up 1 or 2 or
4 bytes, and your filesize is just n_bytes_per_value * n_values), and
attributes by JSON/XML/INI files. If I sent you one of our datasets, zipped
up, you'd be able to make sense of it in a matter of minutes, even with no
prior knowledge of how it was organised.

It is very tricky to build something that works reliably across all systems,
but, thankfully, filesystem designers have done that job for us. And
filesystems are now at a point where they're _very good_ at storing arbitrary
blobs of data (which wasn't the case when HDF was founded). Filesystem
manipulation tools (Windows Explorer / Finder / cd/cat/ls/mkdir/mv/cp/[...])
are also very good and user-friendly.

There isn't really anything we miss about HDF5 at all. Perhaps if your project
has spectacularly complex data storage requirements (as to your examples:
metadata/headers are easily stored in JSON), but there's no other project I
know of that actually relies on an HDF5-only feature and couldn't trivially
use the filesystem instead.

------
superbatfish
I liked the post (well, as an HDF5 user, I found it depressing...).

My main qualm with it was the claim about 100x worse performance than just
using numpy.memmap(). To the author's credit, he posted his benchmarking code
so we could try it ourselves. (Much appreciated.) But as it turned out, there
were problems with his benchmark. A fair comparison shows a mixed picture --
hdf5 is faster in some cases, and numpy.memmap is faster in other cases. (You
can read my back-and-forth about the benchmarking code in the blog's
comments.)

One minor complaint about presentation: Once the benchmarking claims were
shown to be bogus, the author should have removed that section from the post,
or added an inline "EDIT:" comment. Instead, he merely revised the text to
remove any specific numbers, and he didn't add any inline text indicating that
the post had been edited.

I think the rest of the post (without performance complaints) is strong enough
to stand on its own. After all, performance isn't everything. In fact, I'd say
it's a minor consideration compared to the other points.

When it comes to performance, I think the main issue is this: When you have to
"roll your own" solution, you become intimately aware of the performance
trade-offs you're making. HDF5 is so configurable and yet so opaque that it's
tough to understand _why_ you're not seeing the performance you expect.

~~~
superbatfish
And one last point. In the blog comments, I wrote this, which I think sums up
my view of the performance discussion:

...it's worth noting that many of the tricky things about tuning hdf5
performance are not unique to HDF5. For storing ND data, there will always be
decisions to make about when to load data into RAM vs. accessing it on demand,
whether or not to store the data in "chunks", what the size of those chunks
should be (based on your anticipated access patterns), whether/how to compress
the data, etc. These are generally hard problems; we can't blame HDF5 for all
of them.

------
ycmbntrthrwaway
One reason to use binary formats like HDF5 is to avoid precision loss when
storing floating-point values. I started using HDF5 once exactly for this
reason and it was overkill. HDFView requires Java installed and HDF library
with single implementation and complex API is a problem too.

For simple uses I now use '%a' printf specifier. It is specifically designed
to avoid losing a single bit of informaiton. And you can easily read floats
stored this way in numpy by using genfromtxt with optional converters=
argument and builtin float.fromhex function.

~~~
iheartmemcache
It's also really useful if you have a lot of numerical data streaming in that
you want to store and use at a later date. CERN results I'm sure use something
similar to HDF5, nearly all of HFT algo trading uses HDF5 for securities they
are going to explore down the road but don't want to waste KDB+ licenses on,
Google File System's chunking scheme seems to be somewhat similar to it as
well. __ "Third, most files are mutated by appending new data rather than
overwriting existing data. ... Once written, the files are only read, and
often only sequentially."__ [1] _That_ is the use case for HDF5. The problem
is this guy tried to slam a circular peg into a square hole. I'm in no way an
apologist for HDF5 but his complaints are terribly vague. "Limited support for
parallel access" Then you go read the source[2] and and see GIL complaints
abound. And again, this was meant for an append-only situation where you
shouldn't even have to acquire a lock in the first place since there's no
contention possibility! "Impossibility to explore datasets with standard
Unix/Windows tools" right, but there are plenty of Java tools that perform
quite well, even with a cold JVM. "Opacity of the development and slow
reactivity of the development team." AFAIK it's an open-source project, this
complaint is valid if you're paying a vendor fees for a product and have a
support plan with an SLA, and it's not valid in the least otherwise. "High
risks of data corruption" I've never once seen this happen when HDF5 was
properly used, though I'd love to see a pdb dump of the state his program when
that occurred. Open offer - I'll fix that bug if it's a fault with the C lib
you're FFI'ing with.

edit: oh, the Java tooling was already mentioned.

[1]
[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-
sosp2003.pdf) [2]
[https://github.com/h5py/h5py/blob/master/h5py/_locks.pxi](https://github.com/h5py/h5py/blob/master/h5py/_locks.pxi)

~~~
batbomb
CERN uses ROOT.

~~~
iheartmemcache
Huh! Good to know. ftp://root.cern.ch/root/doc/11InputOutput.pdf[1] The spec
for anyone interested. For
comparison:[https://www.hdfgroup.org/projects/hdf5_aip/aip15.gif](https://www.hdfgroup.org/projects/hdf5_aip/aip15.gif)
to page 6 on CERN's PDF.

------
ycmbntrthrwaway
> You can't use standard Unix/Windows tools like awk, wc, grep, Windows
> Explorer, text editors, and so on, because the structure of HDF5 files is
> hidden in a binary blob that only the standard libhdf5 understands.

HDF provides command-line tools like h5dump and h5diff, so you can dump HDF5
file to text and pipe it into standard and non-standard unix tools [1].

[1]
[https://www.hdfgroup.org/products/hdf5_tools/index.html#h5di...](https://www.hdfgroup.org/products/hdf5_tools/index.html#h5dist)

~~~
mynewtb
The submission talks about terabytes of data. Copying/transforming is not
viable in such situations.

~~~
ycmbntrthrwaway
With h5dump you can specify which datasets you want to dump. I am sure nobody
is going to use awk, grep and wc to process terabytes of data. As for sanity
checks, like checking that probabilities stored in a dataset sum up to 1.0 and
things like that, dumping one dataset and processing it with awk should be ok.

~~~
paulmd
Actually, piping data between standard Unix tools can be _extremely_
efficient. This example is working on gigabytes rather than terabytes of data,
but memory utilization is basically limited to just buffers, so you could
definitely scale that to terabytes without a problem.

[http://aadrake.com/command-line-tools-can-be-235x-faster-
tha...](http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-
hadoop-cluster.html)

------
x0x0
The problem -- and I've been burned on both sides of this -- is that you need
either a container file, or you need users to understand that a directory is
essentially a file. Not only does this complicate users lives when they want
to move what they, quite reasonably, view as a single file between different
computers or back it up, but they can and will remove individual pieces. Or
copy it around and have some of the pieces not show up and be very confused
that copy/move operations -- particularly to/from a NAS -- are now nothing
like atomic.

Another thing that will happen is this: if you just use a directory full of
files as a single logical file, you will end up writing code that does the
equivalent of 'rm -rf ${somedir}' because when users choose to overwrite
"files" (really, a directory), you need to clear out any previous data so
experiment runs don't get mixed. You can easily see where this can go bad; you
will have to take extraordinary care.

~~~
nippoo
This is a double-edged sword. For our (reasonably savvy) users, being able to
duplicate and easily modify individual datasets/files is a feature, not a bug:
people can symlink the contents of an entire folder but modify a single array
and easily run their analysis on this slightly different dataset, for example.

While it's true that you lose atomicity with this, it can both burn you and
help you: you can track specific parts of your dataset in revision control,
email subsets of it back and forth, combine datasets easily, or even store it
across several servers and manage it with symlinks, for example.

Our users are aware of this and it isn't really a problem for our use-case.
But if you're worried, there's always the option of having your whole dataset
as a ZIP/TAR file (like all of Microsoft Office file formats are - XML files
within a .ZIP); tools for modifying folders within ZIP are much more well-
established than HDFView, and most modern programming languages provide
libraries to read and modify files within archives without unzipping them; you
could make your program agnostic to the files being within an archive (high
portability, lower performance) or directly within the FS (loss of atomicity,
easier/faster to use).

------
zvrba
In my previous job we were evaluating HDF5 for implementing a data-store a
couple of years ago. We had some strict requirements about data corruption
(e.g., if the program crashes amid a write operation in another thread, the
old data must be left intact and readable), as well as multithreaded access.
HDF5 supports parallelism from distinct programs, but its multithreaded story
was (is?) very weak.

I ended up designing a transactional file format (a kind of log-structured
file system with data versioning and garbage collection, all stored in a
single file) from scratch to match our requirements. Works flawlessly with
terabytes of data.

~~~
srean
Might one take a peek at that ? In other words was it open sourced or plans to
that effect exists ?

~~~
zvrba
No and no. Strictly proprietary technology which gives a real competitive
advantage. Fun thing is, if you choose your data structures wisely, it's not
even that hard to write; it ended up being under 2k lines of C++ code.

GC was offline though; it was performed at the time the container was
"upgraded" from RO to RW access. I don't think it'd be difficult to make it
online, but there was no need for that.

------
jbverschoor
Am I the only one misreading this as HDFS?

~~~
zer01
Nope, I was very confused for a second. Silly homoglyphs
([https://en.wikipedia.org/wiki/Homoglyph](https://en.wikipedia.org/wiki/Homoglyph))!

------
ajbonkoski
"we have a particular use-case where we have a large contiguous array with,
say, 100,000 lines and 1000 columns"

This is where they lost me. This is NOT a lot of data. Should we be surprised
that memory-mapping works well here?

Below about 100-200 GB you can do everything in memory. You simply don't need
_fancy_ file-systems. These systems are for actual big data sets where you
have several terabytes to several petabytes.

Don't try to use a chainsaw to cut a piece of paper and then complain that
scissors work better. Of course they do...

~~~
rossant
Unfortunately our users can't afford fancy computers with hundreds of GB of
RAM. They often need to process entire datasets on laptops with 16GB of RAM
but 1TB+ GB of disk space. Of course with 200 GB of RAM with have no problem
at all...

Also, as I said, the 100,000 x 1000 example is a quite optimistic one, we do
have cases now with 100,000,000 x 10,000 arrays, and this is only going to
increase in the months to come with the new generation of devices.

------
iraikov
This is a very interesting article, thanks for sharing. I attempted several
times to understand the HDF5 C API and create a custom format for storing
connectivity data for neuroscience models, but each time I found the API
exceedingly complex and bizarre. I am quite impressed that the author managed
to write a substantial piece of software based around HDF and relieved to read
the sections on the excessive complexity and fragility of the format.

------
skynetv2
* High risks of data corruption - HDF is not a simple flat file. Its a complex file format with a lot of in memory structures. A crash may result in corruption but there is no _high_ risk of corruption. More over, if your app crashed, what good is the data? How can you make sense of the partial file? if you just need a flat file which can be parsed and data recovered, then you didnt need HDF in the first place. So wrong motivation to pick HDF. On the other hand, one could do a new file for every checkpoint / iteration / step, which is what most people do. If app crashed, you just use the last checkpoint.

Bugs and crashes in the HDF5 library and in the wrappers - sure, every sw has
bugs. But in over 15 years of using HDF, I have not seen a bug that stopped me
from doing what I want. And the HDF team is very responsive in fixing /
suggesting work arounds.

Poor performance in some situations - yes & no. A well built library with a
well designed application should approach posix performance. But HDF is not a
simple file format, so expect some overhead.

Limited support for parallel access - Parallel HDF is one of the most, if not
the top most, popular library for parallel IO. Parallel HDF also uses MPI. If
your app is not MPI, you cant use Parallel HDF. If the "parallel access"
refers to threading, HDF has a thread safe feature that you need to enable
when building the code. If "parallel access" refers to access from multiple
processes, then HDF is not the right file format to use. you could do it for
read-only purposes but not write. again, not the right motivation to pick HDF

Impossibility to explore datasets with standard Unix/Windows tools - again,
HDF is not a flat file, so how can one expect standard tools to read it? its
like saying I would like to use standard tools to explore a custom binary file
format I came up with. wrong expectations.

Hard dependence on a single implementation of the library - afaik there is
only one implementation of the spec. the author seems to know this before
deciding on HDF. Why is this an issue if its already known?

High complexity of the specification and the implementation -

Opacity of the development and slow reactivity of the development team - slow
reactivity to what? HDF source is available so one can go fix / modify
whatever they want.

seems the author picked HDF with wrong assumptions.

HDF serves a huge community that has specific requirements, one of which is
preserving precision, portability, parallel access, being able to read/write
datasets, query the existing file for information of the data in the file,
multi dimensional datasets, large amount of data to fit in a single file, etc.

[https://www.hdfgroup.org/why_hdf/](https://www.hdfgroup.org/why_hdf/)

~~~
x0x0
A common pattern (that my scientific software used) was this: an initial file
is created from the raw data pulled from the sequencer. After sequencing, you
could run all sorts of analyses. Sometimes the analyses themselves, and
sometimes intermediate results, where very slow to compute and hence cached in
the file.

I think it's reasonable to be very upset if you have a container file and
adding new named chunks to the file has the possibility of causing the old
data to become unreadable. It's fair in a crash before the file was saved that
new chunks might be bad, but old chunks should be fine.

~~~
skynetv2
Good example. definitely a problem. but that limitation exists now, so the
programer would to work around it. hopefully journalling support will appear
soon.

~~~
ipunchghosts
Agreed, HDF could benefit from journaling. I have to ask though, why not just
make a cached file of your data and onces its done, integrate into the final
HDF file?

------
adolgert
I may not agree with Cyrille, but what about alternatives for storing binary
data that might be structured and play well with newer tools like Spark? ASN.1
and Google Protocol Buffers both specify a binary file format and generate
language-specific encoding and decoding. Is there a set of lightweight binary
data tools we're missing?

~~~
santaclaus
How widely supported are the alternatives in the wider ecosystem? It is
trivial to read and write HDF5 files in Python, Matlab, Mathematica, etc.

~~~
adolgert
That's a good question. Both ASN.1 and Google's offering have more limited
language coverage (ASN.1 is ancient, but venerable, now in the hands of NCBI),
but maybe we should expand that list. These are tools that serialize buffers
with razor-sharp binary specifications. I, too, use HDF5 for all of its
features, but maybe someone who is rolling their own, for instance, under
Spark, should have a solid binary specification.

------
hzhou321
In the old days, researchers measure the charge of electron with oil drop, and
figure out what is gravity with pen and paper. I guess nowadays researchers
have to spend million dollar on a electron microscope to look at anything and
have to depend on HDF5 to deal with any data.

------
mydpy
We used HDF5 and NETCDF at NASA and it was a constant struggle. I remember
when someone dropped the specification on my desk and said, "should be a good
read. Enjoy!" Glad you found a more suitable alternative.

------
srean
Leaving a few links here. From the discussion that has taken place it seems
these two would be of interest.

[http://cscads.rice.edu/workshops/summer-2012/slides/datavis/...](http://cscads.rice.edu/workshops/summer-2012/slides/datavis/HDF5-CScADS.pdf)
Extreme IO scaling with HDF5

[http://algoholic.eu/sec2j-journalling-for-
hdf5/](http://algoholic.eu/sec2j-journalling-for-hdf5/) HDF5 with a journal

