
Moving Away from HDF5 (2016) - vector_spaces
https://cyrille.rossant.net/moving-away-hdf5/
======
drewm1980
Make sure you read his follow-up article. "I'm not aware of a portable
container format for storing numerical arrays that is not HDF5." If you want
to dump some collection of numerical data out to the most standard file format
you can find, hdf5 is that.

------
stragies
From
[https://en.wikipedia.org/wiki/Hierarchical_Data_Format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format):
"resources in an HDF5 file can be accessed using the POSIX-like syntax
/path/to/resource."

To mount a hdfr file, you need
[https://github.com/zjttoefs/hdfuse5](https://github.com/zjttoefs/hdfuse5)

Unfortunately it does not seem to be packaged on (m)any dists (yet)

------
tkuraku
The only real pain point I've had with hdf5 is with writing or reading
separate files in parallel. I can understand problems writing to the same file
in parallel, but it was a bit of a shock to realize writing two separate files
in parallel isn't supported.

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=10858189](https://news.ycombinator.com/item?id=10858189)

------
jostmey
I love HDF5! It is great for large datasets, and easy to use. It's great for
one-time research projects when you need a quick solution. However, there is a
lot of room for improvement, which is why I understand the authors decision

~~~
ktpsns
I think HDF5 really makes a good choice of
usability/performance/practicability for typical user groups at laboratories
or (small?) institutes. There are flaws and drawbacks which one will
evventually notice. For instance, I developed (yet) a(nother) HDF5
Reader/Writer for a certain fluid dynamic code and I was quite disturbed about
the way HDF5 stores meta data (i.e. attributes). There is no compression --
compression only works on actual data sets (payload). HDF5 performs bad (in
terms of storage size) for many (thousands) of small data sets. This is an
edge case the designers probably did not have in mind. There will be hardly
anything that beats a zipped tarball containing similar information. According
to the blog post, HDF5 was also not designed for deleting data sets (this was
surprising for me). Parallelism is a thing... HDF5 supports there "something",
but I never got happy with that.

What is actually great about HDF5 is: If you are an average 2019 data
scientist (not neccessarily doing HPC) and work with a decent number of
slightly larger CSV files and use the defacto-standard Python-Numpy stack,
making the move to HDF5 is a matter of half an hour! The h5py api is a joy to
work with, and you _will_ notice the speedup of not having to wait for long-
running numpy.genfromtxt() calls.

~~~
Enginerrrd
Yeah HDF5 resulted in some amazing speedups for some things I was doing with
decently large data sets. It's very practical for standard use cases. The
authors are really using and abusing HDF5 and based on some of the tricks they
were using to eek out extra performance by going off reservation so to speak,
I'm not surprised they ran into data corruption issues.

That's an issue I've never had with HDF5.

------
seieste
I've been moving towards putting documents in a virtual folder and zipping
them -- you can still use your own extension and the end consumer is none the
wiser.

This seems to be how Microsoft handles docx and pptx files.

~~~
stragies
No only Microsoft, many others too: .JAR, .WAR, .ODT, .ODS, .ODP, ...

------
eVeechu7
Is sqlite a reasonable alternative? It seems like it would be resistant at
least to the corruption issues the author complains of.

~~~
ktpsns
It really depends on the use case. All use cases I actually know storing
n-dimensional arrays, and this is something where HDF5 provides quite a neat
interface which is identical to the in-memory representation in C/Fortran or
numpy, e.g. This basically allows memory mapping and requires no
transformation at reading/writing, which is important for data heavy
applications (say anything in data science, high performance computing, or
"big data" in general).

SQLite got a defacto-standard for serializing relational data. Of course one
can represent higher dimensional matrices as relational tables, but that isn't
very efficient in storage and CPU time. SQLite may excel in storing any kind
of meta data and structures.

