I've done more than my fair share of fucking with FITS and ROOT files, HDF5, SQL...

nippoo · on Jan 7, 2016

It all depends what your aims are. We have a well-defined set of data we need to keep, including intermediate processing steps. We don't need headers, structured arrays, or any weird esoteric object types. (The author is my colleague.)

We can get by just fine with: - N-dimensional arrays stored on disk - Key-value metadata associated with those arrays - A hierarchical data structure.

We've been very happy so far replacing HDF5 groups with folders (on the filesystem), HDF5 datasets with flat binary files stored on disk (just as HDF5/pretty much any other format stores them - each value takes up 1 or 2 or 4 bytes, and your filesize is just n_bytes_per_value * n_values), and attributes by JSON/XML/INI files. If I sent you one of our datasets, zipped up, you'd be able to make sense of it in a matter of minutes, even with no prior knowledge of how it was organised.

It is very tricky to build something that works reliably across all systems, but, thankfully, filesystem designers have done that job for us. And filesystems are now at a point where they're very good at storing arbitrary blobs of data (which wasn't the case when HDF was founded). Filesystem manipulation tools (Windows Explorer / Finder / cd/cat/ls/mkdir/mv/cp/[...]) are also very good and user-friendly.

There isn't really anything we miss about HDF5 at all. Perhaps if your project has spectacularly complex data storage requirements (as to your examples: metadata/headers are easily stored in JSON), but there's no other project I know of that actually relies on an HDF5-only feature and couldn't trivially use the filesystem instead.

iheartmemcache · on Jan 7, 2016

I've done one format which was append-only (so a pretty easy problem to solve) on one homogeneous system with metadata that had above-average performance (compared to the commercial and open-source alternatives available at the time) but it sounds you have me well-beaten. These are the war-stories that I love to hear. What was your problem domain, what were the recurring implementation problems, where were the bugs primarily?