Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've done more than my fair share of fucking with FITS and ROOT files, HDF5, SQLite, proprietary struct-based things, etc...

It's easy to get a file format working on one system. It's Herculean getting it working on all systems bug free. It's nearly impossible to get something to work portably and performant across many systems.

As for simplicity, people start wanting metadata and headers and this and that, and before you know it you need HDF5 or ROOT again and it's no longer simple. Maybe if you're lucky you can stick with something that looks like FITS. If it's tabular, SQLite still can't be beat. Maybe Parquet would work fine too.

I'd vehemently oppose anyone in the projects I work on from trying to standardize on a new in-house format. I'd maybe be okay if they were just building on top of MessagePack or Cap'n Proto/thrift etc... but nearly every disadvantage the OP references about HDF5 will undoubtedly be in anything they cook up themselves. For example, a "simpler format" that works well on distributed architectures, well... now you're going to go back to the single implementation problem.



It all depends what your aims are. We have a well-defined set of data we need to keep, including intermediate processing steps. We don't need headers, structured arrays, or any weird esoteric object types. (The author is my colleague.)

We can get by just fine with: - N-dimensional arrays stored on disk - Key-value metadata associated with those arrays - A hierarchical data structure.

We've been very happy so far replacing HDF5 groups with folders (on the filesystem), HDF5 datasets with flat binary files stored on disk (just as HDF5/pretty much any other format stores them - each value takes up 1 or 2 or 4 bytes, and your filesize is just n_bytes_per_value * n_values), and attributes by JSON/XML/INI files. If I sent you one of our datasets, zipped up, you'd be able to make sense of it in a matter of minutes, even with no prior knowledge of how it was organised.

It is very tricky to build something that works reliably across all systems, but, thankfully, filesystem designers have done that job for us. And filesystems are now at a point where they're very good at storing arbitrary blobs of data (which wasn't the case when HDF was founded). Filesystem manipulation tools (Windows Explorer / Finder / cd/cat/ls/mkdir/mv/cp/[...]) are also very good and user-friendly.

There isn't really anything we miss about HDF5 at all. Perhaps if your project has spectacularly complex data storage requirements (as to your examples: metadata/headers are easily stored in JSON), but there's no other project I know of that actually relies on an HDF5-only feature and couldn't trivially use the filesystem instead.


I've done one format which was append-only (so a pretty easy problem to solve) on one homogeneous system with metadata that had above-average performance (compared to the commercial and open-source alternatives available at the time) but it sounds you have me well-beaten. These are the war-stories that I love to hear. What was your problem domain, what were the recurring implementation problems, where were the bugs primarily?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: