
BTables: A fast, compact format for Machine Learning - thomson
https://medium.com/@framedio/btables-a-fast-compact-disk-format-for-machine-learning-f719692e2619
======
fjordster
HDF5 isn't perfect but it does this kind of job pretty well. The C, C++, and
HDF5 APIs are definitely not fun to use, but there are wonderful and intuitive
APIs available in some languages---I'm thinking of Python's h5py here.

Let me add that the OP's experience that HDF5 files were less space efficient
than comparable CSV files suggest that something was grossly amiss in his use
of HDF5.

~~~
andrewberls
(I'm the author of the library) I'm almost certain you're correct - we
(thought) we had compression enabled on our feature builder and never found
the root cause, but regardless we're happy with how BTables ended up for the
other reasons detailed. For future use cases we'll definitely be re-evaluating
HDF5!

------
icsa
The BTables discussion takes me back to memories of my first college computing
class (in FORTRAN). We were asked how we might store a sparse matrix in less
memory. The solution was exactly the same as BTables. We thought we'd done
something novel when the professor pointed out that it had already been
implemented in the 60s.

Great ideas never fade. They do get reinvented :).

------
rspeer
I haven't tried the BTables format, but I agree with their criticism of HDF5.
It seems to be an incredibly over-designed format with under-designed APIs.

(Why would I need a directory tree inside a file that only one process can
write to anyway? Why wouldn't I just use the filesystem I already have?)

~~~
felixr
> Why would I need a directory tree inside a file that only one process can
> write to anyway? > Why wouldn't I just use the filesystem I already have?

If you have multiple "tables" that belong together and you need one table to
interpret the data in the other table, wouldn't you want them to be grouped
together? If they are separate files on the filesystem there is always the
risk of forgetting something when you share the data with somebody.

If you can put all the data of an experiment into one file, I think that is
very convenient. After all, you don't have to read the complete HDF5 file if
you are interested just in a subset of the data.

~~~
glogla
I wonder if it weren't more practical to just use sqlite file for data like
that. I mean, it's not plaintext, but sqlite is available pretty much anywhere
and provides convenient interface for data.

~~~
felixr
HDF5 is much better suited for a lot of scientific data sets. How would you
store multidimensional data in sqlite? Not everything is a table or matrix.
HDF5 also allows you to pick compression filters that are especially suited
for the data you have. If you are looking to replace a CSV file, then sqlite
is obviously a pragmatic solution.

------
xaa
It's too bad that this is for sparse data only. ML datasets have differing
degrees of sparsity, and when the sparsity gets low enough, it's more
efficient to use dense matrices, even when there are still missing values.

Also if you have dense data, you can use mmap, which isn't very space
efficient but is very fast. I guess it could also be made to be space
efficient if you use a filesystem with transparent compression.

~~~
Someone
If you combine mmap with "filesystem with transparent compression", and want
the efficiency of mmap, mmap will only see the compressed data.

If you want your mmap to magically see the uncompressed data, your file system
will have to do decompress the data, and that doesn't come for free.

I would try and aim for compression in the application, as data size likely
will be the bottleneck in reading and writing such files. If your data isn't
very sparse, you could delta encode the indices of the non-zero columns, and
use some variable-length encoding for it. Compressing each row of deltas may
help after the delta encoding (especially if it is reasonably dense, because
you expect the deltas to be small).

Once you go down that route, you have sacrificed simplicity, so you might just
as well encode your floats, too.

------
blt
Wondering why they chose row-major storage. I think it's far more common to
only care about a subset of columns than a subset of rows.

~~~
nostrademons
FTA:

"First, we knew we only cared about row-by-row access over the entire file; we
do not need things like random row or column reads."

It sounds like they don't care about subsets of _either_ columns or rows, and
are looking to optimize table size and time for full table scans.

------
zobzu
interesting how it jumps from csv to rewrite stuff without just doing SQL and
be done with it. since csv did the job almost good enough, it seem like SQL
would just fine and dandy while easier to manage and implement (minutes,
literally)

note: after reading a little more I suspect SQL would be faster, in fact.

