
Loading NumPy arrays from disk: mmap() vs. Zarr/HDF5 - itamarst
https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/
======
dalke
Given the discussion on Zarr, HDF5 and TileDB, I hope I can get feedback from
a problem I have. I want to switch from a format I made (the "FPB" format
described at [http://chemfp.com/fpb_format/](http://chemfp.com/fpb_format/) )
to one which scales to more than 250 million entries. (See
[https://jcheminf.biomedcentral.com/articles/10.1186/s13321-0...](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0398-8)
for my recent publication on the system, which implements similarity search
for cheminformatics fingerprints.)

I can't figure out if one of those three methods can do what I want. I want to
include a multi-valued lookup table so given a string id I can find the 0 or
more corresponding records. (I can handle single-valued lookup tables, if
multi-valued isn't available.)

I looked at Parquet, but it doesn't seem to handle id->index table lookups. Or
256 byte fields. Do any of Zarr, HDF5 and TileDB offer a better fit?

Background: it's hard to compare molecules directly. Instead, there are
hashing methods to turn a molecular structure into a fixed-length bitstring of
length nearly always between 166 and 4096 bits in length.

Bitstrings are easy to compare using Jaccard similarity, which gives a decent-
enough and very fast way to compare molecular similarity. With help from
another HN user (who I paid), we were able to get the performance to within a
factor of 2 of the theoretical memory bandwidth.

Going faster requires compressed RAM->uncompressed cache transfer, better
indexing, and distributed computing. I don't know how to do it using one of
the existing Big Data formats, and developing my own format is hard.

My FPB file format is a FourCC format with several chunks. One chunk stores
all N fingerprints, cache size aligned. These are ordered by popcount, so in
principle some should be easily compressible - I haven't tried, and can't
figure out how BLOSC is supposed to be used.

Another chunk stores the corresponding fingerprint identifiers, UTF-8/non-
fixed-length strings, in the same order (fingerprint[i] corresponds to id[i]).

A third chunk stores a multi-valued hash table so I can look up fingerprint(s)
by name, instead of by index. The hash table is a variant of the djb's "cdb"
hash table. This is a 32-bit hash, which means for large data sets I get many
collisions. (I designed it for <100M entries. Some people now want to handle
>1B entries.)

I'm hoping to use an existing format rather than roll my own. Again.

~~~
jmuhlich
Hi Andrew! Looking into similar requirements I came across Feather and fst.
They both basically let you efficiently slice into compressed on-disk
DataFrames. Feather already supports Python, but fst is just for C++ and R at
the moment.

[https://blog.rstudio.com/2016/03/29/feather/](https://blog.rstudio.com/2016/03/29/feather/)

[http://www.fstpackage.org/](http://www.fstpackage.org/)
[https://github.com/fstpackage/fstlib](https://github.com/fstpackage/fstlib)

~~~
dalke
Thanks Jeremy. I looked at Feature, and the underlying Arrow format, but
couldn't figure out how to implement a row-id-label-to-row-index dictionary.
The only "dictionary" I saw was for categorical data.

I had the same issue with fstlib - how do I handle id lookups?

Any pointers?

~~~
jmuhlich
Sorry, I thought Feather supported random row access but it turns out it only
supports random column access.

For fst, I only played with the R interface, which would be called like this
to retrieve row 12345 from the "fingerprint" column:

    
    
      read_fst("library.fst", columns="fingerprint", from=12345, to=12345)
    

However fst didn't offer a raw/binary column datatype last I checked, which is
frustrating. It has chr (string) but R can't have embedded NUL bytes in
strings, so that was a dead-end for efficient storage of binary fingerprints
for R. I didn't check if the underlying fstlib structures accept NULs in
string columns.

~~~
dalke
Thanks for confirming that. Looks like I'll need to look elsewhere for the
next chemfp fingerprint file format.

------
jofer
There seems to be a common misconception that HDF5 somehow isn't chunked like
zarray is... HDF5 will happily chunk things internally and treat them like a
single large dataset. It's an index of chunks under the hood. It's _very_ much
meant to encourage and allow chunking large datasets transparently.

The big advantage of zarray is when reading/writing to a cloud bucket, where
having a single file containing chunks/indicies is impractical and you'd
rather work with direct references to objects in a bucket.

Otherwise, HDF5 offers every single advantage that zarray has and is much more
mature, stable, better documented, and has better support. If you're working
in a situation where it makes sense to have things on disk or some sort of NFS
share, use HDF5. If you're working with objects in a cloud bucket, you'll
incur additional overhead with HDF5, as you'll have to read its table of
indices, then make range requests to each chunk. Zarr is optimized for the
cloud use case.

~~~
MichaelSalib
_Otherwise, HDF5 offers every single advantage that zarray has and is much
more mature, stable, better documented, and has better support._

Absolutely not. HDF5 is an awful format with terrible implementations. For
example, try writing a python program with multiple threads where each thread
writes to a different HDF5 file. This should just work -- there's no
concurrent access. And yet it doesn't because HDF5 implementations are piles
of ancient C code that use lots of global state. There's no technical reason
for this; one could easily store all the state needed in a per-file object.
But back in the day, software eng standards were lower (especially for
scientists) and HDF5 changes at a glacial place.

I've been bitten by this particular bug, but you really have to wonder: given
how poorly it speaks to the software engineering behind HDF5 implementations,
what else is broken in the code or specifications?

 _If you 're working in a situation where it makes sense to have things on
disk or some sort of NFS share, use HDF5. If you're working with objects in a
cloud bucket, you'll incur additional overhead with HDF5, as you'll have to
read its table of indices, then make range requests to each chunk. Zarr is
optimized for the cloud use case._

When last I looked, there were no open source HDF5 implementations that were
smart enough to do range requests to cloud hosted hdf files. Has this changed?

~~~
hcrisp
Have you looked at pyfive [0], h5s3 [1], or Kita [2]? What about version 2.9
of h5py [3] which supports file-like object access?

    
    
      [0] https://github.com/jjhelmus/pyfive
      [1] https://h5s3.github.io/h5s3/python.html
      [2] https://www.hdfgroup.org/solutions/hdf-kita/
      [3] http://docs.h5py.org/en/stable/high/file.html#python-file-like-objects

~~~
MichaelSalib
Ah, thanks for these! But I see nothing has changed. * pyfive is interesting
but immature and doesn't seem to have any cloud bucket support * h5s3 is an
abandoned experiment that hasn't been touched in two years * h5py is fine but
again, no cloud support * kita is a commercial offering from the HDF Group and
-- I cannot stress this enough -- these people are shockingly incompetent;
plus when I last looked at their system architecture diagram I thought it was
a joke (well, I thought it was an intentional joke)

Efficient access to scientific datasets hosted on S3/GCP is a full blown
crisis in the scientific computing community. People aren't switching to zarr
for the fun of it, but because zarr is here, today, and isn't a joke, and is
actually open.

~~~
hcrisp
It's been a while since I worked on it, but I did get pyfive to work reading
from S3 objects using either IOBytes around the entire bytearray read into
memory or against a custom class that implemented peek, seek, etc. against an
S3 object (the first method was better if you need to read a majority of a
large file, the second was better for a small subset of it). Note that it
supports read-only not write. Later I heard that I wouldn't have to use pyfive
since h5py now supports file-like objects. So your comments about no cloud
bucket support are not exactly true.

~~~
MichaelSalib
To be clear, our experience using gcsfuse and friends to do basically the same
things was extremely painful and a performance nightmare. The HDF format was
designed for a world where seeks are free which makes cloud access very high
latency and very low throughput.

------
stavrospap
Hi all, we'd love to hear your thoughts about TileDB
([https://github.com/TileDB-Inc/TileDB](https://github.com/TileDB-
Inc/TileDB)), which offers the key advantages of both HDF5 and Zarr, such as
chunked dense arrays, numerous compression filters and native cloud support.
But TileDB is more as it also supports sparse arrays and integrates with more
languages, databases and data science tools. If you'd like your data to be
readable by multiple languages and tools or need array versioning, it is worth
taking a look.

Docs: [https://docs.tiledb.com/developer/](https://docs.tiledb.com/developer/)

Website: [https://tiledb.com/](https://tiledb.com/)

Earlier HN post:
[https://news.ycombinator.com/item?id=15547749](https://news.ycombinator.com/item?id=15547749)

Disclosure: I am a member of the TileDB team.

~~~
just_testing
I am replying here with some constructive criticism, because I found it
_awesome_. As far as I can tell, TileDB is like a better SQLite and better
Parquet, is that it?

The webpage does not tell me exactly what TileDB _is_ , so I'm having a hard
time getting understanding what's underneath all the marketing mumbo-jumbo.

But if it is indeed a better SQLite/Parquet, DAMN, I'm so going to spread the
gospel of this to all my students.

~~~
dekhn
The simplest way to think about TileDB and zarr is that they are zip files of
arrays (either dense or sparse). Not like sqlite, not exactly like parquet.

For a lot of what I do, I want a hierarchical containment system- the
equivalent of folders with files. And the files themselves are leaves in the
hierarchy, containing multidimensional array data. WOrks great when the arrays
are composed of fairly straightforward payloads, like float[x][y][z] but also
works if your array values are structs. Much of the value in zarr and tiledb
comes from specifically how they arrange the arrays, for convenient read
access to slices of the arrays. Access is going to look like: ages =
root["user"]["age"][100:100:2]

Parquet is mostly a column file format, but with nesting. I'd use it to store
large amounts of structured data with a relatively straightforward schema,
although the schema itself can be fairly nested so some records have very
complex structure. Access would often be in a loop over all records: for
record in records: if record.user.has_age(): print("User age:",
record.user.age)

SQLIte is a library/CLI that implements a relational database. It has a SQL
interface and stores data using classic relational DB approaches, including
secondary indices, etc, and permitting joins directly within the engine:
SELECT age FROM user WHERE user.country == 'Bulgaria'

~~~
Shelnutt2
TileDB is more than a format. At its core, it is an engine that allows you to
store and access multi-dimensional arrays (dense and sparse) very fast.
Similar to Parquet, its sparse array support can capture dataframes[1]. It is
more general than Parquet in the sense that it can support fast _multi_
-dimensional slicing, by defining a subset of its columns to act as
dimensions. This effectively buys you a _primary_ multi-dimensional index (and
in the future we could add secondary indexes as well). Also, TileDB handles
updates, time traveling and partitioning at the library level, obviating the
need for using extra services like Delta Lake to deal with the numerous
Parquet files you may create.

The good news is that we offer efficient integrations with MariaDB, PrestoDB
and Spark, so you can directly process SQL queries on TileDB data via those
engines (which work even for dense data). With MariaDB, we even have a
embedded version which allows running SQL queries directly from Python[2].
This combines the ease of use of sqlite with MariaDB's speed and TileDB's fast
access to AWS S3 (and Azure Blob Store in the next version).

[1] [https://docs.tiledb.com/main/use-
cases/dataframes](https://docs.tiledb.com/main/use-cases/dataframes)

[2] [https://docs.tiledb.com/developer/api-usage/embedded-
sql](https://docs.tiledb.com/developer/api-usage/embedded-sql)

Disclosure: I am a member of the TileDB team.

------
butterthebuddha
mmap is great. I worked on a project a couple months ago where we had ~20
threads concurrently fetching data over the wire and dumping it into a file.
We just ftruncated the file to ~4TB (bigger than we'd ever need it to be) and
mmapped'd it all into memory, initializing the first 8 bytes to be an atomic
integer that kept track of the actual size. The data needed to end up on disk
anyway, and we were more than happy to let the OS take care of that us. We
also had very little random access (mostly sequential writes) so it worked
really well.

~~~
beagle3
Did you ever down-truncate it? Did your backup program take forever to crunch
through non-existent 4TB (and take as much space to restore?)

Atomic access wise, that’s fine as long as you don’t map it through a network
file system if any sort; also, $DIETY help you recover a consistent state if
the power goes out or you kernel panic

Mmap is awesome, I use it all the time. But writing into mmaps robustly is
hard. Best example to learn from I am familiar with is symas LMDB source code.

~~~
dekhn
Unwritten pages in files that are extended don't actually take up space or
appear as blocks; a good backup program should respect that. See
[https://en.wikipedia.org/wiki/Sparse_file](https://en.wikipedia.org/wiki/Sparse_file)

~~~
beagle3
Indeed, but many programs, backup or otherwise, or unaware of it. IIRC
gzip/bzip/zstd don’t (will spend time reading and writing all the missing
pages) whereS gnu tar does.

~~~
terrelln
Zstd by default will write sparse files when it detects runs of zeros during
decompression, but isn't aware of it when reading a file.

------
godelski
I thought I'd mention another library, since it is a competitor to HDF5. In
HPC ADIOS2[0] and HDF5 are commonly user. ADIOS2 helps with streaming data if
you want to do in situ work, but it can also read and write HDF5 files.

[0] [https://github.com/ornladios/ADIOS2](https://github.com/ornladios/ADIOS2)

------
wdobbels
No mention of dask? So far I haven't really had a need for larger-than-memory
arrays, but dask would be my go-to choice.

~~~
itamarst
Dask, while great, doesn't solve this problem at all. You still need some way
to load the data.

And for that Dask supports Zarr, and HDF5, and pre-existing arrays (where
mmap() comes in), and a few other formats:
[https://docs.dask.org/en/latest/array-
creation.html#](https://docs.dask.org/en/latest/array-creation.html#)

------
jszymborski
I wonder how this compares to bcolz [0]. I've been using it for a recent DL
project and it's been a lifesaver... haven't tried zarr or mmap however.

[0] [https://github.com/Blosc/bcolz](https://github.com/Blosc/bcolz)

------
mborg
The article leaves out a pretty important question you should ask yourself:
"What fraction of the data will I need to access?"

If the answer is "much less than the whole file", accessing the data over the
network makes sense. Otherwise, you should almost certainly bring it to a
local disk, where mmap shines.

------
heartbeats
> The data has to be on the filesystem. You can’t load data from a blob store
> like AWS S3.

You sure about this? You can mount S3 locally, and the filesystem should be
transparent to the kernel.

~~~
itamarst
There are two ways of doing this that I know of:

1\. FUSE filesystems. This means sending really slow queries to S3, and so
you're much better off using compression to get better performance.

2\. Syncing to filesystem with AWS DataSync. This would work, yes, but it's a
S3 specific feature and not all blob storage systems have it.

~~~
khc
> 1\. FUSE filesystems. This means sending really slow queries to S3, and so
> you're much better off using compression to get better performance.

As author of
[https://github.com/kahing/goofys/](https://github.com/kahing/goofys/) I
respectfully disagree :-)

~~~
itamarst
I don't mean the filesystem implementation is inherently slow, necessarily:
I'm talking about bandwidth. If you get 3× compression, read times from S3
will be 3× as fast—but that means no mmap().

------
makapuf
I'd much rather have a performance oriented blog post have at least a few
performance data points (even if put into context): simple vs fast ... but by
how much ?

------
tbenst
PyTables with Zstd compression is likely the fastest performance option of
them all. In my benchmarks, PyTables is up to 10x faster than H5PY

