

BLZ: A data format leveraging the hierarchical memory model - hogu
http://continuum.io/blog/blz-format

======
MichaelGG
I came across a similar solution[1] while creating a packet analyzer. At
speeds like 50Kpps, and needing 10 or so indexes per packet, you end up
needing to do 500K index writes per second. If you plan on making it work on a
single small server and disk, you need to get creative. I used the original
PageRank paper as inspiration, but it turns out to be a rather old and
straightforward problem.

Isn't the key insight just using a couple layers of ISAM? Create ISAM index,
save. Repeat for as many levels as you find useful. Top-level pages are almost
always in RAM, so almost all index lookups require only a single seek. Toss in
a nice compression library and you've got a fairly tight system.

Not to say it isn't nice to have a standard format and library to implement
such things. It just seems that everything's just a variation on things
developed in the 70s.

1: I made each index contain a pair of 64-bit ints, allowing me to reasonably
index a wide variety of values and point to the likely chunk file. As the keys
and values are ordered, I could get by with easy-to-implement delta
compression only. The indexes only point to a chunk, so many packets shared a
single entry. In production, it came out to about 2.8bits/packet (down from a
few 8-byte hash values).

Here's my admittedly poor KV implementation:
<https://github.com/MichaelGG/cidb/tree/master/CiDB>

~~~
faltet
Blaze is intended to be a much more general solution than a pure key-value
store (although it will be able to tackle this use case too). But yes, the
idea under BLZ is close to your projects, namely, leveraging the available
resources in your computer in the most useful way.

But sorry, I _strongly_ disagree in that everything is a variation on things
from 70's: during these good old years, the memory hierarchy was far simpler
than nowadays, and having to deal with that introduces a great amount of
complexity in libraries that are designed to get most out of modern computers.

~~~
MichaelGG
The memory hierarchy can be pretty complicated, but the general design, the
part that's making the rest possible, seems to me that it's stacking ISAM. It
might not be a simple stacking, you need to tune the parameters to deal with
various cache levels. But it's the same principle, isn't it?

I agree the hardware is quite different, vastly more complex.

------
monk_the_dog
I use hdf5 as the file format for my project. I'm not at all a hdf5 expert,
but I know it supports: partial reads/writes, extending data sets, and
transparent compression. It's a nice file format and it's very easy to use
from python (and matlab, C++, R, and others).

I read the article, and blz looks interesting. What is it aiming to provide
that is missing from HDF5? (Speed?).

~~~
faltet
HDF5 is a very nice format indeed, and in fact, BLZ is borrowing a lot of good
ideas from it. However, HDF5 has its own drawbacks, like not being able to
compress variable length datasets, the lack of a query/computational kernel or
its flaky resiliency during updates. Also, its approach for distributing data
among nodes diverges from our goals.

Finally, you are right, speed is pretty important for us, and we think that
our approach can make a better use of current computer architectures.

~~~
radikalus
I use HDF5s for storage/analytics of tick data; my experience has been that
the performance for storing large sparse matrices is both expensive in storage
space and speed.

I know that a lot of my peers in HFT have an illicit love for column stores,
but for a lot of work, there's the need for converting to 'wide' format, which
quickly can take a 10 million row matrix with a few columns to one now with a
few thousand columns. (And thus stuff like KX, FastBit, etc becomes sort of
suboptimal)

The need for massive 'last-of' information for time series leads to basically
abandoning python/pandas/numpy and using C primitives and doing a lot more
than you'd typically like 'online' but really a lot of this could happen
behind the scenes with intelligent out of memory ops.

So...I'm pretty excited for innovation in data stores -- I look forward to
seeing more!

------
pbnjay
Sounds really interesting for some of the things I'm working on. I would love
to see some very basic stats on performance before trying it out though.

A fairly trivial example would be to store 8 billion random integers, then
step through them in order and XOR them. Then you can have a baseline to
compare monolithic file performance to blaze.

~~~
csense
This would be a worst-case benchmark. Since random integers aren't
compressible, you'll end up with the overhead of the extra logic for
compression, but none of the benefits since it won't actually reduce the size
of your data.

~~~
faltet
That's right. But even in this case, Blosc, the internal compressor used in
Blaze, can detect whether the data is compressible or not pretty early in the
compression pipeline, and decide to stop compressing and start just copying
(how early that decision is taked depends on the compression level).

The good news is that Blosc can still use threads in parallel for doing the
copy, and this normally gives significantly better speed than a non-threaded
memcpy() (the default for many modern systems).

~~~
csense
> significantly better speed than a non-threaded memcpy()

Really? I always thought that a single core could always saturate available
memory bandwidth (unless you have some weird architecture like NUMA). If
you're seeing a multithread memcpy that has better performance speed, maybe
you're just stealing memory bandwidth from other processes (since AFAIK memory
bandwidth is probably done on a per-thread basis), or maybe you're getting
more CPU cache allocated to you because you're running on multiple cores?

This would be interesting to investigate.

~~~
faltet
I'm not sure why, but yes, I'm seeing these speedups. You can see them too in:
<http://blosc.pytables.org/trac/wiki/SyntheticBenchmarks> and paying attention
to compression ratios of 1 (compression disabled). If you find any good reason
on why this is happening, I'm all ears.

------
paddy_m
I wonder how long it would take to write out 64Gb of non-compressed 0s .

Disclaimer: I work at Continuum and will ask Francesc the same question.

~~~
freyrs3
mmap'ing /dev/zero is the fastest way I know.

~~~
pjscott
I'm delighted that mmaping /dev/zero is even allowed, though in retrospect I
suppose this shouldn't be surprising.

------
dman
Could someone from continuum explain the relationship between dynd and blaze?

~~~
mwiebe
DyND is being developed as a low-level support library for Blaze.

Blaze is focused on supporting many things like distributed storage,
distributed computation, out of core computation, and high level
representations of computation that can move code to where the data lies.

DyND is focused on data that is purely in memory, and defines a dynamic multi-
dimension data structure that Blaze can use as a support library. The
development of DyND is also strictly partitioned into a C++ library, which
could be used by other systems that do not want to depend on a Python runtime,
and a Python module which exposes its functionality to Python.

They're still in a relatively experimental state.

~~~
dman
The DyND code looks fantastic btw.

