
TileDB: Storing massive dense and sparse multi-dimensional array data - rajnathani
http://www.tiledb.io/
======
rajnathani
The underlying technology [1] for TileDB came up with MIT and Intel working
together.

And TileDB, the company, formed out of it, has recently received funding [2]
from Intel in Intel's latest tranche of investments.

[1]
[https://people.csail.mit.edu/stavrosp/papers/vldb2017/VLDB17...](https://people.csail.mit.edu/stavrosp/papers/vldb2017/VLDB17_TileDB.pdf)

[2] [https://techcrunch.com/2017/10/19/data-is-the-name-of-the-
ga...](https://techcrunch.com/2017/10/19/data-is-the-name-of-the-game-as-
intel-capital-puts-60m-in-15-startups-566m-in-2017-overall)

~~~
grabcocque
Good to know private shareholders are getting even richer off all that
taxpayer research money that's pumped into MIT.

~~~
mac01021
Given that this is an open-source project, Intel's involvement doesn't seem
too sinister to me.

~~~
ensignavenger
I think it was MITs involvement the above poster was concerned about... but
the fact that it is open source is indeed relevant!

------
sixdimensional
For posterity sake, it reminds me of MUMPS persistent sparse arrays, but built
to scale much larger. This comment is not meant in any way to take away from
this achievement, but rather to muse on where we were and where we're going.

It seems like the days we were stuck in particular or limited ways of thinking
about databases/persistent storage are finally well and truly behind us! Now
we have many awesome tools to choose from, better to have more tools in the
belt than less.

~~~
jakebol
Haha, fully agree! It's good to separate MUMPS the language from MUMPS the
storage layer / database which was innovative and pioneering in many ways. I
don't think we need to re-hash MUMPS the language :)

~~~
sixdimensional
Agreed! :)

------
ttd
Looks interesting... Here's the publication that it appears to have formed
from:
[https://dl.acm.org/citation.cfm?id=3025117](https://dl.acm.org/citation.cfm?id=3025117)
(VLDB 2016).

~~~
styx31
Also available directly here: [PDF, VLDB 2017]
[https://people.csail.mit.edu/stavrosp/papers/vldb2017/VLDB17...](https://people.csail.mit.edu/stavrosp/papers/vldb2017/VLDB17_TileDB.pdf)

------
dx034
I would like to see some information on compression ratios. Compression can
often significantly improve performance (less data to load/save, more in
memory or cache) and esp dense arrays should compress easily. They mention
that compression is supported but I'm not sure if that's done internally or
just applied to the files they create.

~~~
jbooth
Floating points compress poorly, though.

~~~
jakebol
Jake from TileDB, Inc. wenc and srean are right. techniques such as those used
in zfp and fpzip wenc mentioned are also used to compress real world las file
(point cloud) datasets. For the moment we are only focused on lossless
compression (scientists are paranoid about losing data), but there is
definitely room to explore integration with lossy compression as well. Machine
learning applications often do not need full precision so intelligent forms of
lossy compression are useful.

Another cool research application of TileDB that extends the storage library
with the VP9 codec can be seen here:
[https://homes.cs.washington.edu/~magda/papers/haynes-
sigmod1...](https://homes.cs.washington.edu/~magda/papers/haynes-
sigmod17-demo.pdf)

~~~
dx034
You can get very good lossless compression with floating point numbers,
Facebook's Gorilla paper comes to mind. I usually use it for delta-of-delta
encoding which provides very high compression for time series. While that
won't really help in your case, their floating point encoding could help
compressing matrices quite efficiently.

[http://www.vldb.org/pvldb/vol8/p1816-teller.pdf](http://www.vldb.org/pvldb/vol8/p1816-teller.pdf)
[page 5]

------
sandGorgon
Anyone know how it compares with Apache Arrow (that pandas is moving to) ?

[http://wesmckinney.com/blog/apache-arrow-pandas-
internals/](http://wesmckinney.com/blog/apache-arrow-pandas-internals/)

[http://arrow.apache.org/](http://arrow.apache.org/)

~~~
stavrospap
Stavros from TileDB, Inc. here: Arrow employs a columnar format to store
objects like data frames. TileDB is also columnar (so you can sub-select
attributes and perform analytics with very similar optimizations to Arrow),
but the first-class citizens are (the more general) dense and sparse multi-
dimensional arrays. Moreover, TileDB focuses on optimizing for the persistent
storage backend, so that it can handle out-of-core analytics on massive
datasets that cannot fit in main memory (e.g., genomics), or offer the same
performance using less RAM (leading to cost savings in the cloud).
Nevertheless, we are quite fond of Arrow, so we hope to work together at some
point and integrate as seamlessly as possible.

~~~
polskibus
What do you mean by out-of-core?

~~~
stavrospap
In-core algorithms require the entire array(s) in main memory to perform some
computation. Out-of-core algorithms are typically block-based and stream the
array blocks from persistent storage to main memory on demand, working on
parts of the array(s) at a time, thus minimizing the memory requirements. If
this is done asynchronously and carefully, for some CPU-bound algorithms you
may be able to completely hide the storage-to-memory cost, thus saving memory
without losing performance.

------
polskibus
How do I slice & dice TileDB with a query? Can TileDB compute aggregations or
do I need to copy cells into main program memory out of TileDB and then
perform all data manipulation?

~~~
stavrospap
Stavros from TileDB, Inc. here: Currently, TileDB is a storage manager, so it
offers only efficient slicing. Support for aggregation queries (e.g., similar
to what you can do with Pandas) is in our roadmap. We plan to work on it as
soon as we ship the Python bindings.

------
_pmf_
> Python, R, Matlab and Excel

Mentioning Matlab and Excel immediately puts the product in the category "they
know what they are doing" as opposed to "another group of sophomores trying to
reinvent data science".

I'm still waiting for a raw data dump with "avoid copies at all cost" access
for very raw, very verbose vehicle and manufacturing data that is in 99.9 per
cent never accessed, but must be analyzed when errors are detected late in the
process. I.e. there's practically no transformation to apply during storage,
but upon access, transformation must be done.

If TileDB is kind of like a more structured, more low-level struct oriented
Redis, it is a very welcome addition.

~~~
srean
Apart from a diminishing population of old farts and a clutch of national labs
and universities that Mathworks keeps well greased, is MATLAB even relevant
anymore ?

~~~
jakebol
Jake from TileDB, Inc. Engineering as a discipline is conservative (for good
reason). The tools and the processes they use change slowly. Matlab is still
entirely relevant both in the sciences and in industry. There are people with
huge amounts of domain knowledge (who may only know Matlab or Excel) that are
increasingly called upon to analyze and interpret larger amounts of data.
These "old farts" are the people engineering, designing, and debugging our
modern world. Empowering people with domain knowledge to answer data driven
questions is what the democratization of data science is all about. There is
tremendous value in building bridges across communities and across generations
here.

I say this as someone who helped in small ways to develop open source
alternatives to Matlab.

~~~
StanSeltser
Stanislav, Seltser, Petacube agree with Jake. Matlab still has some unique
features to it which many open source projects (eg python) don't have - eg
FPGA integration, System Modeling and Simulation. Combine that with decent
language, nice debugger and many strong industry-specific solutions its good
bet its will be around for a while. The reasons people dump matlab is not that
its overpriced but due to the lack of integration with big data systems and
the fact that matlab license cost at large scale becomes untenable.Plus number
of industries using these unique matlab features is relatively small.

------
brootstrap
anyone have experience using one of these array data stores to handle large
amounts of weather forecast data? At my job we've come up with our own clever
postgres solution to handle the large amounts of gridded binary data.
Essentially we are inserting large 3D arrays (a series of images where each
pixel represents a geographic location and a value, like windspeed or
temperature or rainfall) into our postgres. Our solution is solid but am
always on the lookup for new novel approaches.

~~~
jakebol
Jake from TileDB, Inc. I think this would be an ideal workload for array data
stores (NetCDF a standard in this area uses HDF5 under the hood). You have <N>
number of attributes that you want per grid-point over time (and you want to
append to the time dimension). If you are ingesting Grib2 files then you can
take advantage of compression as well. An array data store like TileDB should
offer advantages for fast access, as you can get a pointer directly to the
stored array and do not have to access the (serialized) data over a socket,
especially if you are only interested in a subarray of the dataset.

~~~
oliviervg1
Hi both, this is exactly something that I’m looking at doing. We’ve got about
10TB of netcdf data coming in everyday and we’re looking for a cost efficient
data store to provide fast access to individual grid points. S3 has proven to
be too slow.

Any chance I could pick your brains about using either Postgres or TileDB?

Thanks!

~~~
jakebol
Absolutely! Drop us a line at hello@tiledb.io and tell us a little more about
the problem you are trying to solve and we can go from there.

------
Kiro
Would it make sense to store a tilemap for a game in this? Normally I use
NoSQL for that and just store the tilemap as a JSON array. I need to be able
to update and retrieve single cells easily (this is for a backend in a
multiplayer game).

~~~
k__
I don't know, tiledb seems rather low-level and many databases have geospatial
indexing.

------
kuwze
I wonder how it compares to SciDB[0] which is also used for storing
multidimensional data.

[0]:
[https://en.wikipedia.org/wiki/SciDB](https://en.wikipedia.org/wiki/SciDB)

~~~
jakebol
Jake from TileDB, Inc. In addition to the differences pointed out in the
paper, I think SciDB and TileDB are very different philosophically. SciDB is
architected very much like a traditional RDMS while TileDB is much more
lightweight. SciDB encourages you to use their own query language (AQL),
TileDB wants to integrate and extend the high level tools you already use
(Python, R, etc.) with as little overhead as possible.

------
mrdrozdov
Could I use this for deep learning research? It’s common for my models to be
anywhere from 100mb to a few gb. Although, I’d find it more useful for reading
batched training data.

~~~
stavrospap
Stavros from TileDB, Inc. here: TileDB could be useful for storing your
training data (in some storage backend of your choice) as well as your
intermediate data (as davedx pointed out). But this would make sense only if
your data are truly large and cannot fit in main memory. In fact, we are
looking forward to seamlessly integrating with systems like TensorFlow, but we
would rather wait until we can bring some value to applications with very
large storage requirements.

------
davedx
Sounds ideal for storing training/intermediate data for machine learning.
Niche competitor for Cassandra in this space?

~~~
speedplane
Most models I've seen have at most a few hundred MB of parameters, have
complex connections that can only be modeled as a set of many different 1 or
2D arrays with different sizes. Further, this DB stresses that it handles
sparse sparse data, any most ML data is not sparse.

It doesn't seem like the best application. On one of their pages it mentions
they ingest BAM records, which is for biological sequences. I'm guessing some
DNA storage applications.

~~~
wenc
> most ML data is not sparse.

This brings up a question: in what fields does one find heavy use of large,
sparse matrices that need to be persisted and queryable?

In my mind, sparse matrices typically occur in the context of
graphs/relationships, e.g. PageRank, logistic networks, adjacency matrices,
etc. They also tend to be a property of Hessian matrices (2nd order derivative
for a multivariate system). But typically these are intermediate quantities
that are discarded after a computation completes.

~~~
jakebol
Jake from TileDB, Inc. Genomics is a big field where sparse matrix storage is
needed. Human genomes are stored as a diff off of a reference, which as you
indicated forms a graph which can be represented as a sparse matrix. In other
fields of genomics, such as metagenomics, fragments of DNA when analyzed also
have a graph like structure.

TileDB supports both dense and sparse arrays. It was designed around the
concept of handling sparse arrays but dense arrays can be thought of a
degenerate case of sparse array storage in TileDB. For dense arrays tile
extents are contiguous and we don't materialize the coordinate values. This
way all the concepts are the same and we can capture both use-cases. Sparse
annotations to dense array values, such as NA or Null handling can also be
captured as a sparse array fragment layered over a backing dense array.

I agree with you that for most use cases, storage will be dense. But it is
useful to have one system that can handle both representations efficiently,
with the sparse case not added on as an afterthought (it also makes the system
simpler).

------
lowsenberg
This looks very interesting. Currently we are storing our dense simulation
(and experimental) data in NetCDF/HDF5. Given correct chunking, this seems to
be pretty efficient both performance and compression wise. What would we gain
using TileDB? How does performance compare with HDF5?

~~~
stavrospap
Stavros from TileDB, Inc. here: HDF5 is a great software and TileDB was
heavily inspired by it. HDF5 probably works great for your use case. TileDB
matches the HDF5 performance in the dense case, but in addition it addresses
some important limitations of HDF5, which may or may not be relevant to your
use case. These include: sparse array support (not relevant to you), multiple
readers multiple writers through thread- and process-safety (HDF5 does not
have full thread-safety, whereas also it does not support parallel writes with
compression - I am assuming you are using MPI and a single writer though, so
still HDF5 should work well for you), efficient writes in a log-structured
manner that enables multi-versioning and fault tolerance (HDF5 may suffer from
file corruption upon error and file fragmentation - you are probably not
updating, so still not very relevant to you). Having said that and echoing
Jake's comment, we would love to hear from you how TileDB could be adapted to
serve your case better.

A general comment: TileDB’s vision goes beyond that of the HDF5 (or any
scientific) format. Considering though the quantities of HDF5 data out there
(and the fact that we like the software), we are thinking about building some
integration with HDF5 (and NetCDF). For instance, you may be able to create a
TileDB array by “pointing” to an HDF5 dataset, without unnecessarily ingesting
the HDF5 files but still enjoying the TileDB API and extra features.

------
marco-lavag
I wish they had python bindings.

~~~
castis
Their Python, R, Matlab, and Excel bindings are in-progress.

------
Osmium
Looks exciting. What's the easiest way to get started with this via
Python/NumPy? Looks like this is a design goal, but not currently supported.

~~~
stavrospap
Stavros from TileDB, Inc. here: We have elevated the Python/NumPy bindings as
our top priority and have already started development. We will try to ship
them asap. :)

~~~
Osmium
Excellent! Good to hear :)

Also saw in your documentation that you're concentrating on lossless
compression right now, which makes complete sense. However, as a scientist, I
just want to put a vote in for lossy compression too: it's not uncommon to
work with large datasets given in float64 (because float64 is used for any
intermediate processing steps), but that actual final precision we need to
store is much less than that, but we're stuck with these huge binary files.

~~~
stavrospap
The way the code is currently architected allows us to add support for pretty
much any compressor (compatible with our MIT license) with minimal effort. So,
please do send us suggestions about your preferred compressor and we will add
it pretty quickly. Thanks!

------
maxpert
Looks nice, still have to read the whole paper. Seems like it's most useful
for sparse arrays. Maybe we will get a golang port :P

------
acidflask
How long until we can have Julia wrappers?

~~~
leethargo
Since jakebol is involved in both TileDB and Julia, this might be up and
coming?

------
ankitrohatgi
I am curious about how the query performance compares to working with JSON
files in Spark for ~100GB data.

~~~
jakebol
Jake from TileDB, Inc. here: Depending on the structure of the JSON files you
are querying you maybe able to take advantage of columnar compression and
massively reduce the dataset size (especially if the json files contain
numeric data). Also, repeat queries will not have to re-parse the JSON files.
This may speed up queries quite a lot, but it depends on the specifics of your
problem.

