
TileDB 2.0 and the Future of Data Science - dalke
https://medium.com/tiledb/tiledb-2-0-and-the-future-of-data-science-929cdcfe95ed
======
ipsum2
I don't know if the creators of TileDB visit HN, but if they do - it would be
nice to see a code example that showcases what makes TileDB unique. Right now,
I don't see any benefit versus serializing a data frame in hdf5 and storing it
in Redis or cloud blob storage.

(Reading [https://tiledb.com/developer](https://tiledb.com/developer), the
Github, or the blog - none of it shows me anything amazing)

Edit: browsing around the site, I finally found out what tileDB offers, and
I'm still confused (too much fluff language) To save people time:

[https://docs.tiledb.com/main/handling-
dataframes](https://docs.tiledb.com/main/handling-dataframes)

Multi-column slicing: By defining any subset of the columns as dimensions and
due to TileDB's tiling flexibility, you can increase the pruning effectiveness
of multi-column slicing, thus leading to better overall read performance.
Essentially, a TileDB array acts as a primary multi-dimensional index on the
columns selected as the array dimensions.

Data updates and versioning: TileDB offers rapid, parallel, cloud-optimized
updates. All the update logic is pushed into the storage engine and is
completely transparent to the user. TileDB also exposes useful time traveling
functionality, such as reading arrays at time snapshots, effectively
implementing data versioning built into a single embeddable library.

Partitioning: TileDB enables balanced partitioning, without limiting each
partition to single column values. Moreover, we will soon expose API functions
for dynamically selecting different partitioning schemes (e.g., on different
subsets of columns with different orders), without the need for
reorganizing/rewriting the array.

Sorting: All sorting is taken care of by TileDB internally with multi-
threading.

~~~
dalke
I heard about TileDB because one of the authors commented at
[https://news.ycombinator.com/item?id=22423632](https://news.ycombinator.com/item?id=22423632)
. That discussion includes what you mentioned, plus a mention that it
integrates with MariaDB, PrestoDB and Spark, including the ability to use an
embedded MariaDB.

~~~
stavrospap
Thanks for the post!

------
willvarfar
So, basically, tiledb is a file format like paraquet, avro and orc are
formats?

And it comes with a nice library that makes accessing the data fairly storage-
transparent?

~~~
Shelnutt2
Disclosure: I am a member of the TileDB team.

Yes you are correct, TileDB introduces a storage format, but is also a full-
fledged storage engine that implements it. Having the complete storage engine
is what allows us to make accessing the data transparent regardless of the API
or application you use to access the data. We are able to integrate into
different computational tools without having to reimplement or sacrifice
features.

~~~
willvarfar
How do you compare to Iceberg?

~~~
Shelnutt2
Disclosure: I am a member of the TileDB team.

Iceberg, similar to Delta Lake, is another layer you have to add on top of
Parquet, which limits your mobility of compute. You can currently use it
either with Spark or Presto, but not directly via language APIs or other
compute engines (e.g., pandas, Dask, MariaDB, etc). For example, the pandas
"read_parquet" will not be able to read your dataframe, you will have to use
pyspark or presto + sql query to fetch the data into pandas. TileDB has
updates and data versioning built into its format and storage engine,
therefore reading an array directly via a TileDB language API, or Spark, Dask,
PrestoDB or MariaDB will have the same behavior. You will be able to see the
dataframe after all the changes, or time travel and "activate" only some of
the changes in time.

~~~
xyzzy_plugh
This isn't strictly true.

You can't arbitrarily reach into the data from the client with Iceberg or
Delta Lake, true, but that's intentional. The service on top provides a lot of
the functionality TileDB provides. There's nothing preventing you from writing
a pandas integration, for example.

Notably, the most significant difference I can see is that, as far as I can
tell, TileDB will not solve for failures due to S3 consistency, which are
solved by Iceberg and Delta Lake. It's strictly necessary to have a central,
ACID-ish place to record transactions, as probing S3 is not reliable.

At the very least, you are definitely suspectable to stale reads, which means
users will see all sorts of bizarre failures at scale. I'd be pretty terrified
of putting anything serious on TileDB.

~~~
Shelnutt2
Disclosure: I am a member of the TileDB team.

Several important concepts are being conflated here, so I'll elaborate on each
separately.

S3 eventual consistency: TileDB is fully aware and designed around the
eventual consistency guarantees of cloud object stores [1]. When an array is
opened, only the committed writes (up to the specified timestamp if using time
traveling) are seen by the reader (each write produces a timestamped
"fragment", which is essentially a folder). There will never be partial reads,
or corrupt reads. The array is always in a readable state with committed data.
This works very similar to Iceberg's method of opening a table at a snapshot
for reads [2]. I don't believe that TileDB is any more affected by stall reads
than Iceberg (or Delta lake), the user must reopen or re-query a table to see
data from a newer timestamp/snapshot.

Handling S3 write failures: TileDB is designed for a lock-free, multi-writer
scenario. All writes produce a new timestamped fragment, which is immutable
after completion of the write. TileDB performs an atomic write of an "ok"
object, which signals when the fragment is complete. Any fragment which is
missing the "ok" file, is ignored by the reader [3]. TileDB handles corrupt or
incomplete fragments by erroring out on the read. In the future, we could
offer a retry mechanism for failed reads, but the important thing is we will
never return corrupt or invalid results. Incomplete fragments can happen
because of S3's eventual consistency; it is possible for the atomic ok file to
show up before an object in the fragment "folder".

Write serializability: TileDB's fragments are written at a (timestamp + uuid)
fashion. In the event of a conflicting write at the same timestamp, the uuid
provides uniqueness and guarantees that there are no errors by essentially
randomly ordering the conflicting fragments. The effect in the end is similar
to Iceberg's cancel and retry conflicting writes, except TileDB does not have
the penalty of retrying. For Iceberg, if there are two simultaneous
conflicting writes, it is effectively random which one would be accepted and
which one would be retried.

ACID: TileDB does not support ACID intentionally, as it was not designed to be
a transactional database. TileDB was designed with a lock-free multi-
writer/multi-reader model, as our use cases up until now involved a one-off
massive parallel write, and then multiple concurrent reads. Lack of
transactions does not yield inconsistent data though, as stated above the
read/write algorithms are specifically designed with this in mind. That said,
we have recognized that there are workloads where transactions are important,
and we have plans to eventually add transactions to our cloud product where we
have the orchestration layer needed to manage them. It is important to keep
the transactional layer modular and format-agnostic (Delta Lake is blending it
with Parquet) and, therefore, we will build it on top of TileDB, not inside
the storage engine.

Interoperability: Obviously nothing stops someone from writing a pandas
integration for Iceberg or Delta Lake. An important part of our philosophy at
TileDB is extreme integration with existing tools and frameworks, which is why
we strive to support things like returning numpy arrays for the Python results
and doing zero-copying everywhere possible. In other words, it is not trivial
to add a tool integration, just the same as it is not trivial to just add an
ACID layer.

[1] [https://docs.tiledb.com/main/basic-
concepts/consistency](https://docs.tiledb.com/main/basic-concepts/consistency)

[2]
[https://iceberg.apache.org/reliability/](https://iceberg.apache.org/reliability/)

[3] [https://docs.tiledb.com/main/basic-concepts/physical-
storage](https://docs.tiledb.com/main/basic-concepts/physical-storage)

~~~
xyzzy_plugh
Thanks for the detailed response.

This is what I was getting at, mostly:

> Any fragment which is missing the "ok" file, is ignored by the reader [3].

As a client, how do I detect this? Say I just wrote some data and now I want
to generate a new total. Could my data be silently missing?

> TileDB handles corrupt or incomplete fragments by erroring out on the read.
> In the future, we could offer a retry mechanism for failed reads, but the
> important thing is we will never return corrupt or invalid results.

But now I have to handle this everywhere I interact with TileDB. Most users
don't expect essentially random errors. Retries are a stop-gap, as it can take
arbitrary time for consistency to converge. I've observed O(hours) regularly.

> Incomplete fragments can happen because of S3's eventual consistency; it is
> possible for the atomic ok file to show up before an object in the fragment
> "folder".

Indeed! And from what I understand, mostly from conversation with the Iceberg
team last year, they solve this entirely by using a central ACID service
backed by RDBMS (which, aside, I find very amusing!! They never listen to
Stonebraker...).

Indeed, as far as I can tell, the only way to remove consistency issues is to:
\- store an index to objects somewhere central \- never list the bucket (or
speculatively stat, etc.)

Until either a) you support a central index or b) S3 has a better consistency
model (not that DynamoDB garbage Hadoop uses) I would be very reluctant to use
TileDB if it wasn't using my own disks.

I think for small stuff, it'll work great, but at scale you're going to hit
some roadblocks. I recently encountered splitting a bucket into to many
buckets and prefixing UUIDs at the root of the bucket to reduce consistency
problems caused by S3's rebalancing, and it ended up being a very expensive
marginal improvement.

~~~
stavrospap
Folks, apologies, but I think we got a bit side tracked here, TileDB does not
suffer from the consistency issues mentioned above.

Here is how TileDB performs a new (potentially concurrent with other reads and
writes) write:

\- It creates a fragment folder (or "prefix" of a set of objects on S3 - there
are no "folders" on S3) which is timestamped and carries a unique UUID. This
fragment is self-contained and represents the entire write (e.g., all cells
and all attribute values)

\- It writes all data objects under the fragment prefix. Note that TileDB
never updates, it always writes new immutable objects.

\- After all the PUT requests succeed for the data objects, it creates an
empty "ok" object.

Here is how TileDB performs a (potentially concurrent with other reads and
writes) read:

\- It lists the array prefix to get the ok objects

\- There are two cases:

1\. The ok object is not there for some fragment. That fragment is completely
ignored.

2\. The ok object is there. Since TileDB writes the ok object last, all the
data objects it wrote have been committed and are all visible with GET
requests. TileDB reads the data objects only with GET requests (not ListObject
requests). Due to S3’s read-after-write consistency model
([https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction...](https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel)),
all those objects will be available for reading (now on all S3 regions) with
GET and there will be no errors.

Therefore, TileDB follows the eventual consistency model of S3 without any
errors and surprises. The user doesn't need to handle anything. Our customers
have been using TileDB in production for a long time, storing hundreds of TBs
of data on S3, and no consistency issue has ever come up.

Summarizing, what xyzzy_plugh is raising here is that TileDB does not have
ACID guarantees. And that is true (we never claimed the contrary) and
intentional. We are building a transactional layer outside of the storage
engine. The reason is that this transactional layer indeed needs to be a
constantly running distributed service, whereas we want the TileDB storage
engine to be embeddable and used without performance regression even by
applications that do not need ACID (that is, the majority of our data science
applications).

------
xyzzy_plugh
This seems similar to the direction Apache Arrow is going in, but I don't see
the use case for this once Arrow dataframes are prevalent.

~~~
Shelnutt2
Disclosure: I am a member of the TileDB team.

TileDB is designed primarily for persistent storage, similar to HDF5, Zarr and
Parquet. TileDB is a cloud-optimized, dense and sparse multi-dimensional array
storage engine, which is broader than Arrow's Dataframe. In addition TileDB
also handles updates, time traveling and partitioning at the library level,
which are not possible with the Arrow project's current choice of persistent
format, Parquet. TileDB removes the need for using extra services like Delta
Lake for updates or Hive for cataloging and partitioning, as you get all this
functionality in a single, embeddable library.

~~~
missosoup
I feel like you guys need to create some additional engineering-level
documentation to convey all this. The landing page and few links I navigated
to were all too fluffy.

~~~
MiroF
For things that start as academic work, often the original paper is quite
helpful. I find it an enjoyable read if you have a background in databases (it
actually was a reading for my grad systems class)

[https://people.csail.mit.edu/stavrosp/papers/vldb2017/VLDB17...](https://people.csail.mit.edu/stavrosp/papers/vldb2017/VLDB17_TileDB.pdf)

The basic jist is: TileDB is substantially faster than alternatives for random
writes/reads - this allows for more sophisticated parallel linear algebra
algorithms as well as evaluation of predicates.

------
xiaodai
After reading it, it's not clear it is the future of data science.

------
dalke
The highlighted new features for version 2.0 are: 1) Support for heterogeneous
dimensions in sparse arrays (that is, full dataframes), 2) Support for string
dimensions in sparse arrays, 3) Google Cloud Storage and Azure Blob Storage
support, and 4) Completely revamped R API.

They write: "These features mark an important step in our journey to deliver a
universal data storage engine for all applications and data science tools."

~~~
goodSyntax
i agree, the title and article is mostly hyperbole

~~~
stavrospap
Stavros form TileDB here. Here is a more verbose explanation. Up until before
2.0, TileDB was already powerful for the main applications we targeted at:
geospatial and genomics. The support for both dense and sparse arrays and the
way it handles data versioning made it quite unique vs. HDF5 and Zarr. But we
noticed that most of the data scientists we were working with had a lot of
data beyond genomic variants, LiDAR points and rasters. They had tons of
dataframes. And they were using at least two storage engines, TileDB for
arrays, and Parquet or a relational database for dataframes. If you are in a
large organization, this a big pain.

In TileDB 2.0 we made a huge refactoring to support something seemingly
simple: dimensions in sparse arrays that can have different types and that
could even be strings. This allowed us to model any dataframe as a sparse
array, effectively making TileDB act as a primary multi-dimensional index. In
relational databases, this means that your data is sorted in an order on disk
that favors your multi-column slicing enormously, so range search becomes
rapid.

Therefore, what we are telling the community with this release is that you can
have dense arrays, sparse arrays, and dataframes in a single embeddable
library being integrated with pretty much every data science tool out there,
so that data scientists never have to worry about backends, files, updates, or
anything other than their scientific analysis. In other words, we believe the
future of data science is more science.

~~~
bionhoward
Show us code usage examples please!

~~~
stavrospap
Full developer docs here:
[https://docs.tiledb.com/main/](https://docs.tiledb.com/main/)

Specific dataframe examples coming up shortly.

------
jiangplus
Looking at the TileDB docs, I like the idea that you can store and manipulate
array efficiently. It works like a columnar storage, but organized in tiles. I
am not clear how it could scale to multiple nodes and how to do queries. Maybe
it should be integrated into postgres.

~~~
Shelnutt2
Disclosure: I am a member of the TileDB team.

TileDB fully supports multi-writer/multi-reader, so scaling compute to
multiple nodes is mostly about choosing the computation layer. We have
integrations with Spark, Dask and PrestoDB for distributed compute. In our
cloud product we also offer serverless UDFs and SQL (via MariaDB) to allow for
scaling out computations elastically without managing your own clusters.

On the postgres side, we are looking at eventually adding a postgres storage
engine, now that the new storage support is in PostgresSQL 11. To start with
we have built a storage engine for MariaDB, as MariaDB has an excellent
storage engine API, which I have past experience with. We hope to upstream the
storage engine to MariaDB after the MariaDB 10.5 release.

MariaDB integration:
[https://docs.tiledb.com/mariadb/](https://docs.tiledb.com/mariadb/)

Spark integration:
[https://docs.tiledb.com/spark/](https://docs.tiledb.com/spark/)

Dask integration:
[https://docs.tiledb.com/dask/](https://docs.tiledb.com/dask/)

PrestoDB integration:
[https://docs.tiledb.com/prestodb/](https://docs.tiledb.com/prestodb/)

Cloud Docs: [https://docs.tiledb.com/cloud/](https://docs.tiledb.com/cloud/)

