
TileDB closes $15M Series A for universal data engine - k-rus
https://tiledb.com/blog/tiledb-closes-15m-series-a-for-industry-s-first-universal-data-engine-2020-07-14
======
gk1
Congrats on the raise!

Meta comment: The confusion we see in this thread is what happens when a
startup tries to create a new category -- a strategy known as "category
creation." Founders imagine everyone will jump onboard with this category and
run to them as the de facto leader of that category.

In reality, it just creates another point of confusion. Whereas before you had
to explain just one thing, now you have to explain two things: What your
product does, _and_ what the category means.

There are many existing subcategories within the "database" category of
products. Pick the subcategory where you want to compete, and market your
innovation to stand out and win over new or existing customers in that
category. Timeseries, in-memory, data lake, RDBS, ... Lots to choose from.

This goes for TileDB and any startup founder reading this who's about to
launch something like a "Intra-Terrestrial Data Pipeline Miracle" or "Middle-
Out Machine Learning Capacitor" or "The First Cloud Fog Edge Dew Platform" or
whatever.

------
qeternity
It’s difficult to tell from the landing page, but what exactly is this? It
says a lot of things that it’s not, but that leaves it sounding like it’s an
object store with a tightly coupled map reduce framework allowing the
“pluggable computer layer”.

How wrong is this assessment?

~~~
Shelnutt2
Seth from TileDB here.

The foundational invention is the TileDB universal storage engine based on
dense and sparse multi-dimensional arrays. Genomic variants are 2D sparse
arrays. Images and video are dense arrays. KV are interestingly sparse arrays
as well[1]. TileDB is the first solution that can efficiently model all data
with a single powerful data structure. TileDB universal storage engine
natively supports cloud object stores, as well as local filesystems and we
natively handle the eventual consistency through our MVCC design.

On top of the TileDB storage engine we have built and worked on integrations
with a range of existing computation tools and frameworks. Our goal is to make
the computation pluggable and provide fast and efficient (zero-copy or apache
arrow where possible) integrations to allow you to continue to use your
favorite computation tools in your favorite language. A few examples are our
integrations with numpy/Pandas, Spark, MariaDB, gdal and more.

In addition, we built TileDB Cloud on our storage engine, which has two more
innovations: (1) easy data sharing and logging at global scale (beyond
organizations), and (2) a complete serverless infrastructure for scaling out
compute, very similar to the DAG approach of dask.delayed.

[1] [https://docs.tiledb.com/main/handling-key-value-
stores](https://docs.tiledb.com/main/handling-key-value-stores)

[2] [https://docs.tiledb.com/cloud/client-api/task-
graphs](https://docs.tiledb.com/cloud/client-api/task-graphs)

~~~
simonebrunozzi
Seth, congrats on the funding.

> The foundational invention is the TileDB universal storage engine based on
> dense and sparse multi-dimensional arrays.

Want some advice? Find a way to explain TileDB not to sound super intelligent,
but to help readers understand. You will be much more successful if you do
just that.

Added bonus: start with what problem TileDB solves, that current solutions
don't solve.

Edit: 1) scalability for complex data; and 2) deployment; seem to be the
problems you solve. Is that correct?

~~~
qeternity
I'm not even sure this is the problem. For the target customer, this language
is fine (sparse arrays being notoriously difficult).

It's the rest of the language that seems to allude to some sort of
innovation/breakthrough but stops short of explaining that.

------
oxfordmale
Yet Another Database.... As Shelnutt2 states below, its success will depend on
how quickly TileDB can be integrated into other tools.

However, I don't see any benefit in TileDB supporting fast and efficient
updates (and duplicates) of time series data. Time series should be immutable,
and only in rare occasions require updating.

~~~
Shelnutt2
Time series data can vary and I'd agree that most time series is immutable.
There are however use cases in which updates can happen. For instance, at my
previous job before TileDB Inc, we had a case where 99% of our data was
immutable and never updated but a very small amount of data could be updated
if there were late arriving parts. In order to get near-realtime data we
accepted that sometimes some columns might not be available within the window
the datapoint represents. In that case that record might reappear at a later
time with the complete and correct values.

Of course there are trade offs, we could have forced a longer waiting period
until we were confident the data was finalized. We also could have ignored the
updates. In the end we used a system of staging tables for loading the last 24
hours of data before merging out into a more finalized table. This kept the
load of updating records in the database down, and still allowed us to achieve
our goals.

At the time, several years ago, I was not aware of TileDB else we would have
considered it instead of a more traditional database vendor.

------
broken_symlink
I've looked at tiledb a few times. I think it would make a lot of sense to use
it as a serialization format for legion.
[https://legion.stanford.edu/](https://legion.stanford.edu/)

Its been on my todo list to try it out for a while. Maybe my next weekend
project.

------
carterklein13
I'm looking at some of the comments and still having a little bit of trouble
understanding what makes the data engine "universal."

I see reference to "universal storage" in some areas, but keep landing on a
multi-dimensional array structure for data storage - and this seems kind of at
odds. Maybe I'm missing something, but isn't specifying the structure of the
data inherently not universal?

I'm relatively shielded in my databases knowledge, though - having only worked
with "traditional" tools. If I'm missing something definitely let me know!

------
khazhoux
Seeing the number of customer testimonials on their announcement makes me once
again wonder how brand-new products get traction with customers (who really
serve as guinea pigs). I'd love to hear (and learn) how startups have
successfully gotten their foot in the door with technologies like this.

For context, I spent a bit shy of a year a while back developing a middleware
idea, and severely struggled to get anyone to try it. Friends suggested open-
sourcing it, but even that would have been a struggle, I'm sure.

~~~
k-rus
I believe TileDB had some customers when it was framed as the product.
According to its website: "TileDB, Inc. is a data management company spun out
of Intel Labs and MIT"

------
Bootvis
I couldn't find a nice example of using TileDB together with data frames. From
the documentation, I understand this is one of the uses of TileDB. Is someone
aware of a performance oriented blog post about TileDB and data frames?

------
sdinsn
If anyone from TileDB is reading this thread, the link for "Geospatial" under
Applications in the main page's footer points to the wrong link.

~~~
Shelnutt2
Seth from TileDB here. Thanks for reporting this, we've fixed the incorrect
link.

------
lifeisstillgood
Is this solving "where are my HDFS files" ?

~~~
chmod775
Can you elaborate? I can't figure out what you're referring to with just that,
but I wonder whether it's a silly nickname for a common limitation of
databases?

~~~
ben509
HDFS is the Hadoop distributed file system, but maybe they meant HDF5 files,
which is a format the Pandas library saves in. That makes use of Numpy's
n-dimensional arrays, so it's related to what these guys are working on.

~~~
lifeisstillgood
Yeah my typo - I was trying to understand with limited time budget TileDB and
their marketing pitch - they mentioned something like "the problem of knowing
where to keep data was solved by traditional databases, now its solved by us"

On a dumb level I was just wondering if they are tracking the data sets people
create - it does seem more like a data policy than a product. But I may not
understsnd it well

~~~
stavrospap
TileDB Embedded is a storage engine like HDF5, with the following
differentiators: (1) it is cloud-native, (2) it supports also sparse arrays,
(3) it offers rapid updates, (4) it supports data versioning and time
traveling built into its format. TileDB Cloud (our cloud SaaS solution)
further allows you to see which arrays you own in the cloud and which ones you
share with others, along with full access logs. You can also attach arbitrary
descriptions and metadata that can search on, even find and access public
datasets posted by you or others.

------
sjg007
Seems similar to pilosa... What would the differences between these two be
conceptually?

