
Writing a Time Series Database from Scratch - spektom
https://fabxc.org/blog/2017-04-10-writing-a-tsdb
======
jnordwick
I still can't figure out why people can't even come close to KDB+. It is a
real conundrum. I've been waiting patiently for something to show up, but the
gap seems to keep getting bigger instead of smaller.

Is it that people want to make the problem more complex that it needs to be?
Is it that those who know most about these issues don't share their secrets so
implemented from the outside often don't have a good understanding of how to
do things properly? If you were to asked the guy behind Prometheus if he's
looked at the commercial offerings and what he's learned from them, would even
be able to speak about them intelligently?

There seems to be a huge skills gap on these things that I can't put my finger
on. I'd love to be able to use a real TSDB, even at only half the speed and
usefulness. It would be great for these smaller firms that cant or wont pay
the license fees for a commercial offering until they get larger.

~~~
zeptomu
I have to admit that I did not know time-series databases are a thing and just
recently realized that because they came up more often. Unfortunately I do not
have an answer for you, but I know traditional DB systems are hard to build,
as we expect more reliability and guarantees from them compared to other
(daemon) software.

However I am interested to know why this kind of data and/or problems require
specific software and why it can't be handled by traditional RDBMs? Obviously
you could model the domain with a classic database, but seemingly there exist
important queries that can't be satisfied (at least timely) through classic
systems - what kind of queries are these, and why do they fail using a
general-purpose DB approach?

~~~
nickpeterson
I'm no expert, but I believe the Crux of the issue is how data is naturally
organized and stored. In a row oriented database, most data is stored in pages
that contain rows. There are often indexes with ordering, but unless a
secondary index contains all the values needed (often called a covering
index), the entire row must be retrieved to answer any query that uses that
information.

Most Time-series databases are columnar in nature, and often have the concept
of time baked into the ordering of values (think vectors not sets). Because
they are columnar, they are more trivial to retrieve just the data needed by a
query. Suddenly instead of loading a billion rows and averaging the value in
one column, you're just accesing the column itself to answer the question.
From an IO perspective that's a huge savings.

Now imagine you have special, dark arts for working on compressed data, and a
query optimizer you've been tuning for a decade for demanding clients. It does
not surprise me that kdb is much faster than the open source competitors. And
to be fair, even with excellent traditional databases like postgres, I bet
Oracle, db2, and Ms SQL are still generally faster in most queries.

~~~
bbrazil
> Suddenly instead of loading a billion rows and averaging the value in one
> column, you're just accesing the column itself to answer the question. From
> an IO perspective that's a huge savings.

The other side of this is that writing out data in that form would naively be
an iop per sample, as you're usually appending one sample to 100s of time
series in one request.

A significant part of monitoring TSDB design is buffering up samples and
batching writes in order reduce that iop rate to something sane.

For example in the right circumstances the 1.x Prometheus design can ingest
250k samples/s on a hard disk which provides ~100 iops/s.

------
iksaif
You may also want to check
[https://github.com/criteo/biggraphite/wiki/BigGraphite-
Annou...](https://github.com/criteo/biggraphite/wiki/BigGraphite-Announcement)
which is also about how to write a TSDB from Scatch but with different goals.

------
ah-
Exciting times in database land! It certainly seems like the good systems are
converging on very similar storage architectures. This design is so similar to
how Kafka and Kudu work internally.

As the raw storage seems pretty optimal now, I suspect next we'll see a
comeback of indices for more precise queries to get another jump in
performance.

------
nicolaslem
The description of this new storage engine does not explain how it manages the
durability of the data.

When you compare with the extreme efforts traditional databases take to ensure
that unplugging a server will never ever result in data loss[0], silencing
this problem makes me wonder.

Is it that at this ingest rate even trying to ensure durability is a vain
effort?

[0]
[https://www.sqlite.org/atomiccommit.html](https://www.sqlite.org/atomiccommit.html)

~~~
bbrazil
Durability is not a requirement in that sense.

Consider that a regular scrape has happened and that data has been accepted by
the DB but not yet flushed to disk.

Whether the database dies just before or just after the scrape produces the
same result: The data for that scrape isn't present when the server restarts.

There plenty of other ways a scrape might not succeed that we have no control
over (e.g. other end is overloaded, network blip), so there's not much point
obsessing over this particular failure mode.

> Is it that at this ingest rate even trying to ensure durability is a vain
> effort?

It's not in vain, but it'd be a bad engineering tradeoff in terms of
throughput.

------
bogomipz
I had a question about the following statement from the post:

>"Prometheus's storage layer has historically shown outstanding performance,
where a single server is able to ingest up to one million samples per second
as several million time series"

How are there one million samples per second equating to several million time
series? Is a single sample not equivalent to a single data point in a time
series db for a particular metric in Prometheus?

~~~
ah-
I think this means that there are e.g. 10 million different time series, that
each get a new sample appended every 10 seconds.

------
bongonewhere
Is like everyone creating a time series database from scratch?

~~~
rodionos
The fact that many companies (FB, Uber, Google, Netflix, SO) roll their own
TSDBs for metrics collection suggests that there is a real need. Or maybe
there is not. It could a way to make boring system monitoring jobs fun and
fancy again.

~~~
metaobject
Perhaps these companies have such varied requirements that none of the
existing TSDB systems fully support? We built our own custom time series db,
along with a suite of tools for accessing, slicing, plotting, etc bc (at the
time, at least) there was no support for bitpacking data, and
storing/calculating certain spatial data operations.

